Use Case

The 3 of 12 Rule: Choosing Which Third-Party Dependencies to Monitor Synthetically

May 24, 202611 min read

Triage: Most SaaS apps run on 10 to 20 third-party dependencies. You cannot afford a custom monitor recipe for each one, and you should not try. The right move is dependency-graph triage. Rank every vendor on three axes: blast-radius (what breaks downstream when they fail), revenue-attribution (what dollars stop), and vendor-status-lag-history (how late their status badge tells the truth). The top 3 get a 5-minute synthetic of the exact call your app makes. The next 5 stay on the vendor's status page as a secondary signal. The bottom 4 you accept as quiet risk. This post walks the triage across the transactional email path, the vendor status page itself, and the webhook receiver you cannot host, using the Datadog External Provider Status launch from October 2025 as evidence that the largest SRE teams already treat the dependency-monitoring problem as tier-1.

You have 12 third-party dependencies. You should monitor 3 of them synthetically.

Count the third-party dependencies your app calls in production. Auth provider. Email vendor. Payments. Object storage. CDN. DNS. Search. Queue. Webhook senders. AI provider if you have one. SMS provider if you have another. CRM API. Most SaaS apps land between 10 and 20 once you write the list down honestly. A custom synthetic per vendor on a 5-minute interval is ten to twenty monitors, and on most paid uptime tools that is real money per month before you have caught a single incident.

The triage rule we run is this. Score every vendor on three axes from 0 to 3. Blast-radius is what breaks for your customers the moment the vendor goes dark. A payments vendor sits at 3. A vendor used for an offline weekly report sits at 0. Revenue-attribution is what dollars stop. A vendor in the checkout path is 3. A vendor in the marketing-email path is 1, because delayed marketing email rarely breaks the buy. Vendor-status-lag-history is how late their public status page has run on their last three incidents. A vendor that posted within 5 minutes of impact three times in a row scores 0. A vendor that has shown a 30-plus minute lag at least once scores 3.

Add the three axes. Vendors that score 7 or higher belong in the top bucket. Cap the bucket at 3, the budget you actually have. Vendors that score 4 to 6 sit in the next 5: their status page is a secondary signal, you accept the lag, you do not run your own probe. Vendors that score 3 or under sit in the bottom 4: you accept that you find out late. The dependency-graph triage is the same triage every SRE team eventually arrives at. The work is making the scoring explicit and revisiting the list every quarter when vendors change.

This is the generalized parent of two posts we already shipped. For the vendor-specific worked examples, see the LLM-scoped version of this triage and the Stripe-scoped version of this triage. Both posts walk one vendor through the triage at full depth. This post walks the framework across a portfolio.

What a synthetic of your own call actually buys you (the Datadog 2025 anchor)

On October 21, 2025, Datadog launched External Provider Status alongside a free public companion at Updog.ai. The launch pitch is dependency-graph monitoring as a product category. In Datadog's own words: “Datadog External Provider Status provides real-time visibility into the health of more than 40 third-party providers, including 13 AWS services across global regions and widely used SaaS APIs such as GitHub, Stripe, and OpenAI.”

The detection-time claim is the load-bearing number. From the same launch: “During a DynamoDB degradation on July 3, 2025, Datadog surfaced the issue 32 minutes before AWS acknowledged it on their status page.” Thirty-two minutes is the gap between when Datadog's customer telemetry started flagging the vendor and when the AWS status page caught up. The same 32-minute number appears in the companion Updog.ai launch post: “Instead of depending on provider updates, Updog.ai is powered by aggregated, anonymized observability data and AI models.”

Datadog's model is telemetry-derived: it needs APM agents installed and live customer traffic flowing to spot the vendor degradation in aggregate. Velprove's wedge is the opposite shape. A 5-minute synthetic of the exact dependency call your app makes, from one region you pick out of 5 regions available to choose from, with no agents and no instrumentation. It works on day one. If Datadog needed to ship a new product to close the 32-minute gap, you can ship the lightweight version yourself in 15 minutes per top-3 vendor.

This is the structural reason a plain HTTP probe on the vendor's documented endpoint is not the same primitive. The vendor's endpoint often returns 200 while the call your app actually makes (the one with your auth, your headers, your payload shape) fails. The broader version of that argument lives in why HTTP probes alone miss vendor-degradation outages.

What status-page-lag history tells you about a vendor

The third triage axis (vendor-status-lag-history) is the one teams skip because it requires looking at the vendor's last three incidents and measuring the delta between impact start and the first Investigating update. The work is worth it, because the lag varies wildly across vendors. Stripe routinely posts within 5 to 10 minutes. GitHub posted the May 15 2026 Actions degradation 30 minutes after impact. AWS posted the July 3 2025 DynamoDB degradation 32 minutes after Datadog's telemetry caught it. The lag is structural to the vendor, not random.

Third-party aggregator IsDown reports a wider gap across its provider pool. In their own product blog, they write: “In January 2026, IsDown detected outages up to 2.2 hours before vendors acknowledged them, and caught 101 incidents that vendors never reported at all.” Two caveats on that number. It is self-reported product telemetry from a competing monitoring tool, not third-party-audited. And it is an aggregate across the IsDown provider pool in a single month, not a per-vendor claim. Read it as an upper bound on how bad vendor status pages can run, not as the median.

The rule we use is rougher and easier to apply. Pull the vendor's status page incident history. Look at the last three incidents. If all three were posted within 5 minutes of impact, the status page is acceptable as a secondary signal. If any one of the three was posted 30 minutes or more after impact, the status page is not the monitor: a synthetic is. The broader structural argument that vendor status pages lag for the same reason customer-facing dashboards always lag the truth lives in vendor status-page lag is a structural problem.

Triage worked example #1: the transactional email path

Vendor candidates: SendGrid, Postmark, Mailgun. Score the axes. Blast-radius is HIGH: password resets, invoice receipts, double-opt-in confirmations, and magic-link auth all die together when the send API fails. Revenue attribution is MEDIUM: rarely the direct buy path, but onboarding-email failures kill activation and churn-recovery email failures cost real dollars at the long tail. Vendor-status-lag-history is LOW for the three named vendors: their status pages have been generally honest. Combined score: high enough to land in the top 3 for most SaaS shops.

The Velprove recipe is a single API monitor. HTTP POST to the provider's send endpoint with a sentinel recipient address you own (something like monitor@yourdomain.com routed to /dev/null on your end). Assert status_code eq 202 and header_contains x-message-id on the response. Both assertions run on every Velprove plan, including Free. Each monitor run is one snapshot of the send path at that interval, not a poll: the monitor fires once, reads the response, and records the result. The synthetic catches API-side failures (rate limits, auth-key rotation breakage, provider 5xx). It does not catch deliverability failures (the message accepted by the vendor but never landing in the inbox). Velprove does not read inboxes and the browser login monitor cannot click an email link. For deliverability, layer a dedicated inbox-monitoring tool on top.

A second pattern fits the same triage shape when the vendor exposes a customer dashboard you actually sign in to (Stripe Dashboard, AWS Console, SendGrid web app). Velprove's browser login monitor drives a real browser through the vendor's sign-in page with a dedicated low-privilege test account, then asserts on a post-login element only authenticated users see. If the vendor's auth backend is degraded but their public API returns 200, the API synthetic stays green and the browser login monitor flips red. Free plan includes one, running every 15 minutes from any of the 5 regions available to choose from. The two synthetics layer cleanly on the same vendor.

The shape generalizes to any vendor whose API you call to produce a side-effect (send, charge, upload, dispatch). The foundational primitive is the multi-step API monitor primitive, which the Free plan supports at up to 3 steps (5 on Starter, 10 on Pro). For email, 1 step covers it. The Stripe-checkout pattern needs 3 steps to land the full flow.

Triage worked example #2: the vendor status page itself

The triage question for the vendor status page is narrower than it sounds. The question is not “should I scrape my vendor's status page?” The question is “when can I trust this vendor's status page enough to skip running my own probe against the vendor?” The answer is the third axis. If the vendor's last three incidents posted within 5 minutes of impact, treating the status page as the secondary signal is reasonable. If any one of the three posted 30 minutes late, the status page is not the monitor.

AWS sits in the second bucket. The July 3 2025 DynamoDB degradation is the Datadog anchor (32-minute gap) and the October 20 2025 AWS US-EAST-1 cascade is the worst-case example. Slack sits in the second bucket. Most SaaS APIs in the long tail (auth providers, search vendors, CRM webhooks) sit in the second bucket too: their status pages exist, but they update slowly because they are driven by manual SRE confirmation, not by customer telemetry. The right action is to scrape the status page as a secondary signal if you want richer context, but never to rely on it as the primary detection instrument for any vendor that scored 3 on the lag axis. The cluster-fold variant of this pattern (other failure modes that look green on the dashboard) lives in the silent-outage taxonomy.

Triage worked example #3: the webhook receiver you cannot host

The triage question for webhook-driven vendors is constrained by what Velprove can and cannot do. Velprove does not host an inbound webhook receiver. We cannot accept the vendor's POST, parse it, and assert on the payload. That is a category of product (webhook capture and replay) we do not ship. The pattern that works inside Velprove's primitive set is two monitors that compose: trigger the workflow on the vendor side with one monitor, then check the downstream effect on your own application endpoint with a second monitor on the next interval.

Stripe checkout is the canonical example. Monitor A is an API monitor that creates a test checkout session against Stripe's test mode. Monitor B is an API monitor that hits your own /api/orders/test-canary endpoint (or whatever you name it), with a json_path assertion that $.status equals the literal string paid. Your endpoint records the most recent test-canary state server-side, returns 200 with the static JSON if it has been updated by the Stripe webhook within your acceptable window, and returns 503 if it has not. Velprove does not need to know about the webhook itself. It only asserts on what your endpoint says about the webhook's effect. The deeper version of this pattern, including how to design the /api/orders/test-canary endpoint, lives in trigger-and-check-effect for webhooks. The shape applies to any vendor whose only outage signal is a webhook you cannot receive: SMS delivery callbacks, payment confirmations, CRM record-update events, build-finished notifications.

The 5-region pattern and partial regional degradation

Velprove offers 5 regions available to choose from on every plan, including Free. Each monitor runs from one region you pick, not from all five at once. The triage implication is straightforward. For a vendor with global failure modes (Cloudflare data plane, AWS US-EAST-1 cascades), a single monitor in any one region catches the incident. For a vendor with regional failure modes (a CDN with PoP-specific issues, an auth provider whose European cluster degrades independently of US-East), you create one monitor per region you want to cover.

The Cloudflare November 18 2025 outage is the clean global example. Cloudflare's own post-mortem at blog.cloudflare.com/18-november-2025-outage records 11:20 UTC to 17:06 UTC, roughly 5 hours and 46 minutes of global data plane impact. Core CDN and security services returned HTTP 5xx status codes across every Cloudflare region. A Velprove synthetic from any of the 5 regions would have flipped red inside one monitor interval. No region selection wisdom was required.

The October 20 2025 AWS DynamoDB cascade is the contrasting story. ThousandEyes' post-incident analysis at thousandeyes.com/blog/aws-outage-analysis-october-20-2025 documents the shape: a DynamoDB DNS race condition surfaced at 6:49 AM UTC October 20, AWS engineers identified the cause by 7:26 AM UTC, DNS was fully restored between 9:25 and 9:40 UTC, and EC2 instance launches continued failing until 8:50 PM UTC, with Redshift cluster backlogs not cleared until 11:05 AM UTC October 21. The customer-visible window ran over 15 hours. Many of the downstream phases hit US-EAST-1 specifically. A monitor from a non-US region would have stayed green for the EC2-launch phase while a US-region monitor turned red. That asymmetry is the case for putting your top-3 vendor monitors in two or three regions when you can spend the monitor budget.

When NOT to monitor a dependency synthetically (the third bucket)

The honest counterweight to the triage rule is the bottom-4 bucket. Some vendors do not justify a synthetic, because the cost of running the monitor (the time to set it up, the alert noise, the slot it takes in your monitor budget) exceeds the cost of finding out late.

Three concrete shapes land in the bottom bucket reliably. A vendor used for an offline batch report that runs nightly: a 10-minute outage at 4 AM costs nothing real, and your nightly job retries on its own. A vendor used for a low-traffic internal admin feature: you will find out the next time you click the button, which is rare enough that the monitor is overhead. A vendor with a fast, honest status page and an email-subscription pipeline where 5-minute-late detection is acceptable to your operation. Calling these out explicitly is part of the triage: the rule is “3 of 12,” not “all 12.” The point of triage is to spend your monitor budget where it earns its keep, and to consciously accept the risk on the rest.

Patterns to avoid (honest about Velprove's primitive set)

Five patterns commonly recommended for third-party API monitoring do not fit Velprove's primitive set. Naming them is faster than pretending they are options.

No polling primitive. Velprove's multi-step API monitor runs each step exactly once in sequence, then records the result. There is no “keep hitting this endpoint until X” option, no retry-until-success loop, no condition-wait. The replacement is the monitor interval itself. If you need 30-second granularity, set the interval to 30 seconds on the Pro plan.

No time-relative assertion type. The six assertions Velprove supports are status_code, body_contains, body_not_contains, json_path, response_time_ms, and header_contains. There is no “assert this timestamp field is within the last 60 seconds” primitive. The replacement is your endpoint computing freshness server-side and returning 200 or 503, or a json_path assertion against a static expected value.

No percentile latency thresholds. response_time_ms is a per-request budget, not a p95 or p99 aggregate. The replacement is to set a per-request threshold that allows for some single-request noise, and to configure your alert rule to fire on N consecutive failures. The same goal (catch sustained slowdown, not single slow requests) is met by the consecutive-failure rule.

No inbound webhook receiver. As described in worked example 3, Velprove does not host an endpoint that catches third-party webhooks. The replacement is trigger-and-check-effect: two monitors that compose, where the second asserts on your own application state after the vendor's webhook has had time to fire.

No distributed tracing, no RUM. Velprove is the outside-in synthetic layer. APM tracing (Datadog, Honeycomb) and Real User Monitoring (Splunk, LogicMonitor) are complementary categories, not replacements. The right view of the dependency call from inside your application is the trace; the right view from outside is the synthetic.

One final note on alert channels. Today, Velprove's alert channels are email (every plan), Slack, Discord, webhook, and Microsoft Teams (Starter and above), and PagerDuty (Pro). There is no mobile push channel on any plan today. Plan your alert routing around what exists, not what should exist. The opposite-prescription view of dependency monitoring (why your own /healthz should NOT deep-probe these dependencies inside a liveness probe) lives in the inverse view, why your own /healthz should NOT deep-probe these dependencies. The complement holds: synthetic-from-outside, plain-liveness-from-inside.

ThousandEyes' analysis of the October 20 2025 AWS incident captures the recovery-shape implication well: “Recovery timelines are sums of dependent phases, not parallel operations.” The triage rule above tells you which vendors to monitor. The recovery shape tells you why your incident playbook should keep the monitor running through the all-clear: the vendor's status page going green is the first phase, not the last.

Frequently Asked Questions

How do I decide which of my third-party dependencies to monitor synthetically?

Score every vendor on three axes from 0 to 3. Blast-radius is what breaks for your customers when the vendor goes dark (payments 3, weekly report 0). Revenue-attribution is what dollars stop (checkout 3, marketing email 1). Vendor-status-lag-history is how late the vendor posted its last three incidents (within 5 minutes 0, 30+ minutes 3). Sum the axes. Vendors scoring 7 or higher belong in the top 3. Vendors scoring 4 to 6 belong in the next 5, with the vendor status page as secondary signal. Vendors scoring 3 or under sit in the bottom 4: you consciously accept that you find out late. Revisit the list every quarter when vendors and traffic patterns change.

What is the realistic monthly cost of running a synthetic monitor per third-party vendor?

On Velprove's Free plan, three synthetic monitors on your top-3 vendors costs $0 per month, assuming your total monitor count stays under the 10-monitor Free cap. Free includes 5-minute intervals, multi-step API monitors up to 3 steps, 1 browser login monitor (every 15 minutes), all six assertion types, and email alerts, with commercial use allowed. Starter at $19 per month unlocks 1-minute intervals plus Slack, Discord, webhook, and Teams channels. PagerDuty ships on Pro at $49 per month. By comparison, Datadog Synthetic prices per-test-per-region: three vendor synthetics from three regions runs into low-three-figures monthly at Datadog's current list price.

How do I know when a vendor's status page is reliable enough that I do not need my own monitor?

Pull the vendor's status page incident history and look at the last three incidents. Measure the gap between impact start (usually disclosed in the Resolved update) and the first Investigating update. If all three incidents posted within 5 minutes of impact, the status page is acceptable as a secondary signal: you can lean on it instead of running your own probe. If any one of the three incidents posted 30 minutes or more after impact, the status page is not the monitor. Stripe sits in the first bucket. AWS, Slack, and most long-tail SaaS sit in the second. Status page subscriptions are still useful for downstream context, even for vendors in the second bucket. They just are not the primary detection instrument.

How do I monitor a vendor whose API is bursty and noisy on the happy path?

Bursty vendors generate single-request slow responses that are not real incidents. The per-request response_time_ms assertion is per-request, not an aggregate, so a single slow response will trip a raw threshold. The fix is two configuration choices. First, set response_time_ms to a threshold that allows for some single-request noise (often 2x or 3x the observed p50 from your own client telemetry). Second, configure the monitor's alert rule to fire on N consecutive failures instead of a single failure. Three consecutive 5-minute checks failing is 10 to 15 minutes of sustained degradation, which is the signal you actually want. The consecutive-failure rule is available on every plan.

How do I monitor a vendor whose only outage signal is a webhook I cannot receive in Velprove?

Velprove does not host an inbound webhook receiver. We cannot accept the vendor's POST and parse the payload. The pattern that works is trigger-and-check-effect with two composed monitors. Monitor A is an API monitor that triggers the workflow on the vendor side (POST a test checkout, dispatch a test SMS, kick off a test build). Monitor B is an API monitor that hits your own application endpoint (/api/canary/whatever) on the next monitor interval, with a json_path assertion against a static expected value. Your endpoint records the most recent webhook-driven state server-side and returns 200 with the static JSON when the webhook arrived, or 503 when it did not. The deeper version with the Stripe checkout shape lives in monitor Stripe webhooks.

Can I do this on the free plan?

Yes. Velprove's Free plan includes 10 monitors at a 5-minute interval, multi-step API monitors up to 3 steps, 1 browser login monitor (every 15 minutes), HTTP and API monitors with all six assertion types, and email alerts. Three synthetic API monitors on your top-3 vendors fit inside Free as long as your overall monitor count stays under 10. No credit card. Commercial use allowed.