Guide

Anatomy of a Silent Outage: 10 Failures HTTP Misses

May 9, 202612 min read

Bottom line: on June 12, 2025, Cloudflare's own Workers KV storage failed for two hours and 28 minutes. Cloudflare Access went to 100% identity-login failure. Turnstile widgets stopped resolving. WARP couldn't register new sessions. Marketing edge stayed up the entire time. Any HTTP monitor pointed at a customer's marketing page returned 200 OK throughout. The same customer's authenticated dashboard was unreachable. That gap, between what an HTTP probe sees and what a real user sees, is a silent outage. Below are 10 failure patterns where HTTP monitoring returns 200 while users see something broken, each with the public incident or vendor documentation that proves it.

The Cloudflare June 12 2025 Workers KV outage (and what your monitor saw)

A silent outage is when an HTTP monitor returns 200 OK while real users cannot complete the action they came to do. Cloudflare's June 12, 2025 Workers KV post-mortem is the cleanest public example from the last twelve months. Per Cloudflare's own engineering blog post on the June 12, 2025 service outage, the underlying storage provider for Workers KV failed for 2 hours and 28 minutes. The cascade reached Cloudflare Access, Turnstile, WARP, Workers AI, and parts of the dashboard. Marketing properties served from the edge kept serving.

That asymmetry is the point. A customer running a marketing site on Cloudflare and a separate Access-protected dashboard would have seen two different realities at the same time. The marketing URL returned 200 OK every minute. The Access redirect to the identity provider hung or failed. An HTTP monitor pointed at the marketing site never flipped. An HTTP monitor pointed at the dashboard URL might have returned 200 on the redirect page itself, then handed the user to a login flow that never resolved.

The customer-facing impact was binary. Anyone who only owned the marketing surface slept through it. Anyone who depended on Access for the actual product surface lost two and a half hours of authenticated traffic. The monitoring evidence on both sides was a green dashboard.

Velprove dashboard showing a single HTTP monitor on a login URL returning 100 percent uptime in green. This is what HTTP monitoring sees during a silent outage on the authenticated surface: green and healthy. — Baseline view: an HTTP monitor on the login URL shows 100 percent uptime in green. This is the false confidence a status-code-only probe gives you. Whether the actual login flow works is an open question this view cannot answer.

10 failures HTTP monitoring misses

Each pattern below is the same shape: the origin returns 200, the status code says healthy, and the user sees something broken. Where a named public incident exists, the source link goes to the post-mortem or vendor documentation. Two patterns are marked [REPRODUCIBLE-ONLY] because the failure mode is documented in vendor docs and bug trackers but no single named incident is the cleanest anchor.

1. Identity provider dependency outage

Your application authenticates against an identity provider or a token store. The IdP has an outage. Public pages and your origin stay green. The login button takes users to a redirect URL that hangs or errors.

Cloudflare Access went to 100% failure for all identity-based logins during the June 12, 2025 Workers KV outage, while marketing properties on the same Cloudflare network kept serving. Per Cloudflare's post-mortem. Xbox Live sign-in was unavailable for roughly 7 hours on July 2, 2024, while account services failed; see BleepingComputer's coverage and Variety's coverage.

The HTTP probe saw 200 on the marketing surface and on the login page itself. The user saw an IdP redirect that never returned a session.

A browser login monitor walks the redirect chain to the IdP and back. If the IdP hangs, the monitor times out at the IdP step and captures a screenshot of where the redirect died.

2. Auth backend up, sessions broken (token mixup)

Web servers respond. The login form renders. The POST to the login handler returns 200. But the session token issued is invalid, mixed up across users, or the session store has flushed. Users either get bounced back to login or end up on the wrong account.

Meta's March 5, 2024 outage is the cleanest public case. Servers were reachable and network paths were clear. Users got incorrect-password errors on correct passwords; some users reported being logged into other people's accounts. Per ThousandEyes' analysis and Born's reporting on the token-mixup behavior.

The HTTP probe saw 200 on every public surface. The user saw a rejected password they knew was right.

A browser login monitor signs in as a known test user and asserts on a post-login element specific to that user (the test user's name in the navbar, not a generic dashboard string). A token mix-up trips that assertion immediately.

3. Captcha vendor outage breaking form submission

The login or signup form embeds a captcha widget from Cloudflare Turnstile, hCaptcha, or reCAPTCHA. When the captcha vendor has elevated latency or a partial outage, the widget either fails to render or the verification token never returns. The login form HTML still loads with HTTP 200.

Cloudflare's June 12, 2025 outage explicitly listed Turnstile and Challenges among affected services. See the same Cloudflare post-mortem. hCaptcha's public outage history on StatusGator's tracking page shows recurring elevated-latency events that produce the same symptom.

The HTTP probe saw 200 on the form URL. The user saw a captcha that never resolved.

A browser login monitor waits for the captcha widget to render and for verification to complete before clicking submit. If the widget never resolves within the step timeout, the run fails with the captcha frame visible in the screenshot.

4. Cookie domain mismatch on subdomain split `[REPRODUCIBLE-ONLY]`

The application is hosted on a subdomain such as app.example.com and signs cookies for .example.com. A configuration change switches the cookie scope to host-only. Existing sessions become orphaned and new logins from a sibling subdomain cannot see the cookie.

The closest documented case is the Backstage cross-subdomain refresh-token loop in Backstage Issue #28126, where a prod cookie on a sibling subdomain trapped users in an infinite redirect when signing in to dev.

The HTTP probe on the root domain saw 200 because the marketing page rendered. The user saw a redirect loop or a still-anonymous navigation bar.

A browser login monitor opens a fresh Chromium context, hits the login URL, types credentials, and asserts on a post-login element. If the cookie never lands or never gets sent on the redirect, the assertion fails and a screenshot of the redirect loop is captured.

5. CSRF token mismatch after deploy `[REPRODUCIBLE-ONLY]`

A deploy rotates the secret used to sign CSRF tokens, restarts the session store, or changes the session driver. Existing browser sessions hold a token signed with the old secret. The next form submission fails CSRF validation and the user gets bounced back to login.

The failure mode is widely documented. Laravel Issue #9531 documents CSRF token mismatch after session-store changes locking users out. IBM tracker IV91742 documents users being logged out of a dashboard after CSRF mismatch on widget interactions.

The HTTP probe saw 200 on the login page HTML. The returning user saw their submitted form rejected with a session error.

A browser login monitor handles CSRF token negotiation the way a real user would. A secret rotation breaks the next monitored login cycle and surfaces immediately on the following run.

6. JS bundle 404 with HTML 200

A new release invalidates a hashed JS bundle filename. The HTML still references the old hash because the HTML response was cached at the edge. The browser fetches the old bundle, gets a 404, and the page renders as a blank shell.

Workbox Issue #1528 documents the exact pattern: a cached index.html references hashed assets that no longer exist after a new build is released.

The HTTP probe saw 200 on the page URL with a valid HTML body. The user saw a blank page with a console full of 404s.

A browser login monitor executes the script tags. A 404 on the main bundle leaves the page with no event handlers. The login click does nothing, the post-login assertion never resolves, and the screenshot shows the bare HTML shell.

7. Service worker stuck on broken cached version

A previous build registered a service worker that aggressively caches the app shell. The next build is backend-incompatible, but returning visitors are served the old shell from the service-worker cache before the update lifecycle completes. Visitors see the old broken UI; first-time visitors and your HTTP probe see the new working version.

Angular Issue #43163 documents the production failure mode where users get a broken cached shell that survives reload, with recovery requiring the kill-switch trick on ngsw.json.

The HTTP probe saw 200 on the new shell. The returning user saw the old shell wired to a backend that no longer matches.

A browser login monitor uses fresh contexts by default, so it sees the new build. To catch service-worker-stuck visitors, run a second monitor with persistent storage enabled. The failure surfaces as a login-form-handler not bound to the new API.

8. Plugin auto-update silently swaps to a different plugin slug

A platform allows automatic plugin updates. An update silently replaces one plugin with a forked or renamed one. Code that referenced the old plugin's internals breaks. PHP errors land but the response is committed before death, so the body is partial or blank with HTTP 200.

The clean public anchor is the ACF to Secure Custom Fields auto- switch on October 12, 2024, which affected sites running Advanced Custom Fields with auto-updates enabled. The full case study lives in our post on the ACF Secure Custom Fields case study.

The HTTP probe saw 200 on the page URL. The user saw a partially rendered admin screen with the navigation missing.

A browser login monitor renders the actual page and the post-load DOM assertion fails when the body is empty or missing the expected admin nav.

9. Database read-replica lag returning empty results

The application reads from a replica that is lagging or has stopped replicating. Queries return empty result sets, not errors. A login API returns 200 with a null user object. The frontend renders an empty dashboard or a no-data state.

The class is documented by every major cloud provider. See AWS RDS read-replica troubleshooting docs and GCP Cloud SQL replication-lag docs.

The HTTP probe saw 200 with a valid JSON body. The user saw a dashboard with no data on an account that should have hundreds of records.

A browser login monitor logs in with a known test user that has known post-login content. If the replica returned empty for that user's record, the dashboard assertion fails.

10. Internal automation deletes data; monitoring blind to it

A control-plane script with the wrong inputs deletes customer data via a sanctioned workflow path. Internal monitoring sees the workflow as normal because it is a normal workflow. Customer-facing surfaces still respond with 200, but they 404 or empty-state for affected accounts.

Per Atlassian's April 2022 post-incident review, internal monitoring did not detect the issue because the sites were deleted via a standard workflow. The first impacted customer opened a support ticket at 07:46 UTC on April 5th, roughly 8 minutes after the deletion script started at 07:38 UTC. The same incident has a separate SLA-credit dimension covered in the SLA-credit lens on the same Atlassian incident.

Atlassian's internal monitoring saw a normal workflow. The affected customer saw 404s on a workspace that no longer existed for them.

A browser login monitor that signs into the affected tenant flips to failed within one cycle, because the test account's tenant is gone and the login redirects to a workspace-not-found screen.

What a browser login monitor sees that HTTP doesn't

A browser login monitor, like the one Velprove runs on every tier including Free, executes the page in real headless Chromium. It loads the URL, waits for JavaScript to execute, fills the login form, clicks submit, and asserts that a known post-login element actually rendered. The 10 patterns above all surface as either an assertion timeout, a screenshot of the broken state, or a redirect chain that dies at a named step. A status-code probe sees none of that.

Velprove dashboard list view showing two monitors on the same login URL. The HTTP monitor shows status Up with 100 percent uptime in green. The browser login monitor on the same URL shows status Down in red because the post-login text assertion failed. The silent-outage gap is visible side by side. — The silent-outage gap revealed. HTTP monitor on the URL shows status Up with 100 percent uptime in green at the same time the browser login monitor on the same URL shows status Down in red. The URL responded with 200, the browser proved the login flow is broken. Same URL, opposite verdicts in the same dashboard view.

Velprove includes a free browser login monitor on the Free plan. Every tier monitors from 5 global regions. Concretely:

Free, $0: 10 monitors total, of which up to 1 can be a browser login monitor (15-minute interval). HTTP monitors run on a 5-minute interval. Multi-step API monitors up to 3 steps. Email alerts. No credit card required.
Starter, $19/mo: 25 monitors total, of which up to 3 can be browser login monitors (10-minute interval). HTTP monitors run on a 1-minute interval. Multi-step API monitors up to 5 steps. Plus Slack, Discord, Teams, and webhook alerts.
Pro, $49/mo: 100 monitors total, of which up to 10 can be browser login monitors (5-minute interval). HTTP monitors run on a 30-second interval. Multi-step API monitors up to 10 steps. Plus PagerDuty.

If you want a structured walkthrough before signing up, our decide if a browser login monitor is the right tool post asks seven binary questions and lands on a yes/no.

Why this keeps happening (and the broader case)

The structural reason is simple: HTTP probes test the transport layer, not the application layer. Modern applications fail at the application layer in ways that do not propagate to status codes. The failure surface has moved up the stack while the dominant monitor type stayed at L7. Every pattern above is a different version of the same gap. The broader case for why HTTP monitors miss outages covers the underlying argument; this post is the receipts.

The other half of the gap is the API layer. Patterns 2 (token mixup) and 9 (replica lag) both have an API-shape variant where the broken response happens between two services with no human in the loop. For that surface, our guide on multi-step API monitoring catches token-refresh failures covers chained-call assertions that go beyond a single status check.

A practical monitoring stack that catches all 10 patterns

Three layers, in order of cost and depth.

HTTP monitors with content assertions. Cheap, broad, and the right floor for every public URL. Add a keyword assertion against rendered content so a 200 with the wrong body does not pass.
Multi-step API monitors on the auth and critical paths. Velprove includes 3-step on Free, 5-step on Starter, and 10-step on Pro. A chained call proves the auth-token-call-response shape across boundaries.
One browser login monitor on the actual login flow. Velprove's Free tier includes one, and it catches the patterns HTTP and API checks cannot reach.

For SaaS specifically, the closing setup is one browser login monitor on your sign-in flow plus an HTTP layer for breadth. Start with the browser login monitor for SaaS page for the SaaS-shaped configuration; sign-up takes under five minutes.

Frequently Asked Questions

What happened during the Cloudflare June 12 2025 Workers KV outage?

Cloudflare's underlying storage provider for Workers KV failed for 2 hours and 28 minutes. The outage cascaded to Cloudflare Access (100% identity-login failure), Turnstile, WARP, Workers AI, and parts of the dashboard. Marketing properties served by Cloudflare's edge stayed online throughout because they don't depend on KV. Per Cloudflare's engineering blog write-up of the June 12 2025 outage, every authenticated surface failed while every static surface stayed green.

Why did Cloudflare Access fail while marketing sites stayed up?

Cloudflare Access stores identity tokens and session state in Workers KV. When KV's underlying storage went down, Access could not validate identities and rejected 100% of login attempts. Marketing sites on the same Cloudflare network continued serving cached HTML and assets from the edge. An HTTP monitor probing a marketing URL saw 200 OK throughout. A browser login monitor running through the Access flow would have failed at the IdP redirect.

What did Atlassian's April 2022 post-incident review say about internal monitoring?

Internal monitoring did not detect the issue because the sites were deleted via a standard workflow, per Atlassian's published review. The first impacted customer opened a support ticket at 07:46 UTC on April 5th, roughly 8 minutes after the deletion script started at 07:38 UTC. A control-plane script with the wrong inputs deleted hundreds of customer tenants through a sanctioned code path. Monitoring saw a normal workflow. The customer surface returned 404. For the deeper SLA-credit lens, see our SLA vs SLO vs SLI guide.

Posted by Velprove.