Platform

Monitor a Cloudflare Workers + Pages Site: KV, R2, D1, DO

13 min read

Quick rundown: A 200 OK on your Cloudflare Pages homepage proves the edge served HTML. It does not prove Workers KV is reachable, R2 is serving objects, D1 is honoring sequential consistency, or a Durable Object pinned to ENAM is healthy. The Nov 18 2025 global outage broke Turnstile-gated login flows for over two hours in two separate windows, and the dashboard itself was down while customers tried to debug. Velprove probes the Cloudflare platform with multi-step API monitors that chain write-then-read assertions across KV, R2, and D1, plus multi-region browser login monitors that catch what HTTP pings miss. The free plan covers it, no credit card required, and the multi-step monitor type is the same one we walk through in section six.

1. Why HTTP monitoring misses the Cloudflare platform surface

On November 18 2025, a Bot Management feature file roughly doubled in size after a database permissions change and propagated network-wide. Core traffic delivery on Cloudflare broke for about two hours and ten minutes starting 11:20 UTC. Recovery had two distinct windows. Turnstile-gated login flows failed from 11:30 to 13:10 UTC, recovered briefly, then failed again from 14:40 to 15:30 UTC. The Cloudflare dashboard itself was unreachable during portions of the event, so customers debugging their own apps could not log in to read their observability data. Full timeline: November 18 2025 post-mortem.

If your monitor was a single GET on https://example.pages.dev/ that day, one of two things happened. It returned the cached marketing HTML and went green throughout, or it returned a 5xx during the network event and recovered when the network recovered. Either way, you assumed everything was fine by 15:00 UTC. The second failure window was invisible because your dashboard already looked clean. Meanwhile real users could not sign in.

This is the structural problem with treating Cloudflare's developer platform as a single thing. It is not. Workers, Pages, Workers KV, R2, D1, Durable Objects, Queues, Workflows, Cron Triggers, Hyperdrive, and Workers AI are eleven different failure domains stitched into one product surface. The edge runtime can be healthy while KV's central store is down. R2 can serve reads while writes fail. A Durable Object pinned to Eastern North America can be unreachable while every other region returns 200 OK. None of those failure modes change what curl https://example.pages.dev/ returns.

If you are running on Cloudflare Pages today, this is the operational walkthrough. If you are an indie hacker on the free tier figuring out which monitor to set up first, also read Velprove's free-tier monitoring guide for indie hackers, which covers the provider comparison stack.

The rest of this post maps the Cloudflare platform into five failure tiers, builds eight probes that cover the subsystems most apps actually depend on, talks through what Cloudflare's own observability does and does not catch, and walks through setting up the first probe in Velprove.

2. The Cloudflare platform in five failure tiers

Group the platform by consistency model and failure character. Five tiers covers the surface area without listing every product.

Tier 1, Edge runtime. The Worker process plus the Cloudflare network layer that routes a request into your code. Failures show up as Cloudflare 1xxx error codes. A 1101 means your Worker threw an uncaught exception. A 1102 means it exceeded the CPU time limit or hit the 128 MB memory ceiling. A 1015 means a Cloudflare WAF or rate-limiting rule blocked the request, often a customer-side rule misfiring on monitor traffic. These look like generic 5xx to a naive uptime check. They are specific signals about what part of the pipeline broke. Custom domain SSL on Pages belongs here too. If a Worker or an Access policy blocks /.well-known/acme-challenge/*, the cert fails to renew and the domain serves a TLS error instead of HTML, which overlaps with the SSL certificate expiry monitoring guide.

Tier 2, Eventually consistent storage. Workers KV is the canonical entry. KV reads hit a POP-local edge cache; cold reads and all writes go to the central data store. When the central store fails, hot reads on already-cached keys keep returning, which is why marketing pages stay green while signup flows error. The textbook receipt is the June 12 2025 outage, which lasted two hours and 28 minutes and cascaded into Access (100% identity-login failure), Turnstile, WARP, Workers AI, Zaraz, Stream, Images, and AutoRAG. An uptime check on a hot path lies to you about KV health.

Tier 3, Object and SQL storage. R2 and D1. R2 is S3-compatible object storage; D1 is managed SQLite with global read replication and a Sessions API for sequential consistency. R2 had at least two operational incidents in 2025 that broke gateways without breaking marketing pages. The March 21 2025 incident failed 100% of writes and roughly 35% of reads for one hour and seven minutes after new credentials were deployed to a dev instance and the old credentials were deleted. The February 6 2025 incident took the gateway down for 59 minutes after a phishing-remediation action misfired. D1's failure mode is subtler: read replicas can lag the primary by roughly 350 ms in practice, so a write followed by a read without the Sessions bookmark can land on a stale replica and silently return the wrong row.

Tier 4, Strongly consistent regional. Durable Objects and Hyperdrive. A Durable Object lives in exactly one location. When that location degrades, every request to that object fails, even from healthy POPs in other geographies. Cloudflare's community status has public incidents for "Durable Objects startup errors in ENAM" and "Increased Durable Object Errors in AMS." Hyperdrive sits between a Worker and an external Postgres or MySQL origin, pooling connections. Free tier holds about 20 connections; paid tier defaults to about 100. A long-running transaction or a leaky Durable Object holding the pool open exhausts it and surfaces as origin-connect errors that look nothing like a Worker health issue. A one-region monitor never sees a one-region failure.

Tier 5, Scheduled and async. Cron Triggers, Queues, and Workflows. Cron fires within the minute but not at a precise second, and there are documented bugs where a service binding with a named entrypoint silently bypasses to the default fetch handler when invoked from a scheduled handler. Queues retain messages for a configured window and silently drop them when consumers fall behind for too long. None of these subsystems have an external "did the job actually run?" signal. You manufacture one.

3. Eight probes for eight subsystems

Each probe below is a configuration on Velprove's multi-step API monitor or browser monitor. The multi-step monitor chains response variables across steps and runs body assertions on each step. The multi-step API monitoring guide walks through the chaining syntax. None of these probes require new product features. They are configuration.

Probe 1, KV write-then-read. Two steps. Step 1 POSTs to a Worker route like /__monitor/kv-write that generates a UUID + timestamp server-side, writes them into KV, and returns them in the JSON response. Velprove extracts $.key and $.value. Step 2 GETs /__monitor/kv-read?key={{key}} and asserts the response body contains {{value}}. Two steps fit the free plan's three-step cap with room to spare. This catches central-store failures (the June 12 class) because the write either fails or the read returns nothing. If your Worker also exposes a /healthz endpoint that pings KV internally, the API health check patterns guide covers what to return.

Probe 2, R2 object round-trip. Two steps again. Step 1 PUTs a small object (under 1 KB is fine) into a monitor-only R2 bucket via an authenticated Worker route. Step 2 GETs the object and asserts the body matches. This catches gateway failures of the March 21 and February 6 class. The asymmetry of the March 21 incident (100% of writes, 35% of reads) is the reason both directions matter. A read-only probe would have missed the writes-failing window.

Probe 3, D1 sequential-consistency probe. Two or three steps. Step 1 issues an INSERT through a Worker route that returns the D1 Sessions bookmark in the JSON response; Velprove extracts it via JSONPath $.bookmark. Step 2 issues a SELECT that forwards the bookmark as {{bookmark}} and asserts the inserted row is visible. To also assert that the row is not visible on a read without the bookmark, add a third step. Three steps still fits the free plan's three-step ceiling. This proves D1 is honoring sequential consistency and catches read-replica lag that a one-shot SELECT silently absorbs.

Probe 4, Durable Object regional probe. A Velprove monitor probes from one region. To cover three geographies, create three monitors (North America, Europe, Asia-Pacific) all pointing at the same Worker route that fans out into your Durable Object. If the DO is pinned to ENAM and ENAM degrades, your NA monitor fails and your EU and APAC monitors pass. That asymmetry is the signal. Cloudflare's docs explain the single-region affinity of Durable Objects.

Probe 5, Cron freshness assertion. One step plus a discipline. Your Cron Trigger writes a last_run_at ISO timestamp to KV on every successful run. Deploy a Worker route /__monitor/cron-freshness that reads last_run_at, computes the age, and returns 200 if the cron is fresh and 503 otherwise. Velprove pings the freshness endpoint and asserts the HTTP response. Configure a Status Code assertion of 200, or a Body Contains assertion on a known token. This catches the silent-skip class without integrating a heartbeat library: the Worker route flips to 503 the moment the cron stops writing.

Probe 6, Hyperdrive origin probe. One step. A Worker route hits Hyperdrive with a SELECT 1 and returns the round-trip latency. The Velprove HTTP monitor uses a body assertion to confirm the response contains the expected token and a response-time threshold to alert on slow pool acquisition. When the pool is exhausted, this route returns origin-connect errors immediately while your homepage continues returning HTML normally.

Probe 7, Workers AI rate-limit and cold-start probe. Workers AI bills in neurons. The free tier caps at 10,000 neurons per 24 hours and resets at 00:00 UTC. Once you hit the limit, requests return error code 4006 with no graceful degradation. There is a known dashboard bug where usage shows 0 of 10,000 while requests still return 4006, so the in-dashboard signal is unreliable. The probe is a tiny inference request from a monitor-only route, with a body assertion on the success-path JSON shape and a separate body assertion on the 4006 error string. If the assertion catches 4006 in production, you know the rate limit is the actual cause rather than a vague platform failure.

Probe 8, Pages auth login. Browser monitor. Sign in to your Pages-served app and assert post-login content rendered. Velprove's browser monitor uses headless Chromium and identifies itself as VelproveBot/1.0 (Uptime Monitor) in the User-Agent header, so a Cloudflare Turnstile widget on your production login form will challenge or block it. Point the monitor at a dedicated monitor-only login URL (for example /login/monitor) that renders one of Cloudflare's Turnstile test sitekeys instead of your production sitekey. The test sitekey 1x00000000000000000000AA always passes verification, so the monitor signs in cleanly against a dedicated test account while your real /login keeps the production Turnstile widget for human visitors. This probe catches the Nov 18 2025 second-window pattern, the June 12 Access identity-login failure pattern, and any regression on your own form. On the free plan you get one browser login monitor on a 15-minute interval, enough to flag a multi-hour outage and a login regression within one window. If your Pages app uses Next.js running on Workers via OpenNext, the same browser pattern applies; the framework underneath does not change the probe.

These eight probes cover KV, R2, D1, Durable Objects, Cron, Hyperdrive, Workers AI, and Pages auth. Together they replace a single GET on the homepage with eight signals that correspond to subsystem health.

4. What Cloudflare's own observability covers, and where it stops

Cloudflare ships a real observability surface, and an honest monitoring post says so. The Workers Observability dashboard unifies logs, metrics, and a Query Builder beta inside the Cloudflare dashboard. Workers Logs is included on Free and Paid plans, with billing for retention beyond defaults that began April 21 2025. wrangler tail streams real-time logs from a deployed Worker on the command line. Tail Workers consume execution events from other Workers but require Paid or Enterprise plans. Logpush ships Workers Trace Event Logs to S3, Datadog, or another sink, and is Enterprise-leaning in practice.

For day-to-day Worker development this is enough. You can find a 1101 in seconds, query for elevated error rates, and trace a slow request. Three gaps remain.

First, observability tools answer "what did my Worker do?" They cannot answer "what did my user see when KV failed before the request reached the Worker?" A request that 522'd at the edge or that broke on Turnstile rendering never reaches your code. Your Tail Worker has no event to consume. Your Logs query has no row to return.

Second, in-dashboard observability is itself a single point of failure during the events that matter most (see the Nov 18 and June 12 timelines above). If your only signal lives in the place that just went down, you have no signal during the incident.

Third, all of these tools probe from inside Cloudflare's network. They cannot witness a regional POP problem that affects users on one continent. They cannot tell you a Durable Object pinned to ENAM is broken from Sydney. They cannot tell you Turnstile is failing on the rendered login page from a user perspective.

The Cloudflare-specific point: the network-internal observability is good and you should still use it. Use Workers Observability for debugging your code and external multi-region monitors for proving the platform is alive end to end. The same structural argument applies to any single-network observability stack and is unpacked in ten silent outage patterns across the web.

5. Cloudflare versus Vercel: when this post is the right one

This post is not a Cloudflare-vs-Vercel deployment comparison. If you are already running on Vercel and want the platform monitoring playbook for that stack, read the Vercel platform monitoring guide. The structure is parallel: failure tiers, subsystem probes, observability gaps, monitor setup. The receipts and probes are different.

Short routing rule: read this post if your stack uses Workers KV, R2, D1, Durable Objects, Cloudflare Cron Triggers, Hyperdrive, Workers AI, or any combination. Read the Vercel post if your stack uses Vercel KV (now Upstash Redis after the December 2024 sunset), Vercel Blob, Vercel Postgres (Neon partner), or Vercel Cron Jobs. Cloudflare's storage is first-party with documented consistency models you can write monitor logic against. Vercel's storage in 2026 is mostly partner-mediated, so the monitoring story passes through the partner's surface.

6. Setting up your first Workers + Pages monitor in Velprove

End-to-end walkthrough for Probe 1, the KV write-then-read. The pattern generalizes to R2 and D1 with the same shape.

Step A. Deploy the monitor Worker route. Add a small Worker that exposes /__monitor/kv-write and /__monitor/kv-read. Use the current ES module export shape:

export interface Env {
  MONITOR_KV: KVNamespace;
  MONITOR_TOKEN: string;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    // Shared-secret guard so this isn't world-writable.
    if (request.headers.get("x-monitor-token") !== env.MONITOR_TOKEN) {
      return new Response("forbidden", { status: 403 });
    }

    if (url.pathname === "/__monitor/kv-write" && request.method === "POST") {
      // Generate key + value server-side so Velprove can extract them
      // from this response and forward to the read step.
      const key = crypto.randomUUID();
      const value = Date.now().toString();
      await env.MONITOR_KV.put(key, value, { expirationTtl: 300 });
      return Response.json({ ok: true, key, value });
    }

    if (url.pathname === "/__monitor/kv-read" && request.method === "GET") {
      const key = url.searchParams.get("key");
      if (!key) return new Response("missing key", { status: 400 });
      const value = await env.MONITOR_KV.get(key);
      if (value === null) return new Response("not found", { status: 404 });
      return Response.json({ ok: true, key, value });
    }

    return new Response("not found", { status: 404 });
  },
};

Bind a KV namespace called MONITOR_KV in your wrangler.toml and set the MONITOR_TOKEN secret with wrangler secret put MONITOR_TOKEN. Deploy with wrangler deploy. The 300-second TTL keeps the namespace from accumulating monitor garbage.

Pick the Multi-Step tile on the new-monitor screen. This is the same type that chains response variables across steps.
Velprove new-monitor screen with the Multi-Step tile selected, showing the multi-step API monitor type chosen for the KV write-then-read probe

Step B. Create a multi-step API monitor in Velprove. Sign in at /signup if you have not already (the free plan covers this end to end). Create a new monitor and choose the Multi-Step type. The screenshots below substitute https://httpbin.org/anything/kv-write for the Worker URL because *.workers.dev does not resolve A records and fails Velprove's URL validator at create time; readers substitute their real Worker URL (https://your-worker.your-account.workers.dev/__monitor/kv-write) when configuring the monitor.

Step 1 is a POST to https://your-worker.example.com/__monitor/kv-write with the x-monitor-token header set, an empty JSON body {}, and a body assertion that the response contains the substring "ok":true (no space between the colon and true, since Response.json emits compact JSON). In the same step, add two response extracts: key from JSONPath $.key, and value from JSONPath $.value.

Step 1 POSTs to kv-write, asserts the response contains the ok and true substring, and extracts key and value for the next step.
Step 1 form in the Velprove multi-step API monitor editor showing a POST to the kv-write endpoint with a body assertion on the substring containing ok and true, and two response extracts pulling key and value out of the JSON response

Step 2 is a GET to https://your-worker.example.com/__monitor/kv-read?key={{key}} with the same auth header and a body assertion that the response contains {{value}}. The interpolation pulls the Worker-generated UUID and timestamp from Step 1 into Step 2, so every run proves a round-trip through the KV central store.

Step 2 GETs kv-read with the key from Step 1 interpolated into the URL, then asserts the response body contains the matching value.
Step 2 form in the Velprove multi-step API monitor editor showing a GET to kv-read with a templated key parameter interpolated from step 1 and a body assertion that the response contains the templated value from step 1

Step C. Choose a region. The free plan lets you run from any of Velprove's 5 global regions, one region per monitor. Pick the geography closest to where the bulk of your users are. To cover three geographies for a Durable Object probe, create three monitors and assign one region each, which fits inside the free plan's ten-monitor limit.

Step D. Configure interval and alerts. Free plan multi-step API monitors run at a five-minute minimum interval. Alerting is email-only. For Slack, Discord, Microsoft Teams, or webhooks, the Starter plan at nineteen dollars per month adds those channels; the Pro plan at forty-nine dollars per month adds PagerDuty.

Step E. Extend the pattern to other subsystems. The same multi-step shape covers R2 (PUT then GET on an object) and D1 (INSERT returning a Sessions bookmark, then SELECT forwarding the bookmark). For Probe 8 (Pages auth login), create a browser login monitor; the free plan gives you one on a 15-minute minimum interval. Point it at a dedicated monitor-only login URL that renders Cloudflare's 1x00000000000000000000AA Turnstile test sitekey. The three-step ceiling on the free plan is enough for the D1 sessions probe. The Starter plan raises the multi-step ceiling to five steps and the Pro plan to ten.

That is the whole loop. Deploy a small monitor Worker, configure two steps in Velprove, choose a region close to your users, and you have a probe that proves KV is alive end to end rather than a probe that proves your homepage HTML cached. Sign up at /signup and the free plan covers this with no credit card required.

Frequently Asked Questions

How do you monitor a Cloudflare Workers and Pages site beyond HTTP 200?

You add probes that exercise the platform layer rather than only the page layer. A multi-step API monitor that writes a key into Workers KV and reads it back proves the central store is reachable. A second multi-step monitor that PUTs and GETs an R2 object proves the gateway is healthy. A third that inserts and selects through D1 with the Sessions bookmark proves replication is consistent. A browser login monitor proves your auth flow actually renders and accepts credentials end to end, pointed at a monitor-only login URL that uses Cloudflare's Turnstile test sitekey since the monitor identifies as a bot. None of these require code you do not already have; they require monitor configuration that asserts real subsystem behavior rather than HTML delivery.

What does Cloudflare's own Workers Observability cover, and what does it miss?

Workers Observability covers logs, metrics, and queries for requests that reached your Worker. It is excellent for debugging code-level issues and for understanding traffic shape. It misses three things. It does not see requests that broke before they reached your code, including 522 errors at the edge and Turnstile failures on the login page. It lives inside the Cloudflare dashboard, so it is unavailable during the same incidents that take the dashboard down. And it probes from inside Cloudflare's network, so it cannot witness regional POP failures from an outside vantage point. Use it for debugging and external monitors for ground truth.

Can a single uptime check catch a Workers KV outage?

Not reliably. KV is eventually consistent, with a POP-local edge cache in front of a central store. A check that requests a hot, popular key hits the edge cache and returns successfully even when the central store is offline. The June 12 2025 KV outage broke writes and cold reads globally for two hours and 28 minutes while hot reads continued returning cached values, which is why marketing pages stayed green while signup flows failed. The reliable probe is a write-then-read multi-step monitor that forces the central path on every run.

How do you monitor a Cloudflare cron trigger that may have silently skipped?

Build a freshness signal. Have the cron write a last_run_at timestamp to KV on every successful execution. Then run a Velprove HTTP monitor against a Worker route that reads that timestamp, computes the age, and returns 503 if the age is more than two times the cron interval. There is no first-party "did my cron run?" signal you can subscribe to from outside Cloudflare, so this manufactured heartbeat is the cleanest pattern. It also catches the documented case where a service binding with a named entrypoint silently bypasses the wrong handler from a scheduled invocation.

Should you monitor Durable Objects from a single region?

No, because a Durable Object lives in one region by design. If the DO is pinned to Eastern North America and that region degrades, a monitor running from Europe will see a happy 200 from a fan-out path it never had to traverse, and a monitor running from North America will see the failure. Each Velprove monitor probes from one region. To cover three geographies, create three monitors and assign one region to each. The asymmetric pass/fail pattern across the three is the signal that a regional DO is the problem.

Why is this different from your "Anatomy of a Silent Outage" post?

The cross-platform silent-failure reference takes one storage failure and walks through ten distinct ways it leaks into user experience across any web stack. This post takes one platform, Cloudflare Workers and Pages, and walks subsystem by subsystem through the probes that prove each piece is alive. That post is a "what does silent failure look like" reference; this one is a "how do I monitor Cloudflare specifically" tutorial. Read the first if you want the conceptual map across platforms. Read this one if you are already on Cloudflare and want the configuration walkthrough.

How does Velprove probe Workers KV without a custom Worker?

You cannot probe KV reads or writes without a Worker, because KV is not addressable from outside Cloudflare's network. There is no public KV endpoint to point a monitor at directly. The lightweight pattern is in section six: deploy a tiny monitor-only Worker that exposes /__monitor/kv-write and /__monitor/kv-read guarded by a shared secret, then point a Velprove multi-step API monitor at those two routes. The Worker is about thirty lines of TypeScript and runs inside your existing Workers project, so the operational cost is negligible.

Is Cloudflare Workers + Pages monitoring covered on the free plan?

Yes for the eight probes in this post. The free plan includes ten total monitors, multi-step API monitors with up to three steps each on a five-minute minimum interval, and one browser login monitor on a 15-minute interval. That covers all eight probes; a three-region Durable Object probe uses three of your ten monitor slots. Alerts are email-only on free; Slack, Discord, Teams, and webhook delivery require Starter at nineteen dollars per month, and PagerDuty requires Pro at forty-nine dollars per month. No credit card required to start.

Start monitoring for free

Free browser login monitors. Multi-step API chains. No credit card required.

Start for free