Platform

How to Monitor a Vercel Site at the Platform Layer

Q: Did the October 20 2025 outage affect every Vercel site?

No. The impact was uneven. Customers with backup or secondary function regions configured had service restored at 07:25 UTC. Customers using us-east-1 (iad1) as their only region waited until 09:21 UTC for function invocations to come back. That gap is 1 hour and 56 minutes. External monitors flagged the issue before vercel-status.com updated.

Q: How do I monitor a Vercel Cron Job?

Vercel Cron Jobs do not emit a native 'did not fire' alert on Hobby or Pro. The working pattern is to have the Cron write a timestamp on success, then have your heartbeat endpoint compute freshness server-side and return 503 when the timestamp is stale. A Velprove HTTP monitor then asserts Status Code Equals 200. The endpoint flips to 503 when the Cron stops firing, and the monitor catches it within one probe interval. A static body_contains assertion against today's date does not work, because the monitor stores the value once and never updates it.

May 13, 202611 min read

Direct answer: To monitor a Vercel site properly, you have to monitor the platform layer, not just the page layer. Vercel Cron Jobs, Vercel Blob, Marketplace storage (Upstash KV, Neon Postgres), and single-region functions all fail in ways a 200 OK on / cannot see. The October 20 2025 outage left customers pinned to iad1 down for nearly two hours longer than customers with multi-region functions. Four Velprove monitor patterns close the platform-layer gap: a Cron freshness probe, a Marketplace storage probe, a non-default region probe, and a browser login monitor on your checkout flow. Each probe can run from any of 5 global regions on the free plan. For Next.js render-layer monitoring (ISR, cold starts, auth-protected routes), read the Next.js render-layer guide instead.

Why a Next.js monitor misses half the Vercel surface

There are two monitoring scopes on a Vercel deployment, and they are not the same scope. The render layer is your Next.js app: ISR pages going stale, cold starts on archived functions, the dashboard route that returns 200 while the page is empty. The Next.js render-layer guide linked at the top of this post covers that surface. If that is what you came for, read it and come back.

The platform layer is everything else Vercel sells you. Cron Jobs that fire on a schedule and silently stop. Vercel Blob storing your user uploads, regional, with no global replication. Edge Config holding feature flags, global, replicated to every PoP. A Marketplace-provisioned Upstash KV or Neon Postgres that your function reads on every request. Function deployments pinned to one region (the default is iad1, US East), with the option on Pro to deploy to up to three regions. None of those failure modes show up on a status-code monitor pointed at /. A 200 OK on your marketing root says nothing about whether your nightly billing Cron fired six hours ago.

The split matters because the two layers fail differently. Render-layer failures degrade the page. Platform-layer failures degrade the product without changing the page. You can keep serving a fresh marketing site for hours after the function that processes signups has been silently dropping invocations. The platform layer is the rest of this post.

The October 20 2025 cascade: 20 hours, single-region penalty

Vercel's own postmortem opens with this line: "Vercel is fully accountable for this incident, even if it's now public that it was triggered by the unexpected outage of AWS us-east-1." That sentence is doing a lot of work. It admits the dependency. It refuses to hide behind it. And it sets up the timeline that every Vercel-hosted SaaS should have read by now.

First alerts paged Vercel's on-call at 06:55 UTC. Traffic rerouting from iad1 started at 07:15 UTC. The cascading failure cost roughly 22% of global caching by 07:45 UTC. Static file serving came back at 08:18 UTC. Then the load-bearing detail. At 07:25 UTC, Vercel "began rerouting function invocations for customers with configured backup or secondary regions, restoring their full service." For everyone else, the postmortem reads: "For customers who do not have multiple function regions configured and who use us-east-1 as their only region, function invocations were restored at 09:21 UTC."

Read those two timestamps together. 07:25 UTC versus 09:21 UTC. One hour and 56 minutes. That is the multi-region penalty, stated plainly by Vercel, measured in the same incident, on the same day. A Pro customer with three configured regions was effectively back at 07:25. A Pro customer on the default single-region setup was down until 09:21. The full incident ran another 10 hours before a feature-flag provider outage cascaded the dashboard and API at 19:20 UTC, and full control-plane restoration did not land until 03:00 UTC the next morning. Total elapsed time: roughly 20 hours.

The lesson is one sentence: if you are pinned to iad1 today, you are betting on us-east-1 staying up. Vercel will write the postmortem. You will write the apology to your customers.

The Vercel platform surface, in three failure tiers

Vercel's platform breaks into three tiers, and each one has a different failure shape and a different probe shape.

Tier 1, global products (Edge Config, the CDN itself). Replicated to every region in the Vercel CDN. Vercel describes Edge Config reads this way: "Most lookups return in less than 1ms, and 99% of reads will return under 10ms." The failure mode is rare and uniform: when these break, they break everywhere at once, and Vercel's own status page will catch it eventually. The signal a monitor can see is straightforward. A latency probe that asserts response time under a reasonable threshold from every region. If three regions go red simultaneously, a tier-1 product is involved.
Tier 2, regional products (Vercel Blob, Vercel Functions, Vercel Cron Jobs). These run in the region you configured. Failure is per-region, often silent, and looks identical to "nothing happened" on a monitor that only watches the marketing root. A function pinned to iad1 is dead when iad1 is dead. A Cron Job runs inside a function, so it inherits that region. The signal a monitor can see is a probe against the actual storage path or function endpoint, from a region that is not the function's home region. When the home region fails and the probe region stays up, the monitor flags it before your customers notice.
Tier 3, Marketplace products (Upstash KV, Neon Postgres, Supabase, and the rest). These are the partner products you provision from the Vercel Marketplace. Vercel KV and Vercel Postgres are not on the first-party storage page anymore. They live with their partner providers, billed through Vercel, but running on partner infrastructure. The failure mode that matters most here is the one nobody tells you about: Vercel's status page will not reflect a Marketplace-partner outage. Upstash goes down, your reads time out, your dashboard 500s, and vercel-status.com stays green the whole time. The signal a monitor can see is a small /api/health/kv route that reads a known fixed key, with an assertion on the expected value. When it returns the wrong value or times out, the partner is degraded.

That three-tier split is the spine of the four patterns below.

Four monitor patterns for the Vercel platform

Each tier needs its own probe. None of these overlap with render-layer monitors.

(a) Cron sentinel. Vercel Cron does not emit a native "did not fire" signal on Hobby or Pro. The Cron Job lives in a function, the function runs in one region, and if it never invokes you find out from your customers. The freshness logic has to live on the endpoint, not in the monitor. Have the Cron write a timestamp to KV or Postgres when it succeeds. Then have /api/cron/_heartbeat compute freshness server-side and return 503 if the timestamp is stale:

// /api/cron/_heartbeat.ts
const lastRun = await kv.get("cron:billing:lastRun");
const ageMs = Date.now() - new Date(lastRun).getTime();
const STALE_MS = 25 * 60 * 60 * 1000; // 25h grace for a daily Cron

return new Response(ageMs > STALE_MS ? "stale" : "ok", {
  status: ageMs > STALE_MS ? 503 : 200,
});

The Velprove HTTP monitor (free plan, 5 global regions) then only needs to assert Status Code Equals 200. When the Cron stops firing, the endpoint flips to 503 and the monitor goes red within one probe interval. A static Response Body Contains assertion against the date string does not work here: the monitor stores whatever you typed once, so tomorrow's probe still checks for today's date and false-alarms on a healthy Cron. Move the freshness check to the endpoint and let status code carry the signal. If you need a true freshness probe through the CDN, add a Cache-Control: no-cache request header so the monitor forces a pull rather than a cached response (see Vercel's cache-control headers reference for the request-header semantics).

(b) Storage path probe. HTTP monitor against a small function that reads a known key from your Marketplace KV or Blob store. /api/health/kv reads a fixed key like monitor:canary whose value you set once and never change. The monitor asserts Status Code Equals 200 and Response Body Contains <expected-value>. Use a fixed key, not a randomized one. Random keys can mask a regional cache miss as success because the partner's Redis layer is happy to return null quickly. A fixed known-good key catches both "partner down" and "partner returning wrong data." For the broader design of these endpoints, see the API health endpoint patterns we cover separately.

(c) Non-default region probe. This one is conditional. If you have multi-region Functions configured on Pro (Pro allows up to three regions; Hobby is single-region only), deploy a tiny /api/health/region route that echoes process.env.VERCEL_REGION from the function. Then create two Velprove HTTP monitors from the 5 global regions: one pinned to Europe, one pinned to Asia. Each asserts the echoed region is not iad1. That is how you verify end to end that your backup-region routing does what you think it does. If both probes keep echoing iad1, your "multi-region" config is a config-only multi-region.

(d) Browser monitor on the production checkout flow. This is the pattern that catches a Marketplace-database outage, and it is the reason a render-layer browser monitor is not enough for a Vercel-hosted SaaS. The trick is the login redirect target. Velprove's browser login monitor signs in as a dedicated test user, follows the redirect to wherever your login flow lands, and asserts that a known string from that landing page is present. If your default post-login URL is /dashboard and that page is shell-only, point the login flow at a deeper URL (for example, /login?next=/billing) so the redirect lands on a page whose render depends on a real database read. Then set the Success Verification indicator to Page contains text with a string that only renders if the Postgres read succeeded: a customer name, an invoice ID, a known plan label. The render layer can return 200 with the page shell intact while the database read silently fails. A text-present check on post-database content catches it. HTTP probes do not. This is one of the clearest cases when a browser monitor beats an HTTP probe.

Configured together, the four patterns cover Cron silence, storage outage, regional pinning, and database-backed product flows. The same playbook scales beyond Vercel to other monitoring patterns for API surfaces. None of them require leaving the free plan to start.

Your monitor is the source of truth, not Vercel's status page

A Vercel community thread from 2024 opens with: "at least 3 outages where https://www.vercel-status.com claimed systems were operational, but it was clearly down," with the OP listing incidents that ran "10, 20, 40 minutes at a time or more sometimes." A Vercel staff response acknowledged the lag without explaining it. That single anecdote is small (one user, three outages), but read it next to the October 20 2025 timeline and the pattern stops being anecdotal. Function rerouting began at 07:25 UTC. The status page took longer. External monitors saw it first.

The broader pattern: vendor status pages are a lagging indicator on purpose. They publish after the vendor has confirmed and triaged. Your external monitor publishes the moment a probe fails. For a SaaS founder on Vercel, this is the difference between learning about your own outage from your monitor and learning about it from an angry support ticket. If you operate any kind of SaaS uptime monitoring for a paid customer base, the working assumption has to be that your monitor flags before Vercel does, sometimes by 30 minutes or more. See also a community thread on status-page lag for the original Vercel-side acknowledgement.

The honest probe-cost tradeoff on Vercel Pro

A 1-minute Velprove HTTP monitor from all 5 global regions hits a Vercel function 216,000 times per month (5 regions times 1,440 minutes per day times 30 days). Pro includes 1,000,000 function invocations per month. Three monitors at that frequency, plus your real user traffic, will move you toward the included quota. The overage rate is $0.60 per million extra invocations.

Practical recommendation: 1-minute interval from all 5 regions on the production root and on your Cron heartbeat endpoint. 5-minute interval on the KV and Blob health probes (most partner outages are minutes long, not seconds long, so 5 minutes is fine). 15-minute interval on the browser monitor against the checkout flow. That mix catches the failure modes that matter and stays well inside Pro's quota. The math is more forgiving on the indie-hacker free-stack writeup if you are on Hobby, but for a paid Pro deployment, treat probe cost as a real budget line.

Setting up your first Vercel platform monitor in Velprove

Start with the Cron sentinel. The free plan covers it: 10 monitors, 5-minute HTTP interval, probes available from 5 global regions, no credit card required. Starter at $19 per month drops to 1-minute intervals if you need them. The Cron heartbeat usually does not.

In the Velprove dashboard, create a new HTTP monitor. On the Configure step, set the URL to https://<your-app>/api/cron/_heartbeat. Click Continue. The wizard advances to the Verify step. Add two Success Conditions in order:

Status Code Equals 200. With the heartbeat endpoint from pattern (a) above, this is the freshness check. The endpoint returns 503 when the Cron is stale, so a 200 means the Cron ran recently.
Header Contains with Header name x-vercel-id and Value ::. The two-colon substring confirms the header is present and shaped as <pop>::<region>::<hash>, which means the response came from a Vercel edge node and not a cached error page from upstream.

Velprove new HTTP monitor wizard on the Verify step showing two Success Conditions configured for a Vercel Cron heartbeat endpoint: row 1 Status Code Equals 200, row 2 Header Contains with header name x-vercel-id and value set to the two-colon substring. — Two Success Conditions for a Vercel Cron sentinel. Status Code 200 confirms the heartbeat endpoint computed a fresh timestamp server-side. The x-vercel-id header confirms the response came from a Vercel edge node.

Click Continue again. The wizard advances to the Schedule and Alerts step. Pick one Velprove probe region for this monitor: North America, Europe, United Kingdom, Asia, or Oceania. Velprove's probes are independent of your Vercel deployment. If your Vercel function is pinned to iad1 (US East) and you want to catch the next us-east-1 outage cleanly, pick a non-US Velprove probe region like Europe or Asia. The greater the geographic distance between Velprove's probe and your Vercel function, the better the signal independence. Set the interval to 5 minutes on Free, or 1 minute if you are on Starter or Pro and the Cron is on a sub-hourly schedule.

Velprove new HTTP monitor wizard on the Schedule and Alerts step showing the single-select region picker with all five regions visible (North America, Europe, United Kingdom, Asia, Oceania) and North America selected as the probe origin for a Cron sentinel monitor. — Schedule and Alerts step with North America selected. Picking a probe region different from your function's home region is how you make the sentinel fail when that region fails.

Vercel's CDN sets the x-vercel-id header on every response. Its presence confirms the probe hit a Vercel edge node, not a stale CDN error page from somewhere upstream. To probe the same endpoint from all 5 regions, create 5 monitors (one per region). That uses half your Free-plan budget for one path, so be deliberate: a Cron sentinel needs only 1 or 2 monitors, and the storage path probe plus the regional-failover probes earn the rest of the budget.

That is the platform-layer baseline. The storage path probe and the browser checkout monitor are the next two.

Frequently Asked Questions

Did the October 20 2025 outage affect every Vercel site?

No. If you monitor a Vercel site today, the takeaway from Vercel's October 20 2025 postmortem is that the impact was uneven. Customers with backup or secondary function regions configured had service restored at 07:25 UTC. Customers using us-east-1 (iad1) as their only region waited until 09:21 UTC for function invocations to come back. That gap is 1 hour and 56 minutes. External monitors flagged the issue before vercel-status.com updated.

How do I monitor a Vercel Cron Job?

Vercel Cron Jobs do not emit a native "did not fire" alert on Hobby or Pro (the Vercel Cron Jobs docs cover pricing and scheduling precision, but no built-in failure alert is documented). The working pattern is to have the Cron write a timestamp on success, then have your heartbeat endpoint compute freshness server-side and return 503 when the timestamp is stale. A Velprove HTTP monitor then asserts Status Code Equals 200. The endpoint flips to 503 when the Cron stops firing, and the monitor catches it within one probe interval. A static body_contains assertion against today's date does not work, because the monitor stores the value once and never updates it.

Should I monitor Vercel preview deployments?

No, preview URLs are not what production uptime monitoring is for. Preview functions are archived after 48 hours of inactivity and will cold-boot on most checks, which produces false-positive slowness signals on quiet weekends. Use a CI synthetic check as a deploy gate against the preview URL, then route your uptime monitors at the production domain. See the production-vs-preview discussion in the Next.js render-layer guide for the framework-side of the same call.

Does Vercel auto-failover on the Pro plan?

Partially. The Pro plan allows you to deploy Functions to up to three regions, per the Vercel function runtime docs, but Vercel does not automatically reroute function traffic the way Enterprise does. During the October 20 2025 incident, customers with configured backup regions were rerouted at 07:25 UTC; single-region Pro customers were not. If you are on Pro and your product cannot afford a two-hour outage tied to a single AWS region, configure at least one backup region on your functions today.

How is monitoring a Vercel app different from monitoring a Next.js app?

Monitoring a Next.js app is the render layer: ISR freshness, cold starts on archived functions, auth-protected routes. Monitoring a Vercel-hosted site is the platform layer: Cron firing, regional health, Marketplace storage liveness. Most production teams need both, and the two posts are written to compose. If you have not already, read How to Monitor a Next.js App in Production for the render-layer half.

The free Velprove plan covers 10 monitors at 5-minute intervals, 1 browser login monitor at 15-minute intervals, and runs every probe from 5 global regions. That is enough to land the Cron sentinel, the storage probe, and the checkout browser monitor for a single production app. Start with the free plan. No credit card required.