Case Study

Vercel Workflow Failures, June 2026: Runs Stuck Pending and Queued Messages Stalled

June 26, 202611 min read

The wedge: on June 25 2026, a major-impact incident on Vercel's status page left some recently deployed Workflow projects stuck as pending, with queued messages not delivered as expected, for roughly six hours. The structural reason this kind of failure hides is that a durable run executes asynchronously, off to the side of the request that started it, so the page at the front door keeps returning a healthy 200 while the run that should advance behind it never leaves pending. The instrument that surfaces it is an endpoint you expose from your own app, call it /_freshness, that computes the age of your last successful run or queue drain server-side and flips to 503 when that goes stale; a Velprove HTTP monitor asserts Status Code Equals 200 against it from any of 5 probe regions on every plan, and goes red within one probe interval when runs wedge. To be clear up front, Velprove did not detect this incident and makes no claim that it did, and the green-deploy framing below is our illustration of how the failure class behaves, not a Vercel-reported fact. Start free, no credit card required.

What broke on June 25, 2026 (and what Vercel did not say broke)

Vercel named exactly one component in this incident: Workflow. It marked the impact major. The verbatim opening line on its status page was that "some recently deployed Workflow projects are stuck as pending and not delivering queued messages as expected." That sentence is the entire confirmed scope, and the words doing the most work in it are "recently deployed." This was not all of Workflow, and it was not all of Vercel. It was a subset of projects that correlated with recent deploys, while projects still running an older, pre-incident deployment were the safe-rollback target Vercel later pointed people to.

It helps to know what a Workflow is before reading that symptom. Per Vercel's own docs, Vercel Workflows is "a fully managed platform for building durable applications and AI agents," code that can pause, resume, and keep its state across deployments and crashes. It is built on Vercel Queues, which the docs describe as "a durable event streaming system" where "you publish messages to topics" and consumers process them. So a Workflow run is one execution of a workflow, and "delivering queued messages" is that message-delivery layer handing work to the consumers that drive a run forward. When that hand-off stalls, runs sit at pending. What Vercel did not say is which layer stalled. The Functions that execute the code, the Queues message-delivery path, and the managed persistence that stores run state are three distinct pieces, and the status post named none of them as the fault. We describe the architecture from the docs and stop there; we do not diagnose a subsystem Vercel never named.

Not the FRA1 CDN incident, and not Cron

One disambiguation, because Vercel has had more than one June incident and the phrase "Vercel down" gets attached to all of them. This is not the June 14 2026 FRA1 CDN event, where Frankfurt's edge degraded and Vercel rerouted serving to Paris; that was a regional serving problem at the front edge, a different failure class entirely, and we tore it down separately in the FRA1 CDN teardown. It is also not the October 20 2025 us-east-1 platform incident, a different date with a far wider blast radius. And it is not a Vercel Cron Job stalling. Cron is a scheduled-invocation surface; Workflows is durable execution, a different product. Cron monitoring belongs to the Vercel platform-layer guide, not here. We name these only to keep them apart.

The timeline (UTC, primary-source)

The timeline below is taken from Vercel's status page incident permalink. All times UTC. Source: Vercel Status incident x4ngbgzymq4h.

Time (UTC)	Status	Update
Jun 25, 07:45	Investigating	"We are currently investigating an issue where some recently deployed Workflow projects are stuck as pending and not delivering queued messages as expected."
Jun 25, 08:27	Identified	"The issue has been identified and a fix is being implemented."
Jun 25, 09:25	Identified	"The issue has been identified. We are continuing to work on a fix for this issue."
Jun 25, 10:42	Identified	"We are still working on a fix for this issue. While we fix the issue, users experiencing pending Workflow runs can trigger an instant rollback to a working deployment created before Jun 24 21:00 UTC to resume operation on new runs."
Jun 25, 12:15	Identified	"We are currently rolling a fix for the issue. We are monitoring the results."
Jun 25, 13:30	Monitoring	"The fix has been fully rolled out and we are monitoring the results."
Jun 25, 13:43	Resolved	"This incident has been resolved."

The two bookend timestamps run from 2026-06-25T07:45:00Z to 2026-06-25T13:43:22Z, which works out to approximately 5 hours 58 minutes. That figure is our derivation from the two published timestamps, not a number Vercel stated. The path to resolution was staged: identified by 08:27, a rollback workaround offered at 10:42, a fix rolling by 12:15, monitoring at 13:30, and resolved at 13:43. The one piece of operational advice Vercel gave affected users is worth quoting exactly, because it is the only concrete remediation in the record. At 10:42 UTC it said users experiencing pending Workflow runs could "trigger an instant rollback to a working deployment created before Jun 24 21:00 UTC to resume operation on new runs." Read that precisely: 21:00 UTC on June 24 is the rollback target, the last known-good deploy window, not a stated time that a bad deploy shipped. Vercel did not say a deploy shipped at 21:00; it said to roll back to something created before then. We do not turn the rollback target into an onset time.

Vercel published no root cause. The feed contains "the issue has been identified and a fix is being implemented," the rollback workaround, and the resolution, and nothing more. There is no postmortem at the incident URL as of writing, which is normal for a same-day incident of this size. We do not fill that gap. No bad-deploy story stated as fact, no queue-backlog theory, no database guess. This is also effectively a single source, Vercel's own status page, with no third-party corroboration, stated here so the post claims no more than the record supports.

Why a 200 at the front door sails past a wedged Workflow run

Here is the failure shape, and it is the reason this incident is worth a teardown rather than a status-page screenshot. A durable run is asynchronous by construction. A request comes in, enqueues the work that should advance a run, and returns. The page keeps serving 200s the whole time, because serving the page and advancing the run are two different jobs done by two different parts of the system. The status code on the front door tells you the front door answered. It tells you nothing about whether the run behind it ever left pending. Those are two separate facts, and on June 25 they diverged: front doors answering 200, runs stuck pending.

This is why the cheapest, most common way to watch a deployment misses it completely. A plain uptime check fetches your homepage or a health route, sees a 200, and reports green. It never enqueues a run, never waits for one to complete, never asks whether the queue is draining. It confirms the one thing that was not broken, that the site responds, while the thing that was broken, durable runs advancing, sits outside its field of view entirely.

Why a clean deploy makes it worse

Here is the part that is our framing, not a Vercel-reported fact, so we flag it as ours before we make it. A deploy is the worst case for this failure shape, and a clean deploy is worse than a broken one. Picture the ordinary good outcome: your build goes green, the new version serves a 200 at the front door, your smoke check passes, and you close the laptop. "The build shipped" and "the queue is draining" are different claims, and a green deploy only ever proves the first. In-flight durable runs can still wedge while every front-door signal you have stays healthy. We want to be clear that Vercel never described the incident in these terms; this green-deploy picture is our plausible illustration of how a deploy-correlated durable-run stall behaves, consistent with "some recently deployed Workflow projects," not a quote from the status page.

It is worth contrasting this against the deploy check most teams already know, the build-SHA or /version probe that asserts the deployment you intended is the one actually serving. That check catches the wrong build serving. This is the opposite problem: the right build is serving, the front door is correct, and the durable runs behind it are stalled anyway. A version assertion would have stayed green here, because the version was never the issue.

This puts the incident squarely in the silent-outage family, where the status code is true and useless at the same time. We catalog the broader pattern in the anatomy of a silent outage and argue the general case in why uptime monitors miss outages. A wedged durable run behind a healthy 200 is one of the quieter members of that family: nothing errors, nothing times out, the dashboard is green, and work simply stops moving.

How Velprove monitors Vercel Workflows

The instrument that catches this is not a cleverer ping at the front door. It is a small endpoint you expose from your own app, call it /_freshness, that computes server-side the age of your last successful Workflow run or queue drain and flips to 503 when that age crosses a staleness cutoff. A Velprove HTTP monitor then does the simplest possible thing: it asserts Status Code Equals 200 against that endpoint. While runs advance the endpoint returns 200; when the last successful run ages past your cutoff it flips to 503 and the monitor goes red within one probe interval.

The direction is worth stating once: Velprove has no inbound heartbeat your Workflow checks into, and nothing in your app pings Velprove. Detection is entirely outbound, a Velprove monitor reaches out and reads the status code your /_freshness endpoint returns, the same way a user's browser would. Your Workflow never has to know Velprove exists.

The endpoint is yours to write and it is small: a read-only handler that looks up your last successful run timestamp and returns 503 once it ages past your cutoff. That 503 response contract is not new and not specific to Vercel; we work through how to build and tune the freshness endpoint in API health check patterns, and the same instrument carries over to any platform with durable or background work. The fresh part of this post is not the recipe; it is the surface, durable runs wedged by a deploy behind a 200. For the rest of the Vercel platform surface, Cron, Marketplace storage, and regional Functions, see the Vercel platform-layer guide.

Velprove HTTP monitor builder on the Verify step (heading Success Conditions), with a single success condition configured: Type set to Status Code, Operator set to Equals, and Value set to 200. This is the only condition the freshness monitor needs; a 200 response passes, so a flip to 503 when durable runs go stale fails the condition and turns the monitor red. — An HTTP monitor's Verify step asserting Status Code Equals 200 against the app's /_freshness endpoint. While durable runs are advancing the endpoint returns 200 and the monitor is green; when runs wedge and the last successful run ages past the staleness cutoff, the endpoint flips to 503 and this assertion turns the monitor red within one probe interval.

There is a sharper version of this check. A multi-step API monitor drives the loop end to end: trigger a canary run through your own endpoint in one step, then assert in a later step that your freshness endpoint advanced to reflect it, which proves a brand-new run can actually complete right now. Velprove's multi-step API monitors handle trigger-then-assert in up to 3 steps.

And if a Workflow drives a user-visible surface, say a dashboard view that only renders once a run completes, a free, no-code browser login monitor opens a real browser, signs into your own login, and asserts a known post-login string is present, so a view that never populates because its run is wedged surfaces as a red monitor rather than a support ticket.

Picking the staleness threshold

The one number that makes or breaks this monitor is the staleness cutoff, the threshold your endpoint compares the run age against. Set it to a small multiple of your expected run cadence or queue-drain interval, the same way you would size a Cron grace window: long enough that a normal idle gap between runs does not trip it, short enough that a genuinely wedged queue crosses it quickly. Workflows that run every few minutes want a ten- or fifteen-minute cutoff; hourly ones want a larger one. That calibration, not a round number copied from a docs page, is what separates a monitor that catches real stalls from one that false-pages on a slow afternoon.

The honesty boundary

The strongest version of this post is the one that names what it does not claim, so here are the boundaries plainly.

We did not detect this. Velprove did not monitor Vercel Workflows, and we are not claiming we caught this incident or would have caught it as a matter of fact. Everything above is the failure shape that a freshness endpoint plus an HTTP monitor is built to surface, presented as a worked example, not a detection war story. There is no detection-time lead to quote either: the incident opened on Vercel's status page at its own start time, with no published gap between impact and acknowledgment to measure against.

The green-deploy picture is ours, not Vercel's. The build-went-green, front-door-returned-200, runs-wedged-anyway framing is our illustration of how this class of failure behaves. It is defensible as a general statement and it is consistent with "some recently deployed Workflow projects," but Vercel never described the incident that way, and we do not attribute it to them.

Stall, not loss. Vercel's words were "not delivering queued messages as expected," which is delay and stall language. We do not claim messages were permanently lost, because the record does not say that, and a stalled queue that resumes is a different thing from data destroyed.

We do not know the root cause, and we do not name a failing subsystem. Vercel published no cause and named no faulty layer. We described the Workflows-on-Queues architecture from the docs, but we do not assert that Queues, or Functions, or persistence is what broke.

We do not claim a bad deploy shipped at 21:00 on June 24. That timestamp is only the rollback target Vercel pointed to, the last known-good window. The actual onset time of the cause is unconfirmed, and we leave it that way.

Detection surfaces, it does not prevent. A freshness monitor would tell you sooner that your runs stopped advancing. It would not have stopped Vercel's Workflow layer from stalling. And it watches exactly one endpoint from one vantage you point it at; it is not a blanket health claim about everything in your stack.

This pattern, not just this incident

A durable or asynchronous run wedged behind a healthy front door outlives this particular Vercel blip. The shape recurs anywhere work is enqueued and processed out of band: background jobs, queues, scheduled pipelines, durable workflows on any platform. The instrument is ready for the next one. A freshness endpoint that flips to 503 when your last successful run goes stale, watched by an HTTP monitor asserting 200, does not care whether the durable layer underneath it is Vercel Workflows or something else. Build it once and it is waiting the next time a run stalls while the page stays 200.

This teardown reads as a sibling to the other June 2026 Vercel incident we covered, and it is worth saying clearly how they differ. In the FRA1 CDN teardown, the failure was at the front edge: one region's CDN degraded and Vercel rerouted serving to Paris, and the catching instrument was coverage from more than one region. This incident is the opposite end of the stack. The front door was not the problem; the durable-execution layer behind a 200 was, and the catching instrument is a freshness endpoint, not regional coverage. Same vendor, same month, two failure classes that want two different instruments. The same freshness instrument applied to another platform is in the Render worker-monitoring guide, and the broader green-but-broken family is cataloged in the anatomy of a silent outage and why uptime monitors miss outages.

Monitor your Vercel Workflows

A wedged Workflow run behind a healthy 200 is invisible to a front-door uptime check and visible to a freshness endpoint watched from outside. The fix is small: expose a /_freshness endpoint that flips to 503 when your last successful run or queue drain goes stale, and point a Velprove HTTP monitor at it asserting Status Code Equals 200 from the regions you care about. Velprove's free plan covers the setup: 10 monitors, 5 regions on every plan, commercial use allowed, and no credit card. If a Workflow drives a post-login view, the no-code browser login monitor walks the real sign-in path and catches the user-facing half; if you want the sharper trigger-then-assert canary, multi-step API monitors are free up to 3 steps.

This post owns the durable-execution-wedge angle. For the full Vercel platform surface, Cron, Marketplace storage, and regional Functions, see the Vercel platform-layer guide, which is where Vercel Cron Job monitoring lives. Start free and point a monitor at your own freshness endpoint.

Frequently Asked Questions

What happened with Vercel Workflows on June 25, 2026?

Vercel opened a major-impact incident on its status page, incident x4ngbgzymq4h, affecting the Workflow component, running from 07:45 UTC to 13:43 UTC, roughly 5 hours 58 minutes. Vercel said some recently deployed Workflow projects were stuck as pending and not delivering queued messages as expected. It marked the issue identified by 08:27, offered a rollback-to-a-pre-incident-deployment workaround at 10:42, began rolling a fix by 12:15, moved to monitoring at 13:30, and resolved it at 13:43. Vercel published no root cause.

Was all of Vercel down during the June 2026 Workflow incident?

No. Vercel named only the Workflow component, and only a subset of it: some recently deployed Workflow projects. Projects still on an older, pre-incident deployment were the safe-rollback target. This was not a platform-wide outage; Functions, the CDN, and other regions were not named as affected. Do not conflate it with the separate June 14 2026 FRA1 CDN incident or the October 2025 us-east-1 incident, which were different components, dates, and failure classes.

How long did the Vercel Workflow incident last?

Approximately 5 hours 58 minutes, from 2026-06-25T07:45:00Z to 2026-06-25T13:43:22Z. That figure is our derivation from the two published timestamps, not a number Vercel stated. The path to resolution was staged: identified by 08:27, a rollback workaround offered at 10:42, a fix rolling by 12:15, monitoring at 13:30, and resolved at 13:43.

What does "Workflow runs stuck as pending" mean?

A Workflow run is one execution of a workflow, and stuck as pending describes a run that never progressed past its initial pending state. Per Vercel's docs, Workflows is built on Vercel Queues, where you publish messages to topics and consumers process them, so not delivering queued messages refers to that message-delivery layer not handing work to the consumers that advance a run. We describe the architecture from the docs; Vercel did not publish which layer failed, so we do not name one.

Why doesn't a normal uptime check catch a stuck Workflow run?

Because a durable run executes asynchronously, off to the side of the request that started it, and the page at the front door keeps returning a healthy 200 while the run sits wedged. A normal uptime check sees that 200 and reports green; it never enqueues a run or asks whether the queue is draining. The catch is a freshness endpoint you expose that flips to 503 when your last successful run or queue drain goes stale, watched by an HTTP monitor asserting Status Code Equals 200. When runs wedge, the endpoint returns 503 and the monitor goes red.

Did Velprove detect the June 2026 Vercel Workflow failures?

No, and this post makes no such claim. Velprove did not monitor Vercel, and there is no detection-time lead to quote, because the incident opened on Vercel's status page at its own start time. The green-deploy framing in this post is our illustration of how the failure class behaves, not a detection. A freshness endpoint plus an HTTP monitor demonstrates the shape is catchable from outside; it does not claim it caught this specific incident.