Case Study

The GitHub Actions May 2026 Degradation: A Detection-Time Teardown

May 22, 20268 min read

The honest take: on May 15 2026, GitHub's status page took 30 minutes to acknowledge an Actions degradation that was already failing workflow runs. The incident ran 65 minutes start to finish. An API monitor on the runs list endpoint with a json_path assertion would have surfaced the failure within one monitor interval, roughly 25 to 29 minutes ahead of the public status update.

Thirty minutes is how long GitHub's status page took to acknowledge the May 15 2026 Actions degradation after workflow runs started failing. The window was 07:43 to 08:48 UTC, sixty-five minutes total, peak 42 percent of Actions runs failing. An external API monitor pointed at the runs list endpoint with a content assertion on the latest run's status saw the failure inside a single monitor interval.

What broke (and what didn't) on May 15 2026

The failure was selective. At peak, 42 percent of GitHub Actions runs failed or experienced delayed starts. GitHub Pages builds backed by Actions degraded alongside, because a Pages build is an Actions workflow. GitHub Coding Agent and GitHub Code Review Agent both showed elevated error rates, because both depend on Actions for orchestration.

What stayed up matters as much as what went down. Core git operations over HTTPS and SSH (clone, push, pull) continued serving normally. The github.com web surface (browsing repos, viewing pull requests, reading issues) was unaffected. The api.github.com endpoints unrelated to Actions, including repo metadata and user endpoints, kept returning 200. Even the api.github.com/.../actions/runs endpoint stayed reachable: it returned 200 with valid JSON throughout. The body told the truth that the status code hid.

This selective shape is the cleanest possible setup for the case this post makes. A customer running a plain HTTP monitor on https://github.com or on api.github.com saw 100 percent uptime through the entire window, because every endpoint kept returning 200. The Actions surface was broken behind the response, not in front of it. The only way to catch that externally is to assert on the response body, not the response code.

The timeline (UTC, primary-source)

The timeline below is taken verbatim from the GitHub Status incident permalink. All times UTC. Source: GitHub Status incident ctf7nxpq5jzn.

Time (UTC)	Event
07:43	Impact start. The Resolved update states impact began at approximately 07:43 UTC.
08:13	First Investigating update: “We are investigating reports of degraded availability for Actions”.
08:14	Pages added to the incident: “Pages is experiencing degraded availability. We are continuing to investigate.”
08:26	Actions mitigation announced: “The degradation affecting Actions has been mitigated. We are monitoring to ensure stability.”
08:27	Pages mitigation announced: “The degradation affecting Pages has been mitigated. We are monitoring to ensure stability.”
08:29	Downstream services acknowledged: “We are monitoring an issue that was affecting GitHub Actions and causing downstream issues in GitHub Coding Agent and GitHub Code Review Agent. The issue has resolved now but we are closely monitoring our systems for full recovery.”
08:41	Monitoring status posted across the affected surfaces.
08:48	Incident marked Resolved.

The Resolved update attributed the trigger to a planned failover of supporting infrastructure used by GitHub Actions, where an automated service discovery update did not propagate correctly. Traffic was routed incorrectly and request timeouts increased in a core dependency for workflow orchestration. That is the full published explanation. Anything more specific would be speculation.

GitHub publishes monthly availability reports roughly one month in arrears. A May 2026 report may add detail after this post launches; the body above is what the live status page records as of one week after the incident.

The 30-minute detection-time argument

The load-bearing number is 30 minutes. That is the gap between first user-visible impact (07:43 UTC) and the first Investigating update on GitHub Status (08:13 UTC). Anyone running a workflow that hit the bad 42 percent during those 30 minutes saw a failure with no upstream confirmation. The status page was still green.

A single monitor type is enough to close that gap: an API monitor on the runs list endpoint with a json_path assertion on the latest run's status. The table below shows what that monitor would have observed and how far ahead of GitHub Status it would have flipped red.

Monitor type	What it would see	Earliest detectable timestamp	Delta vs 08:13 UTC ack
API monitor on /repos/{owner}/{repo}/actions/runs with a json_path assertion that the latest run's status is not queued	Endpoint stays 200, but the json_path assertion fails because the latest run is stuck at status:queued instead of transitioning to in_progress or completed.	One interval after impact lands on the repo (07:48 UTC at free 5-min, 07:44 UTC at Starter 1-min, 07:43:30 UTC at Pro 30-sec)	25 to 29 minutes earlier

The interval math is straightforward. At the free 5-minute interval, the worst-case detection lag puts surfacing time at 07:48 UTC, 25 minutes ahead of the 08:13 status acknowledgment. On Starter's 1-minute interval, the worst case is 07:44 UTC, 29 minutes ahead. Pro's 30-second interval shaves another half-minute on top of that. Even the free interval gets you 25 minutes ahead of GitHub Status.

One honest caveat. The detection-time delta assumes there is at least one workflow run touching the repo during the impact window. If the repo had no workflow activity between 07:43 and 08:48 UTC, there is nothing for the monitor to assert against. For an agency monitoring multiple repos, or a team with scheduled workflows that fire on a tight cron, this is rarely a constraint. For a single low-traffic repo, the monitor might stay green through an incident that never touched it.

What an API monitor with json_path actually does

A single GET on /repos/{owner}/{repo}/actions/runs?per_page=1 returns the latest run as JSON. The relevant field is $.workflow_runs[0].status. On a healthy day, that value cycles through queued briefly, then in_progress, then completed. A well-shaped Velprove assertion is:

Assertion field	Value
type	`json_path`
path	`$.workflow_runs[0].status`
operator	`neq`
value	`queued`

The assertion passes when the latest run is anything other than stuck queued. It fails when the latest run is queued. Pair it with a standard status_code eq 200 assertion on the same step so the monitor also catches the case where api.github.com itself returns 5xx. Both assertions can run in one API monitor on Velprove's free plan.

The shape extends naturally to the multi-step API monitor pattern if you want to dispatch a fresh canary workflow and assert on the result. The single-step version above works when your repo already has scheduled or frequent workflow runs and you just need to detect the stuck-queued condition. Either way, the load-bearing primitive is the json_path body assertion, not the HTTP status code.

For the broader pattern of why a plain 200 OK probe misses this class of failure, see why HTTP probes miss this class of failure and the deeper assertion shapes in API health check patterns.

The honesty boundary

The strongest version of this post is the version that names what it does not claim.

First, detection is not prevention. An external monitor surfaces a failure faster. It does not stop the failure. The 25-to-29 minute delta is earlier detection, not prevention.

Second, the “5 regions” in Velprove's plan is a plan-level option. Each check runs from one region you pick, with 5 regions available to choose from. The GitHub Actions degradation was global, so a single monitor in any one region would have caught it. The detection math above assumes one monitor in one region.

Third, we do not claim to know what broke inside GitHub. The Resolved status update attributes the trigger to a planned failover where an automated service discovery update did not propagate correctly. That is the full published explanation, and we quote it verbatim above. Anything more specific would be guessing.

Fourth, a monitor sees its own canary or its own scheduled workflows. It does not see 42 percent of someone else's runs. The detection-time delta applies to any workflow that hit the bad 42 percent during the window. Repos with no workflow activity between 07:43 and 08:48 UTC had nothing to assert against.

Fifth, the assertion shape is calibrated to the failure mode of this specific incident (stuck queued). A future GitHub Actions degradation that manifests differently (for example, runs that start but never complete) would need a different assertion (for example, on $.workflow_runs[0].conclusion asserted equal to success, with the monitor interval doing the work of catching repeated failures over time). Monitoring is calibrated to the failure mode you can articulate.

The broader version of this argument, with a different vendor and a different dated incident, lives in the structural argument that status pages lag incidents.

This pattern, not just this incident

The May 15 2026 degradation is not the only recent GitHub Actions incident; on May 20 2026, runner-connect delays starting at 16:58 UTC produced a similar selective-failure shape, though with less complete published timeline detail. The two together read as a pattern, not a one-off.

The broader catalog of failure shapes that HTTP monitoring misses lives in the 10-pattern silent-outage taxonomy. The GitHub Actions degradation is a clean 2026-dated case study against that taxonomy: a CI/CD orchestration surface degrades, the web surface stays up, the API surface stays up, and only the response body tells the truth about what is broken. GitHub's auth plane can fail the same way one layer earlier, valid credentials drawing erroneous 401s, as torn down in our GitHub API authentication failures teardown.

What to monitor on GitHub Actions

The recommendation is a single API monitor on /repos/{owner}/{repo}/actions/runs?per_page=1 with two assertions:

status_code with operator eq and value 200. Catches the case where api.github.com itself goes down.
json_path on $.workflow_runs[0].status with operator neq and value queued. Catches the stuck-queued condition that defined the May 15 2026 incident.

On the Free plan: 5-minute interval, json_path assertions included, no credit card. On Starter: 1-minute interval. On Pro: 30-second interval. All run from one region you pick out of 5 regions available; the May 15 2026 degradation was global, so single-region detection was sufficient.

Frequently Asked Questions

What happened during the GitHub Actions May 15 2026 degradation?

GitHub Actions degraded between 07:43 UTC and 08:48 UTC on May 15 2026, a 65-minute window. At peak, 42 percent of Actions runs failed. The failure was selective: core git operations, github.com web, and api.github.com endpoints unrelated to Actions continued serving normally. GitHub Pages builds backed by Actions, GitHub Coding Agent, and GitHub Code Review Agent all degraded alongside the Actions surface. GitHub's Resolved status update attributed the trigger to a planned failover where an automated service discovery update did not propagate correctly.

How long did GitHub's status page take to acknowledge the May 15 2026 Actions outage?

Impact started at approximately 07:43 UTC. GitHub Status posted the first Investigating update at 08:13 UTC. That is a 30-minute gap between user-visible impact and public acknowledgment. Customers without their own monitoring spent that window debugging failed workflow runs without confirmation that the cause was upstream. Public mitigation for Actions was announced at 08:26 UTC, 43 minutes after impact start. Resolved status posted at 08:48 UTC.

Would a Velprove monitor have detected the May 15 2026 GitHub Actions degradation?

Yes, for any repo with active workflow runs during the impact window. An API monitor on the runs list endpoint with a json_path assertion that the latest run's status is not queued would have flipped red within one monitor interval of any canary or scheduled workflow getting stuck. At the free 5-minute interval, that puts detection roughly 25 minutes ahead of GitHub Status. On the Starter 1-minute interval, roughly 29 minutes ahead. The honest framing is that a monitor sees the runs it can observe, not the aggregate 42 percent. If the repo had no workflow activity during the 65-minute window, there is nothing for an external monitor to assert against.

Does GitHub Actions have an SLA, and would the May 15 2026 incident trigger a credit?

GitHub Actions ships under the GitHub Enterprise SLA. Consumer plans (Free, Pro, Team) do not carry a contractual SLA. Even on Enterprise, a 65-minute degradation is unlikely to breach a 99.9 percent monthly threshold, which allows 43 minutes and 50 seconds of downtime per month. The operational angle is the detection-time gap, not the credit. For the deeper SLA-credit lens, see the SLA-vs-SLO-vs-SLI breakdown.

What is the json_path assertion shape that catches this incident?

Velprove's API monitor supports json_path assertions on the response body. For GitHub Actions, the shape is: GET /repos/{owner}/{repo}/actions/runs?per_page=1, then assert json_path $.workflow_runs[0].status with operator neq and value queued. On a healthy day the latest run's status is completed (or in_progress mid-execution), so the assertion passes and the monitor stays green. On May 15 2026, runs dispatched during the impact window got stuck at queued, so the assertion failed and the monitor flipped red within one monitor interval. Velprove supports json_path with the eq, neq, gt, lt, contains, exists, and other operators on every plan, including Free.

Why didn't github.com or git clone break during the May 15 2026 incident?

The failure was scoped to the Actions surface and its direct dependencies. Core git operations over HTTPS or SSH, github.com web for browsing repos and PRs, and api.github.com endpoints unrelated to Actions all continued serving normally throughout the window. Only orchestration paths through the Actions workflow queue and its downstream services (Pages builds, Coding Agent, Code Review Agent) hit the increased timeouts. That selective shape is what makes a content-aware API monitor on the runs list the right detection instrument here: a plain 200 OK probe on api.github.com would have stayed green throughout, because the API endpoint itself was fine.