Guide

Your GraphQL API Returns 200 While It's Down. Here's How to Catch It.

9 min read

The 30-second version: A GraphQL endpoint can return HTTP 200 while it is functionally down. When a field or a resolver fails on a well-formed request, GraphQL reports it inside the response body as a top-level errors array, often with null sitting in data where the real value should be. The HTTP status stays 200, so a plain status check stays green. The fix is a body-content assertion: confirm 200 AND that $.errors[*] has no matches AND that $.data.<criticalField>is not null. That is exactly what Velprove's free multi-step API monitor does: it asserts on the response body, not just the status code.

If you run a GraphQL API, your uptime monitor is probably lying to you. Not because the tool is bad, but because GraphQL breaks the assumption every status-only check is built on: that a 200 means the request worked. For REST that assumption mostly holds. For GraphQL it does not. A query can fail in a way that takes your whole feature down and still hand back a tidy 200 OK. To monitor GraphQL API uptime for real, you have to read the body.

This is the same blind spot we wrote about more generally in why a 200 OK can hide an outage. GraphQL is the sharpest example of it, because the 200-with-an-error is not an edge case here. It is the documented, default, by-design behavior.

Why GraphQL returns 200 when a field fails

Start with the scope, because the common overstatement ( "GraphQL always returns 200") is wrong and will lead you to build the wrong monitor. The 200-with-errors behavior applies to one specific situation: a well-formed request that the server accepts and executes, responding with the application/json media type, where a field or a resolver fails during execution.

In that case the transport did its job. Your query parsed, validated, and ran. One of the resolvers threw, or returned null for a non-nullable field, or an upstream the resolver called timed out. The server has a valid HTTP response to send you, so it sends 200 and reports the failure in the body. The response carries a top-level errors array describing what broke, and a data object that holds null where the failed field should have been.

This is described in the graphql-over-http specification, and the clearest practitioner write-up is Nigel Sampson's "GraphQL and 200 Not OK" (2020), which frames the problem exactly the way a monitoring engineer runs into it. Sasha Solomon's "200 OK! Error Handling in GraphQL" (2019) covers the same ground from the schema-design side. The short version: in GraphQL, the HTTP status describes the transport, and the errors array describes your query. A monitor that only reads the status is reading the wrong layer.

The three failure modes a status check misses

There are three shapes this takes in production, and a status-only check is blind to all three. Each one returns 200. The samples below are responses your own GraphQL API would send back. Names and fields are illustrative.

1. Field error with a populated errors array

A resolver throws. The field it was responsible for comes back null, and the failure shows up in errors. The status is still 200.

{
  "data": { "currentUser": null },
  "errors": [
    {
      "message": "Failed to fetch user from accounts service",
      "path": ["currentUser"],
      "extensions": { "code": "INTERNAL_SERVER_ERROR" }
    }
  ]
}

Your dashboard's "who am I" query just failed for every signed-in user. The status check sees 200 and a non-empty body and reports green.

2. A critical field returns null while siblings resolve

The query mostly works. One important field goes null because its resolver failed, while the cheap fields around it resolve fine. Sometimes there is an errors entry, sometimes the resolver swallowed the error and just returned null. Either way the body looks populated.

{
  "data": {
    "product": {
      "id": "prod_123",
      "name": "Standing Desk",
      "price": null
    }
  }
}

The product page renders with a name and no price. Nothing is "down" by any status-code measure, but you cannot sell the thing. An assertion on $.data.product.price being non-null is the only check that catches this.

3. Partial data, the page half-loads

This is partial success: data and errors in the same response. The fields that worked are in data, the ones that failed are null, and errors explains the gaps. The graphql-over-http spec treats this as normal, expected behavior, not an error condition for the transport.

{
  "data": {
    "order": { "id": "ord_55", "total": 4200 },
    "recommendations": null
  },
  "errors": [
    {
      "message": "Recommendation engine unavailable",
      "path": ["recommendations"],
      "extensions": { "code": "INTERNAL_SERVER_ERROR" }
    }
  ]
}

Half the page loads. The order details are there, the recommendations rail is empty. The response is 200 with a healthy-looking data object, and the only signal that something broke is the errors array nobody is reading.

When it's actually a 4xx or 5xx (and when it isn't)

GraphQL does use real HTTP error codes, just not for the failures above. Knowing where the line falls keeps you from building a monitor on a false assumption. There are three broad cases where the status does carry the signal, and one important divergence between the spec and what servers actually do.

Parse and validation errors. If your query is malformed, or asks for a field that does not exist, that is a request error caught before execution. Here the spec and real servers part ways. The graphql-over-http spec recommends returning 200 even for request errors when the response uses the application/json media type. In practice Apollo Server returns 400 for parse and validation errors. So do not assume 400 is universal, and do not assume 200 is either. It depends on the server. For a monitor this is fine, because a malformed canary query is your bug to fix before you ship the monitor, not a production signal.

Invalid variables. One trap worth a single sentence: older Apollo Server 4 returned 200 when a variable failed coercion, which meant a bad-input failure hid behind a success status. Current Apollo fixes this with status400ForVariableCoercionErrors, which returns 400 and is the default in Apollo Server 5.

Transport and server crashes. If the process is down, the load balancer has no healthy backend, or an upstream gateway times out, you get a real 5xx (or a connection failure). This is the one case a status-only check reliably catches, and it is the minority of GraphQL outages.

The newer media type. The spec defines a second media type, application/graphql-response+json, which may use non-200 statuses for errors, and the draft even sketches a non-standard 294"Partial Success" code. Treat that as emerging, not deployed. The spec is still at Draft stage, and most servers in the wild still answer with application/json and 200. Build for what your server actually sends today.

Net of all of this: the real GraphQL blind spot is the field error that resolves to 200 with a populated errors array. No status code will surface it. You have to read the body.

How to monitor a GraphQL API for the 200-that-lies (Velprove)

Velprove's free multi-step API monitor asserts on the response body, not just the status code. That is the whole game for GraphQL. The browser login monitor is the differentiator we lead with for sign-in flows, but the right tool here is the API monitor with a JSON-path assertion on the errors array. Here is the shape, in four steps, no config files.

Step 1. POST your GraphQL endpoint with a small canary query. Create an API monitor that sends a POST to your single GraphQL URL (something like /graphql) with a small, read-only query in the request body. Keep it cheap and stable. Ask for the one or two fields you most need to be alive. Run it with a dedicated low-privilege monitoring account, never real admin credentials.

Step 2. Assert the status code is 200. This is the baseline that catches the transport and crash failures from the section above. It is necessary and, on its own, nowhere near sufficient.

Step 3. Assert the JSON path $.errors[*] has no matches. This is the assertion that turns a status check into a real GraphQL health check. The [*] matches the entries inside the array, so it passes when errors is absent or an empty [], and fails the moment any error entry appears, even though the status is still 200. It catches failure modes 1 and 3 above.

Step 4. Assert $.data.<criticalField> is not null. Point this at the field your product genuinely depends on, for example $.data.currentUser.id or $.data.product.price. Use a not-null assertion, not a bare existence check. A field can be present and still null, which is exactly failure mode 2, where a critical field quietly goes null and the resolver swallowed the error so errors stays empty. Belt and suspenders: assert no error entries and a non-null value for the field you care about.

The GraphQL assertion set in Velprove's multi-step monitor builder: status 200, $.errors[*] Not Exists so any error entry fails the check, and $.data.product.price Not Equals null so a critical field quietly going null still pages you.
Velprove multi-step API monitor builder showing a POST step to a GraphQL endpoint with three success assertions: Status Code Equals 200, JSON Path $.errors[*] Not Exists, and JSON Path $.data.product.price Not Equals null.

That four-assertion pattern is the entire GraphQL-specific part. The mechanism underneath it, how a monitor sends a request body, reads the JSON response, and runs JSON-path assertions, is the same engine you would use for any API. If you want to extend this into a token-then-query flow, or chain several queries, that is just chaining and JSON-path assertions in a multi-step API monitor, and that guide teaches the mechanism end to end. This post only adds the GraphQL assertion shape on top of it. All of this runs on the free plan, from 5 regions, with commercial use allowed.

A GraphQL data probe is a different layer than a /healthz endpoint

It is tempting to think you already cover this because you have a /healthz endpoint. You do not. They are different layers and you want both.

A /healthz probe is an endpoint you deliberately build to report health. It returns 200and a small body that says "I am up," usually after checking a database connection and a couple of dependencies. It is a self-report. The patterns for designing one are covered in our note on why a /healthz probe is a different layer.

A GraphQL 200-with-errors is the opposite situation. It is not a special health endpoint. It is your normal data endpoint, the one your app actually queries, telling you it is fine while a field underneath it is broken. A green /healthz can sit right next to a GraphQL query that returns nullfor the field that pays your bills. The health endpoint reports the service's opinion of itself. The canary query reports what a real client actually gets back. Monitor both.

Frequently asked questions

Why does my GraphQL API return 200 when there's an error?

When the request itself is well-formed and the server responds with the application/json media type, GraphQL signals field-level and resolver-level failures inside the response body, not in the HTTP status. The transport succeeded, so the status stays 200. The failure is reported as an entry in a top-level errors array, usually alongside a data object that holds null where the failed field should have been. The graphql-over-http spec describes this behavior, and most servers, including Apollo Server in its default configuration, follow it.

What's in the GraphQL errors array?

The errors array is a top-level field in a GraphQL response. Each entry is an object with a human-readable message, and usually a locations array pointing at the spot in the query that failed, a path array naming the response field that errored, and an extensions object that servers like Apollo use to carry a machine-readable code such as INTERNAL_SERVER_ERROR or UNAUTHENTICATED. When errors is present and non-empty, at least one part of your query did not resolve correctly, even though the HTTP status is 200.

Can a GraphQL response have both data and errors?

Yes. This is called partial success, and it is normal GraphQL behavior. If one field's resolver throws while its siblings resolve fine, the server returns a data object containing the fields that worked plus null for the field that failed, and an errors array describing what went wrong. A status check sees 200 and a non-empty body and reports the API as healthy. The page is half-broken. This is the single most important reason to assert on the body, not the status.

Does GraphQL ever return a 4xx or 5xx?

Yes, for failures that happen before execution or in the transport. A malformed query or a request body that fails to parse is a request error, and while the graphql-over-http spec recommends 200 under application/json, Apollo Server actually returns 400 for parse and validation errors. Invalid variable values return 400 in current Apollo when status400ForVariableCoercionErrors is on, which is the default in Apollo Server 5. A crashed server or an upstream that times out returns a 5xx. The gap a status check cannot see is the field error that resolves to 200 with a populated errors array.

How do I alert on a GraphQL errors array if the status is 200?

Use a monitor that asserts on the response body, not just the HTTP status. POST a small canary query to your GraphQL endpoint, then add three assertions: status code equals 200, the JSON path $.errors[*] has no matches, and the JSON path $.data.<criticalField> is not null. If the errors array fills in or your critical field goes null, the monitor fails even though the status is still 200. Velprove's free multi-step API monitor asserts on the response body, not just the status code.

What query should I use to monitor a GraphQL endpoint?

Use a small read-only canary query that touches the field you care about most, run with a dedicated low-privilege monitoring account rather than real admin credentials. A good canary asks for one critical field and maybe one stable identifier, for example a viewer or health-style query that returns a known id. Keep it cheap so it does not load your resolvers, keep it stable so it does not break on unrelated schema changes, and assert that the one critical field comes back null-free with an empty errors array.

Set up a free GraphQL monitor with Velprove. POST a canary query, assert 200 AND no errors entries AND a non-null critical field, free, from 5 regions, commercial use allowed. The next time a resolver fails behind a 200 OK, you hear about it before your users do.

Start monitoring for free

Free browser login monitors. Multi-step API chains. No credit card required.

Start for free