Platform

Monitor a Railway App: Sleep, Private Net, Cron Services

Q: How do I monitor a Railway service that only runs on railway.internal ?

You cannot reach `*.railway.internal` from outside. The private network is a Wireguard mesh scoped to a single project and environment, structurally unreachable from the public internet. The working pattern is a public companion route, for example `/deps` on a public web service in the same project, that exercises the internal call and returns 200 only if the private service responded. A Velprove HTTP monitor against `/deps`, probing from one of 5 global regions you pick, then tells you when the private service stops answering.

May 20, 202613 min read

TL;DR: To monitor a Railway app properly, you have to probe it from outside Railway. Railway's own healthcheck only runs at deploy time and explicitly is not for continuous monitoring. Its sleep timer is outbound-driven so an external probe does not keep a service awake, services on *.railway.internal are unreachable from the public internet, and a cron service has no native did-not-fire alert. Four Velprove patterns close those gaps: a public HTTP monitor on the web service, a /deps probe for private services, a heartbeat probe for crons, and a browser login monitor on the real signed-in path. Free-tier Railway spin-down economics are a separate hosting decision covered in the indie-hacker free-stack guide. Every monitor probes from one of 5 global regions you pick, on the Velprove free plan. No credit card required.

Why Railway's native healthcheck is not your uptime monitor

Railway has a built-in healthcheck and it is not what most people assume. Railway's healthchecks reference reads: "The healthcheck endpoint is currently not used for continuous monitoring as it is only called at the start of the deployment, to ensure it is healthy prior to routing traffic to it." That single sentence is the entire reason this post exists. Railway itself is telling you the native healthcheck is a deploy-time gate, not a runtime alert.

The mechanic is precise. When a new deployment is triggered, Railway repeatedly queries the configured healthcheck endpoint until it receives an HTTP 200, then activates the deployment and starts routing traffic. The default timeout is 300 seconds. The probe originates from healthcheck.railway.app, which matters only if you have host-restricted access on that route. Once the new instance is live, Railway stops calling the endpoint. It does not come back later to confirm the instance is still answering.

Two consequences follow. First, an instance that passes its deploy-time check and then stops responding an hour later produces zero signal from the native healthcheck, because the native healthcheck is not watching anymore. Second, an endpoint that returns 200 with an empty body or a stale cached error page will satisfy the gate just as easily as a real, working endpoint will. Deploy-time gating and runtime monitoring are two different problems, and Railway covers the first one. The second one is an external probe's job. This post is about the platform surface a 200 OK on your public URL cannot see.

Service sleep is outbound-driven, not inbound-idle

This is the Railway fact drafters get wrong most often, so it goes early. Railway's app-sleeping docs state the rule verbatim: "For Railway to put a service to sleep, a service must not send outbound traffic for at least 10 minutes." And, also verbatim: "Inbound traffic is excluded from considering when to sleep a service." The clock that triggers sleep is outbound. The clock has nothing to do with how many requests arrive at your service.

What counts as outbound is broader than most people expect. Telemetry pushed to a logging or APM service, database connection pool keepalives, NTP queries, requests to another service in the same project over the private network, and external API calls all count as outbound traffic that keeps the service awake. What does not count is anything arriving at your service, including a request from a customer's browser and a probe from an external monitor.

That last point is the one to internalize: a Velprove HTTP probe is inbound traffic from Railway's perspective, so it does not keep your Railway service awake. A probe will wake a slept service on the first hit, because Railway wakes a slept service on any inbound request from the internet or from another service in the same project over the private network. But it will not prevent the next sleep. Ten minutes after your service's last outbound packet, it sleeps again, regardless of how often the probe arrives. If you need the service awake, do that with outbound activity originating inside the service: a periodic outbound heartbeat to a logging or telemetry endpoint is the honest way. Treat "monitoring keeps my service warm" as a trap on Railway, because the mechanic runs in the opposite direction from inbound-idle platforms.

Private services on `*.railway.internal` are invisible from outside

Railway exposes a private network between services in the same project. Railway's private-networking docs define the hostname pattern: "<service-name>.railway.internal. For example, a service named api would be reachable at api.railway.internal." The transport is an encrypted Wireguard mesh, which is why the docs consider HTTP over the mesh acceptable: the tunnel itself is encrypted. Isolation is per-project and per-environment, so services in a different project or a different environment cannot resolve your railway.internal hostnames at all.

The load-bearing consequence for monitoring is structural. *.railway.internal is unreachable from the public internet by design. An external probe, including a Velprove monitor running from any of the 5 global regions, cannot resolve those names and cannot reach those ports. There is no flag to flip and no header to add. The private network is private. That is the whole feature.

The pattern that works is a public companion route. Pick one of the public web services in the project and add a route, conventionally /deps, that exercises the actual dependency call you care about and returns 200 only if the private service responded correctly. A Velprove HTTP monitor against /deps then gives you an external signal for a structurally internal service. Frame this honestly to yourself: /deps is a userland convention, not a Railway primitive. The probe is watching the public companion, and the companion is watching the private dependency. If the companion lies or stops being deployed, the probe lies too. Keep the route's implementation small and obvious, and put the same Wireguard-side call your real code uses behind it.

Cron-as-service has no didn't-fire alert

Railway crons are not a separate service type. They are a setting on a normal service. Railway's cron-jobs reference describes the model: you define a 5-field crontab string, in UTC, on a service's settings. On schedule, Railway invokes the service's start command. The service is expected to do its work and terminate. The minimum frequency is every 5 minutes. On Render, a cron is its own separate billable service; on Railway it is a setting on a normal service, and that difference shapes everything downstream (see the Render platform-layer guide for the contrast).

The concurrency rule is the one that bites silently. Railway's docs state, verbatim: "If a previous execution is still running when the next scheduled execution is due, Railway will skip the new cron job." That is the silent-skip failure mode. A cron that hangs once because a downstream API is slow can quietly suppress every subsequent run until you notice the data is stale. From outside Railway, a hung cron and a skipped cron look identical: nothing happened. No surfaced email, no surfaced webhook for "the next run did not start."

The pattern that works is a heartbeat. The cron writes a timestamp on real success into Postgres or a key value store, after the work is durably done, not on entry. A small companion web service reads that timestamp, computes its age, and returns 503 when the cron has been quiet longer than its expected cadence plus a grace window, 200 otherwise. A Velprove HTTP monitor against /last-run/<job> asserts status_code = 200. The endpoint flips to 503 the moment the cron goes stale, and the monitor catches that within one probe interval.

Velprove does not receive passive heartbeats from your cron. The freshness logic lives on your /last-run/<job> endpoint, on your service, and Velprove asserts the status code from outside. A static body_contains assertion that looks for today's date does not work for this; the monitor stores whatever string you type once at setup, then keeps asserting that stale value forever. Let the endpoint compute freshness server-side, and let the status code carry the signal.

Velprove HTTP monitor wizard on the Success Conditions step for a Railway cron's /last-run/<job> freshness endpoint, asserting Status Code Equals 200. — A heartbeat HTTP monitor against a Railway cron's /last-run/<job> endpoint. The endpoint returns 503 when the cron is stale and 200 otherwise, so a status_code = 200 assertion is the entire check.

Verify your deploy actually came up: the `/version` SHA pattern

Railway auto-deploys on every push to the connected branch by default. A green deploy in the Railway dashboard means the native healthcheck returned 200 once at activation. It does not mean the build that came up is the build you intended, and it does not mean the build is still serving correct responses now.

The cheap fix is a /version endpoint that returns the current git SHA, wired from an environment variable Railway sets at build time. Two ways to assert it with Velprove, and the right one depends on where you want the SHA comparison to live.

The recommended form is a multi-step API monitor: Step 1 hits /version and captures $.build_sha into a variable, Step 2 calls a second route that compares its own runtime SHA against the captured value and returns non-2xx on mismatch. The comparison lives server-side in your app, the monitor just orchestrates, and the setup survives every future deploy unchanged. Available on every plan including free up to 3 steps.

The shorter setup is a plain HTTP monitor with body_contains set to the SHA your build just produced. It works for the current deploy and stales on the next, because the deployed app starts returning the new SHA while the monitor keeps asserting the old one. Use this form only when your CI/CD pipeline updates the assertion on every deploy via Velprove's PUT /api/checks/<id> API. When a deploy reports green but serves a stale or wrong build, either assertion fails and the monitor pages you. Velprove does not provide a native deploy-skew detector; your /version assertion is the detector.

Velprove multi-step API monitor builder configured as a 2-step Railway chain: Step 1 GET to /version, Step 2 GET to /version with a JSON Path assertion on $.build_sha. — The basic 2-step capture pattern. The X-Expected-SHA variant that actually detects deploy skew is in the API health-check guide linked below.

The full multi-step capture-and-assert flow, including the X-Expected-SHA header variant for capturing a value from one step and asserting it in the next, is already walked through in the multi-step build_sha pattern in the API health-check guide. If multi-step is new, the same flow framed for API teams is in the multi-step API monitoring walkthrough. This section is the Railway-specific framing on top of that pattern, not a re-derivation of it.

The four Railway monitors in Velprove (free plan)

Put the patterns above together and the Railway-side coverage lands in four concrete monitors. All four fit inside the Velprove free plan: 10 monitors total, a 5-minute HTTP interval, one browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, email alerts, and 1 status page. Each monitor probes from one of 5 global regions you pick at setup time. If your Railway service runs Next.js, pair this set with how to monitor a Next.js app in production for the render-layer half.

(a) Public web-service HTTP probe. A plain HTTP monitor against your public Railway URL, or its public custom domain, asserting status_code = 200 and a body_contains rule on a static string that only your real app emits (a footer tagline, a known marker in the HTML). The body_contains rule keeps a cached gateway error page that happens to return 200 from passing. Set the interval to 5 minutes on free or 1 minute on a paid plan, and pick whichever of the 5 global regions is closest to your real customers.

(b) Private-service /deps probe. You cannot point a Velprove monitor at db.railway.internal or worker.railway.internal, because those names resolve only inside your project's Wireguard mesh. Expose a /deps route on a public service in the same project that calls the private dependency and returns 200 only on real success. Point a Velprove HTTP monitor at /deps, assert status_code = 200, and the private service becomes externally observable without giving it a public surface.

(c) Cron heartbeat probe. On the companion route that reports cron freshness, set up an HTTP monitor against /last-run/<job> asserting status_code = 200. The endpoint returns 503 when the cron has gone stale, so a 200 is the whole check. Match the probe interval to the cron cadence: a 5-minute cron is comfortable on a 5-minute probe; a daily cron is comfortable on a slower probe with a generous grace window. The detection lag is bounded by your probe interval, not Railway's cron minimum.

(d) Browser login monitor on the signed-in path. The three monitors above prove the platform's pieces are alive. They do not prove a real user can sign in and see their data. The browser login monitor opens a real browser, signs in as a dedicated low-privilege test user, follows the post-login redirect, and asserts the landing page looks right. By default it verifies success by confirming the URL changed; that catches a login that fails outright but not a login that lands on an empty shell because the database read behind it silently failed. Under Customize detection, switch Success verification from the default URL-change to "Page contains text" or "Element is visible", and set it to a string or selector that only renders when a real database read returned data: a customer name, an invoice ID, a known plan label. This is the clearest case of when a browser monitor beats an HTTP probe. Use a dedicated test account, never real admin credentials. The free plan includes one browser login monitor at a 15-minute interval, which is enough to catch a multi-hour database-backed outage and a login regression inside one window.

Velprove browser login monitor wizard with the Customize detection panel expanded and Success verification set to Page contains text with the value Logout. — Under Customize detection, switch Success verification from the default URL-change to Page contains text. The default catches a login that fails outright; a text-present check is what catches a login that lands on an empty shell.

Velprove monitor list view on the free plan showing four Railway monitors, each assigned to a different global region. — The four-monitor Railway set on the Velprove free plan: a public HTTP probe, a /deps probe for a private-service dependency, a cron heartbeat probe, and a browser login monitor on the signed-in path.

No credit card required. The set lands on free and stays on free unless you want sub-5-minute intervals or more than one browser login monitor.

The honest probe-cost tradeoff on Railway

Probes cost request volume on your service, not Railway pricing dollars in this post's frame. The math is easy. A 1-minute probe from a single region hits your endpoint about once per minute, which is 1,440 per day, which is roughly 43,200 requests per month at that single endpoint. A 5-minute cron-heartbeat probe from a single region is about 288 per day, roughly 8,640 per month. Both numbers are small relative to any real traffic, but they are not zero, and they are the load you are adding by deciding to probe continuously.

The sane default for Railway on the Velprove free plan is HTTP probes at 300-second (5-minute) intervals, which is what the free plan includes. That is enough to catch a multi-minute outage and small enough to stay invisible on any real Railway service's billing. If you need the 1-minute interval, you need it for the customer-facing paths where one minute of detection lag is one minute of silent revenue loss, not for the cron heartbeat that fires hourly anyway.

The same probe-cost discipline applies across the Platform sibling guides: Render, Vercel, and Cloudflare Workers and Pages carry the same four-pattern shape, with platform-specific plumbing under each pattern.

Frequently Asked Questions

Does Velprove keep my Railway service from sleeping?

No. Railway's sleep timer is outbound-driven. Per Railway's app-sleeping docs, a service goes to sleep when it has not sent outbound traffic for at least 10 minutes, and inbound traffic is explicitly excluded from that decision. A Velprove HTTP probe arrives at your service as inbound traffic, so it does not reset the sleep clock. It will wake a slept service on the first request, then sleep again 10 minutes after your service stops sending outbound traffic. If you need the service awake, do that with outbound activity inside the service, not with an external monitor.

How do I monitor a Railway service that only runs on `railway.internal`?

You cannot reach *.railway.internal from outside. The private network is a Wireguard mesh scoped to a single project and environment, structurally unreachable from the public internet. The working pattern is a public companion route, for example /deps on a public web service in the same project, that exercises the internal call and returns 200 only if the private service responded. A Velprove HTTP monitor against /deps, probing from one of 5 global regions you pick, then tells you when the private service stops answering.

How do I detect a Railway cron that did not fire?

Heartbeat pattern. The cron writes a timestamp on success into Postgres or a key value store. A small companion web service reads the timestamp, computes its age, and returns 503 when the cron has gone stale, 200 otherwise. A Velprove HTTP monitor asserts status_code = 200 on that endpoint. Railway's cron docs state that if a previous execution is still running when the next scheduled run is due, Railway skips the new run, so a hung cron looks identical to a missing cron from outside. Velprove does not receive passive heartbeats, so the freshness lives on your endpoint and Velprove asserts the status code.

Does Railway alert me when a deploy serves the wrong build SHA?

No. Railway's native healthcheck only gates the deploy at activation time, not its content afterwards. Expose /version returning the git SHA from a build-time environment variable, then assert it with Velprove. The recommended form is a multi-step API monitor: Step 1 captures $.build_sha from /version into a variable, Step 2 hits a second route that compares its own runtime SHA against the captured value and returns non-2xx on mismatch. The comparison lives in your app, the monitor just orchestrates, and the setup survives every deploy unchanged. The lighter alternative is a plain HTTP monitor with body_contains set to the current SHA, but body_contains goes stale on your next deploy unless your CI/CD updates it via Velprove's PUT /api/checks/<id> API. When a deploy reports green but serves a stale or wrong build, either assertion fails and the monitor pages you.

Velprove HTTP monitor wizard on the Success Conditions step for a Railway /version endpoint, showing two assertions in order: Status Code Equals 200, and Response Body Contains build_sha. — Two Success Conditions on a Railway /version monitor: a 200 status code and a body-contains assertion on the build SHA marker, which is what catches a deploy that came up but serves the wrong build.

Is Railway's native healthcheck enough for uptime monitoring?

No, and Railway says so. The healthchecks reference page states, verbatim: "The healthcheck endpoint is currently not used for continuous monitoring as it is only called at the start of the deployment, to ensure it is healthy prior to routing traffic to it." It is a deploy-time gate that lets a new instance start receiving traffic once it returns 200, not a runtime alert that fires when the instance later stops responding. Continuous uptime needs an external probe.

What is the cheapest way to monitor a Railway app?

The Velprove free plan. It covers 10 monitors total, a 5-minute HTTP interval, one browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, email alerts, and 1 status page, with each monitor probing from one of 5 global regions you pick. That is enough to land a public HTTP monitor on your web service, a /deps monitor on a private dependency, a heartbeat monitor on a cron, and one browser login monitor on the signed-in path. Start with the free plan. No credit card required.