Founder

Solo Founder Outage Playbook: Survive the 3 AM Call

May 10, 202611 min read

The honest take: Your phone goes off at 3:14 AM. The monitor says login is failing in three regions. You are about to be the Incident Commander, the Communications Lead, and the Operations Lead at the same time, because there is no one else awake. Here is the playbook for what happens next. Most of it is deciding which role to be in which minute, not which command to run. The commands are the easy part.

You are wearing all three core ICS roles right now

The Google SRE Book's Managing Incidents chapter (Chapter 14) defines four roles for any production incident under the Incident Command System (ICS): Incident Command (IC), Operational Work, Communications, and Planning. For a solo founder, Planning collapses into IC because there is no multi-day coordination to schedule. The three you actively cycle through are IC, Ops, and Comms. The IC owns the incident, makes decisions, and holds the timeline. Comms talks to customers and stakeholders. Ops touches the production system. In a team of five SREs at 3 AM, the roles go to different humans on purpose, because one person trying to do all of them at once is the most reliable way to make a 10-minute outage into a 4-hour one.

You do not have three humans. You have you. The playbook adaptation for a solo founder is not to pretend the roles do not exist. It is to wear them in sequence, not in parallel. For the first 10 minutes you are the IC and only the IC. You do not push fixes. You do not write status updates. You confirm the outage, you open the incident document, and you decide what happens next. Then you switch to Comms for two minutes and post the first update. Only then do you become the Operations Lead and touch the system.

That sequence sounds slow. It is faster than the alternative, because the alternative is you SSHed into production with no incident log, no customer message posted, and a half-formed theory of what is broken. Founders skip the IC role at 3 AM because it feels like overhead. It is the opposite. It is the part that keeps the next 60 minutes from spiraling.

What you should have set up before 3 AM

You cannot fix the prep gap mid-incident. If you are reading this mid-outage, skip to the next section. If you are reading this on a calm Tuesday, the following list is the one that pays out at 3 AM, and the one most founders postpone because none of it feels urgent until it is.

A monitor that actually catches the outage. Most founders run one HTTP check on the homepage, and most real outages do not flip that check. The four layers your monitoring should cover covers what to monitor, layered by depth. The short version: HTTP plus a browser login monitor plus a multi-step API monitor plus a public status page.
A phone alert you cannot sleep through. Email at 3 AM is not enough. Use a phone-ringing channel for severity-1 alerts: a dedicated ringtone, a hardware pager, or a paid escalation tool once you have a teammate. For solo founders, a ringing email-to-SMS bridge plus Do Not Disturb allow-list is the floor.
A written runbook with three commands. One command to roll back the last release. One to check the database is reachable. One to flip the marketing site to a static maintenance page. If you cannot run those three at 3 AM without thinking, write them down now.
A public status page URL customers know about. Linked from your marketing footer and from your support replies. If customers have to search for it during the incident, the page is too late.
A pre-written first-message template. Three sentences, with blanks for what is broken and when you will update next. Drafting prose at 3 AM with adrenaline is how founders post sentences they later regret.

How you find out you're down (and why your monitor lies)

The first lie your monitor tells you is that it knows. The second is that what it knows matches what your customers are seeing. Both lies have specific shapes, and both have specific fixes.

The most common false-green is a 200 OK on the marketing page while the authenticated dashboard is broken. The 200 OK can be a lie post walks through 10 documented incidents where HTTP monitoring returned green while real users could not log in, including the Cloudflare June 12 2025 Workers KV outage that took Access to 100% identity-login failure while marketing properties on the same network kept serving. If your only monitor is a homepage HTTP check, your dashboard could be down right now and your monitor would not know.

The second lie is regional. A monitor in one city sees one network path. A regional fiber cut, a CDN routing change, or a single-AZ failure can look like a full outage from one probe and look fine from another. The green-while-down problem post covers why single-region and low-frequency probes miss the real failure surface. Multi-region with a probe interval short enough to catch the incident inside its lifetime is the structural answer.

Velprove's browser login monitor is the layer that catches what HTTP cannot. It opens a real Chromium context, signs in as a known test user, and asserts on a post-login element. If identity is broken, if a captcha vendor is down, if a session token rotated, the run fails with a screenshot of where it died. Every plan includes a browser login monitor; the Free tier includes one browser login monitor at a 15-minute interval, and every plan probes from 5 global regions.

The first 10 minutes: triage before you touch a thing

You wake up. The phone is buzzing. Resist the urge to push a hotfix. Your job for the next 10 minutes is to be Incident Commander first, Operations Lead second. The IC-first sequence is:

Confirm from a second source. A monitor saying down is one source. Open a private browser window on your phone and try the affected flow yourself. If you can sign in, the monitor may be lying. If you cannot, the monitor is right. 30 seconds.
Open the incident document. One text file. Write the wall-clock time. Write what the monitor said. Write what you just confirmed. This is the timeline that becomes the post-mortem at 9 AM. 60 seconds.
Form one hypothesis, write it down. If your last release was within the last hour, the hypothesis is "recent change." If the database CPU graph is spiking, the hypothesis is "database." If the monitor failed in one region only, the hypothesis is "regional." You are not committing to the hypothesis. You are writing it down so you can test it. 60 seconds.
Decide stop-the-bleed. Stop-the-bleed is never the same as root cause. If recent change is the hypothesis, roll back. If the database is the hypothesis, failover or scale. If a third-party is the hypothesis, flip the affected feature to a degraded fallback. The goal of the next move is not to fix the bug. It is to stop the customer-facing damage while you keep investigating. 5 minutes.
Switch to Comms, post the first update. You have not solved it. Post anyway. The next section is the template. 2 minutes.

That whole sequence is roughly 10 minutes. The temptation is to skip steps 2 and 3 and go straight to step 4. The reason experienced ICs do not skip is that step 3 is what saves you from fixing the wrong thing at minute 12.

What to say to customers in the first 10 minutes

The 5-minute first-comms rule is the operating standard across mature incident response: post something within 5 minutes of confirming the outage, even if you do not know the cause. Customers do not need root cause in the first message. They need acknowledgement, scope, and a next checkpoint.

A three-sentence template that works at 3 AM:

What is broken from the customer's point of view. "We are seeing failed logins for some users." Not "auth-service is throwing 503s." The customer language version.
That you are aware and investigating. "Our on-call has acknowledged and is investigating now." You are the on-call. The phrasing still works.
When you will post the next update. "Next update within 30 minutes." Then keep that promise. A 30-minute checkpoint with no new info is "Still investigating. Next update by 04:30 UTC."

Post the update to your status page first, then to the channel your customers actually watch. For most early-stage SaaS that is one email blast to active users, not a tweet that nobody will see for six hours. Do not speculate on cause. Do not commit to a fix time. Post-now beats perfect every time, because the alternative is customers filing tickets that you have to answer one by one while you are also fixing the system.

Mistakes that turn a 10-minute outage into a 4-hour one

The 7 pre-incident mistakes post covers what to configure before the outage. This list is different. These are the mid-incident behaviors that take a contained outage and unwind it into a bad night.

Panic-pushing a fix without a rollback plan. At 3 AM with degraded judgment, the fix that looks obvious often is not. If you push a change, know exactly how you roll it back, and rehearse the rollback command before you push the fix.
Fixing the wrong thing because the symptom looks like the cause. A failed login can be auth, can be a database replica, can be a captcha vendor, can be a CDN. The hypothesis from step 3 of the first-10-minutes block exists so you test it, not so you commit to it. If the rollback did not fix it, the hypothesis was wrong. Form a new one.
Going dark on comms because you are deep in the fix. Customers do not see the SSH session. They see the silence. If you committed to a 30-minute update, post it on time even if the only new information is "still investigating." The silence is what generates the angry support emails, not the downtime.
Staying up past judgment. A founder at minute 90 of an outage at 3 AM is making worse decisions than the same founder asleep. If stop-the-bleed is in place and customer impact is contained, sleep. Root cause can wait for daylight.
No pre-written templates. If you are drafting prose mid-incident, you are spending IC and Comms cycles on wordsmithing instead of triage. The template lives in your repo before the outage, not in your head during it.

The post-mortem you write at 9 AM (not the one in the SRE book)

The Google SRE Book talks about a "living incident document" updated in real time by the IC. That works when the IC is a different person from the Ops Lead. For a solo founder, the document gets written during triage (the incident log from step 2) and finished at 9 AM after sleep and coffee.

A solo-scaled post-mortem has five fields. No more.

Timeline. Wall-clock times. When the monitor flipped, when you confirmed, when you posted comms, when stop-the-bleed landed, when full recovery was verified. Lift this from the incident log you opened in step 2.
Customer impact. Who saw what, for how long. "Roughly 40 users could not sign in for 22 minutes." If you do not know the number, write "unknown, follow-up in action items." A missing number is a learning, not a failure.
Root cause, in one paragraph. Plain language. The five-whys version is fine if it fits. Do not write a 20-page narrative. The next-you will not read it.
What worked, what did not. If the rollback worked, name it. If the monitor did not catch the failure for 8 minutes, name that. If the first comms post went out in 4 minutes, name that. Specific.
One action item with a date. One. Not five. The post-mortem with five action items becomes the post-mortem with zero action items completed, because solo founders do not have the bandwidth to land five concurrent fixes. Pick the one that prevents this exact outage from happening again, put a date on it, and put it in your backlog.

One question that sometimes comes up at the post-mortem stage: do you owe customers a credit? For SaaS without a formal SLA, the answer is usually no, but the goodwill question is real. The SLA vs SLO vs SLI customer guide covers what a credit actually obligates you to and what a discretionary credit signals.

What an outage actually costs you at 5 customers

The headline numbers from enterprise outage cost surveys do not apply at 5 customers. The real cost of downtime for small businesses post covers the math at small scale, where the cost is not revenue per minute but trust per incident. A 30-minute outage on a $50 monthly plan is roughly 70 cents of pro-rated revenue if you credit it. The actual cost is one customer deciding they have seen enough and starting to evaluate your competitor. That cost is unrecoverable and does not appear on any spreadsheet.

The practical implication: at low customer count, the incident response that pays out the most is the first comms message, not the technical recovery speed. A 45-minute outage with proactive updates every 15 minutes is a survivable story. A 12-minute outage that you never acknowledged is the one that loses the customer.

Tools a solo founder actually needs (and what you can skip)

The tools market for incident response is shaped for teams of 20 with on-call rotations. Most of it is not what a solo founder needs. Here is the honest floor.

What you actually need. A monitor that catches real outages from outside your network, a phone alert that wakes you up, a public status page, and a written runbook. Velprove's Free plan covers the first three at $0 with no credit card required: 10 monitors including a browser login monitor at 15-minute intervals, HTTP and API monitoring at 5-minute intervals, 5 global regions, email alerts, multi-step API monitors with up to 3 steps, 1 status page, and 30-day incident history. The browser login monitor is the layer that catches the silent outages an HTTP check will miss.

What you can skip at single-digit customer count. PagerDuty at $21 per user per month (billed annually), Atlassian Statuspage at $29 per month, Better Stack's on-call add-on at $29 per responder per month (billed annually). None of these are bad products. They are scoped for teams that have multiple humans to coordinate. Email alerts on a phone allow-list cover the same ground for one person.

When to upgrade. Velprove Starter at $19 per month adds Slack, Discord, Microsoft Teams, and webhook alerts at 1-minute intervals, 3 browser login monitors at 10-minute intervals, and 90-day dashboard incident history. Pro at $49 per month adds PagerDuty integration, 30-second HTTP intervals, 10 browser login monitors at 5-minute intervals, and 1-year incident history. The trigger to upgrade is the first time you miss a real outage because the interval was too slow or the channel was too quiet, not the first time someone tells you to.

If you are weighing the broader tool landscape rather than the Velprove plan ladder specifically, the uptime monitoring tool comparison for 2026 covers the major options side by side.

Frequently Asked Questions

What should a solo founder do first when their website goes down at 3 AM?

Be Incident Commander first, not Operations Lead. Open a single text file, write the timestamp, write what your monitor said, write what you confirmed from a second source, and do not touch production for the first 90 seconds. The most expensive solo-founder outage move is panic-pushing a hotfix into a system you have not finished diagnosing. Confirm the outage from an outside source, declare the incident to yourself in writing, then triage. Stop-the-bleed before root cause.

How do I write a status update during an outage when I don't know the cause yet?

Post a three-sentence update inside 10 minutes. Sentence one: what is broken from the customer's point of view. Sentence two: that you are aware and investigating. Sentence three: when you will post the next update. Do not speculate on cause. Do not promise a fix time. The 5-minute first-comms rule beats the perfect post-mortem update every time, because customers tolerate downtime and do not tolerate silence.

Should I roll back my last deploy or try to debug forward during an outage?

Roll back. If the outage started within an hour of your last release, the rollback hypothesis is correct often enough that it is the default move. Debugging forward at 3 AM with degraded judgment, no second pair of eyes, and a clock running against your customers is the wrong shape of work. Get the system to the last known good state, then debug at 9 AM when you have coffee and daylight.

Do I need a paid incident management tool like PagerDuty as a solo founder?

No, not at single-digit customer count. PagerDuty's value is on-call rotation, escalation policies, and team coordination. A solo founder has no rotation to schedule and no team to coordinate. Email plus a phone alert on a monitor that actually catches real outages is enough. The Velprove free plan includes a browser login monitor, 10 monitors from 5 regions, and email alerts at no charge. Add PagerDuty if and when you have a teammate to escalate to.

How long should I stay up during an outage before sleeping and escalating in the morning?

If you have rolled back to a known good state and customer impact is contained, sleep. A founder past the 90-minute mark at 3 AM is making worse decisions than a founder asleep. If you are still in active impact at 5 AM and the rollback did not work, the call is to post a final status update telling customers you are pausing until 8 AM, accept the downtime, and sleep. A 4-hour outage you fix with judgment beats an 8-hour outage you make worse with panic.

What goes in a post-mortem when I'm the only person on the team?

Five fields. Timeline with timestamps, customer impact (who saw what, for how long), root cause in one paragraph, what worked and what did not, and one action item with a date attached. Skip the blameless-culture section because there is no one to blame other than yourself, and skip the leadership-review section because there is no leadership. Save the document in your repo. The next outage will rhyme with this one, and the file is the only thing standing between you and repeating it.

The 3 AM call is going to happen. The question is which of the three core ICS roles you start in, what you have written down before the alert fires, and whether your monitor catches the outage at all. Start a free Velprove account. One browser login monitor, 10 monitors total, 5 global regions, email alerts, status page. No credit card required. The setup is five minutes. The next outage will be on its own schedule.