Most runbooks are useless. Either they're too abstract ('check the logs') or they're a 40-page Confluence doc that nobody reads at 3 AM.
Here is the template I use. It fits on one page and works.
The template
1. Trigger. The exact alert name and what it means.
2. Impact. Who is affected? What are they seeing? Is this user-facing?
3. First 5 minutes. The single most useful command to run. One. Not five.
4. Common causes. The 3 things that most often cause this alert, in order of likelihood.
5. Fix per cause. For each common cause, the exact fix. Copy-paste-ready.
6. Escalation. Who to page if none of the above works. Include their timezone.
7. Post-incident. What to update after the incident is done (ticket, dashboard, doc).
Why this works
At 3 AM, your brain is running at 60%. You need a runbook that gives you the next action in under 30 seconds. A 40-page doc makes you think. A one-pager tells you what to do.
Start with your noisiest alert. Write the runbook. Test it on a new team member. If they can follow it without you, it works.
Repeat for your top 10 alerts. That's 90% of your on-call load handled.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
United States
NORTH AMERICA
Related News
How Braze’s CTO is rethinking engineering for the agentic area
10h ago
Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools
21h ago

Implementing Multicloud Data Sharding with Hexagonal Storage Adapters
15h ago

DeepMind’s CEO Says AGI May Be ~4 Years Away. The Last Three Missing Pieces Are Not What Most People Think.
15h ago

CCSnapshot - A Claude Code Configs Transfer Tool
21h ago