Skip to main content

Incident Response for Small Teams: On-Call, Runbooks, and Postmortems

·1420 words·7 mins· loading · loading ·
Author
Maksim P.
DevOps Engineer / SRE

TL;DR

  • You do not need a dedicated SRE team to have functional incident response. You need a rotation, a runbook, and a postmortem process.
  • On-call without structure burns people out. A simple weekly rotation and clear escalation path fixes most of it.
  • Runbooks do not need to be comprehensive documentation. They need to be useful at 2 AM.
  • Blameless postmortems are not a culture exercise—they are the cheapest way to stop the same incident from happening twice.
  • The goal is boring. Boring means the system is working.

Who this is for

This guide is for engineering teams of three to ten people who ship software, own their infrastructure, and get paged when things break—without a dedicated reliability engineer in sight. If you are a startup where “incident response” currently means a frantic Slack message to whoever is awake, this is for you.


What incident response actually needs to cover
#

“Incident response” sounds like something that needs a war room, a runbook binder, and a team of specialists rotating through shifts. In practice, for a small team, it needs to cover four things:

Detection. You find out about the problem before your users do, or at least not much after. This means alerts. Not “alert on everything” alerts—useful alerts tied to things that actually matter to users.

Communication. Someone is accountable for the incident. Everyone else knows who that is. There is a shared place where status gets updated, even if it is just a Slack thread.

Resolution. The on-call engineer can act without needing to wake up three other people to get context or credentials. This is where runbooks earn their keep.

Learning. After the incident, you write down what happened and why. Then you fix the underlying cause, not just the symptom.

That is the whole process. You do not need a framework. You do not need a vendor. You need these four things to work reliably.


On-call without burning out your team
#

The fastest way to burn out a small engineering team is an unstructured on-call rotation where the same two people always get paged because they are the ones who know how things work.

The fix is boring but effective: a formal weekly rotation, even on a team of four.

Rotation basics. Everyone on the team who touches production is in the rotation. One week on, three weeks off (on a team of four). Use PagerDuty, Opsgenie, or even a shared Google Calendar—whatever your team will actually maintain. The point is that “who is on call this week” has a single, unambiguous answer that anyone can look up.

Escalation. Define a secondary on-call. If the primary does not acknowledge an alert within fifteen minutes, the secondary gets paged. If both are unreachable, it goes to the engineering lead. Write this down. “I’ll just text someone” is not an escalation policy.

Alert hygiene. An on-call rotation is useless if the person on call gets thirty alerts per night, most of which are noise. Before you implement rotation, spend an afternoon auditing your alerts. If an alert has fired more than five times in the last month without resulting in action, it is either misconfigured or not worth alerting on. Delete it or fix it.

Compensate for the burden. On-call is work, and it disrupts sleep. If someone gets paged three times overnight, they should not be expected at standup at 9 AM. This is not a policy you need to write down—just be a reasonable manager.


Runbooks: what to write and what to skip
#

A runbook is not documentation. Documentation explains how a system works. A runbook tells the on-call engineer what to do when a specific thing breaks, in the order they should do it, with enough context to act without guessing.

Write a runbook for every alert that fires more than once. Skip writing runbooks for hypothetical scenarios that have never happened. You will never maintain them and they will be wrong when you need them.

A good runbook has five things: what the alert means, immediate impact to users, first steps to diagnose, steps to remediate, and who to escalate to if those steps do not work.

Here is a minimal template:

# Runbook: [Alert Name]

## What this means
[One sentence: what is broken and why this alert fires.]

## User impact
[What users are experiencing right now. "Users cannot log in." "Checkout is failing." etc.]

## Diagnosis steps
1. Check [dashboard/log query/metric] for [what to look for]
2. Check [second thing]
3. Determine if this is [X] or [Y] by [how to tell]

## Remediation
### If X:
- Run: `[command]`
- Expected output: `[what success looks like]`

### If Y:
- Restart: `[service]`
- Verify: `[how to confirm it worked]`

## Escalate to
- [Name / team] if [condition]
- [Name / team] if [other condition]

## Last updated
[Date] by [who]

Keep runbooks in the same repository as your infrastructure code. If they live in a wiki that nobody checks in, they will rot. If they live next to the thing they describe, there is a chance someone will update them when the system changes.


Postmortems that don’t feel like blame sessions
#

The goal of a postmortem is not to find out who broke it. It is to understand why the system allowed it to be broken, and what changes will make it less likely to happen again.

This is not a philosophical stance. It is practical. If your postmortems turn into blame sessions, engineers will start hiding incidents, routing around the process, and fixing things quietly without documentation. That is a worse outcome than the original incident.

Blameless does not mean no accountability. It means the focus is on system and process failures, not individual failures. “The deploy script did not validate the config before applying” is a system failure. “Alex forgot to check the config” is not a useful finding—it will happen again because humans forget things.

Run the postmortem within 48 hours while details are fresh. Keep it short. A good postmortem document does not need to be longer than one page.

# Postmortem: [Incident Title]

**Date:** [YYYY-MM-DD]
**Duration:** [Start time] to [End time] ([X hours Y minutes])
**Severity:** [P1 / P2 / P3]
**On-call:** [Name]

## Summary
[Two or three sentences: what broke, what the impact was, how it was resolved.]

## Timeline
- HH:MM — Alert fired / issue detected
- HH:MM — On-call acknowledged
- HH:MM — Root cause identified
- HH:MM — Mitigation applied
- HH:MM — Incident resolved

## Root cause
[What actually caused the incident. Be specific. "Database ran out of connections because the connection pool was not configured with a max limit."]

## Contributing factors
- [Factor 1]
- [Factor 2]

## What went well
- [Thing that worked as intended]

## Action items
| Action | Owner | Due |
|--------|-------|-----|
| [Specific fix] | [Name] | [Date] |
| [Add alert for X] | [Name] | [Date] |

The action items are the point. If a postmortem ends without concrete follow-up, it was a meeting, not a process.


The minimal tooling setup
#

You do not need much. Here is what you actually need:

Alerting and on-call routing. PagerDuty and Opsgenie are the standards. Both have free tiers that are adequate for a small team. If you want something simpler, Betterstack (formerly Better Uptime) covers alerting, on-call, and status pages in one tool.

Incident communication. A dedicated Slack channel (#incidents or #outages) with a convention for opening and closing incidents. Write a two-line Slack workflow that posts a template when someone types /incident. That is enough.

Status page. If you have customers who care about uptime, they need somewhere to check that is not your Slack. Betterstack, Statuspage.io, or even a static page updated manually works. Pick one and maintain it.

Runbook storage. A directory in your infrastructure repo, a Notion database, or a Confluence space—it does not matter, as long as it is one place and the on-call engineer knows where to find it.

Postmortem storage. Same principle. A folder in your repo, a Notion template, a Google Doc. One place. Every incident gets a document.

That is the full stack. Resist the urge to buy a purpose-built incident management platform until you have outgrown this setup. You probably have not.


Related reads #

Reply by Email