playbook - WIP

Oh maybe there was a typo right here... Wait when was the last time this was updated? Wait.. how much of S3 is down?... Can someone update the status... no? Hmmmmm

Credit to @bobtfish and @mbaxa

The Theoretical Safety Net

The runbook exists because someone had the right idea at some point. Incidents are stressful. Stressed people make bad decisions. A written procedure removes the cognitive load of figuring out what to do next when you cannot think straight. This is a good idea. The runbook is a good idea in theory.

In practice, the runbook is a historical artifact. It documents the system as it existed when the runbook was written, which was during or shortly after a previous incident, which means the author was also stressed, which means some of the steps are documented with the specificity of someone who wanted to get back to sleep. The flag that the runbook tells you to pass was deprecated in version 3. The service it tells you to restart was renamed when the team did the Great Kubernetes Migration. The dashboard link goes to a URL that now returns a 404 because someone reorganized Grafana last quarter.

You discover all of this at 3am, sequentially, one step at a time.

The runbook is not wrong. It is correct for a system that used to exist. The gap between that system and the current one is undocumented.

— What the postmortem will say, diplomatically

The Action Item That Does Not Close

Every postmortem from every incident where the runbook was wrong produces the same action item: update the runbook. This action item enters the backlog. It is labeled medium priority because the incident is resolved and the urgency has dissipated. It competes with feature work and tech debt and the other medium-priority items from the last three postmortems.

The person who knows what the runbook should say is the person who just resolved the incident. That person is now asleep. When they wake up, they have a full queue of other work. The runbook update is on the list but it is not at the top of the list. This is not negligence. It is the rational response to a backlog that is longer than any sprint can clear.

The next incident finds the runbook in exactly the same state. The inner monologue — wait, when was this last updated — plays out again. Another action item is written. The cycle is self-sustaining and nobody designed it that way.

I'm really sure I copy pasted all the commands exactly as written

The Theoretical Safety Net

The Action Item That Does Not Close