Untested runbooks are confidence theatre

The day the primary database failed, the team had a disaster recovery plan. Three pages of it, neatly formatted, approved by the auditor; nobody had ever opened the file before that morning.

It took six hours to bring the system back, and almost none of that time was spent on the technical recovery itself. The hours went into figuring out who had which credential, which one of the documented procedures still matched reality, and which of the two people listed as on-call was actually reachable.

So here is the gap I keep finding. A plan that has never been tested is a document, not a plan. The question is not “do we have a plan,” the question is “when was the last time we ran it end to end,” and the second question is what separates resilient teams from teams that have a binder.

Plans like this get written when someone asks for compliance evidence, and they get written quickly because the deadline was the auditor’s, not the operator’s. They read well, because that was the goal; they execute badly, because that was never the goal.

What good actually looks like is much more modest. A clear acceptable downtime number written down somewhere everyone can find, a clear acceptable data-loss window alongside it, and one named owner per phase of the recovery with a backup owner and a primary phone number that someone has confirmed in the last quarter.

A communication plan that names the customer-facing voice and the internal voice; those are usually different people, and figuring out which one talks first is the kind of thing that should not be decided at 2 AM. The plan should also say what happens in the first thirty minutes, because the first thirty minutes is when the wrong decisions get made by default.

The single move that fixes most of this is unglamorous. Tabletop exercises, quarterly; the team gets in a room, somebody reads the scenario out loud, and everyone walks through what they would actually do. Nobody touches production, nobody fakes a failover; the cost is two hours on the calendar and a notetaker.

Every gap you find in that two-hour walk-through is one less surprise on a real Tuesday. The first time we ran one for a portfolio company, we found three credentials only one person knew, two services with no documented runbook at all, and a vendor escalation path that pointed to an email address that bounced.

The thing nobody plans for is the people layer. The on-call rotation that depends on one heroic engineer, the credential vault that only one founder can unlock, the runbook stored on a laptop that the team lead took to a cabin without service. So when you walk the plan, walk it with the assumption that the most knowledgeable person on the team is unreachable; the plan that works under that assumption is the plan that holds up.

The cheapest version of a resilient operation is a tested plan; not a more sophisticated plan, not a more expensive tool, not a longer document. The most expensive version is the same plan, untested, learned for the first time on the worst day of the year.

Untested runbooks are confidence theatre

Share this article