Runbooks for better incident management
An SRE's best friend, runbooks can help engineering teams stop putting out the same fires again and again.
Why runbooks are useful
Automated processes don't always protect against issues -- so software needs 10s to 100s of different activities actioned by skilled humans to keep the system rolling
"30-40% of procedures require human judgement to resolve safely so that's still a bunch of run books won't go away - even if large parts of deployment are push-button processes."
Prevents an issue like this: "I recently ran into a situation where I spent 6 hours understanding how something works that would have taken 20 minutes if the relevant information was stored somewhere."
Ways that teams have set up their runbooks
Confluence -- not particularly designed for runbooks but they are an open-ended tool that enables you if you have a solid idea of runbook design
Jupyter Notebooks - an open-source tool with a combo of text, image and live code snippets so decent option if you are happy to install and update
Markdown files hosted in git repo -- maintenance might be an issue
Err… this ➝ "Sticky notes on someone's desk. We're thinking about getting a laminator to keep the coffee spills from being too serious of a problem."
Factors to consider in runbook setup
Make a standard runbook template -- makes it easier to process information when in a pinch like when resolving an urgent incident
Have a collaborative approach to build the runbooks -- don't palm off to technical writers - the people who design and build the systems should be main authors or at least participate
Give an explanation of why the component of the system was designed as it appears to runbook user
Some runbooks have sub-processes - it's important to clarify what are these and how they relate to the process they are children of
I originally posted this on Reddit — in case you see an identical post there