For any technical organization, it's critical to have a knowledge base, in conjunction with (ideally automated) runbooks, so that engineers are prepared to solve both common and uncommon production issues quickly.
Indeed, for engineers and support folks working in production, the use of well-designed, automated runbooks can give big wins in both efficiency and visibility.
Let's go over the basics of runbooks (both in their classic written form, and in the automated form) and why they are so valuable.
What is a runbook?
A runbook is a compilation of the concrete and repeatable steps necessary to complete a technical operation or procedure within a company. Examples might range from application actions (say, refunding a customer), or pure infrastructure actions (backing up a database). They might be every day (running a report), or they might be rare and critical (failing over to a backup database).
They are generally available to every team member that might need to use them.
It’s important to note that each runbook completes one specific task - 1:1 ratio.
Types of runbooks
There are three different types of runbooks:
- Manual: Step by step instructions for an operator to follow.
- Mixed: Combination of step by step instructions and automated steps.
- Automated: No human intervention necessary
These runbooks can be split into two different categories:
- General runbooks: These runbooks are designed to hold the documentation and steps necessary to complete routine IT operations such as performing daily backups, or monitoring app performance.
- Specialized runbooks: These runbooks are for more complex or less common situations such as incident response, complex DevOps workflows, disaster recovery, etc.
Creating a runbook
First and foremost, organizations need to understand what an effective runbook is before planning on making one. A popular standard on judging what makes a runbook effective or not is through the five A's.
- Actionable: Documents everything that needs to be done.
- Accessible: Every team member knows where to access the runbook.
- Accurate: Content of the runbook solves the issue at hand and is kept up-to-date.
- Authoritative: One runbook for one single operations process.
- Adaptable: Easy to modify when necessary to make it more efficient.
It's also important to have each runbook be set up in a standard, consistent way so your employees are able to follow any runbook without requiring additional assistance.
After understanding these five A's, you can now start planning your first runbook.
Closely related to runbooks is the idea of a service guide, or a service overview, that gives insight into how to operate a discrete service. OpsReportCard describes seven parts, of which runbooks (ideally ones that can be automated-with-a-human-in-the-cycle) play a part:
- Service overview: Answers what is the service, why do we have to perform this action, who are the primary contacts, links to design docs, and other relevant information.
- Service build: How to build the software.
- Deploy: How to deploy the software.
- Common tasks: Step-by-step instructions for the common tasks that are related to the end goal.
- Pager playbook: What alerts may occur during this procedure by your monitoring system and how to resolve them.
- Disaster recovery: Plans and procedure if an incident were to happen.
- Service level agreements: Uptime goal, recovery point objective (RPO), and recovery time objective (RTO) all documented here.
When composing a service description, talk to subject matter experts on your team, and identify what's critical to the solutions such as diagrams and flow charts.
Writing (and composing) phase
While there are some things to keep in mind while you're creating a runbook that may differ from organization to organization, the ultimate goal should always be that users require little-to-no assistance while following the runbook to complete the task at hand.
Think in the perspective of a brand-new employee. Would they be able to follow? Are the steps too complicated? In an automated runbook, is it clear what every input is used for? Try and reduce technical language and abbreviations when relevant to make it more understanding.
Before publishing, runbooks should be tested by multiple people to ensure that one, it actually solves the issue at hand and two, it's the optimal solution and easy to follow.
Like the software development lifecycle and it's constant updates, runbooks created in the past should be looked at and updated when necessary every few months.
Advantages of runbooks
There are many advantages that come with having runbooks play a pertinent role in every engineers workflow.
They contribute to standardization - making operations more efficient and consistent by documenting them step by step.
They optimize your DevOps processes and increase engineering velocity. They reduce downtime.
They improve developer experience. By making common, tedious, or error-prone tasks easily repeatable or automated, developers have to do less work. Less work = happier developer.
They grant your team more ownership. Runbooks allow you to trust more people to run complex tasks or commands since the end result is already documented. Runbooks are great for allowing your developers to run queries without giving them direct production access.
They allow you to onboard new employees quicker. Onboard new team members faster by having common tasks automated for on-call roles, support roles, or general debugging.
They provide knowledge. Having tasks documented and automated within your organization, allows your team to learn, enabling them to think about what else they can automate for future projects.
Every technical organization should have runbooks integrated into their engineering workflow as a best practice. In terms of what type of runbook is the best practice, automated runbooks are the gold standard. Human error is the primary cause for software defects. Eliminiating the main source of your software bugs will increase engineering productivity as well as developer experience.
Since automated runbooks are capable of allowing developers to run queries in production without direct production access, additional safety measures can and should be put in place in order to control who's doing what in production.
Groups and permissions should be put into place to determine which developers can run which queries in specific hosts and prompts. Secondly, a review and approval step should be in the runbook workflow before the action is completed. For example, if a junior developer is trying to use a runbook that performs a MySQL query in production, that action should need at least one senior engineer to review and approve the step before it runs.
Runbooks help with making production work a lot faster and easier. By allowing your developers to work in production via runbooks without giving them direct production access, your engineering team can do more. Hopefully this guide helps point your organization in the right direction by creating runbooks to help expedite your engineering and IT teams.