Running Successful Incident Postmortems

“Failure should be our teacher, not our undertaker. Failure is delay, not defeat. It is a temporary detour, not a dead end. Failure is something we can avoid only by saying nothing, doing nothing, and being nothing.” – Denis Waitley2019_Incident_Postmoretem_FirefightersMistakes are coming out of your paycheck. At the IT department of a national restaurant chain undergoing an agile transformation, this was exactly the case: each employee was compensated based on personal goal achievement and everybody was required to take on a goal entitled “perfect execution” with the aim of avoiding surprise incidents at all costs: no schedule delays, no production bugs, and no downtime.

The IT department also had the responsibility of delivering the future of the company: to use their existing chain restaurant kitchens to serve up new dishes only available via delivery to the “Netflix and Chill” generation. The IT department had built an API for their order management system that could be used by partners such as Doordash, Caviar, and Uber Eats. When they did their first tests in a few stores with live customers, they received so many delivery orders that it crashed the order management system used by thousands of restaurants. While marketers saw this incident as overwhelming evidence people wanted delivery, IT and finance saw it as a catastrophic failure. What happened wasn’t “perfect execution”: time and money were lost.

Postmortems were used by management to identify the folks responsible. After all, how could we ever achieve “perfect execution” unless there are real consequences? How otherwise can people learn to be more careful?

Blame, the enemy of learning

Punishing those responsible for incidents didn’t seem to be working. Even after weighing the “perfect execution” goal more heavily and even letting people go, production failures continued to rise and the timeliness of project execution continued to fall. Why? Because “perfect execution” is a management philosophy built on a false premise, the fallacy that good employees do not cause failures.

When a team of people create anything of any sophistication together, it is prone to unexpected failure. Some percentage of errors inevitably will make it to customers. When incidents occur, seeking which individuals to blame rather than the underlying reasons for why the failure occurred encourages the following feedback loop:

  • Bad things happen ⇢
  • We seek who is to blame ⇢
  • Those to blame keep quiet, to avoid punishment ⇢
  • The underlying causes do not surface ⇢
  • We don’t learn ⇢
  • Even more bad things happen ⇢ (leading to more blame…)

Focusing on finding the individuals responsible removes the incentive to learn. And yet, we value the end-product of learning so highly: experience.

Consider the choice between hiring two candidates: all things being equal, who would you rather hire? A candidate who “just graduated” or one with “5 years experience”? What if an organization could provide the equivalent of 5 years experience in just 1 year of work by providing the right working conditions? How much more quickly would the quality of the team improve?

How to learn from incidents

After a devops or IT failure, successful learning is no more complicated than:

  1. Hold an incident postmortem
  2. Focus on fixing the problem and getting to the root causes, not assigning blame

Fix the problem, not the blame

The only mistake that can be made when an incident occurs is not to learn from it. The goal is not to determine who is at fault but rather which conditions and steps caused the incident to happen through root cause analysis. The key is to assume positive intent. To assume positive intent, one must suspend the belief that other people deliberately cause failure. Rather, it relies on each participant to understand that everybody made decisions or took actions they believed were good at the time. Once people believe they will be free from judgement, it becomes much easier to get to the root causes of what caused an incident in the first place.

To get at root causes, the 5 Whys Process—first made popular within the Toyota Performance System—is an excellent tool. Here’s an example Toyota offers of a potential 5 Whys that might be used at one of their plants:

  1. “Why did the robot stop?”
     The circuit has overloaded, causing a fuse to blow.
  2. “Why is the circuit overloaded?”
     There was insufficient lubrication on the bearings, so they locked up.
  3. “Why was there insufficient lubrication on the bearings?”
     The oil pump on the robot is not circulating sufficient oil.
  4. “Why is the pump not circulating sufficient oil?”
     The pump intake is clogged with metal shavings.
  5. “Why is the intake clogged with metal shavings?”
     Because there is no filter on the pump.

Running this process of inquiry, it’s possible to get past the symptoms “the server crashed” to the root causes, for example:

  1. Initial symptoms: The server crashed
  2. Why: the server ran out of memory
  3. Why: the web microservice consumed all the memory
  4. Why: the monitoring service didn’t catch it
  5. Why: the monitoring service wasn’t configured to watch for a run-away memory condition

At each level within the 5 Whys process, it’s possible for individuals to take tasks to make this same sort of incident less likely in the future.

Running an Incident Postmortem

So now that you have an outline of the principles behind a blameless incident postmortem, how do you actually go about conducting one? There are 5 main steps:

  1. Invite anyone affected by the incident
  2. Select a meeting leader
  3. Ask “Why” five times
  4. Assign responsibilities for next steps toward solutions
  5. Share the knowledge in your organization

Step 1: Invite anyone affected by the incident

Soon after all the immediate aftermath of an incident is dealt with, invite anybody who was affected or noticed the issue to be involved in the incident mortem. If you’re a remote team gather on your favorite video conferencing solution. At Parabol we use our own integrated video conferencing, while using our Retro meeting to conduct the 5 Whys.

Step 2: Select a meeting leader

The meeting leader will facilitate the incident postmortem. Their responsibility is to keep the meeting moving forward, and remind the group of its purpose. For example, should something that sounds like blame start to come up they might say, “let’s stay focused on what happened, not pursuing who was responsible,” and then prompt the group, “so why did this happen?”

The meeting leader is also responsible for sending a summary of the meetings’ findings to a wider audience.

Step 3: Ask “why” five times

When examining an incident together, it’s compelling to want to investigate broadly and fix the entire world. We might be compelled to look at multiple paths of “why?” However, if a group goes too broadly, it lowers the chance that it will go deep enough to make fundamental changes. How can this be avoided? By choosing 1, or as few as possible, paths to walk down and having the meeting leader keep the group focused.

We can make starting off on the right path easier by using a postmortem template and letting Parabol lead a prioritization process for what path to examine. When you start a Retro meeting in Parabol…

You can customize the template by using the menu in the meeting lobby:

Hit “Customize…” and set up a template like this:

Then, when you run the incident postmortem your team can generate and vote on the root causes it wants to dig into first.

When the votes are tallied your team can start discussing the top-voted cause and ask “why did this happen?”

Step 4: Assign responsibilities for next steps toward solutions

As your team asks itself “why did this happen?” it will find opportunities to make changes. Write down the changes you’d like to make and assign each task with a single, clear owner. If a task seems too big to tackle, break it down into smaller steps (e.g. “process monitoring solutions researched vs. process monitoring solutions deployed across the entire cluster”).

Step 5: Share the knowledge in your organization

Recording the outcome of the meeting helps gives an opportunity for the organization to learn. It may give stakeholders comfort and build confidence that the organization can handle the unexpected, bring useful suggestions, or—at the very least—encourage others to apply the incident postmortem practice to their own areas and make improvements. Saving the output also builds a knowledge base for future incidents. Make sure to archive it someplace accessible and searchable. When a similar problem occurs, it’s useful to find out that “it’s another one of those” vs. some new and unknown issue.

Parabol is able to automatically generate meeting summaries. Or, if you prefer, there are a number of useful templates available at this GitHub repository you can adapt to your own context.

 

Crib notes for a successful incident postmortem

Here is a summary for how to run an effective incident postmortem:

  • Make them blameless by assuming positive intent
  • Call together all stakeholders and appoint a meeting leader
  • Conduct the process of 5 whys
  • Assign each other practical next steps
  • Send a summary to a wider audience, archive the summary

And finally, the most important point: the only wrong way to run an incident postmortem is not to hold one at all!

2019_Jake_the_Dog_Sucking_is_the_First_Step

Jordan Husney

Co-founder, Parabol