Post-incident review best practices

By Christian Green, Kepner-Tregoe

 

Post-incident reviews offer an opportunity for self-reflection about both the event that happened and the way the incident management team responded. But how do you avoid devolving into blame and finger pointing? It is important to establish the right focus and tone for the review meeting to keep the conversation constructive and produce effective continuous improvement ideas. The purpose for doing these reviews is to improve the process for future incidents.

Set a positive tone and involve the right people

Many companies keep the list of people involved in a post-incident review small – hoping to “save face” in the eyes of stakeholders and not draw attention to things that may not have been well-handled. Experience has shown that this is a mistake. You want to make sure that all the key people who participated in or observed the incident can contribute their ideas to the review process.

At a minimum, the PIR should include the:

  • incident managers and support staff who worked on this issue
  • problem manager responsible for root-cause-analysis
  • service owner who is responsible for overall service assurance
  • representatives from the business function impacted by the incident.

Depending on the nature of the incident, the PIR might also include service provider representatives, communications staff and, in some cases, the user who initially reported the issue.

Review the full timeline

One of the biggest mistakes of post-incident reviews is focusing on what happened after the ticket was opened. The timeline of an incident begins when the business function or user is first impacted, and there is likely some time delay before the ticket is opened and the incident declared. This early period is one of the easiest parts of the process to improve, but it often is overlooked in PIRs. Here is a list of questions that focus the PIR discussion on the early stage of the incident and identify opportunities for improvement.

  1. When did the business impact really start?
  2. How did we find out about the issue? Did someone call the helpdesk or did automated monitors identify it?
  3. How long did it take to capture the issue and identify that it was an incident?
  4. Was escalation required? How well did the escalation process work?
  5. Were there any challenges getting support staff engaged (identifying contacts, availability, etc.)?
Evaluate the use of data and tools for diagnostics

Incident troubleshooting frequently comes down to having the right tools and data available for subject matter experts to use in understanding what is going on and diagnosing underlying issues. The post-incident review should include a review of what data and tools were available, which were used and what data and tools the team wishes had been available to make the process easier. Frequently the incident support team is unaware of tools and resources that actually were available. The PIR can help identify these resources and ensure that staff has access and knows how to use them during the next incident.

Clarify decision making

Incidents can be high-stress, time- and cost-sensitive situations with intense pressure to restore service fast. As the incident is being resolved, many decisions must be made, such as: assigning an impact/urgency classification; figuring out who to communicate with and when to do it; choosing what troubleshooting paths to pursue; deciding which action steps to take to resolve service and much more. Many of these decisions will be made spontaneously by someone on the team (and that may be okay). But some decisions should be made by either a group or someone with designated authority. The PIR is a good time to review the decisions that were made and how they were reached to ensure that there is no confusion on authority and decision-making processes in the future.

Strengthen future incident response

Post-incident reviews are one of the most powerful tools in your continuous improvement toolbox. Take advantage of this opportunity to make the best of incidents and gain valuable insights into your existing incident management processes and how they can be improved for the future.

 

Watch Kepner-Tregoe videos about post incident reviews

Preparing to fail – What the Titanic teaches us about post-mortems

A review of the Apollo 13 incident told through a description of primary root cause analysis tools

 

About Kepner-Tregoe

Kepner-Tregoe has empowered thousands of companies to solve millions of problems. KT provides a data-driven, consistent, scalable approach to clients in Operations, Manufacturing, IT Service Management, Technical Support, and Learning & Development. We empower you to solve problems. KT provides a unique combination of skill development and consulting services, designed specifically to reveal the root cause of problems and permanently address organizational challenges. Our approach to problem solving will deliver measurable results to any company looking to improve quality and effectiveness while reducing overall costs.

 

()