By Russell Whitehouse, Kepner-Tregoe

Major Incidents aren’t just caused by system failures or external attacks. They can be the result of a planned change gone bad. When this happens, not only are your users impacted but it is highly likely that you will encounter confusion about processes, roles and responsibilities as well. When you have a major incident caused by change, systems are often between states, many stakeholders are disengaged due to the planned downtimes and your change management and incident management processes are forced to operate in parallel. The complexity that is created makes managing the issue extremely difficult at a time when business impact is elevated – leading to a high-risk situation for your business.

Without considering this scenario in advance and figuring out how you will handle it, there is a high likelihood that your change management and incident management processes will collide, leaving technical staff and management wondering who is in control, who should be communicating, and how decisions should be made to restore service. As you are analyzing and preparing your company’s plans for handling major incidents caused by change, here are some of the key areas where the normal change management and incident management approaches differ and require reconciliation.

Communications – The first area of conflict occurs with communications. In a normal change environment, communications are scripted, routine and infrequent – omitting technical details about the situation, activities taking place and impact to operations. In a major incident situation, the communication needs are almost opposite – focusing on keeping stakeholders apprised of impact, diagnosis activities and status frequently to assure stakeholders that the support team has the issue under control. When both situations happen at the same time, it is important to establish who should be communicating, how often, what the focus should be and (most importantly) that all requests for updates should channel through the designated contact to avoid creating confusion and distraction.

Decision structures – The decision authority for change management is typically highly structured with a regular cadence, standardized documentation and focusing on the confidence of recommendations being made. In a major incident situation, decision making needs to take place real-time, focusing on cost/benefit, trade-offs and risk. To avoid confusion, it should be clear when decision making authority switches from change management to incident management and then once the incident is resolved, how the transition back to change management will occur.

Roll-Forward vs. Rollback – The common approach for most change management processes in dealing with a failed change is to rollback the change and revert to the previous working configuration of the system. When a major incident occurs, your company may need to consider a different option of rolling forward by developing and deploying a new change instead of rolling back. This decision can be predicated by the rollback plan failing, business criticality of implementing the planned changes, or an evaluation of the comparative cost/benefit/risk profile of different options. 

Root Cause Analysis – When a major incident occurs because of a change, there are 2 root-cause analysis efforts that must be undertaken. The first is a technical analysis of the change to understand why it failed and what needs to be done to correct the problem and prevent it from happening again. The second is an analysis of the change management process to determine how the defective change was approved for release. This second piece is critical in order to uncover issues in the development and testing process and enhance the process controls within the change management process.

Major incidents caused by change are a high-risk scenario for your company – both highly likely to occur and very impactful when they happen. Your best preparedness comes from planning how your teams will respond when these scenarios occur and practicing in advance. If you don’t already have a plan in place, focusing on the fundamentals (communication, decision making, technical options and continuous improvement/root-cause analysis) is a good place to start. If you do have an existing process, these same fundamentals can be used to refine and tune the process to minimize business impact, provide targeted stakeholder communications and add steps to capture root-cause analysis information while diagnosis and decision making are taking place. Learn more about Incident management here.