By Russell Whitehouse, Kepner-Tregoe
When a company develops a major incident process, they first consider what constitutes a major incident. Organizations will tend to focus their attention externally on the potential for hackers and cyber-attacks to wreak havoc on networks and create data security concerns. They define incident priority accordingly with an emphasis on external causes.
This is a valid concern. In today's corporate environment, suffering from a system breach is not necessarily a matter of "if" but rather "when". As a result, companies focus the responsibilities of major incident team around external factors that could cause a service outage. However, they often fail to remember that major Incidents aren’t just caused by system failures or external attacks. They can be the result of a planned change gone bad.
What happens when a major incident is caused by change management
First and foremost, you should recognize that not only are your users impacted but it is highly likely that you will encounter confusion about processes, roles and responsibilities too.
When you have a major incident caused by change, systems are often between states and many stakeholders are disengaged due to the planned down times. Your change management and incident management processes are forced to operate in parallel, and each team may have a different process owner. The complexity that is created makes managing the incident extremely difficult at a time when business impact is elevated – leading to a high-risk situation for your business.
Without considering this scenario in advance and figuring out how you will handle it, there is a high likelihood that your change management and incident management processes will collide. This will leave technical support staff and management wondering who is in control, who should be communicating, and how decisions should be made to restore services as quickly as possible.
How to align your change management and major incident management plans
As you are analyzing and preparing your company’s plans for handling major incidents caused by change, here are some of the key areas where the normal change management and incident management approaches differ and require reconciliation.
Communications – The first area of conflict occurs with communications. In a normal change environment, communication plans are scripted, routine and infrequent – omitting technical details about the situation, activities taking place and impact to operations. In a major incident situation, the communication needs are almost opposite. Instead, you focus on keeping your stakeholders apprised of impact, diagnosis activities and status frequently. This assures stakeholders that the support team has the issue under control.
Conflicting communication approaches can also lead to extended resolution times when these scenarios collide. When both situations happen at the same time, it is important to establish who should be communicating and how often. Define what the focus of each communication should be. Most importantly, all requests for updates should channel through the designated contact to avoid creating confusion and distraction.
Decision structures – The decision authority for change management is typically highly structured with a regular cadence, standardized documentation and focusing on the confidence of recommendations being made. In a major incident situation, decision making needs to take place in real-time to resolve the incident, focusing on cost/benefit, trade-offs and risk.
To avoid confusion, it should be clear when decision making authority switches from change management to incident management and then once the incident is resolved, how the transition back to change management will occur.
Roll-Forward vs. Rollback – The common approach for most change management processes in dealing with a failed change is to rollback the change and revert to the previous working configuration of the system. This will largely be successful to ensure service restoration, unless a major incident is involved.
When a major incident occurs, your company may need to consider a different option: rolling forward instead of rolling back. Developing and deploying a new change may lead to shorter resolution times. The decision to roll forward can also be a result of the rollback plan failing. It can also be due to redefining the business criticality of implementing the planned changes, or the result of an evaluation of the comparative cost/benefit/risk profile of different options.
Root Cause Analysis – When a major incident occurs because of a change, there are 2 root-cause analysis efforts that must be undertaken. The first is a technical analysis of the change to understand why it failed and what needs to be done to correct the problem and prevent it from happening again. The second is an analysis of the change management process to determine how the defective change was approved for release. This second piece is critical in order to uncover issues in the development and testing process and enhance the process controls within the change management process.
Major incidents caused by change are a high-risk scenario for your company – both highly likely to occur and very impactful when they happen. They can lead to lost revenue and more importantly, lost credibility with your customers.
Your best preparedness for a major incident caused by change management comes from planning how your teams will respond when these scenarios occur and practicing in advance. If you don’t already have a plan in place, focusing on the fundamentals (communication, decision making, technical options and continuous improvement/root-cause analysis) is a good place to start. If you do have an existing process, refine and tune the process to minimize business impact, provide targeted stakeholder communications and add steps to capture root-cause analysis information while diagnosis and decision making are taking place.