By Shane Chagpar
Recently, major critical operations hubs including the NYSE and United Airlines experienced wide scale and nationally reported system outages. The confusion, frustration, and monetary loss of these outages have not been calculated but I can only guess that they will be astronomical, and linger in the minds of people for time to come.
According to press releases the four-hour outage at the NYSE was apparently due to a software upgrade. Although the upgrade was planned during an off-hours maintenance window, it began causing havoc as traders logged in to resume their regular activities at 7 a.m. the next morning and found that they had difficulty connecting. It’s unknown at the time of this writing when the upgrade was complete, but it stands to reason that with some additional planning this could have been avoided.
While it’s a bit of 20/20 hindsight to identify lack of planning or failure to apply preventative and contingent thinking as the cause of this problem, I would like to examine instead the handling of the incident after the occurrence.
The difficulty with incident management is that it’s live and requires strong facilitation skills and intense direction. Compounding this fact is that everyone has visibility, and there were surely more than 100 people on a conference call, many simply begging for a quick action to be taken to save a late opening of the exchange. During this firefight, it is very easy for a would-be leader to take the easiest potential action they are presented with. In the case of the NYSE, the initial actions thought to restore services only created a condition known as a secondary outage, where the problem only worsened through attempts to make it better.
The true victory in this situation is that previous planning—which should occur when things are running smoothly—went into effect and enabled trades to resume later that same day. Orders were correctly suspended and cancelled according to plan and a datacenter in Mahwah, N.J. came online to resume trading. The issue was resolved by 3:10 p.m. that same day.
When we work with our clients who have challenges in the incident management space, we approach these incidents using a combination of skill development, coaching, tooling integration and focused culture change. A strong incident management team should have roles and responsibilities defined well in advance, and like a fighter pilot or rescue helicopter crews, use a series of checklists and an overall ‘playbook’ to help teams remain calm, and function well under pressure.
A playbook should at minimum help define the following:
- Methods of understanding and validating service degradation.
- Systematic methods to clarify and understand symptoms and user reported error so that the right people can be involved.
- Tools to help manage involvement, including current on-call numbers, backups and vendor engagement reps.
- Standardized tools and locations for conference call information, war rooms, use of dashboards or live tools.
- Methods of quickly and accurately determining priority, including understanding Current Impact, Future Impact and Timeframe.
- A decision-making methodology and objectives per application that are developed in advance.
- A risk management framework used to submit accurate and useful documentation to change management as well as the fix agents.
- A plan for how to validate systems have been restored and verify that a secondary outage has not been created.
- Handover requirements to update documentation and transfer the incident to problem management.
- Framework to raise and execute projects in order to prevent future incidents.
At KT, experience tells us that preemptive set up of this framework and playbook-type structure leads to quicker results as well as more confident and empowered teams, particularly at the junior level. It's amazing what a structured plan can do when your organization is under fire, and you have to rely on your Incident Management team to think under pressure.
The KT problem solving approach is used worldwide for root cause analysis
and to improve IT stability