By Christoph Goldenstern, Kepner-Tregoe

A human error—a very basic one—caused British Airways to suffer an IT outage on May 27, 2017, forcing it to cancel more than 400 flights and leaving 75,000 passengers stranded. An engineer had disconnected a power supply at a data center, and when it was plugged back in, a power surge caused major damage. Net cost to the airline: a whopping 80 million pounds (about $102 million).

This might sound like a lot of money—and it is—but according to Statista, it’s not unusual. The average cost per hour of downtime for 86% of enterprises is more than $300,000. And the hours add up quickly.

The 2019 IT Outage Impact Study found that the typical organization experienced 10 brownouts (where infrastructure or software performs at a degraded level) or outright outages over the past three years.  Those 10 incidents easily add up to millions of dollars.

Not surprising, then, 80% of companies reported that the performance and availability of their IT infrastructure tops their list of concerns. More than half worry about experiencing an outage so devastating that it will make the mainstream news. And if some such event occurs, 53% expect heads will roll—and that someone will lose his or her job.

And as much as it would be nice to simply automate responses to IT issues, “Incident response needs people, because successful incident response requires thinking,” wrote Bruce Schneier, in his blog, Schneier on Security, back in 2014. What you need: an IT (major) incident management team with clearly defined roles and responsibilities, trained to fulfill those responsibilities by following a crisis-proven process while effectively communicating with managers, customers and subject-matter experts alike.

The human side to outages

Herein lies the problem. Nearly half (47%) of respondents to a SAN survey said that staff and skills shortages were their greatest challenge to effectively responding to incidents. Indeed, the Uptime Institute’s 2019 study is now calling the IT staffing problem a crisis. Sixty-one percent (61%) of respondents said they had difficulty retaining or recruiting staff — up from 55% the previous year.

This matters because 60% of organizations believe that their most recent significant downtime event was preventable. If they had better management, processes, or configurations, the outage could have been avoided, they say. For outages that cost more than $1 million, this figure leapt to 74%.

“By under-investing in training, failing to enforce policies, allowing procedures to grow outdated, and underestimating the importance of qualified staff, management sets the stage for a cascade of circumstances that leads to downtime," wrote Kevin Heslin, chief editor of the Uptime Institute Journal in a September 2019 blog post about the survey.

Staffing the IT incident management team

An incident is any unexpected event that disrupts normal operation of an IT service.  IT incident management is an area of IT service management (ITSM) where the service is returned to normal ASAP. Many IT incident management teams use established ITSM frameworks such as IT infrastructure library (ITIL®) or COBIT. Others use a combination of proprietary best practices established over time.

Here are some of the most common IT incident management roles to hire and train for.

(Major) Incident managers

These people need to be “in control”. When something goes wrong, they provide immediate structure, leadership and are ultimately responsible for bringing services back to normal.

  • Acts as the central command for an incident
  • Facilitates the process, end-to-end
  • Manages involvement of resources
  • Drive the issue resolutions process and tasks SMEs with specific analyses
  • Produces incident reports
  • Performs a post-mortem on critical incidents
  • Adds incidents to an ongoing knowledgebase of incidents and solutions
  • Oversees all the processes involved in the designated incident management workflow
  • Ensures that incidents are resolved to the point that designated SLAs are met
Process owners

This person is responsible for the overall incident response process, including modifying it when necessary to make sure it’s aligned with business goals.

  • Delineates key performance indicators (KPIs) for determining how operations should function normally
  • Makes sure KPIs meet business goals
  • Designs, documents, reviews, and improves processes.
  • Continuously learns from incidents to adjust any aspects of the process to meet overarching business goals
Tier 1 service desk personnel

As the first point of contact when anyone—a user, customer, manager, or anyone else in the organization—reports an incident, the Tier 1 service desk is made up of people with a basic but broad working knowledge of the most common IT issues, such as password resets or printer problems as well as solutions to known issues.

  • Does initial data gathering, assessment and diagnosis of any service report
  • Acts immediately to restore a failed IT service as quickly as possible
  • Escalates any issues that can’t be resolved immediately to the Tier 2 service desk
  • Records all service requests and resolution steps taken
  • Keeps the person who reported the incident information about its status
Tier 2 support personnel

This level is typically staffed with people who have advanced knowledge of specific systems. Requests generally come when Tier 1 personnel escalate an issue that they can’t resolve.

  • Act as subject matter expert on a particular system, software, or technology
  • Diagnose the issue
  • Conduct RCA (root cause analysis)
  • Record everything done to resolve the incident for the knowledgebase
  • If the incident is resolved, confirm the resolution with person who reported it
  • If the incident is unresolved, escalate it to Tier 3 and/or engineering
  • Deliver subject matter expertise
Conclusion

According to the 2019 IT Outage Impact Study, the top-two missed opportunities to avoid outages were not identifying when systems were near capacity, and not identifying when performance—of critical hardware, software, or network components—was slowly but steadily degrading.

These are primarily people issues, which can be resolved with putting robust, but scalable processes/practices in place and training your IT staff to apply these. Questions to ask yourself when putting together your incident management team include:

  • Are you building IT capacity faster than hiring the resources to manage it?
  • Are you having difficulty hiring and retaining IT skilled workers?
  • Are your IT training and education programs suffering from lack of budget?

As systems are only getting more complex—especially with cloud entering the picture—outages are going to continue. But many can be avoided, and the others fixed much more quickly by putting resources behind having the right skilled employees in the right positions following proven best practices and processes.

Read more about how to how to manage major incidents

Achieving Service Excellence in Major Incident Management

Major Incident Essentials: Communication and Effective Action. Help! What do we do now?

 

About Kepner-Tregoe

Kepner-Tregoe has been the industry leader in problem-solving and service-excellence processes for more than 60 years. The experts at KT have helped companies raise their level of incident- and problem-management performance through tools, training and consulting – leading to highly effective service-management teams ready to respond to your company’s most critical issues.