By Christoph Goldenstern, Kepner-Tregoe
If a tree falls in the woods and no one is there to hear it, does it make a sound? Major incidents and technology outages happen every day, but few make the news or cause actual customer dissatisfaction. It is not because customers are numbed to technology issues or have low expectations; and it’s not because the incidents that occur aren’t major issues for companies. The reason you don’t hear about most major incidents and outages is because service providers and company IT departments are becoming increasingly aware of the importance and impact of managing these situations and are taking preemptive action to make them non-issues. Some of the actions your company can take to keep your outages from the news include:
Design services for resiliency – Technical issues and component outages will happen. A well-crafted service designed for resiliency includes capabilities for redundancy, monitoring, diagnosis and impact-mitigation to enable the service to remain available to the end-user, even if one or more components were to fail. Companies are increasingly adopting new architectures and technologies with built-in resiliency capabilities and actively analyzing legacy systems to assess vulnerability and risk.
Mitigate the impact to users – Even the best designed services aren’t perfect and because they are dependent on people and technology, they are susceptible to failure. Just because a failure or event occurs, doesn’t mean that the service will be unavailable to users. In many cases, companies can mitigate the impact to users through secondary processes and work-arounds – enabling partial service availability where critical features or full functionality operates at degraded performance levels. This partial service availability should be assessed and triggered via a rigorous (Major) Incident Management process to ensure the actions are effective and do not create secondary incidents.
Manage external visibility – The duration and impact of the service outage are critical in determining if external parties are aware that a critical situation is occurring. The other major factor is how (and if) your company communicates to external stakeholders about the incident. Like the tree analogy, most external parties will be unaware of the outage unless someone tells them about it. There are some situations where contractual requirements provide a mandate for notification. Extended outage periods and/or significant end-user impact can increase the likelihood that external parties become aware that an outage is occurring. If in doubt, then communicate proactively. In these situations, communications should focus on providing clear, specific and data-based updates of the most critical situation/impact/cause/resolution information captured during the incident-handling process, assuring stakeholders that the company has control of the situation and there is a robust process in place.
Restore services first – Because of the normal (expected) performance variability of technology, users are often unaware that an outage is occurring. It is important to differentiate between resolving the outage or issue and restoring service to users. Users are only aware of service availability, not the status of underlying components. If services to end-users are restored quickly, then they may never become aware of the issue. Resolving the underlying issue often follows a separate timeline. This requires the individual to understand when he or she is in an incident management mode or a problem management mode.
Most of these actions are facilitated through an effective Major Incident Management process that enables company staff to be prepared, show situational awareness and be responsive and decisive when a critical issue or outage occurs. Major incidents must be addressed differently than normal day-to-day operational incidents because of the impact to users and the risk they pose to the business. As a part of your overall Service Excellence program, consider reviewing your Major Incident and Risk Management processes in addition to the design of the services you provide. With an effective strategy that is executed well, your service outages won’t become a news item and your end-users will be happy and productive.
Kepner-Tregoe is the industry leader in Problem Solving and Service Excellence processes for Operations and IT. With more than 60 years of experience working with organizations across industries and geographies, the experts at KT understand what is required to take your processes from effective to high-performing.