By Shane Chagpar, Kepner-Tregoe
The widespread internet outage last week impacting service providers and customers across the US is a harsh reminder about how important robust service management processes are in a connected digital world. In the incident last week, a routine technical change by infrastructure provider Level 3 resulted in a backbone network outage that disrupted service for major internet service providers including Comcast (Xfinity), AT&T and Verizon. The downstream impacts were so widespread, they were felt by most of the internet users across the US.
The disturbing part of the Level 3 incident isn’t that something went wrong – technology is fragile and will fail from time to time. It is that this isn’t the first time this kind of incident has occurred because of a configuration change (a similar thing happened at Level 3 in 2015) and as recently as last year (following the Dyn DDoS Attack) security experts warned the industry of the risks and potential impacts of incidents like this with core infrastructure providers. Considering these two past incidents, one would expect Level 3 to have a comprehensive picture of their services’ dependencies on people, processes and technical components as well as an understanding of the vulnerabilities which pose a significant risk to service continuity.
The fact that this outage happened is a good indicator that one (or more) of the service management processes at Level 3 has a deficiency – likely candidates are: change management, problem management or risk management. Additionally, the length of the outage and widespread downstream impacts are indicators that there may be opportunities to improve incident management and/or service design as well. Without an insider’s perspective, it is difficult to say which of the service management processes was the true root cause of the outage (the folks at Level 3 will be looking at that over the coming months). For those looking at this from the outside, we should be asking “what can we learn from this situation to avoid it happening at our company?”
The lesson that we can (and should) learn from the Level 3 outage is that “service management processes are not independent” – when something happens with one of them, you will need to look at the holistic system to identify where corrective and preventative actions should be applied. The 2015 incident should have alerted Level 3 to the potential for manual changes to cause a service disruption – prompting them to put extra controls in their change management processes to prevent human mistakes in network reconfiguration. The DDoS attack last year on competitor Dyn’s infrastructure should have prompted a review of Level 3’s service design and incident management processes to identify risks associated with common dependencies (points of failure).
Could the problem-solving processes from Kepner-Tregoe have helped Level 3 avoid this outage – we think so. I’m sure Level 3 has great people, solid processes and safeguards in place, but are the really prepared to handle situations like this? Breaking down a complex series of interdependencies to understand not only what caused the outage but how to avoid something similar from happening again can be a daunting challenge. As the leader in problem-solving and service excellence for over 60 years, the experts at Kepner-Tregoe know that companies can (and should) do better when it comes to managing change and deriving lessons learned from major incidents. The recent outage provides an opportunity for Level 3 (and other companies observing) to review their service management processes with a critical eye towards understanding dependencies, processes and risk.