By Christoph Goldenstern, Kepner-Tregoe

Planning for failure seems like a pessimistic attitude and some might caution you about creating a self- fulfilling prophecy. Planning for failure, however, is precisely what you must do if you want to achieve service excellence. The reality of technology is that it is prone to failure, and that won’t change especially as the complexity of IT increases. Unfortunately, most IT professionals focus on designing systems and services for best-case scenarios. What if instead of considering component failures as exceptions, we considered them normal and treated the brief periods when they work fine as the exception?

Redefining “Normal”

If you were to survey the IT operations of any major company and ask the question, “What percentage of the time are all your IT systems up and running, functioning normally?” The answer from most companies would be less than 15% of the time. What this means is that more than 85% of the time, something is not working as expected. Perhaps, it is time to redefine “normal.” It is important to note that this question doesn’t account for the scope or impact of an outage, just that a component, system or process isn’t working correctly. This is an important distinction, because it’s a clue into the actual service excellence opportunity. A broken component, for example, doesn’t necessarily mean the service is unavailable.

Creating Services Designed To Fail (and Still Be Okay)

What if companies were to implement services that were designed to fail – assuming components would break and that processing anomalies would happen? Could these services be architected in such a way that failure could occur and repairs could be made without impacting performance and availability to users?  To both questions the answer is “yes.” It is possible to create services designed to fail and some of the leading companies in the world are doing it today. The key lesson they have learned is the need to mitigate critical dependencies, reduce the scope of releases (and instead have them on a continuous basis) and the number of instances where a single component can impact the whole system. Redundancy is essential.

The architectures necessary to deliver services designed to fail are also designed to apply “hot fixes,” eliminating the need to take the service offline to make changes. If you can solve problems and make changes without taking the system offline, then 100% availability is possible.

Driving for Service Excellence

Service excellence isn’t just fulfilling SLAs and minimum expectations. It is giving your business and your users the services they need, not just the minimum levels of quality and performance they demand.  Modern technology architectures can enable your company to strive for more than “good enough” and set your sights on services that just work – 100% of the time. It all starts with changing the way you think about failure – treating it as the new normal and planning for it.

()