By Steve White

What keeps a gazelle awake at night? It might be the thought of the lurking crocodiles that inhabit rivers and waterholes waiting to pounce without warning. The wise gazelle avoids lingering at the edge of the herd and hopes that the crocodile numbers are low.

Staying in the middle of the herd is important for survival. In IT support, we recognize the effect of this survival instinct when a new piece of software is released. Early adopters will load it and play with it, but few will immediately use it as a core business tool. The clever gazelles wait until the waters have been tested. Clever gazelles also know to keep up and not become stragglers. The dawdling gazelles are at risk, using mission-critical applications that the vendor has ceased to support.

Without vigilance, it is easy to be vulnerable and separated from the herd. Forging ahead, without clear risk management, increases vulnerability: load newly released and untested code onto production equipment or untested hardware into a production environment and the crocodiles start to circle. Falling behind occurs by not updating systems and using exotic solutions: software or hardware that is no longer supported is penny-wise and pound-foolish. In addition, system integrating hardware and software to make it one-of-a-kind and changing core code to make it unique can rob you of the protection of “the herd.” Vulnerability is increased with exotic loads or profiles that overwork the system beyond capabilities or with extreme tuning of software and firmware parameters for a given application.

Diagram 1 illustrates how these risky actions make IT organizations vulnerable. Once on the edge of the herd, it’s easy to be picked off by the lurking latent crocodiles.

Diagram 1: Risky Actions That Separate Organizations from the Herd

Unfortunately, just being in the middle of the herd–using standard configurations and software, staying up-to-date and within performance tolerances–is still no guarantee of survival. Reducing the number of hungry crocodiles is the real key to survival.

The very worst IT incidents we see from our perspective as consultants, result from undiagnosed problems and poorly completed changes. Bring together undiagnosed problems in just the right way can be miraculous—in a bad way—will cause catastrophic failure.

For example, a global Fortune 500 company, who uses IT systems like everyone else to receive orders, plan manufacturing, schedule deliveries and issue invoices on current hardware and popular software, lost the ability to know what to manufacture, ship and invoice for about three weeks. The incident did not reach the media as it was handled well from a PR-perspective and the company continues to thrive. However, for three weeks the crocodiles were in the middle of the gazelles acting in uncoordinated concert to bring the core IT systems down.

Pest control—reducing the number of crocodiles— reduces the number of opportunities for them to mindlessly conspire to hurt you. But where do they lurk? They are waiting to pounce in your undiagnosed backlog of IT problems.

The higher the number of undiagnosed IT problems you have, the higher the opportunity for one, two or many to interact in some interesting way, with an innocent change, to bring your system down. Organizations that find the root causes for IT problems have a mathematically better chance of IT stability than those with undiagnosed problems. Problems that are both lurking (you know about them–they are in a queue somewhere, in a mass of uncontrolled changes or hiding in poor housekeeping) and latent (not affecting anything at the moment) eventually conspire to do unanticipated damage.

Case Study. Problems can randomly come together to cause prolonged IT outages. After Company A bought out a competitor, product lines needed to be integrated. Working with suppliers, Company A specified the hardware and software required and a project plan was created to implement the change. Unknown at the time, deeply buried in a backlog of undiagnosed problems, were four existing faults with the current production system, none of which were causing problems and so were not on the minds of the support staff. These included:

  • A slow database queue processing job (existing now for six months)
  • Slow logical input/output to a shared data storage device on other systems not obviously related to this one (logged with another part of the infrastructure several weeks ago)
  • A firmware upgrade to the data storage interconnect that did not apply correctly (made some weeks ago)
  • Database monitoring tools that occasionally stopped recording (ongoing for a year)

These problems had been logged and were awaiting action by either the supplier or staff.

When the software upgrade and the required hardware were completed, everything went perfectly. The system resumed production, but no one checked the expected performance overhead. This was a very big crocodile.

Diagram 3

The increased load to the system was made smoothly, one factory at a time, to ensure that each step was controlled. But two weeks after beginning this process a tipping point was reached and the system flipped from a free flow to turbulence –from taking 20 hours to process a day’s work to 60 hours per day. The consequences were rapid and severe. Business managers began screaming that the business was dying. They severed factories from the batch jobs and rescheduled production runs from every day to once a week. Some depots had to invent from experience what customers were likely to order and only the heroic actions of huge numbers of staff kept the business running without its IT systems.

Returning to the previous configuration was only possible if two weeks’ worth of invoices were sacrificed. The decision was made to forge ahead using the new configuration. During this process the latent lurking crocodiles were discovered. Not all crocodiles were immediately malicious–the database monitoring tool had simply stopped two weeks before, and so the problem-solving effort was extended by the lack of that information. The lurking latent crocodiles had been out there waiting, unobserved, to come together in a single, calamitous event.

How to survive
Clearly, there are lessons to be learned from mistakes. Staying in the middle of the IT crowd is a strategic IT decision to make. But reducing the likelihood of undiagnosed faults conspiring against you is rarely addressed with enough vigor. How many undiagnosed cases are in your IT support backlog? If you are clearing them away quickly and effectively and if you have plans to handle the interim fixes and the corrective actions for the ones that are genuinely hard to solve, all is well.

Most support organizations backlog large numbers of problems or routinely close cases without finding root cause–lining their future with crocodiles.

In our engagements with clients who initially have a large backlog, we work with them to perform an analysis of the current state, calculate anticipated savings in terms of time and money, identify leverage points, and complete a structured and well-managed implementation of good-quality, issue-handling processes. This builds a better support organization with more effective work processes and more highly-motivated engineers. In addition, there are fewer lurking latent crocodiles waiting, watching and ready to pounce.

The KT problem solving approach is used worldwide for root cause analysis
and to improve IT stability