Proactive Problem Management

By Christoph Goldenstern, VP of Innovation and Service Excellence, Kepner-Tregoe

What problems cause your organization the greatest aggravation and expense? Barring any existential business-threatening cataclysms you may have suffered, you’re likely to agree that they are the recurring incidents — essentially the same types of problems happening over and over again. 

These incidents typically are triaged in the short term with workarounds or “patches”, but never fully resolved. The organization has too many competing priorities; it never gets to the “structural causes” and never puts in place the measures to deal with them long-term. 

The approach to problem management is reactive… assuming we even get out of incident management mode. But it doesn’t have to be. Problem management can be reactive and/or proactive. Proactive Problem Management (PPM) is still a highly underrated part of ITIL®, but organizations are starting to adopt PPM and tools and processes to support it. 

Proactive Problem Management (PPM) takes a holistic view of the incident and aims to identify and prevent future incidents by identifying and eliminating the systemic root causes before they happen or have an impact. 

Visualizing the Problem

PPM has at least three major dimensions:

  • Monitoring– Business processes and technology designed to detect and record anomalies and head off incidents before they have an impact
  • Post major incident investigation– Getting to underlying systemic issues that cause recurring incidents
  • Continuous Service Improvement (CSI)– We have issues beyond the technical aspects. What does it tell us about our value streams, practices or services What can we do better?

A critical factor in establishing a proactive problem management discipline is communication, so that everyone involved in IT support – those who deal with major incidents directly and those who must understand them from a more remote, strategic perspective – can understand what is going on.

A typical incident review, conducted either while a problem is occurring or as a post-mortem after resolution, is conducted as a facilitated conference. There is a facilitation document which may run to five or six pages, essentially recording the entire time frame of an incident. In the dense and detailed text, it can be difficult to separate the relevant from the irrelevant, the important from the unimportant.

A simpler and more practical way for the individuals to communicate is through a graphical representation called an Incident Map. An incident map is the product of a process that systematically describes the incident in terms of:

  • The main incident/problem and its impact
  • The cause-effect chain(s) that triggered it
  • The circumstances contributing to the incident’s effect — why the impact was or wasn’t as bad as it could have been
  • Barriers that have been breached (proven ineffective) — measures that could have interrupted the cause-effect causality, and why they didn’t work
  • Actions that were taken
  • Actions that are proposed, chosen or implemented to prevent the issue’s recurrence
  • Actions to mitigate risks from suggested changes (so as to not cause new, separate incidents!)

Organizations have begun incorporating proactive problem management and incident mapping into their post incident reviews — and in some instances, in the management of their ongoing incidents. There is also increasing use of the mapping methodology in “Pre-Mortem” analysis — a type of proactive problem management that takes place prior to an incident, as a major change is being made, in which the stakeholders ask themselves, “If we make this change what could go wrong?” 

Pre-mortem risk assessment is not a new concept. Operations thought leaders have been writing about it for at least a dozen years.  But it is only recently taking hold in IT support circles.

A Strategic Necessity

The risk-management benefit of proactive problem management is self-evident. It isn’t even necessary to quantify the returns to see the value in preventing major incidents before they happen. So why isn’t every organization committed to PPM?

There are multiple reasons, but probably the most important barrier is cultural. In infrastructure support circles, prevention isn’t sexy. We idolize heroic problem-solvers, the ones who put out the big fires. We can put a dollar figure on the impact to the business when someone relieves pain we’re actually feeling. It’s much harder to quantify the value of heading off a problem that hasn’t ever happened, even when it’s obvious that it would be catastrophic if it ever did. When organizations have asked support managers to invest time and budget in proactive problem management, those managers haven’t always seen it as a plum assignment.

But in the last 15 to 20 years, a subculture within the support community has become increasingly concerned about the complexity of infrastructures. In 1997, before networks became the vast, data-saturated and interconnected ecosystems they are now, it made sense to wait for things to break and then fix them. But systems today are so large and complex that chaos effects make major outages unpredictable, but inevitable. Proactive Problem Management is a strategic necessity.

If you’re convinced your support process needs to become more proactive, a good first step is to level up your communications for Pre- or Post-Mortem incident reviews. We’ve just published a helpful white paper on Incident Mapping and PPA. 

Get your copy of Put an end to firefighting with Proactive Problem Management

About Kepner-Tregoe

Kepner-Tregoe has been the industry leader in problem-solving training and consulting for more than 60 years. Its experts have worked with companies large and small to implement best practices and techniques for solving problems quickly, tracking them consistently and making informed decisions.