By Christoph Goldenstern, Kepner-Tregoe
In IT service management, incidents and problems aren’t independent from each other. They are different dimensions of the same issue – opposite sides of the same coin. Incidents are the immediate impacts of the issue on normal business operations. Incident managers are challenged to understand the issue, act to mitigate any impact and restore service as quickly as possible.
Problems are the long-term impacts of the issue and the risks the issue poses to sustaining normal, future business activities. Problem managers are challenged to investigate the root-cause of the issue, related dependencies and environmental factors that could lead to the issue recurring. This investigation also helps them determine what level of risk (impact and likelihood) is associated with the issue if the root cause is not addressed and the appropriateness of some sort of corrective action (long-term fix) vs. simply acknowledging and accepting the risk.
Why incident and problem management are separate in most organizations
If incident and problem management are both addressing performance issues, then why are they separate functions in the standard ITSM paradigm? The answer lies in the perceived conflicting priorities of these two functions. Incident management is focused on immediate impacts, not what could happen in the future. Your company needs incident managers to maintain this focus to prevent prolonged business disruptions and outages that could cause major damage to productivity or reputation.
Problem management, in most instances, takes time to do thoroughly and correctly, assuming we are dealing with an actual cause-unknown problem. You don’t want problem managers to rush to conclusions and miss a key detail.
Sharing data between incident and problem management
Even though incident and problem management are separate functions in many IT organizations, it is imperative they share information and insights to create an end-to-end information flow. Incident managers are on the front lines when the issue is occurring. They see what is happening with the underperforming system or application. They see the impacts the issue is having on the business. They also see other events in the environment while the incident is occurring. This last point is especially important. A change, an action or some other occurrence with a service component or dependency causes most incidents (and problems).
These factors are often difficult (if not impossible) to identify once service has been restored. As a result, incident managers are the “eyes and ears” of the problem-management function – collecting valuable information needed to aid in root-cause analysis and problem diagnosis. Rather than looking at Incident and Problem Management as two functions, we should look at them as a continuum. Also, it’s a fallacy to believe that there aren’t situations where incident managers (especially major incident managers) need to initiate, and sometimes even lead, some form of root cause analysis, when existing restoration options are ineffective or simply too risky to undertake without knowing cause first. Handing this task off to problem management during a live incident can prolong the outage and lead to a lot of “ping-pong” behavior.
Problem managers, on the other hand, are the keepers of the “known-issues list” on which incident managers rely to diagnose issues and identify potential short-term fixes quickly. Even after problem managers identify the root-cause of an issue and remediation steps for a long-term fix, considerable time is often required to implement the corrective actions and fully resolve the issue. During this period, the company is at risk of the issue recurring and causing further business disruption. It is essential problem managers provide clear instructions to incident managers about what to do if they encounter a recurrence of the issue to avoid renewed business disruption. If additional data collection is needed during a recurrence, then incident managers must know that too, so they can collect the needed data before implementing corrective action that could potentially destroy or distort the clues needed for problem management.
Incident and problem managers working together as a team
Companies that achieve high levels of service excellence know incident and problem managers, must work together as a team as part of an end-to-end support process to achieve the shared goal of minimizing business disruptions and their impact.
Read more articles about IT Service Management
Kepner-Tregoe has been the industry leader in problem-solving and service-excellence processes for more than 60 years. The experts at KT have helped companies raise their level of incident- and problem-management performance through tools, training and consulting – leading to highly effective service-management teams ready to respond to your company’s most critical issues. To learn more, visit www.kepner-tregoe.com.