A Closer Look at How Problem Managers Measure Performance
Understanding how analysts and engineers work on problems, find root causes and follow up with appropriate actions sounds like an easy task. Once one has access to the application used for documenting ITIL Problem Management, the case content can be read. All it seems to require is access to the case management tool and some skills to use that tool.
However, asking Problem Managers how they handle problems will typically uncover the true procedures that describe what steps they take when finding and working on problems. These documented processes and procedures are very helpful when expectations are very clearly set on the steps to be taken to progress problems that require attention.
Reading problem tickets or asking Problem Managers how they fill in the steps in the procedures seems a logical next step to find out more about how value is created in Problem Management. This is where information is really gathered, data is analyzed and conclusions are drawn. So, how does the performance of Problem Management get measured? Many organizations seem to measure timing-related parameters around the problems, or counting the number of problem tickets, in a given state. Examples include:
- Number of open problem tickets (backlog) per group of applications, considered over time
- Average age of open problem tickets, often considered over time
- Average time to find root cause in problem tickets
- Number of recurring problems
Considering the goals of Problem Management: to find causes of problems and proactively take actions to avoid future incidents and problems, how well do the examples above tell how successful a team is towards achieving these goals? Are we asking for one thing and measuring something completely different?
A Real Life Experience
About two days into an assessment of how Problem Management was being handled in a global IT department of a worldwide company, we decided to take a break to compare our findings amongst the participants in the assessment. Fields that were considered for inspection included the ticket summary and problem description, as well as individual progress updates and the resolution descriptions.
The pattern was seen in most problem tickets. The summary was clearly indicating the affected application or hardware and what was wrong with it, followed by some underlying data in the detailed problem description. Further updates would typically indicate how the problem was traveling through the procedural steps of Problem Management as time was progressing and reaching a conclusion in the resolution description.
Although this seems like an individual case, it represents the pattern that was seen amongst the team doing the assessment. After talking through other experiences, the following picture was made to represent the observations seen
This raises a set of questions on how conclusions were drawn and actions were taken or planned:
- What data needs to be gathered to find a cause effectively?
- How do experts make sure they have gathered the appropriate data at the appropriate time?
- What does the magic look like? What undocumented steps were taken? What undocumented thinking was done?
- What other causes were considered?
- What level of confidence did the resolving team have that found the cause really was the “true cause”?
- What side effects may actions taken to fix the problem have caused?
Answers to these questions may give a good insight in how value was created in Problem Management for any given ticket. The answers to these questions are typically not related to timing or numeric parameters around the Problem Management procedures. They are about the quality of data gathering and the quality of the thought processes by the individuals involved.
In our next blog, learn what “magic” is really all about. Finding stability after recurring problems and realizing that there is no common way to handle a single problem will reveal exactly how Problem Management “magic” is performed.
The KT problem solving approach is used worldwide for root cause analysis
and to improve IT stability