By Russell Whitehouse, Kepner-Tregoe
The goal of IT organizations is to provide the best quality services and systems they can to support their users and business processes. Unfortunately, systems (like people) aren’t perfect and they sometimes break. When this happens, well-meaning IT staff will do all they can to quickly fix the issue and return the system to normal so the users that rely on it can continue doing their job. Ensuring business continuity is critical to the IT function, but anticipating and preventing problems from re-occurring is just as important.
The challenge arises when these two co-dependent ITSM processes come in conflict with each other (a common occurrence). When this happens, IT staff must often pick their poison – restore service quickly or resolve the problem permanently. Developing a set of guidelines for addressing this conflict can help staff be more effective and ensure the right balance for your organization. Here are 6 factors that you should consider when de-conflicting your incident and problem management processes.
- Business impact – Every incident and problem is different and making an informed decision on where to place priority is largely dependent on having an objective means of measuring and classifying current and future business impact. Number of people impacted, criticality of business processes impacted, length of time the system/service will be impacted and intangible impacts such as customer perception should all be considered.
- Resource utilization – The same IT staff and technical resources are often used to resolve active incidents as well as prevent problems. The goal is to achieve as much overall benefit to the organization as possible from the activities that your resources are working on. Applying portfolio management techniques to these constrained resources can help determine whether focusing resources on a few high-priority activities, spreading them across a lot of activities for coverage, or selective allocation based on skills and experience will yield the greatest returns.
- Destruction of bread-crumbs – Rebooting the server may restore service, but it might destroy the very information IT needs to resolve the problem. Often the symptoms and context data about what was going on in the environment, active dependencies, resources being used, what the user was trying to do, and how the system was performing when the incident occurred are lost when the incident is resolved. Providing clear instructions for capturing this data as part of managing the incident can avoid future delays finding the root cause of the underlying problem.
- Recurrence & impact – Incidents are typically addressed independently, however problem management must take into account the risk of incidents happening repeatedly. Assessing frequency, duration, and impact of incident recurrence can help the IT staff and decision makers look more holistically at the situation. By “connecting the dots” , a closer analysis can help determine whether mitigating the immediate impact of the incident should take priority over preventing it from happening again.
- Reproduce-ability – Some problems can only be diagnosed through real-time analysis of the systems and environment as incidents are occurring – reproducing the incident at a later time may not be possible. When it is suspected that this will be the case, it is necessary to coordinate the activities of problem and incident management to enable live-diagnosis as the incident is being resolved. The keys to doing this effectively are: clear decision-making authority, effective communication and coordination of activities.
- Deferring impact to a later time – One of the most difficult decisions of incident management is assessing the impact of prolonging the current incident to diagnose the underlying problem versus resolving the current situation and increasing the risk of recurrence in the future when the impact is unknown. Most companies have non-linear, yet predictable cycles of business activities. For example, impact may be lower in the middle of the night or higher the last week of the quarter during the final push for meeting financial targets. Having a clear understanding of these cycles helps IT staff make decisions more in line with business needs when handling incidents.
Your organization’s problem-solving methodology should include process steps for capturing diagnostic data as a part of incident management and clearly defined decision criteria for investigating a problem. There is often no clear correct answer for when to just restore service or when to permanently resolve a problem, but taking into account the 6 factors listed will help.
Sometimes you just have to pick your poison.
Kepner-Tregoe is an industry leader in problem-solving - helping companies refine their ITSM processes with the tools, techniques, skills and processes to enable IT staff to better support users and business processes.