By Christoph Goldenstern, Kepner-Tregoe
The speed of recovery for problems in our IT systems is one of the most important factors in providing business value as an IT provider. In this article we’ll look at five ways that speed of resolution can be improved.
1. Clearly define roles, responsibilities, processes, policies and metrics
It helps to have a plan. Making things up “on the fly” is not an optimal approach to solving problems. When you need to fix IT fast (major incidents, emergencies, crisis management, etc.) it’s critically important to know what you’re expected to do – and also what you’re NOT supposed to do – so that you’re not stepping on someone else’s toes. Understanding roles and responsibilities is a great first step, but it’s not always sufficient. Having clearly defined processes and procedures quickly focuses activity and can prevent a major incident from descending into chaos.
For example, when a major incident occurs, these are some predefined, appropriate actions:
- Have everyone on the Major Incident Team go to a “war room” (virtual and/or physical). This could involve starting a web meeting, opening at least one conference call “bridge,” using whiteboards, and using other helpful collaboration tools.
- The Major Incident Manager opens a new master ticket in the incident management system (later all related incidents will be linked to this new master record) and the clock will start ticking on the major incident process.
- Contact the Service Desk so they can create a new outgoing message on the helpline, so when users call in to complain about the issue they are immediately informed about the ongoing incident. Otherwise your Service Desk could get flooded with calls.
- Alert the Service Delivery Manager(s), or equivalent role, so they’re aware of the issue and know that it’s being addressed.
- Investigate and diagnose the issue by describing and documenting the issue in sufficient detail.
- Find the cause of the incident using best practice techniques that the team has been trained to use (see RCA, tip #5).
- After the cause of the major incident is determined, apply a fix (temporary or permanent resolution) – ideally by engaging the emergency Change Management process if necessary.
- Stop the clock on the incident and check compliance with established service agreements.
- Communicate incident resolution to users, the Service Delivery Manager(s), and other relevant stakeholders.
- Conduct a “post mortem” with documented lessons learned and a clearly diagramed “incident map” within seven days of incident resolution.
- Be proactive and look for areas of improvement to prevent similar future incidents.
2. Utilize Knowledge Management
If your people have the right skills, they will be able to resolve incidents more quickly. This reduces call-waiting times, improves availability and helps meet agreed service levels. A good knowledge management system helps your staff get the information they need “on demand”.
Knowledge management is the process responsible for sharing information and experience in the right place and at the right time. A good knowledge management system helps to take all the information that is rattling around in the heads of your experts, and puts those solutions into a searchable database that can be accessed online. Ideally, a few key words entered into the incident ticket will automatically pull up a few relevant knowledge articles that have a suggested solution to the issue.
A knowledge management system that’s integrated into your incident tools and processes can turn an inexperienced employee into a high performer with minimal training and effort. If everyone on the team uses this approach, then your organization can do more with less people while reducing incident resolution times and improving customer satisfaction.
3. Implement easy to use integrated systems
User friendly systems make it easier for IT staff to input data into key systems (e.g. an incident ticketing system). Information from multiple sources should be shared seamlessly across the enterprise in the form of data warehouses, on-demand reports, and real time dashboards. It’s a best practice to integrate the following technologies into one holistic system.
- Incident & Problem Management ticketing system
- Event Management / monitoring / alerting tools
- Change & Release Management system
- Knowledge Management Solution Database
- Process workflow engines
- Configuration Management Database (CMDB)
- Asset Management Database
- Asset and “CI” discovery tools
- Service Level Agreements
- Online Service Catalog
- Software distribution tools
- Reporting and Dashboards
4. Create a critical service matrix and utilize monitoring with automation
It’s important to define your critical services, systems, and applications and document them in a Critical Service Matrix (CSM). A CSM is essentially a prioritized list of all your organization’s critical technology. The CSM should also briefly document all of the relevant key information that you need in order to monitor and manage these key systems.
The CSM is an enormously helpful guide when configuring your monitoring and alerting systems to detect high priority events and incidents. Once you have completed the CSM, the next step is to setup your automated alerting systems to monitor the key technologies listed in the CSM. If something breaks, your monitoring systems automatically detect and report the issue to the appropriate people in your organization – preferably integrating this alerting into your incident ticketing system. Ideally, all known potential incidents are fixed automatically (using scripted workarounds, event correlation, artificial intelligence, etc.) without the need for human intervention.
5. Troubleshooting technique and critical thinking skills
To be an effective troubleshooter and problem solver you need excellent critical thinking skills. These include (but are not limited to) the following:
- Root cause analysis (RCA) for complex problems
- Systematic decision making that aligns with business and operational priorities
- Ability to identify and plan for the resolution high-priority issues
- Understanding and proactively managing risks and opportunities
- Asking the right questions in order to get the necessary information and derive meaningful insight
The KT problem solving approach is used worldwide
for root cause analysis and to improve IT stability