By Andrew Vermes, Kepner-Tregoe
When facing a major incident in the service environment, the pressure is on to act fast. Yet most tech environments are complex, with an interdependency of products that make the way forward unclear. Incident managers face an enormous amount of information that could be fact or opinion. Add to this the potential involvement of diverse teams from various locations, including suppliers whose products may have affected the incident. As time passes, pressure mounts and costs climb.
In the service environment, major incident managers report these major stresses when something goes wrong:
- Time pressure
- Conflicting demands
- Potentially threatening environment
- Fear of making things worse
In emergency services that ensure public safety and health—police, rescue, fire—depend on clear processes, checklists and repeated training to make response fast and automatic. Incident managers similarly need an organized, planned approach that guides their action.
There are four major things incident management must do. The Kepner-Tregoe approach, which is used globally by service organizations, handles these four work-streams with specific Kepner-Tregoe processes.
- Situation Appraisal: Understand what is really happening.
- Problem Analysis: Analyze the cause. Although in some incidents we may just work around to resolve the situation without pursuing cause.
- Decision Analysis: Resolve and/or restore service with an agreed upon solution.
- Potential Problem Analysis: Prevent trouble in the future. This is not only to avoid the same incident occurring again but also to avoid trouble caused by actions taken to resolve the current incident.
Faced with the cacophony of mounting pressures, an understanding of these work-streams and having processes in place to guide the work gives structure to an unraveling situation and provides a way forward. In addition, making these work-streams visible creates a visible framework for filtering out irrelevant information, communicating progress and keeping everyone involved.
A visible display of relevant information and progress, helps everyone involved see what is going on and what lies ahead. A conference call is not enough. Customers and managers need meaningful progress reports, reassurances that the problem can be fixed, and no surprises. Similarly colleagues need clear visibility of actions taken or underway, shared data, and an understanding of the clear path forward to resolution.
A shared dashboard or a shared whiteboard that makes work visible and up to date provides this visible framework. Formatting the dashboard into the four work-streams/ processes needed for incident management communicates information clearly to everyone.
The example just below shows an easily updated whiteboard with four quadrants that track progress of the four work-streams.
The KT Restore Dashboard just below shows the four quadrants with Kepner-Tregoe process templates to guide work and share progress
Dashboards make the parallel work-streams visible while keeping each process separate. Managing incidents with real-time dashboard support organizes information as it comes in, guides progress and informs.
- SA: What’s the status now?
- PA: Diagnostic work done or underway
- DA: Restoration work or decisions/actions taken to restore service
- PPA: Managing risk created by or to success of restoration work or to other systems
This approach to both the work and to the communication platform ensures that people contribute in the most effective and efficient way. It helps clarify what work needs to be done and how it fits into the incident management. Separating the processes/work-streams and recording and updating within each area, helps everyone stays informed and guides relevant contributions
The KT problem solving approach is used worldwide for root cause analysis and to improve IT stability