by Andrew Vermes, Kepner-Tregoe
Every day we have those “Homer Simpson” moments: you’re looking at a new project, and feel something will probably go wrong but for various reasons (time, stress, budget) you do nothing. And then it happens...D’oh!
What’s your guess about the percentage of issues/problems/negative events that you could have seen coming? During a recent webcast when we posed this to IT service professionals, they overwhelmingly responded that a significant percentage of problems could have been prevented, with answers ranging from 20% to 85%.
IT incidents are like ticks: much of the action is under the skin. After a tick bite goes unnoticed, it may be weeks, months or years and then in the worst cases, Lyme disease is diagnosed. Like tick bites, it is the small, pinpricks of issues that are the forewarning of bigger problems. Prevention is best. You must be ready and prepared, not only to prevent the initial “bite,” but also to take contingent actions that ameliorate the problem once it has happened.
Why take unnecessary risks? Consider an IT incident and its effects, shown in this graphic. The risks are evolving and can be tremendous.
So why do people take risks? There is a failure to appreciate and appropriately measure the consequences. While incident handling costs are often measured (parts, labor, travel to fix the issue) it is difficult to estimate the actual costs to an organization’s reputation after a downtime. Have you lost customers for good? What else has happened? Lost work is difficult to estimate. The loss to productivity can be huge, yet the reasons not to be proactive and preventative are actually quite weak. When you add up the costs of incidents and add the various costs of lost work, it is staggering. Risk managment is worth it.
Stability and consistency is the way to get the most value from work. To be consistent in how work is done, it is necessary to manage risks. Whether you use Kepner-Tregoe Potential Problem Analysis (PPA) methods or another method, such as FMEA (which is also effective but takes longer), it is essential to anticipate this and implement risk management.
Sometimes it pays to step back and assess what needs to be addressed first, before jumping in to analyze risk. Do I want to do a risk assessment or should I review the decision that was made? Should we be anticipating risk or should we be solving an actual problem?
Assuming that risk analysis is the right thing to do, move ahead by asking these four questions:
1. What could go wrong with this activity or process step?
2. Why might that happen?
3. How can we stop it?
4. What contingent or backup plans are needed if the preventive action fails?
Too often the risk analysis is too simplified. For example, when we ask, what could go wrong? Our answer is one-dimensional: an upgrade may fail. When we ask, what will we do? We imagine a single course of action: back it out.
Regrettably this may not be enough. Planning risk in a more granular, detailed way is much more effective. In the same example we specifically note: We have 12 hours to upgrade our storage management software to release 5.20.
1. What could go wrong?
- There is not enough time to make the upgrade in 12 hours
- The system administrators may make a mistake which costs time
- A bug in the upgrade script causes the upgrade to fail
- A latent fault in the customers’ machines causes the upgrade to fail
- The root file system is too small, we are unable to back out existing patches, there’s latent corruption to the patch files
- The system administrators are distracted, there are gaps in the upgrade procedures to follow, something unexpected occurs
3. How can we stop it (preventive actions)?
- Practice the upgrade in advance
- Make the upgrade a priority for system administrators, create and test procedures, have the system administrators run the practice
- Check support database problems and review upgrade
- Check the machine, use a copy of the customer environment when performing the test, make the machine identical in disk layout and architecture, verify that the upgrade will run on the current operating system
4. What contingent actions/backup plans are needed if we don’t upgrade in 12
- Abandon the upgrade, reload the original and test for functionality
- Assess the gravity of the mistake and abandon the upgrade if it cannot be brought on track on time
- Gather as much data as possible about the upgrade failure and look in support databases for further
- Attempt to fix the problem or abandon the upgrade
Effectively managing risk demands an attention to detail. Before taking a significant action, it’s worth looking at the detail of the risk assessment, and taking action if any part is vague or open to interpretation.
When to do Risk Analysis
Within the ITIL framework, risk analysis is indicated once a fix or a work around has been identified and before it is implemented (see yellow PPA in ITIL graphic).
The same timing applies in manufacturing situations and in other enterprises. Before taking a significant action that is likely to change the nature of your process, do a detailed risk analysis and log the risks of the proposed preventive and contingent actions as well as those actually implemented. In addition to preparing for and preventing risk, you will have increased the richness of any risk analysis for the future.
But why wait? Risk management doesn’t need to begin when planning for change. Small disturbances may precede major incidents. These are the starting points for proactivity. In any complex system there are small things happening all the time. Almost every day some things are a bit off. It pays to take note of them before there are lots of disturbances and the situation is blown apart and difficult. Once there is a problem, it is more difficult to go back and find these small things if they were left unnoted.
Well-oiled event management systems will watch for issues, but what do you monitor? How do you know which variabilities are important? One way is to ask users to report to you when small things are happening—even if nothing is going wrong. By noticing the anomalies affecting users earlier, things can be cleaned up before distress sets in. IT support people who are regularly processing problems, can use proactive tickets to record small disturbances: no problem reported, but we have noticed unusual behavior. Log it and do some risk management. Proactive tickets can have a huge effect on reliability. Proactivity is like insurance. It is better to have it than cope without it.
Try it yourself: proactively managing risk is practical. Consider the three most important things you will do in your work, pick one and do some risk management. If something doesn’t seem right, open a case or record it. You may find it will be useful in the future.
The KT problem solving approach is used worldwide
for root cause analysis and to improve IT stability