By John Ager, Kepner-Tregoe
Sometimes the smallest change in a daily routine can have profound effects on a business. Consider the mystery of the weekly, Thursday afternoon, system outage.
Situation – The IT systems at a stock brokerage house experienced slow transaction times followed by a complete outage of their transaction system at 3:20pm on a Thursday afternoon. Rebooting the transaction system resolved the issue and got everyone back to work … until the same thing happened on the following Thursday afternoon.
Identical symptoms occurred over the following weeks. The issue was always resolved with a system restart but frustration was mounting, particularly on the trading floor, where lost time can result in lost profits. When this weekly event came to the attention of the board, they instructed the director of IT to make this issue a priority. He formed a problem-solving team to find the cause and they used the Kepner-Tregoe RCA methodology to guide their efforts.
Identify the Problem – Recognizing that to solve a problem you first have to clearly state what the problem is, the team began by separating and clarifying the problem from the generalized “transaction system is slow” to the more specific “transactions are timing out”. The team used this problem statement to focus their search for the information needed to find true cause rather than wasting time examining interesting but irrelevant information.
Describe the problem – A clear problem statement is necessary but not sufficient to rule out false causes and suggest probable causes. So, the team began gathering information about what, when, where, and the extent the problem was—and was not—observed.
- The problem occurred for all transactions run on the system – queries, reports and trades
- The problem was specifically timeouts – no error messages were generated
- The problem affected all staff, it was not restricted to any particular group of users or geographic location
- The problem first happened Thursday 6 September at 3:20pm – it had not been noticed previously
- The problem occurred only on Thursdays between 3pm and 3:30pm. There was an exception to this – the problem was not reported on Thursday October 4
- The problem only happened once a day and once a week
By taking the time to first describe the problem, the team was able to quickly find the proximate cause and then find the systemic cause
Identify possible causes – with a robust problem description, the team was able to avoid the trap of considering all changes that could possibly affect the system; the cause they were looking for affected the entire system, but only on Thursdays between 3pm and 3:30pm. The predictable timing of this deviation during normal working hours suggested cause was likely due to some human interaction with the system. This became their focus.
Examining work rosters did not give them any viable leads, but talking to team leaders eventually identified a possible link. There was one staff member in the invoicing team who left early every Thursday afternoon to take her daughter to ballet class. Members of the problem-solving team interviewed her to find out how she interacted with the system. They discovered that just as she was leaving every day, she began running a report that she needed to use the following morning. Normally this would be run at 5:30 pm, as this was when she typically left work. At this time of the day the stock exchange was closed and few other people used the system. On Thursdays, as usual, she set the report to run when she left, but she left around 3:15pm. The one Thursday the issue did not occur coincided with a day her daughter was on a school trip and did not go to ballet.
By taking the time to first describe the problem, the team was able to quickly find the proximate cause and then find the systemic cause: the report was run without parameters so it searched the entire transaction database and the report had a higher priority than all other transactions—not a problem when the stock market was closed. But at 3:15pm, this caused the already busy system to run extremely slowly and ultimately timeout, consequently dropping the connection to the stock exchange.
Remove the root cause – The quick fix was to instruct the staff member not to run the report when the stock market was open. She showed someone else how to run the report for her at the end of the day on Thursdays and this removed the proximate cause of the issue and prevented it from happening in the future. The team went on to address the systemic cause: running a report without parameters and consuming more transaction system capacity than necessary.
To remove the systemic cause, a development team made changes to the system to ensure reports required specific parameters and any report that could potentially impact system performance could not be run during stock exchange trading hours. Now stock trades sail through the IT system, even on Thursdays, while across town, a class of small girls in pink tights learn ballet. Mystery solved.
The KT problem solving approach is used worldwide for root cause analysis and to improve IT stability