An Abbreviated Use of Problem Analysis
Trouble Aboard Apollo XIII
Reprinted from The New Rational Manager, by Charles H. Kepner and Benjamin B. Tregoe
Princeton Research Press, Princeton, NJ, 1991, 1997.
The best use of Problem Analysis is the use that works best. There is no particular virtue attached to slavish adherence to every step in the entire process if a brief, informal use of the ideas can reveal the cause of the problem. In fact, the longer people use Problem Analysis, the more adept they become at singling out fragments of the process that apply to the kinds of problems they face every day. When people begin asking questions like “Has anything changed in the timing of this operation lately?” or “What stage was this process at just before you noticed the trouble?” they have made the transition between an academic appreciation of Problem Analysis techniques and internalization of their practical role in daily problem solving.
The vast majority of Problem Analyses never see pen and paper.
This is especially true of the abbreviated application of the process. The seriousness of a problem does not necessarily determine the length or complexity of the analysis required to resolve it. Some extremely serious problems have been solved through abbreviated uses of the process. They were so data-poor that full use could not be undertaken. Fragments of the process had to be relied on and combined with educated speculation to arrive at a most likely cause.
Apollo XIII was on its way to the moon.
Fifty-four hours and fifty-two minutes into the mission—205,000 miles from earth—and all was well. Then John L. Swigert, Jr., duty commander at the time, reported: “Houston, we’ve got a problem here…. We’ve had a Main Buss B undervolt.” This was an insider’s way of saying that electrical voltage on the second of two power generating systems had fallen off and a warning light had appeared. A moment later the power came up again. Swigert reported: “The voltage is looking good. And we had a pretty large bang associated with the caution and warning there.” Three minutes later, as the dimensions of the problem became clearer, he reported: “Yeah, we got a Main Buss A undervolt too….It’s reading about 25½. Main B is reading zip right now.”
Apollo XIII, carrying three people toward the moon at incredible speed, was rapidly losing power and could shortly become a dead body. A disaster had occurred in space and no one was sure what had happened.
NASA engineers put Problem Analysis to work.
On the ground at Houston, NASA engineers put Problem Analysis questioning to work immediately. They began to build a specification of the deviation from the information that came in answer to their questions and from data displayed on their monitoring equipment.
Contingency actions are taken.
At the same time they started a number of contingency actions to reduce use of electrical power on board Apollo XIII. Thirteen minutes after the first report, Swigert reported: “Our O2 Cryo Number Two Tank is reading zero…and it looks to me, looking out the hatch, that we are venting something…out into space…it’s gas of some sort.”
What had begun as an electrical problem—loss of voltage—became a sudden loss of oxygen in the second of two tanks, with a more gradual loss of oxygen from the first. Since oxygen was used in the generation of electricity as well as directly in life-support systems, the situation could hardly be more serious.
Engineers find cause and take actions.
Although no one at the time could conceive of what might have caused the tank to burst, “Rupture of the Number Two Cryogenic Oxygen Tank” would explain the sudden loss of voltage and the subsequent loss of pressure.
Further actions were taken to conserve both oxygen and electricity. A number of “IS…COULD BE but IS NOT” questions were asked to get further data, and a series of system checks was undertaken to verify cause. In the end it was determined that the Number Two Tank had burst and vented all its oxygen, plus a large portion of the gas from the Number One Tank, through a damaged valve and out into space.
The three men returned successfully to Earth but only by the narrowest of margins. Had the cause remained unknown for very much longer, they would not have had enough oxygen left to survive.
So, what was the root cause?
It was weeks before the root cause of this problem was established through on-the-ground testing and experimentation. Two weeks before the launch, a ground crew had piped liquid oxygen into the tanks in a countdown demonstration. After the test they had had difficulty getting the oxygen out of the Number Two Tank. They had activated a heater inside the tank to vaporize some of the liquid oxygen, thus providing pressure to force it out. They had kept the heater on for eight hours, longer than it had ever been used before. Although a protective switch was provided to turn off the heater before it became too hot, the switch was fused in the ON position because the ground crew had connected it to a 65-volt power supply instead of the 28-volt supply used in Apollo XIII. Later, in flight, the crew turned the heater on briefly to get an accurate quantity reading. The fused switch created an arc that overheated the oxygen in the tank, raised the internal pressure tremendously, and blew the dome and much of the connecting piping off into space.
There was no time for NASA Houston to go through a complete listing of all the distinctions and changes they might observe. Instead, they asked, “What traumatic change could cause the sudden, total failure in electrical generation?” Cutting off the flow of oxygen to the fuel cells would have that effect. They knew which fuel cells were inoperative when Swigert reported that the Number Two Tank was reading zero.
Using what was known to test cause.
They tested the cause—that the Number Two Tank had ruptured—and found that this would explain the suddenness and totality described in the specification. It would also account for the bang reported at the time of the first undervolt indication, a shuddering of Apollo XIII felt by flight crew members, and the venting of “something…out into space.” It accounted for both the IS data they had amassed and for the IS NOT information that had come from their monitoring activities. More importantly, it explained a sudden, total failure within the system.
For the NASA Houston engineers, this cause was difficult to accept.
They had unbounded faith in Apollo equipment, knowing that it was the best that could be devised. The idea of an oxygen tank bursting open in the depths of space was not credible. All this was justified from their experience. Without the bungling that had occurred on the ground two weeks before the launch, the tank would have gone to the moon and back just as it was designed and built to do. However, the Houston engineers stuck to the Problem Analysis process despite their incredulity, believing that the test for cause they had carried out had provided the correct answer. In fact, they proved this cause in record time. What saved the day was their knowledge of Apollo XIII’s systems and of what could produce the exact kind of sudden failure that had occurred.
An analytic approach to enterprise-critical problems.
In a case such as this, Problem Analysis is rendered difficult by two factors: secondary effects and panic. Sudden failure in a complex system usually causes other deviations that may obscure the original deviation. The shock of a sudden failure often precipitates panic, making a careful review and use of the facts even more difficult. A disciplined and systematic investigation is difficult in any case, but discipline becomes essential when a top-speed search for cause is undertaken and there is no possibility of amassing all the data that would be optimal in the investigation.
In the NASA incident, the presence of a systematic approach enabled a team of people to work together as a single unit, even though they were separated from the deviation by nearly a quarter of a million miles. For the NASA Houston engineers, this cause was difficult to accept.