I’d like to interrupt my personal recollections of the Columbia accident and its aftermath to give a few words from Admiral Gehman. You might as well know that there are still people out there who will tell you the CAIB got it wrong. I am not one of them. Of course there are minor details that are not exactly right, no report is perfect. But in the main points the CAIB did get it right, in my opinion. Admiral Gehman lead the CAIB and in December 2005 he made a remarkable speech to a group of Navy officers. His talk included a discussion of the terrorist attack on the USS Cole and the USS Iowa main gun battery explosion which killed a number of sailors. He also mentions various aspects of other naval programs and investigations. In the interest of brevity, I have deleted all of that other material and just retained his comments on the Columbia accident. You should read this carefully.
In the course of working closely with NASA engineers and NASA scientists as we tried to solve what had happened to the Columbia, we became aware of some organizational traits that caused our eyebrows to rise up on our heads. After not very long, we began to realize that some of these organizational traits were serious impediments to good engineering practices and to safe
and reliable operations. They were doing things that took our breath away.
We concluded and put in our report that the organizational traits, the organizational faults, management faults that we found in the space shuttle program were just as much to blame for the loss of the Columbia as was the famous piece of foam that fell off and broke a hole in the wing. Now, that’s pretty strong language, and in our report, we grounded the shuttle until they fixed these organizational faults.
I need to give you the issue from the NASA point of view so you can understand the pressures that they were under. In a developmental program, any developmental program
The program manager essentially has four areas to trade. The first one is money. Obviously, he can go get more money if he falls behind schedule. If he runs into technical difficulties or something goes wrong, he can go ask for more money. The second one is quantity. The third one is performance margin. If you are in trouble with your program, and it isn’t working, you shave the performance. You shave the safety margin. You shave the margins. The fourth one is time. If you are out of money, and you’re running into technical problems, or you need more time to solve a margin problem, you spread the program out, take more time. These are the four things that a program manager has. If you are a program manager for the shuttle, the option of quantity is eliminated. There are only four shuttles. You’re not going to buy any more. What you got is what you got. If money is being held constant, which it is—they’re on a fixed budget, and I’ll get into that later—then if you run into some kind of problem with your program, you can only trade time and margin. If somebody is making you stick to a rigid time schedule, then you’ve only got one thing left, and that’s margin. By margin, I mean either redundancy—making something 1.5 times stronger than it needs to be instead of 1.7 times stronger than it needs to be—or testing it twice instead of five times. That’s what I mean by margin.
It has always been amazing to me how many members of Congress, officials in the Department of Defense, and program managers in our services forget this little rubric. Any one of them will enforce for one reason or another rigid standard against one or two of those parameters. They’ll either give somebody a fixed budget, or they’ll give somebody a fixed time, and they forget that when they do that, it’s like pushing on a balloon. You push in one place, and it pushes out the other place, and it’s amazing how many smart people forget that.
The space shuttle Columbia was damaged at launch by a fault that had repeated itself in previous launches over and over and over again. Seeing this fault happen repeatedly with no harmful effects convinced NASA that something which was happening in violation of its design specifications must have been okay. Why was it okay? Because we got away with it. It didn’t cause a catastrophic failure in the past. You may think that this is ridiculous. This is hardly good
engineering. If something is violating the design specifications of your program and threatening your program, how could you possibly believe that sooner or later it isn’t going to catch up
with you? For you and me, we would translate this in our world into, “We do it this way, because this is the way we’ve always done it.”
The facts don’t make any difference to these people. Well, where were the voices of the engineers? Where were the voices that demanded facts and faced reality? What we found
was that the organization had other priorities. Remember the four things that a program manager can trade? This program manager had other priorities, and he was trading all right, and let me tell you how it worked. In the case of the space shuttle, the driving factor was the International Space Station.
In January of 2001, a new administration takes office, and the new administration learns in the spring of 2001 that the International Space Station, after two years of effort, is three years behind schedule and 100 percent over budget. They set about to get this program back under control. An independent study suggested that NASA and the International Space Station program ought to be required to pass through some gates. Now, gates are definite times, definite places, and definite performance factors that you have to meet before you can go on. The White House and the Office of Management and Budget agreed to this procedure, and the first gate that NASA had to meet was called U.S. Core Complete. The name doesn’t make any difference, but essentially it was an intermediate stage in the building of the International Space Station, where if we never did anything more, we could quit then. And the date set for Core Complete was February 2004. Okay, now this is the spring of 2001.
In the summer of 2001, NASA gets a new administrator. The new administrator is the Deputy Director of OMB, the same guy who just agreed to this gate theory. So now if you’re a worker at
NASA, and somebody is leveling these very strict schedule requirements on you that you are a little concerned about, and now the new administrator of NASA becomes essentially the author of this schedule, to you this schedule looks fairly inviolate. If you don’t meet the gate, the
program is shut down; they took it as a threat. If a program manager is faced with problems and shortfalls and challenges, if the schedule cannot be extended, he either needs money, or he needs to cut into margin. There were no other options, so guess what the people at NASA did? They started to cut into margins. No one directed them to do this. No one told them to do this. The organization did it, because the individuals in the organization thought they were defending the
organization. They thought they were doing what the organization wanted them to do. There weren’t secret meetings in which people found ways to make the shuttle unsafe, but the
organization responded the way organizations respond. They get defensive. We actually found the PowerPoint viewgraphs that were briefed to NASA leadership when the program for good, solid engineering reasons began to slip, and I’ll quote some of them. These were the measures that the managers proposed to take to get back on schedule. These are quotes. One, work over the Christmas holidays. Two, add a third shift at Kennedy Shuttle turnaround facility. Three, do safety checks in parallel rather than sequentially. Four, reduce structural inspection requirements. Five, defer requirements and apply the reserve, and six, reduce testing scope. They’re going to cut corners. That’s what they’re going to do. Nevertheless, for very good reasons, good engineering reasons, and to their credit, they stopped operations several times, because they found problems in the shuttle, and they got farther and farther behind schedule.
Well, two launches before the Columbia’s ill-fated flight—it was in October—a large piece of foam came off at launch and hit the solid rocket booster. The solid rocket boosters are recovered from the ocean and brought back and refurbished. They could look at the damage, and it was significant. So here we have a major piece of debris coming off, striking a part of the shuttle assembly. The rules and regulations say that, when that happens, it has to be classified as the highest level of anomaly, requiring serious engineering work to explain it away. It’s only happened six or seven times out of 111 launches. But the people at NASA understand that if they classify this event as a serious violation of their flight rules, they’re going to have to stop and fix it. So they classify it as essentially a mechanical problem, and they do not classify it as what they call an in-flight anomaly, which is their highest level of deficiency.
Okay, the next flight flies fine. No problem. Then we launch Columbia, and Columbia has a great big piece of foam come off. It hits the shuttle. This has happened two out of three times.
Now, we go to these meetings. Columbia is in orbit, hasn’t crashed, and we’re going to these meetings about what to do about this. The meetings are tape-recorded, so we have been listening to the tape recordings of these meetings, and we listen to these employees as they talk themselves into classifying the fact that foam came off two out of three times as a minor material
maintenance problem, not a threat to safety. Why did they talk themselves into this? Because they knew that, if they classified this as a serious safety violation, they would have to do all these
engineering studies. It would slow down the launch schedule. They could not possibly complete the International Space Station on time, and they would fail to meet the gate. No one told them
to do that. The organization came to that conclusion all by itself. They trivialized the work. They
demanded studies, analyses, reviews, meetings, conferences, working groups, and more data. They keep everybody working hard, and they avoided the central issue: Were the crew and the shuttle in danger? [This was] a classic case where individuals, well-meaning individuals, were swept along by the institution’s overpowering desire to protect itself. The system effectively blocked honest efforts to raise legitimate concerns. The individuals who raised concerns and did complain and tried to get some attention faced personal, emotional reactions from the people who were trying to defend the institution. The organization essentially went into a full defensive crouch, and the individuals who were concerned about safety were not able to overcome.
So back to me. I would tell you nobody intentionally did anything to compromise safety. Everyone that I worked with always held the safety of the crew to be their highest priority. So the real question that you have to answer is how did so many hard working, intelligent, safety minded people come to make such a fundamentally unsafe set of decisions? And do you think that you or your organization is immune?