Pilot Error is Never Root Cause

Most accidents originate in actions committed by reasonable, rational individuals who were acting to achieve an assigned task in what they perceived to be a responsible and professional manner.
— Peter Harle, Director of Accident Prevention,Transportation Safety Board of Canada and former RCAF pilot, ‘Investigation of human factors: The link to accident prevention.’ In Johnston, N., McDonald, N., & Fuller, R. (Eds.), Aviation Psychology in Practice, 1994

Recent news stories have made me think about STS-28 landing. That flight is special to me because it was my first shuttle flight to sit in the big chair in the center of mission control. I was the Flight Director on the planning shift. New flight directors always start on the night shift when the crew is asleep. Get in less trouble that way. But your first time is always special and I won’t forget that flight.

STS-28 was a ‘classified’ flight that carried a national security payload. Someday, perhaps a long time from now, they will declassify it and let me know what exactly it was we were carrying. But for now, all I know is that they told me it was ‘important’. Important enough, in that post-Challenger era, to put a flight crew at risk. Because every shuttle flight is risky.

Brewster Shaw was the commander of STS-28, his first time in that role. Brewster is a remarkable pilot, one of the best, and went on to demonstrate significant skills as Program Manager for the Space Shuttle and later a leader in the Boeing Space and Defense organization. Not all astronauts make good managers but Brewster certainly did. But in those days Brewster was best known for his piloting ability.

Immediately prior to the flight of STS-28 a problem was uncovered with the way the flight software worked in connection with the small sensors on the landing gear. These so called ‘squat switches’ made contact as the landing gear was compressed and the software moded from flying to rolling on the wheels control. I’ve forgotten the particulars but there was a failure mode that if the switches made contact in a certain way that the computers would put the flight control system into the wrong mode – steering with the nose wheels when steering should be controlled by the rudder and elevens or something like that. Could lead to catastrophic loss of control.

It was too late to modify the software, and the switches were inaccessible with the shuttle attached to the external tank. A manual workaround by the pilot was required to ensure safety. So the Commander and Pilot got briefed – multiple times – in the last few days before flight about the need to land very softly – with a low ‘sink rate’ at touchdown – so the switches and software would work properly.

On the last night of the flight, I supervised the team as we prepared the entry messages for the crew. One of those was a reminder to land ‘softly’. The Entry flight control team came on and I went home hoping for a good landing. One of the first calls that the Capcom made – the crew was waking up as I was leaving the MCC – was a reminder to land softly.

So we set Brewster up.

Nominal deorbit burn, nominal entry, TAEM and HAC acquisition all normal, Commander took over flying manually as planned just as the orbiter decelerated to subsonic speeds. A perfect final glideslope. And now for the moment of truth, would the landing be soft enough to prevent the software glitch?

Normally an orbiter lands with a heavyweight payload in the bay at 205 kts – that is really fast for airplanes, but those stubby delta wings on the shuttle don’t create a lot of lift. With the payload bay empty – as it was for STS-28 – the lightweight landing speed is targeted at 195 kts. Under special circumstances, the pilots were allowed to land as slow as 185 kts. Brewster kept working and working to get the landing sink rate low and the speed kept dropping and dropping. At some point, as any fixed wing aircraft slows down, the wings will ‘stall’ and the aircraft will drop like a rock. Also, as the speed goes down, the pilot has to adjust the nose higher and higher – increasing the ‘angle of attack’ – to maintain lift. At some point with a high angle of attack at low altitude, the tail will scrape on the runway – always considered to be a catastrophic event for the shuttle.

The shuttle touched down at 154 kts. It is still the record for the slowest shuttle touchdown speed by a wide margin. It was less than 5 kts above stall speed. The tail avoided scraping by inches.

Oh, and by the way, the squat switches and software worked perfectly. No issues.

The post flight debriefings were all very positive and constructive – except for the entry and landing analysis. You can look back in my posts for the one called ‘Hockstein’s Law’ for a flavor.

I’ve never seen Brewster so embarrassed. In trying to avoid one hazard he nearly created another. In colorful pilot language (which I won’t repeat) he told us all that ‘on any given day the pilot can foul things up’. And it’s true. But I never blamed Brewster. We had set him up.

By concentrating on one issue to the exclusion of all others, and not reminding him of the training – probably years earlier – about very slow landing hazards – we, the flight control team, the program office, the NASA management – we set him up.

When doing an accident (or close call) investigation, I’ve been told to ask ‘why’ seven times before getting to root cause. The root cause, for example, can never be “the bolt broke”; a good accident investigator would ask “why did the bolt break”. Otherwise, the corrective action would not prevent the next problem. Simply putting another bolt in might lead to the same failure again. Finding out the bolt was not strong enough for the application and putting in a stronger bolt, that is the better solution – and so on.

The Russians had a spectacular failure of a Proton rocket a while back – check out the video on YouTube of a huge rocket lifting off and immediately flipping upside down to rush straight into the ground. The ‘root cause’ was announced that some poor technician had installed the guidance gyro upside down. Reportedly the tech was fired. I wonder if they still send people to the gulag over things like that. But that is not the root cause: better ask why did the tech install the gyro upside down? Were the blueprints wrong? Did the gyro box come from the manufacturer with the ‘this side up’ decal in the wrong spot? Then ask – why were the prints wrong, or why was the decal in the wrong place. If you want to fix the problem you have to dig deeper. And a real root cause is always a human, procedural, cultural, issue. Never ever hardware.

So it is with pilot error. Pilot error is never ever a root cause. Better to ask: was the training wrong? Were the controls wrong? Did the pilot get briefed on some other problem that cause distraction and made him/her fly the plane badly?

Corrective actions must go to root causes, not intermediate causes. Really fixing the problem requires more work than simply blaming the pilot.

About waynehale

Wayne Hale is retired from NASA after 32 years. In his career he was the Space Shuttle Program Manager or Deputy for 5 years, a Space Shuttle Flight Director for 40 missions, and has retired from consulting and is currently a full time grandpa. He might be available for speaking engagements for the right incentives (coffee and donuts work!)
This entry was posted in Uncategorized. Bookmark the permalink.

26 Responses to Pilot Error is Never Root Cause

  1. rangerdon says:

    Well said, and important. Thanks.

  2. Yusef Johnson says:

    I can only imagine Hochstein in all of his glory during that postflight!

  3. David Fuller says:

    Good timing, right on the heels of the NTSB findings for the SpaceShipTwo mishap. (see http://www.ntsb.gov/news/events/Pages/2015_spaceship2_BMG.aspx for more info)

    But even the though the Board came out strongly with a human factors/human performance related finding, some statements by Board members were disturbing. Bloomberg News quoted Robert Sumwalt as saying, “Humans will screw up anything if you give them enough opportunity.” This statement creates the stigma of a human doing “bad” things, as if they had a choice in the matter.

    As your opening quote from Peter Harle says, people make decisions because it seems like a good idea at the time. Using terms like “screw up” and, from an “expert” on cyber security “…No patch for human stupidity” is a way to blame the actor and not the latent and/or immediate factors.

    Worse, readers of these statements have the reaction of “I’d never be that dumb” or even the old fall back, “Well, they just need more training…”

    As every safety engineer should know, when a hazard is identified, the preferred solution is to remove the hazard. The least preferred solution is “train the operator.” Unfortunately training is always cheaper. At least, in the short term.

  4. Beth says:

    Wayne, I wondered if you had the recent SpaceShipTwo accident in mind when you wrote this.

    “SpaceShipTwo mishap due to pilot error and company training oversight.” Headline from SpaceFlightNow.com.

    Well written; I will remember to ask ‘why’ seven times.

  5. Mark Triplett says:

    While I’m not an engineer and I don’t work in the field of aviation, the “ask why 7 times”, can apple to almost any profession. Thank you for another great insight into the black art of project and system management!

  6. numbers_guy101 says:

    I very much agree. Yet, I’m reminded of how often I found push-back in Shuttle operations when I would suggest something or some process might be designed to be more “idiot-proof”, with a usual retort about how if the error were ever committed it would be because the technician would not have read the instructions. Such attitudes seemed a blend of not asking “why” enough times, when bad things did happen, being content to lay blame, or from the notion that sometimes doing too much might actually complicate the thing. We let the validity of the later, not wanting to over complicate things, obscure the former, not asking “why” enough times when considering critical systems and their failure modes.

  7. Charley S says:

    Great insight. Thank you.

  8. dphuntsman says:

    Humans make mistakes; that includes pilots, and flight controllers on the ground.

    I remember being interviewed for a section head job in Flight Control Division – competing against one Wayne Hale, who eventually won the job – and one of the questions I was asked was: What was the worst mistake you’ve ever made on the job? I wasn’t prepared for that, hemmed a little, and finally admitted that, on console, I had once sent the wrong command to the spacecraft. The branch chief interviewing me waived that off, saying, “Hell, everybody’s done that……..”.

    We learned early on in the human space program that it’s important to try and protect where practical against hardware and softwareand “procedural errors”, the latter being code for human error, whether it be on the ground or in the sky. That’s why I was surprised in the latest accident case that a single human error could have the consequences it did. It can be a hard call to make in the design process, of course; I’ve read reports about certain Airbus-built aircraft that are so automated to protect against “pilot error” that the protections themselves can end up causing a problem. But Wayne’s main point is spot-on: it’s not enough to know that a lever was moved early, or that a strut holding a He tank inside a LOX tank broke; in both of those questions, there remains the question in each case: Why?

    Dave Huntsman

  9. Guillaume says:

    Thanks for sharing this insight. I’m a historian, so definitely not an engineer or a scientist, but the fields join here: There is never a monocausal explanation. Context is everything.

  10. John G. says:

    I agree; I think things went wrong with open the ended direction to “land softly”. The need to land softly can be a constraint, but it isn’t a direction. Modifying physical parameters like landing speed, touchdown point, glide slope, etc are the kinds of things that be changed to accommodate a new constraint. Should have picked the most nominal numbers that least tickled the bug and said “hit those”.

  11. J says:

    You’re right that root causes will be human, but I also like the modern culture of keeping blame out of postmortems. People want to get to the root of things, and if they aren’t worried about the politics and blame, they can be much more honest about what’s broken.

  12. Tom Liverani says:

    One slight correction…this was Shaw’s second flight as CDR (following 61B).

  13. Michael Wright says:

    I sent a friend who does hazards/safety training this article, he strongly agrees with preferred solution is to remove the hazard.

    He sent me an illustration of hierarchy of hazard control, beginning with most useful to least useful:
    1. Eliminate the hazard
    2. Substitude the hazard
    3. Isolate the hazard
    4. Use engineering controls
    5. Use administrative controls
    6. Use personal protective equipment.

    This looks simple enough but difficult to first begin with step one (pointless to get rid of airplanes). Looking deeply into this each step has to be heavily considered, which is why HSF is so expensive and difficult to do. Obviously it can be done more easily but it would be very dangerous.

  14. Dwight says:

    Outstanding article Wayne, thanks very much. One very minor nit – this was Shaw’s second command. He commanded 61B in late ’85.

  15. Daryl Schuck says:

    Wayne, I recall many years ago. when I was a fledgling new EMU spacesuit systems instructor, when we had a rash of crews making the mistake of turning the switch labeled “WATER” on when they shouldn’t have. This switch fed water to the sublimator which only worked at vacuum. Turning it on any other time caused water to feed into the porous plate and just pass through, letting water into the PLSS in places it really shouldn’t go, possibly creating bigger problems later when the suit went to vacuum, when the water would freeze and expand, and doing very bad things. It was a serious problem, and I do recall being on the receiving end of criticism from the engineering side asking why we couldn’t train the crew to stop making such a stupid mistake. The truth is, the crew were confusing this switch labeled “WATER” with the one labeled “FAN” right next to it, and for a very explainable reason: the fan switch not only drove the fan but the pump that moved water around the cooling loop. To us engineers all so focused on just this suit system, it was a ‘stupid mistake,’ but to a a crew, this was just a fraction of their entire training. A crewperson wanting to turn the suit pump on could incorrectly think the “WATER” switch was the right switch for that. My human factors courses taught me to think differently, and in time I helped convince folks this wasn’t a human error issue, but a problem with the design and labeling of the Human Machine Interface. The fix ended up being a foam switchguard that we put on that switch labeled WATER, which was removed by procedure just prior to depress and replaced shortly after repress, and to my knowledge this eliminated any further incidents of this “stupid mistake”, and is still in place to this day. I’m glad the spacesuit community came to the right conclusion that human error is never the root cause.

  16. Daniel Woodard says:

    Wayne, perhaps you could settle a debate that has simmered for decades. I have read from several sources that the Shuttle, for most of the program, had an autoland system that was capable of operating the flight controls all the way to rollout. The crew would have to start the APUs, lower the landing gear, and deploy the drag chute, but the flight controls could be controlled by the flight computer. According to the Wikipedia account, the system was not complete on STS-3 when it was engaged until late in final, and it produced incorrect inputs to the speed brake. https://en.wikipedia.org/wiki/STS-3 Were the problems resolved before STS-28? Was the autoland system fully tested on the STA? Was there any technical reason that a shuttle never landed automatically?

  17. John says:

    Didn’t you say once that there were issues with STS-33 landing, too??

    • waynehale says:

      There were issues with every landing, some small, some greater. Under the microscope there were imperfections in them all.

      • John says:

        Wow, interesting. They all seemed so good, visually lol. I remember STS-91 kind of bounced after main gear touchdown; gusty winds maybe (?). I was fortunate enough to witness one in person- really something to see.

      • Michael Wright says:

        Reminds me of STS-9 that had a fire in aft compartment during landing but it extinguished before serious damage. I always wondered if things went terribly wrong on that landing of the first Spacelab mission and European astronauts plus John Young making his 6th spaceflight.

        Another scary landing was the intake of hydrazine and or nitrogen texoxide gases while command module of 1975 Apollo Soyuz during atmosphere equalization. I believe Vance Brand passed out and other crew were able to don O2 masks then to Brand. A tragedy that could have resulted of all crew perish in last Apollo mission (double whammy as all crew perished in first Apollo mission though it was a ground test).

      • waynehale says:

        And these issues bear on pilot error how?

  18. Dave H. says:

    Wayne,

    The Human Performance seminars my employer sent everyone to taught us that in the end EVERYTHING is caused by human error. The premise is simple: humans make everything, and nothing drops out of the sky.

    The class also taught us that disasters are rarely caused by a single event; they are caused by multiple events. The metaphor used was the layers of an onion. One by one, the layers are peeled back until the center is exposed. Even though the event was not specifically covered in the seminar (STS-51L and Apollo 13 were, STS-107 will be in the next edition), we used the “onion” to analyze the wreck of the Edmund Fitzgerald. We came to the conclusion that misplaced faith in the machine (“She’s always gotten us home before, she’ll get us home this time.”) was the final layer, exposed by the two huge rogue waves that nearly sent the Arthur Anderson to the bottom as well.

    It’s been my experience that every day, humans discover the flaws and repair them. Depending on the timing and severity, these events begin with preventive maintenance and end with near misses, or more accurately, near hits! Go one step further and you lose Air France Flight 447 and Fukushima.

    There are always going to be losses and disasters, but it’s up to us to not let it happen on our watch.

  19. Pingback: Branson Still Doesn’t Really Understand Why SpaceShipTwo Crashed – Parabolic Arc

  20. jlscott64 says:

    I know this is years later, but Michael Wright’s comment on the ASTP reentry problems bears directly on “pilot error”. The first “Why” on that incident was that the RCS was left on. The second “Why” is that the CMP missed a checklist item.

Leave a comment