Accident Investigations

This week’s crash of a commuter train in New Jersey plus other recent events got me thinking about the lessons I have learned the hard way through several accident investigations.

One of the toughest parts of investigating an accident is that everyone wants to know the answer, the cause, right away.  Folks, it simply does not happen quickly.  Accidents involving complex high energy systems are tough to figure out.  Even little incidents can be hard:  as we used to say in the shuttle program:  “The first story is always wrong”.  It takes time to get an accurate picture and compile the evidence.  It will be long after the news cycle has moved on to other things before the true cause becomes apparent in these complex systems failures.  The boss and the media want answers right away and sometimes will make something up – or at least grasp on the first theory – to get a quick answer.  Real accident investigators know it takes time.  Typically the NTSB takes a year to finalize their reports; I’d have to say that is about right in most cases.

So I’d offer this explanation about why it takes so long.  And I would offer a few tips and rules from my experience.  This is not what they teach in accident investigation school where the emphasis is typically on preserving the data, making sure the evidence is uncontaminated, gathering witness statements, collecting the maintenance records, examining training records, and the other mechanics of the process – which are important.  These rules are in addition to those good practices which should be followed at all times.  Sort of ‘Gibbs Rules’ here:

Rule #1 for accident investigators:  keep an open mind.  Do not start making theories too early. Stay away from quick conclusions and let the facts lead to the conclusion, not the other way around.  This is very hard to do, not to start spinning theories right away.  Don’t speculate and don’t let others lead you into conversations that speculate on the cause. It is amazing how biased thinking becomes and how easy it is to overlook evidence once your mind (consciously or subconsciously) believes it has a conclusion.  Keep an open mind as long as you can.  When we had shuttle ‘anomalies’ the first analysis was almost invariably wrong and sometimes it would take months before we got to the correct conclusion.  Fixing the wrong thing is never helpful.

Rule #2 for accident investigators:  make a good, comprehensive timeline of all the information around the time of the accident.  This is not as easy as it sounds.  In the rocket business we always have a stream of telemetry of pressures, temperatures, valve positions, operating speeds, etc., recorded in a central place.  But sometimes the telemetry is built up from different sources on the rocket or on the ground.  It is important to dig into the time sources and make sure every event is put on a master time line with the correct and cross correlated time.  When milliseconds count, as they often do in these investigations, make sure that time lags and adjustments in the system are fully understood and accounted for.  For other evidence, video from external cameras for instance, be certain about the timing source, make sure the frame rate of the video is accounted for.  It is vitally important that video or still photos give their evidence in the right time frame.  Knowing what happened when, in the right order is the most powerful tool in the investigator’s toolbox.  Getting all the evidence and timing right can take weeks.  Be patient and adjust the timeline as the evidence comes in.  Continue to keep an open mind at this stage.

Rule #3 for accident investigators:  make sure the physical evidence is examined by a dis-interested third party that is well qualified to evaluate it.  To avoid the appearance of impropriety that may cloud a final report, using a well-qualified lab that is not associated with the manufacturer or operator of the equipment is vital.  Well qualified labs and experts are hard to find and often expensive, but it is worth the time and cost to come to a final conclusion that is as free of controversy as possible.

Rule #4 for accident investigators:  make a fault tree.  This helps the accident investigator build up knowledge about the system and makes sure that all possible causes are investigated.  Fault trees typically are of less use than the outside world may think.  The most important result from building the fault tree is that it makes the accident investigators aware of all the items that need to be examined.

Rule #5 for accident investigators:  ask ‘why’ seven times.  It is much too easy to come to a first level conclusion and leave the investigation.  That is guaranteed to result in future accidents.

I have no idea why the train crash occurred, but let’s take an imaginary trip through the kind of questions that an accident investigator should ask later in the investigation when a proximate cause is identified.  Here is that strictly hypothetical example: Q1: Why did the train not stop? A1:  The brakes failed to apply when commanded by the operator.  Q2: Why did the brakes fail?  A2:  part X in the braking system failed.  Q3: Why did part X in the braking system fail?  A3 It was installed improperly at the last maintenance period. Q4:  Why was part X installed improperly? A4:  The maintenance installation procedure was incorrect.  Q5:  Why was the maintenance procedure incorrect? A5:  The procedure was not updated when a new part manufacturer was selected to build part X.  Q6: Why was the procedure not updated?  A6:  The process for updating maintenance procedures did not allow for a change in part manufacturer.  Q7:  Why did the process not allow for a new manufacturer:  A7:  It was not foreseen that a new part manufacturer would make a part that needed new installation procedures.  Following this hypothetical case – and just note that I know nothing about the train crash, I am just making this up as a teaching tool – an accident investigator would find that the proximate cause of the accident was a braking failure, but the root cause was an inadequate process to account for new part manufacturers and the corrective action is to update the maintenance procedure change process to ensure that when a new part is introduced, the maintenance procedures are updated properly.

It is easy to see that if one quit with the part failure, a band aid fix would probably ensure that that particular part never failed again, but other failures could occur.  Similarly if at step 3, the investigation were to blame the maintenance person who installed the part, not only would an injustice occur, but the way for other failures in the system would be left open.  It is important to get to root cause – which is almost always a process problem – and address that as well as the more simple corrective action for proximate cause.

All of this takes time and discipline.  Months may pass before the real cause of the accident can be established.  That is true for rocket ships and airplanes and I’m sure it’s true for trains.

Don’t expect the media to get this right, or even care about it when the final report comes out.  Short attention span there.  The important thing is that the source of future accidents has been cut off.

Rule #6 for accident investigators:  there will always be conflicting and confusing information.  It is very rare in these failures of complex systems to come to a one and final guaranteed 100% conclusion.  There are always counter indications and other possibilities.  A good accident report will always give the most likely cause and list other causes which are less likely but cannot be completely ruled out.  Absolute certainty is not something that engineers or accident investigators deal in.

So be patient and let the investigators do their job.

Oh, and follow Gibbs rule #13.  Look it up.

About waynehale

Wayne Hale is retired from NASA after 32 years. In his career he was the Space Shuttle Program Manager or Deputy for 5 years, a Space Shuttle Flight Director for 40 missions, and is currently a consultant and full time grandpa. He is available for speaking engagements through Special Aerospace Services.
This entry was posted in Uncategorized. Bookmark the permalink.

11 Responses to Accident Investigations

  1. Pete Goldie says:

    I, along with several dozen other amateur astronomers and space enthusiasts, took imagery of the STS-107 entry accident. One of my images appeared to show an interaction with the shuttle flightpath and the high atmosphere, and the timing happened to coincide exactly with initial loss of spacecraft telemetry. The mass media was eager to report that Columbia was downed by “space lightning”, but NASA did not take the bait. While some might have conceivably hoped that the accident could be blamed on an entirely unforeseen phenomenon, high-altitude physics could not support this explanation. To be certain, a special study group was formed to examine the image, as implausible as the theory was. Over the next two weeks, the telemetry timeline was refined to show spacecraft anomalies began earlier and other evidence pointed to the better conclusion. To be sure, neither the first answer nor the easy answer was the right one.

  2. rangerdon says:

    To be put in the Bible of such events.

  3. The best outcome is the Falcon 9 explosion was already determined to be an act of terrorism, they’re only deferring the announcement until after H-Rod’s election & flights can resume on Nov 9. The worst outcome is it takes 2 years of investigation, lots of people lose their jobs, & the whole COTS experiment becomes a failure. If flights can’t resume until 2018, they would have to buy Soyuz flights until long after 2030 & probably switch to a government vehicle for cargo. It would be a failure of commercial spaceflights for our generation.

  4. Dave H. says:

    You’re quite correct about jumping to conclusions. The El Faro, lost in a hurricane one year ago, has an incomplete investigation. All we know is the end result: for reasons yet unknown, they chose to sail into the path of a Cat 4 hurricane and 33 people were lost at sea.

    I was hoping that you’d had seen “Deepwater Horizon” and had thoughts about its many teachable moments. I saw it, and it was instantly comparable to BP’s Texas City explosion.

    It’s tough to change a culture.

  5. Roy Smith says:

    Wayne, your article should be required reading for journalism students. Thanks for such a clearly written summary.

  6. USARocketman says:

    Timely thoughts Wayne. I was running through similar thoughts when watching the recent movie Sully. While I do not know how much movie vs. facts there was in the NTSB making too quick of judgment on the big screen, but it reminded me to not jump to conclusions when additional data is forthcoming (i.e., engine not yet recovered), consider the timeline with all available parameters (i.e., don’t look 2D, look 4D), and recognize that the reaction time required for the human-in-the-loop decisions that go into making life or death actions (i.e., situational awareness, then react).

  7. Dave H. says:

    Click to access 598645.pdf


    This is a link to the NTSB transcript from the El Faro’s Voyage Data Recorder. This device records bridge audio and the resulting transcript reads almost like a movie script. The only thing that’s missing is one-half of the ship’s in-house telephone conversations. The result is that we never get to know what exactly the ship’s Chief Engineer told the Captain about why they were unable to get the ship’s “plant”, or steam turbine propulsion unit, back into operation. What is known is that the engineers were unable to establish bearing oil pressure on the turbine, and if it was like every other turbine I’ve ever encountered in my career bearing oil pressure is a start permissive. No oil pressure, no run.

    I was taught that major disasters and accidents are rarely caused by one thing; they tend to result from several seemingly insignificant events and that one large unexpected event causes the disaster to take place. One instructor likened it to unpeeling an onion. In the El Faro’s case, the loss of propulsion and the inability to get it back was that large event. There were other events taking place…the ship was listing and water was entering the engine room through the ventilation system, but the disaster takes many hours to unfold and one can easily imagine what the ship and crew were enduring and feeling as the hurricane intensified.

  8. Jonathan says:

    I investigate mining accidents for a living and have had to work with ‘experts’ that were sure they knew the cause within moments of arriving on the scene, or even just a description over the phone – every time they were wrong with their initial diagnosis, though some refused to change their conclusion.
    I am also wary of initial reports from the scene – they often have the wrong equipment or misunderstand what happened. My investigations (usually) don’t take a year, but they are also much simpler than aircraft accidents.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s