This week’s crash of a commuter train in New Jersey plus other recent events got me thinking about the lessons I have learned the hard way through several accident investigations.
One of the toughest parts of investigating an accident is that everyone wants to know the answer, the cause, right away. Folks, it simply does not happen quickly. Accidents involving complex high energy systems are tough to figure out. Even little incidents can be hard: as we used to say in the shuttle program: “The first story is always wrong”. It takes time to get an accurate picture and compile the evidence. It will be long after the news cycle has moved on to other things before the true cause becomes apparent in these complex systems failures. The boss and the media want answers right away and sometimes will make something up – or at least grasp on the first theory – to get a quick answer. Real accident investigators know it takes time. Typically the NTSB takes a year to finalize their reports; I’d have to say that is about right in most cases.
So I’d offer this explanation about why it takes so long. And I would offer a few tips and rules from my experience. This is not what they teach in accident investigation school where the emphasis is typically on preserving the data, making sure the evidence is uncontaminated, gathering witness statements, collecting the maintenance records, examining training records, and the other mechanics of the process – which are important. These rules are in addition to those good practices which should be followed at all times. Sort of ‘Gibbs Rules’ here:
Rule #1 for accident investigators: keep an open mind. Do not start making theories too early. Stay away from quick conclusions and let the facts lead to the conclusion, not the other way around. This is very hard to do, not to start spinning theories right away. Don’t speculate and don’t let others lead you into conversations that speculate on the cause. It is amazing how biased thinking becomes and how easy it is to overlook evidence once your mind (consciously or subconsciously) believes it has a conclusion. Keep an open mind as long as you can. When we had shuttle ‘anomalies’ the first analysis was almost invariably wrong and sometimes it would take months before we got to the correct conclusion. Fixing the wrong thing is never helpful.
Rule #2 for accident investigators: make a good, comprehensive timeline of all the information around the time of the accident. This is not as easy as it sounds. In the rocket business we always have a stream of telemetry of pressures, temperatures, valve positions, operating speeds, etc., recorded in a central place. But sometimes the telemetry is built up from different sources on the rocket or on the ground. It is important to dig into the time sources and make sure every event is put on a master time line with the correct and cross correlated time. When milliseconds count, as they often do in these investigations, make sure that time lags and adjustments in the system are fully understood and accounted for. For other evidence, video from external cameras for instance, be certain about the timing source, make sure the frame rate of the video is accounted for. It is vitally important that video or still photos give their evidence in the right time frame. Knowing what happened when, in the right order is the most powerful tool in the investigator’s toolbox. Getting all the evidence and timing right can take weeks. Be patient and adjust the timeline as the evidence comes in. Continue to keep an open mind at this stage.
Rule #3 for accident investigators: make sure the physical evidence is examined by a dis-interested third party that is well qualified to evaluate it. To avoid the appearance of impropriety that may cloud a final report, using a well-qualified lab that is not associated with the manufacturer or operator of the equipment is vital. Well qualified labs and experts are hard to find and often expensive, but it is worth the time and cost to come to a final conclusion that is as free of controversy as possible.
Rule #4 for accident investigators: make a fault tree. This helps the accident investigator build up knowledge about the system and makes sure that all possible causes are investigated. Fault trees typically are of less use than the outside world may think. The most important result from building the fault tree is that it makes the accident investigators aware of all the items that need to be examined.
Rule #5 for accident investigators: ask ‘why’ seven times. It is much too easy to come to a first level conclusion and leave the investigation. That is guaranteed to result in future accidents.
I have no idea why the train crash occurred, but let’s take an imaginary trip through the kind of questions that an accident investigator should ask later in the investigation when a proximate cause is identified. Here is that strictly hypothetical example: Q1: Why did the train not stop? A1: The brakes failed to apply when commanded by the operator. Q2: Why did the brakes fail? A2: part X in the braking system failed. Q3: Why did part X in the braking system fail? A3 It was installed improperly at the last maintenance period. Q4: Why was part X installed improperly? A4: The maintenance installation procedure was incorrect. Q5: Why was the maintenance procedure incorrect? A5: The procedure was not updated when a new part manufacturer was selected to build part X. Q6: Why was the procedure not updated? A6: The process for updating maintenance procedures did not allow for a change in part manufacturer. Q7: Why did the process not allow for a new manufacturer: A7: It was not foreseen that a new part manufacturer would make a part that needed new installation procedures. Following this hypothetical case – and just note that I know nothing about the train crash, I am just making this up as a teaching tool – an accident investigator would find that the proximate cause of the accident was a braking failure, but the root cause was an inadequate process to account for new part manufacturers and the corrective action is to update the maintenance procedure change process to ensure that when a new part is introduced, the maintenance procedures are updated properly.
It is easy to see that if one quit with the part failure, a band aid fix would probably ensure that that particular part never failed again, but other failures could occur. Similarly if at step 3, the investigation were to blame the maintenance person who installed the part, not only would an injustice occur, but the way for other failures in the system would be left open. It is important to get to root cause – which is almost always a process problem – and address that as well as the more simple corrective action for proximate cause.
All of this takes time and discipline. Months may pass before the real cause of the accident can be established. That is true for rocket ships and airplanes and I’m sure it’s true for trains.
Don’t expect the media to get this right, or even care about it when the final report comes out. Short attention span there. The important thing is that the source of future accidents has been cut off.
Rule #6 for accident investigators: there will always be conflicting and confusing information. It is very rare in these failures of complex systems to come to a one and final guaranteed 100% conclusion. There are always counter indications and other possibilities. A good accident report will always give the most likely cause and list other causes which are less likely but cannot be completely ruled out. Absolute certainty is not something that engineers or accident investigators deal in.
So be patient and let the investigators do their job.
Oh, and follow Gibbs rule #13. Look it up.