Blog posts in English

Are you playing the Russian roulette? Learning from failure

I think most (if not all?) testers have witnessed situations like this: A new feature of the system put into production, only to crash weeks, days or just hours later.
”Why didn’t anybody think of that?!”
Truth is, quite often, somebody did actually think about the problem, but the issue was not realised, communicated or accepted.
Below is the story about the space shuttle Challenger accident in 1986.
Twentynine years ago, space shuttle Challenger exploded seven minutes into the flight killing the seven astronauts aboard.
Theoretical physicist Richard Feynman was a member of the accident commision. During the hearings he commented that the whole decision making in the shuttle project was ”a kind of Russian roulette”.
The analogy is striking. Russian roulette is only played by someone willing to take the risk to die.
I don’t know anyone who deliberately want to play the Russion roulette, so why did they play that game?
Feynman explains: [The Shuttle] flies [with O-ring erosion] and nothing happens. Then it is suggested, therefore, that the risk is no longer so high for the next flights. We can lower our standards a little bit because we got away with it last time…. You got away with it but it shouldn’t be done over and over again like that.
The problem that caused the explosion was traced down to leaking seals in one of the booster rockets. On this particular launch ambient temperatures were lower than usual and for that reason the seals all failed. The failed seals allowed very hot exhaust gasses to leak out of the rocket combustion chamber, and eventually, these hot gasses ignigted the many thusand litres of higly explosive rocket fuel.
Challenger blew up in a split second. The seven astronauts probably didn’t realise they were dying before their bodies were torn in pieces.
It was a horrible tragedy.
Chapter 6 of the official investigation report is titled: ”An accident rooted in history.”
The accident was made possible because of consistent misjudgements and systematically ignored issues, poor post flight investigations, and ignored technical reports. The accident was caused because three seals failed on this particular launch, but the problem was known and the failure was made possible because it was systematically ignored.
The tester’s fundamental responsibilites
As a tester, I have three fundamental responsibilities:

  1. Perfom the best possible testing in the context
  2. Do the best possible evaluation of what I’ve found and learnt during testing.  Identify and qualify bugs and product risks.
  3. Do my best to communicate and advocate these bugs and product risks in the organisation.

The Challenger accident was not caused by a single individual who failed detecting or reporting a problem.
The accident was made possible by systemic factors, i.e. factors outside the control of any individual in the programme. Eventually, everyone fell into the trap of relying on what seemed to be “good experience”. The facts should have been taken seriously.
A root cause analysis should never only identify individual and concrete factors, but also systemic factors which enabled the problem to survive into production.
Chapter 6 of the Challenger report reminds me that, when something goes wrong in production, performing a root cause analysis is a bigger task than just finding out the chain of events that lead to problem.
Many thanks to Chi Lieu @SomnaRev for taking time to comment early drafts of this post.

Photo of the space shuttle Challenger accident Jan. 28, 1986. Photo credit: NASA
Photo of the space shuttle Challenger accident Jan. 28, 1986. Photo credit: NASA