Categories
Blog posts in English

Introducing STPA – a new Test Analysis Technique

At the core of innovation in IT is someone getting the idea of connecting existing services and data in new ways to create new and better services. The old wisdom behind it is this:

The Whole is Greater than the Sum of its parts
– Aristotele

There is a flipside to this type of innovation that the opposite is also true: The whole can become more problematic than the negative sums of all the known risks.
My experience as a tester and test manager is that projects generally manage risks in individual subsystems and components quite well.
But I have on occasions found that we have difficulty imagining and properly taking care of things that might go wrong when a new system is connected to the infrastructure, subjected to real production data and actual business processes, and exposed to the dynamics of real users and the environment.

Safety, Accidents and Software Testing

Some years ago, I researched and came across the works of Dr. Nancy Leveson and found them very interesting. She is approaching the problem of making complex systems safe in a different way than most.
Leveson is professor of aeronautical engineering at MIT and author of Safeware (1994) and Engineering A Safer World (2011).
In the 2011 book, she describes her Systems-Theoretic Accident Model and Process – STAMP. STAMP gives up the idea that accidents are causal events and instead perceives safety as an emergent property of a system.
I read the book a while ago, but has only recently managed to begin the transformation of her ideas to software testing.
It actually took a tutorial and some conversations with both Dr. Leveson and her colleague Dr. John Thomas at the 5th European STAMP/STPA workshop in Reykjavik, Iceland in September to completely wrap my head around these ideas.
I’m now working on an actual case and an article, but have decided to write this blog as a teaser for other testers to look into Leveson’s work. There are quality resources freely available which can help testers (I list them at the end of this blog).
The part of STAMP I’m looking at is the STPA technique for hazard analysis.
According to Leveson, hazard analysis can be described as “investigating an accident before it occurs”. Hazards can be thought of as a specific type of bug, one with potentially hazardous consequences.
STPA is interesting to me as a tester for a few reasons:

  • As an analysis technique, STPA helps identify potential causes of complex problems before business, human, and societal assets are damaged.
  • One can analyze a system and figure out how individual parts need to behave for the whole system to be safe.
  • This means that we can test parts for total systems safety.
  • It works top-down and does not require access to knowledge of all implementation details.
  • Rather, it can even work on incomplete models of a system that’s in the process of being built.

To work, STPA requires a few assumptions to be made:

  • The complete system of human and automated processes can be modeled as a “control model”.
  • A control model consists of interconnected processes that issue control actions and receive feedback/input.
  • Safety is an emergent property of the actual system including users and operators, it is not something that is “hardwired” into the system.

I’d like to talk a bit about the processes and the control model. In IT we might think of the elements in the control model as user stories consisting of descriptions of actors controlling or triggering “something” which in turn produce some kind of output. The output is fed as input either to other processes or back to the actor.
The actual implementation details should be left out initially. The control structure is a mainly a model of interconnections between user stories.
Given the control model sufficiently developed, the STPA analysis itself is a two step activity where one iterates through each user story in the control structure to figure out exactly what is required from them individually to make the whole system safe. I won’t go into details here about how it works, but I can say that it’s actually surprisingly simple – once you get the hang of it.

36574241164_d2989109b0_o.jpg
Dr. John Thomas presented an inspiring tutorial on STPA at the conference.

Safety in IT

I have mentioned Knight Capital Group’s new trading algorithm on this blog before as it’s a good example of a “black swan project” (thanks to Bernie Berger for facilitating the discussion about it at the first WOTBLACK workshop).
Knight was one of the more aggressive investment companies in Wall Street. In 2012 they developed a new trading algorithm which was tested using a simulation engine. However, the deployment of the algorithm to the production environment turned out to be unsafe: Although only to be used in testing, the simulation engine was deployed and started in production resulting in fake data being fed to the trading algorithm. After 45 minutes of running this system on the market (without any kind of monitoring), Knight Capital Group was bankrupt. Although no persons were harmed, the losses were massive.
Commonly only some IT systems are considered “safety critical” because they have potential to cause harm to someone or something. Cases like that of Knight Capital indicate to me that we need to expand this perspective and consider safety a property of all systems that are considered critical to a business, society, the environment or individuals.
Safety is a relevant to consider whenever there are risks that significant business, environmental, human, personal or societal assets can be damaged by actions performed by a system.

STAMP/STPA and the Future of Testing

So, STPA offers a way to analyze systems. Let’s get this back to testing.
Software testing relies fundamentally on testers’ critical thinking abilities to imagine scenarios and generate test ideas using systematic and exploratory approaches.
This type of testing is challenged at the moment by

  • Growing complexity of systems
  • Limited time to test
  • Problems performing in-depth, good coverage end-to-end testing

DevOps and CD (continuous delivery) attempts to address these issues, but they also amplify the challenges.
I find we’re as professional testers more and more often finding ourselves trapped into frustrating “races against the clock” because of the innovation of new and more complex designs.
Rapid Software Testing seems the only sustainable testing methodology out there that can deal with it, but we still need to get a good grip on the complexity of the systems we’re testing.
Cynefin is a set of theories which are already helping testers embrace new levels of complexity in both projects and products. I’m actively using Cynefin myself.
STAMP is another set of theories that I think are worth looking closely at. Compared to Cynefin, STAMP embraces a systems theoretical perspective and offers processes for analyzing systems and identify component level requirements that are necessary for safety. If phrased appropriately, these requirements are direct equivalents of test ideas.
STAMP/STPA has been around for more than a decade and is already in wide use in engineering. It is solid material from one of the worlds’ leading engineering universities.
At the Vrije Universiteit in Amsterdam, the Netherlands they have people taching STPA to students in software testing.
The automobile industry is adopting STPA rapidly to manage the huge complexity of interconnected systems with millions of lines of code.
And there are many other cases.
If you are curious to know more, I suggest you take a look at the resources below. If you wish to discuss this or corporate with me on this, please write me on twitter @andersdinsen or e-mail, or join me at the second WOTBLACK workshop in New York on December 3rd, where we might find good time to talk about this and other emerging ideas.

Resources

Thanks to John Thomas and Jess Ingrassellino for reviewing drafts of this blog post. Errors you may find are mine, though.

DSC_0146
This photo shows machinery in an Icelandic geothermal power plant. Water heated to 300 deg C by the underground magma flows up and drives turbines and produces warm water for Reykjavik.

Categories
Blog posts in English

Testing Hopes for 2014

DSC_0540A
Christmas is a ”Lichtfest” for us in the North. Daytime, at this time of year, only lasts a few hours and the sun never really rise on the horizon. Christmas reminds us that light days will return and it’s time to look ahead on the year to come.
I have two hopes for software testing for 2014:

  1. I hope we will stop looking for simple explanaitions why something failed: The product, the testing, the development.
  2. We cannot expect all managers to be testing experts, so we need better documented and qualified testing practices (in various contexts) in order to support better top management software testing decisions.

Looking back on 2013…

I had a busy 2013, privately as well as professionally. Let’s Test in May was fantastic! A few weeks later, I gave a successful lecture on Context Driven Testing in IDA-IT.
I have for long wanted to link my favorite philosopher Niels Bohr to testing. Denmark celebrated the 100 year anniversary of Niels Bohrs articles on the atom model this year. Niels Bohr was a Nobel Prize wining physicist, but more than anything, he was a philosopher – my favorite philosopher by far.
My second favorite is Nassim Taleb. Taleb published his new book Antifragile in late 2012, and I read it this year. But it was his previous book Black Swan that made me a fan.
In chapter 12 of Black Swan, Taleb criticizes historicism: Always wanting to find causes of why things happen. That happens a lot in testing too:

  • ”Why was that bug in the system!?”
  • ”Why didn’t test find it!?”
  • ”Who blundered!?”

Taleb points out that explaining an event is just as difficult as predicting the future. He argues that any logical deductions and computations involved in analysing an event will yield random results.
Good managers knows that appreciating and handling a team’s frustration over something not going as planned is important, but we are too often committing the error of turning a psychological healing process into a development system, mindlessly making up apparently deterministic explanations for the unexpected – the random.
Randomness and historicism
Randomness is actually two different things: (1) Indeterministic mathematical radomness. (2) Something that is acting chaotically, but still according to deterministic laws.
The ”butterfly in india” is an example of a chaotic, but deterministic chain of events: It is said that the beating of a butterfly’s wings in Delhi can cause a thunderstorm in North Carolina.
According to Newtonian and relativistic physics, determinism is a fundamental property of nature, but since most of the events involved in the forming of the thunderstorm are outside our reach, we won’t be able to reconstruct the event completely anyway.
This is perhaps where Taleb and Bohr might disagree, since Bohr did not believe in determinism as a fundamental property of nature.
With quantum physics, Bohr, Heisenberg, Pauli and other pioneers were able to show that events on the nuclear level do not follow rules of causality. An electron, for example, moves from one energy level to a lower, releasing a photon, spontaneously.
”So what? We’re not living in microcosomos. Butterflies don’t move electrons, they set complete molecules in motion. Causality should still apply on any observable level.”
This is a valid counter argument, but Bohr, in several of his philosophical writngs, points out that the lack of casualty on the subatomic level does in fact affect the macroscopic level: There are many amplification systems in nature, which amplify single quanta of energy into macroscopic effects. One such is the human eye, which can detect single photons and amplify it as a stream of information sent to the brain, where it can trigger actions. Obviously there are lots of such amplification systems in the brain and our bodies, so maybe there’s no such thing as determinism in people? And in nature in general, for that matter.
Does having a bad childhood make someone bad?
Se we’re essentially left with a world of repeatable patterns. Statisticians know that children of poor parents will usually be poor themselves. That is a well known pattern, but does it work the other  way too? Does a bad childhood make you a bad person?
Obviously no. The pattern cannot be linked to the individual, per se.
But that doesn’t mean patterns aren’t useful: Patterns simplify reality, and simplification is necessary in all planning and management.
Many projects have contracts which are negotiated several years before the testers start. Such contracts often specify which kinds of testing should take place e.g. how acceptance testing should be carried out.
Now, we can’t expect all IT contract managers to be testing experts, but if we can document research evidence of the usefulness of e.g. exploratory testing, we’re much more likely to be able to convince them to use it constructively, even when they’re working on the early planning phases.
Happy 2014!

Categories
Blog posts in English

On Antifragility and Robustness

Some drinking glasses are very fragile, but fragility does not have to be a bad thing. I think most people will prefer a fragile but thin and beautiful champagne glass for over a heavy, robust one. Thin glass just suits champagne better. A fat, fruity Barolo will go better in a thicker, more robust glass.
But both the thick, more robust glass, and the thin fragile glass share a property: There is no way you can make them stronger once they have been cast or blown. In fact, since glass is an amorphous solid material, it tends to become more fragile over time. Vibrations of the molecules in the material will eventually distort the structure, making it weaker.
The diagram below illustrate how a glass subjected to daily use over a period of time can eventually break when force it is subjected to exceeds the ”breakting threshold” of the weakest point in the glass. The threshold is the maximum force the glass can take before it breaks, and in the case with the fragile champagne glass, the threshold decreases over time: The glass is becoming more fragile with age.
A fragile class subjected to forces in daily use eventually breaks
Now, let’s imagine that science discovers a way to strengthen molecular bindings in amorphous materials by vibration. Lets say, we can somehow convert the kenetic energy in vibrations of molecules into potential energy in the intramolecular bindings. The effect would be that we could make a type of glass, which gets stronger with use.
The breaking threshold will now increase over time: The more the glass is bumped around during normal use, the stronger it will get. The diagram will look like this:
A drinking glass made out of a special antifragile material gains strength when used
Note that this does not mean that the glass has become unbreakable. The only thing that has changed is that the curve illustrating the strength of the glass has changed from going downwards to going upwards and the forces the glass is subjected to during daily use no longer crosses that curve.
The term ‘antifragility’ is new, invented by Nassim Taleb and first used in a publication in the appendix to the second edition of his book The Black Swan.
Taleb says antifragility is a property of the oganic and the complex. Human bones are antifragile: Children’s bones are in fact quite fragile and grow stronger with use. But if you sit down (or live in weightlessness), they become fragile. That does not mean that bones can’t break – of course they can break. But like the imaginary antifragile glass above, bones of a well trained person are just able to take a lot more “beating” before they do so.
Dead things like champagne glasses, washing machines, computers and their software are inherently fragile and unless someone incorporates something which can implement the feedback of use into strength. Similarly, adding complex, but still ordinary software to a system cannot make it antifragile, nor will any special kinds of tools used in the development of it make neither the project, nor the product, antifragile.
Thanks to Jesper L. Ottosen for very reviewing this blog post.

Photo of Two drinking glasses
A robust glass and a fragile glass. Neither of these are antifragile.

Categories
Blog posts in English

Antifragility by Testing?

”There are two classes of things [] One class of things that gain from disorder, and one class of things that are harmed by disorder.”

Nassim Nicholas Taleb, author of the best seller ”Black Swan” is out with a new book: ”Antifragile: How to live in a world we don’t understand”. He gave a lecture at the London School of Economics on December 6th 2012 during his book tour. The lecture is available as a podcast here.

”Technology is inherently fragile.”

The words are Nassim Taleb’s , and the statement should not surprise any testers: Testers can find bugs in even the best pieces of computer software: It is only a question of having useful testing heuristics, how much effort we’re using, and about observation skills of course.

”In psychology, people talk about post traumatic disorder. But very few talk about post traumatic growth.”

I am a big fan of Nassim Taleb for his original philosophical thinking and his ability to think and speak clearly about subjects which are very complex and sometimes even counter intuitive.

Taleb has a lot to teach us in testing, and it is very obvious to me that fragility is something that we should start looking for.

”The difference between the cat and the washing machine […] is you need to fix [the washing machine] on every occasion. The organic self-heals.”

Computer systems do not self-heal – they are inherently fragile.

Photo of Nassim Nicolas Taleb giving a lecture
Nassim Nicholas Taleb (photo: Bloomberg)

But let’s step back for a moment, taking a broader look. Let’s look at the systems incorporating the computers and the software: Organizations using information technology to run their business, communities using IT to stay connected, factories with machinery, workers, managers and computers to run them. Can any of these systems be described as antifragile?

”You should never let an error go to waste.”

My question is ”Can testing be applied in such a way that it not only detects fragility, but instead facilitates the development of anti-fragility?”

I believe that the answer is yes. And yes; there are antifragile systems out there incorporating computers and IT.

Please consider a recent test activity you participated in. Now think about the system which was building the product you were testing (my apologies for using manufacturing terms here): The people, the project teams, the organization. Such a system is organized into layers, and while the bottom layer (where the technology is) is usually inherently fragile, some of the higher level layers were perhaps antifragile?

This is where I see the role of testing coming in:

”It’s not about trial and error – it’s about trial and small error.”

In this very statement, Nassim Taleb, in my humble opinion, speaks clearly about what testing is about. The antifragile system for developing products grow stronger when testers find problems, since not only will the system learn from experience; no the antifragile system will prepare itself for things that are worse than what was experienced.

Put in another way: The antifragile software project does not just fix the bugs it encounters. The antifragile software project fundamentally eliminates problems based on knowledge from the small problems testers find.

So my message to project and and program managers is this: Don’t hire testers to find the defects. You should hire great testers to ensure your projects experience many small problems, allowing them to grow stronger and build better products: If your project systems are anti-fragile by structure, leadership and management, not only will bugs found by testers not be found in production: The overall quality of the product will be better!

And that, to me, is where the real business value of testing is!

Thanks to Jesper L. Ottosen for very constructive reviewing and commenting of drafts of this blog post.

Categories
Blog posts in English

Turning up the Heat

– What are you doing?
– I’m turning up the heat. To see if anything catches fire. It’s an old CID trick.

Detective Chief Inspector Jack Frost in the TV series “A Touch of Frost”

In the follow up post after my presentation at Let’s Test, I concluded that the next step in my work on black swan testing should be on operationalisation. This post introduces a a testing heuristic which I’ve successfully used myself. I call it “Turning up the Heat”
The basic idea is that odd things happen when a system is put under pressure, and that we can learn stuff about the system by doing so.
There are basically three ways to do it:

  • Load testing, i.e. putting the system under exceptionally heavy load for a period
  • Soak testing, i.e. loading the system lightly for a long time
  • Destructive testing, e.g. crippling subsystems by disabling or forcing malfunctioning

They can of course be combined.
In the quote above, heat is a metaphor for the psychological pressure Jack Frost is subjecting suspected murderers to, and here ‘load’ is also a metaphor for any environment changes that can have an impact on the way the system works. It could be a high data load, e.g. 10 or 100 times the normal rate of requests to a service, but it could also be something quite different, e.g. a 10 degrees higher than normal temperature in the server room. The black swan domain is the systems domain, so any component in the complete system is a valid target for putting load on.
Likewise, ‘crippling’ is a metaphor here for doing something to a subsystem that will cause the subsystem to work differently from normal: It could be as simple as just removing a component, or bugs could be deliberately introduced. In fact the target doesn’t have to be one single subsystem: Changing the same thing in several subsystems, e.g. compile all software modules with a buggy version of some widely used library, can be a simple and efficient apporach.
As testers, we often don’t have direct access to the tools needed to load and cripple subsystems and complete systems. I find that in order to practically “turn up the heat”, I often have to rely on the help of others, e.g. developers and system administrators. This leaves me with a communication and cooperation challenge which should not be taken lightly.
There are no right or wrong approaces in testing, but there’s a risk of wasting time: I.e. spending time on preparations and never getting down to the actual testing – thereby not learning. That’s one of the reasons I usually prefer simple techniques rather than planned approaches.
In one project I’ve worked on, the testing tool we used had a simple load testing function. I managed to crash the test environment completely by just running the tool off my own pc, and this eventually gave us some important information about a vulnerable subsystem (the root cause of the crash was not what anyone expected when the system stopped responeding). I spent less than an hour on this test – though getting the problem diagnosed and the environment recovered involved somewhat more work afterwards by the system admins, I’m afraid.
This actually points me to another point related to cooperation: While load testing or soak testing is normally non-destructive, only do it if the project can afford repairing what might be affected by possible malfunctions of the system. This could include other testers not being able to complete other testing activities!

Categories
Blog posts in English

The Next Problem in Black Swan Testing

The pervasive, but intangible nature of Black Swans means that that practical testing with the aim of demonstrating actual problems is probably either not going to give any useful evidence, or so much resources will be spent proving a point, that one might be missing the point itself completely.
This was the fundamental problem I was facing when I finished reading Talebs book and wanted to apply his philosophy into actual testing. I realised that I needed a model, and the Skype incident of december 2010 led me in the right direction.
The model that I’ve come up with is that black swans are system errors. This may not be true in all circumstances, but it’s a good model and it’s helping me come up with solutions to the testing problems.
Unfortunately, treating black swans as system errors also mean that instead seeing Black Swan Testing as a practical testing activity, I’m moving it to a meta level, where the ‘root causes’ are of a more abstract nature and often not directly observable.
In my speach here at Let’s Test yesterday, I introduced three classes of system attributes and suggested that practical testing, with the aim of learning about potential black swan incidents in a system, should focus on these attributes:

  • Complex versus Linear Interactions
  • Tight versus Loose Couplings
  • Barriers

The two first come from the work of sociologist Charles Perrow, in particular his book Normal Accidents, the third on I owe to psychologist James Reason, author of Human Error. I’ll come back to these attributes in later blog posts, but for now you just have to accept them as system attributes that play parts in system errors and Black Swans.
But we’re at a conference with all sorts of things going on: My presentation was well received, the discussions afterwards were great, but Let’s not just talk… let’s do it, Let’s Test.
I think James Lyndsay and I got the idea at about the same time yesterday: Let’s take Black Swan Testing into The Test Lab.
So I did, and it was great. I had a great team of very brave testers, and the mission was clear: Find indications of Black Swans, look for tight couplings, complex interactions, and barriers.
Did we succeed? Not really. But it was loads of fun and we learned a lot!
In particular, I learned that while I think I have a very good idea of what Black Swan Testing is, I need to work on the practical aspects: Making useful charters, coaching and teaching testers efficiently on the subject, reporting… Black Swan testing must be communicated and operationalized.
That’s the next problem, I’m going to address.

The brave team of Black Swan Testers in the Test Lab
My very brave team of Black Swan Testers

Categories
Blog posts in English

Death by Virtual Memory

Every pc user knows that pc’s become slower and slower over time until the point where they are almost unusable. This is where upgrading RAM will usually help – until eventually you have to buy a new pc. Apparantly that’s just the way pc’s wear out.
Actually, pc’s don’t wear out – readers of my blog probably know that, but users (knowingly and unknowingly) add software which consumes resources of which memory is the most important one.
RAM used to be very expensive and therefore a scarce resource. Programmers used to do all sorts of tricks to fit their increasingly complex programs in memory. To help them focus on the programming task and not worry too much about resource scarcity, operating system designers invented something called virtual memory or swap memory.
Swap memory allowed the operating system to remove running processes from the (expensive and therefor scarce) RAM and store the state of the process on disk (‘swap’ it out – hence the name), from where it could later be restored into RAM and start running on the cpu. The technique is still employed by all modern operating systems, and while the amount of RAM has grown considerably to a level where lack of it is usually not a problem, virtual memory techniques are still useful with long running processes that only need to run once in a while and where it is not a problem if the initial response time is a second or more – and when they’re not running, the RAM can be used for caching file system data and other important things.
But what happens if load increases, e.g. if the number of users grow or the system becomes otherwise loaded and the processes running on the system start competing for memory? The good news is that functionally nothing changes: Virtual memory is transparant to the process, so the code will execute the same as it did before. But the bad news is that execution time increases rapidly when real memory become exhausted and the OS has to start using VM. If this only happens during nighttime or at other times when users or external systems are’nt depending on the system, all is probably okay, but if not then you can be in real trouble. In fact, the problem can be so bad that the system becomes useless.
In fact, with much more RAM and larger programs in today’s computers, the relative performance penalty is much higher than it used to be. This is because when the OS starts swapping, the amount of data that needs to be transferred in and out of the hard disk(s) is probably a factor of 10 higher than it would have been say 10 years ago. During that time, however, hard disk access speeds has only doubled, so overall, the damage you risk of hitting the virtual memory “wall” is much higher now than it used to be.
An interesting factor which I have found useful to look for is the fact that anti virus systems installed on your servers often make the virutal memory problem worse. They do so because they install hooks into applications running on the system, monitoring all i/o. This monitoring performs well as long as the anti virus system can keep its database and code in memory, but when memory starvation starts occurring, it can turn into a real bad situation. How can we detect that situation (except by performance dropping)?
I’m not aware of any really useful tools that can sit in the background automatically detecting (or better: predicting) memory starvation problems on running servers or test systems. But there are ways to look for it: On Windows, I’ll be looking at the running processes, particularly focusing on the Page Faults Delta column, looking for processes consistently experiencing high numbers here:

This is an important performance testing subject. And one which is too often overlooked.

Categories
Blog posts in English

Peugeot's Black Swan at Le Mans 2010

This blog post is  about motorsport. What does motorsport have to do with software engineering, you ask? Read on!
I’m a big fan of motorsport and Le Mans 24 hours in particular.  Le Mans is a 24 hours motor race with about 50 race cars of four different classes competing in the same race. Le Mans is also a legend, run first time in the 1920’s. To run a race over 24 hours is very challenging for teams and machinery. An F1 race is only 2 hours and cars are only a bit faster. We’re 240,000 spectators, and about 40,000 danes travel the 1500 km to get there – including me and two of my boys, so it’s also a big, great party.

My son Aksel at Le Mans 2010
My son Aksel (very focused - and a bit tired from a long drive) at Le Mans 2010

 
But to me as an engineer, Le Mans is also intellectually inspiring. Le Mans is a reminder that while we can do a lot with technology, there’s also a lot that we can’t do and that the laws of physics will always set a limit on the track. In order to try to win, race car manufacturers and teams will constantly try to push that limit, but it will always be there.
When cars are withdrawn from an F1 race, its typically because of an accident – drivers making mistakes. While driver mistakes are unavoidable over such a long race, withdrawals are actually more common due to technical reasons: The equipment breaking down, engines blowing up, or just electrical gremlins pulling the plug. The fascinating part is that it has been like this since the very beginning.
So failures are more or less expected. 50 cars at the start line, and usually only some 25 at the finish. But Le Mans 2010 was a little different: It was Peugeots ”Black Swan Year”.
The Peugeot 908’s were again extremely fast, perfectly tuned, and ready to race. Audi had gone through a challenging development process with their new R15, which turned out to not be as fast as they had had hoped it would be in 2009, but was improved in 2010, so we all thought that 2010 was to be a year where Audi would be able to compete with Peugeot on speed. But Peugeot again set impressive lap times never before seen at Le Mans. Couple that with the fact that their team finally seemed to be a well working machine now (proved by the 2009 overall win), so it seemed that Audi could only hope for a podium.
Until a conrod broke on the leadning Peugeot at Tertre Rouge on Sunday morning. I was there with my camera, enjoying the early morning and the race, but I left that area only 10 minutes before so I didn’t have a chance to catch the action (aren’t you always in the wrong place at the right time?).
It came as a shock to everyone. I watched the TV pictures on the big screens around the track showing the team completely in shock about what had happened, and I looked down into the pit area where the Eurosport TV crew was trying to get comments from the team which seemed to be paralyzed. But it seemed to be a coincidence at the time. Until a few hours later when another Peugeot failed in a similar way. We started wondering what was going on? And with only one hour remaining of the race, the customer entered Peugeot 908 failed and the race was lost. Audi won 1-2-3 with their three R15+ cars.
It was devastating. The Peugeot Sport director was seen crying on TV. The french spectators and press went home early from the race. This was a nation loosing a battle with their negihbors.
Of course we didn’t know the technical reason why all Peugeots had failed at the race, but it seemed as if they had been ‘programmed’ to fail. About a month later, Peugeot released a statement that the three cars had suffered from the same failure and that the fourth car (which retired before the others due to a broken suspension) would have suffered the same problem if it had still been running during Sunday. Peugeot said that the breakdown came as a surprise. That they had tested the cars and engines and never expected this. I’m sure it was a surprise. I’m also sure their sports director didn’t expect this embarrasing disaster in front a whole nation of supporters. I’m sure they thought everything was Hunky-dory.
But at the same time, I’m not in doubt that the problem was rooted in history: That an engineer somewhere knew that there was a risk, but for reasons which are probably rooted in group thinking and organisational behaviour, kept the knowledge to himself – or simply chose to ignore it. Conrods have failed in cars since the first reciprocating engine was built, but engineers have learnt to handle this so today we have reliable engines that can easily do more than 300,000 km. When engineering has made something inherently unreliable reliable, people tend to forget about it. Management expect it to be under control.
This is true even for competetion engines, even though they are of course pushed much more and designed to be minimal and as light as possible in order to promote power output: I’m sure Peugeot management thought the conrod supplier had everything under control, which they might have had – but they could have worked to meet the wrong specification. We won’t know the details, and it’s not important either.
To win Le Mans you have to be running at the end. The Peugeots didn’t. They obviously forgot what it takes to make something inherently unreliable reliable: It takes focus on what can possibly go wrong. Software is not different: When software fails, it’s often also because someone forgot to raise and issue or because someone chose to ignore it. Many disasters in systems are rooted in history, which also means that they could have been prevented.
 This is where professional pessimists on a team can help. Where testers’ negative attitude can mean the difference between success and failure.
For me, Peugeot’s black swan event at Le Mans 2010 is a reminder that we all have shared responsibility for seeking out and communicating these details. By testing, careful inspection, talking to developers and users, and by constantly focusing on problems. We’re on a mission to prevent disaster by making the risks known so managers can take informed descisions.
Le Mans 2011 will be interesting in a new way since the cars will be technologically different with hybrid engines. This is new technology, so we should probably expect it to fail more at random or just affect performance during the race: Longer pitstops and the like. But lets see, it’s a big and long race. Anything can happen! I’ve got our tickets booked, so let me know if you’ll be at Le Mans in June – and we can meet up and enjoy the cars. And perhaps discuss engineering and testing?
Leading Peugeot 908 photographed at Tertre Rouge Sunday morning at Le Mans 24h 2010 - just prior to failing
Leading Peugeot 908 caught by me and my Nikon on Sunday morning only about 30 minutes before it failed at the same location

Categories
Blog posts in English

Skype's first Black Swan

Skype went out for about 24 hours just before Christmas. Skype management is embarrassed and promise this will not happen again, which of course is true. The particuar situation is now prevented. However, the question is: Will Skype never be out again?
Skype’s CIO explains what went wrong in this post mortem of what I’d call Skype’s first Black Swan.
To summarise, it was a high load on Skypes infrastructure which triggered a bug in a certain version of the Windows client for Skype which again increased the load on the infrastructure, thereby rapidly taking the entire network down and making it almost impossible to get it up again.
The bug was always there of course, and it was probably already known internally at Skype. It is also possible that the risk of server overloading and service degradation had been identified, but obviously not in the context of making a complete system crash a likely possibility (if so, they would have prevented it). Further, I’m quite certain that the risk of the client bug affecting the server load had not been identified. Humans are positive thinkers, as Taleb documents in his book: The Black Swan: The Impact of the Highly Improbable.
So Skype’s challenge now is to prevent outages in general by identifying and preventing Black Swans in general. This will involve a cross organisational backwards thinking process, which the innovation driven company has probably not been focusing on at all until now. (I may actually be wrong here, Janus Friis, one of the founders of Skype, used to work in a support function of an ISP so he may have been involved in preventing problems, but generally, startups think very positively, and even if Skype has millions of users, it’s still a very young company – a startup.)
One may think that this is going to be extremely expensive for Skype since they will have to predict every possible way their system can go wrong. It does not have to be that expensive, although it will cost money.
When securing a nuclear facility, engineers don’t have to analyze every possible way a disaster can happen, instead they think: How can we prevent failure at every level? This is what I mean with “backwards thinking” – start assuming something is failing, then work backwards identifying ways to prevent it becoming worse.
This is done on multiple levels: On component level, asking what can go wrong here and how can we prevent a bug or incident from affecting the rest of the system? And on system level, assuming that disaster is happning, how can we prevent it from developing.
I assume that’s what Skype is doing now.
Wishing everyone a happy 2011!