As a teenager, I spent hours reading datasheets for CPU components like the bit slice processors AM2901. I also designed hypothetical CPUs in block diagrams. A 16 bit CPU was my objective, and it should have had a clock of 20 MHz. The instruction set was to be small and beautiful. My first computer was based on a National Semiconductor SC/MP processor, which had a very nice and simple instruction set, and I’ve always liked things which are the simple, beautiful and logical.
I never did build my own CPU, but at university, I spent hours playing with the old RC4000 mainframe we had in the electronics club. It was (is!) a 24 bit machine from the late 60’s, a very nice and advanced design. Its concurrent programming system was among the first in the world (appearing at about the same time as Unix was developed in the US), and my father, who used to work at Regnecentralen, gave me one of the original books about the system, which I read with great interest. It was a message passing based system, and I think the design had a few drawbacks, but the beauty of it was fascinating.
Yesterday I saw my old friend again. She is now residing in the rooms with computer historic enthusiasts. I visited the club with 14 year old son Frederik.
The RC4000 still works, but the enthusiasts are more focused now on its predecessor: A GIER computer, which has recently been brought back to life. The first GIER was delivered on July 31st 1961, so the design is turning 50 years old this year. About 50 of these machines were built and the club owns two: One early model and a late model. It is the late model which they are bringing back to life.
There are very few test programs for GIER, and this makes the repair process a bit more complicated. Poul-Henning Kamp, the author of the Varnish Cache, who is one of the active enthusiasts in the club, mentioned that it was probably because the designers found test program development to be too boring. These guys were innovators! Poul-Henning has unit tests for Varnish with great code coverage, but admits that writing tests is not the fun part of that project.
I used to program for a living and I can agree with this. Unit testing isn’t fun!
The smell of the old computers, their physical size, the noise they make, and their sheer lack of computing power (by todays standards) suggest that these machines are of a different era. Technology has evolved a lot, but people are still people. And the people matter – the technology is just a means to an end.
I still love working hands-on with technology, but my view on IT has grown a lot wider since I was young.
Acceptance testing is a key method in Agile. One way of defining acceptance tests are Gojko Adzic‘s ”Specification by example” paradigm which has gained quite a bit of momentum lately. I personally found it to be both refreshing and nice when I heard him present it at Agile Testing Days 2009, and I also found his book Bridging the Communication Gap a nice read.
I’m sceptical of the concept of acceptance testing. Not because verification of agreed functionality is not a good thing, but because it tends to shift attention to verification instead of exploration.
This will shift attention from testing to problem prevention. Is that bad, you ask? Isn’t it better to prevent problems than to discover them?
Well, most people think ”why didn’t I prevent this from happening?” when problems do happen. Feelings of regret are natural in that situation and that feeling can lead you into thinking you should improve your problem prevention. And maybe you should, but more examples aren’t going to do it!
Real testing is still necessary.
To explain why, I’ll consult one of the great early 20’th century mathematical philosophers: Kurt Gödel. In particular his first incompleteness theorem. It says that no consistent system of axioms whose theorems can be listed by an “effective procedure” is capable of proving all facts about the natural numbers.
What does this mean to us?
It means that we will never be able to list all things that can be done with this particular set of data.
A specification is a kind of listing of ”valid things to do” with data, thus Gödel’s theorem teaches us that there are infinitely more things to a system than any long list of requirements. This also applies when the requirements are listed as examples.
If you’re in the business to deliver products of only ”agreed quality” to a customer, you can be all right only verifying things which are explicity agreed. If something goes wrong you can always claim: ”It wasn’t in the specification!”
But if you’re striving for quality in a broader sense, verifying that the system works according to specifications is never going to be enough.
Gojko has made a good contribution to agile. Examples can be useful and efficient communication tools, and if they are used correctly they can help making users and other interested parties better aware of what’s going on on the other side. His contribution can help bridge a communication gap. It can also produce excellent input for automated unit tests.
Just don’t let it consume your precious testing time. The real testing goes far beyond verification of documented requirements!
If you want to learn more about this, I recommend you sign up for one of the Rapid Software Testing courses offered by James Bach and Michael Bolton.
Covering test coverage
Rolf Østergaard @rolfostergaard suggested on twitter when I posted my previous blog that instead of counting defects and tests we take a look on test coverage. Certainly!
Mathematically, coverage relates the size of an area fully contained in another area, relative to the size of that other area. We could calculate the water coverage of the Earth or even how much of a floor a carpet could cover. Coverage can be expressed as a percentage.
But coverage is also a qualitative term. For example a book can cover a subject, or a piece of clothing can give proper (or improper!) body coverage.
So what is test coverage? Well, the term is often used to somehow describe how much of a system’s functionality is covered by testing.
Numbers are powerful and popular with some people, so a quantified coverage number would be nice to have. One such number is code coverage, which is calculated by dividing the number of code lines which have been executed at least once by to the total number of code lines in a program.
Another measurement relies on business requirements for the system being registered and numbered, and tests mapped to the requirements which they test. A suite of tests can then be said to cover a certain amount of requirements.
Numbers can hint something interesting. E.g. if your unit tests exercise only 10% of the code and it tends to be the same 10% on all of them, the chances are that something important will be missing from the unit tests. Or you could even have a lot of dead legacy code. This would be similar if you found that you actually only tested functionality in a few of the documented business requirements: Could the not-covered requirements be just noise?
No matter what, a coverage number can only give hints. It cannot give certanity.
Let’s imagine we can make a drawing of the functionality of a system; like a map. Everything on the map would be intended functionality, everything outside would be unaccepted. Let’s make another simplification and imagine for the moment that the map is the system, not just an image of it. Here is an example of such a simple system:
The blue area is the system. The red spots are checks carried out as part of testing. Some of the checks are within the system, others are outside it. The ones within are expected to pass, the ones outside are expected to fail.
Note that there is no way to calculate the test coverage of this imaginative system. Firstly, because the area outside the system is infinite and we can’t calculate the coverage of an infinite area. Secondly, because the checks don’t have an area – they are merely points – so any coverage calculation will be infinitesimal.
Ah, you may argue, my tests aren’t composed of points but are scripts: They are linear!
Actually, a script is not a linear entity, it’s just a connected sequence of verification points, but even if it was linear, it wouldnt’ have an area: Lines are one-dimensional.
But my system is not a continous entity, it is quantified and consists only of the features listed in the requirement document.
Well that’s an interesting point.
The problem is that considering only documented requirements will never consider all functionality. Think about the 2.2250738585072012e-308 problem in Java string to float conversion. I’m certain there are no requirement documents on systems implemented in Java, which actually listed this number as being a specifically valid (or invalid) entry in input fields or on external integrations. The documents probably just said the system should accept floats for certain fields. However a program which stops responding because it enters an infinite loop is obviously not acceptible.
A requirement document is always incomplete. It describes how you hope the system will work, yet there’s more to a system than can be explicitly described by a requirements document.
Thus any testing relying explicitly on documented requirements cannot be complete – or have a correctly calculated coverage.
My message to Rolf Østergaard is this: If a tester makes a coverage analysis of what he has done, remember that no matter how the coverage is measured, any quantified value will only give hints about the testing. And if he reports 100% coverage and looks satisfied, I strongly suggest you start looking into what kind of testing he has actually done. It will probably be flawed.
Intelligent testing assists those who are responsible for quality in finding out how a system is actually working, it doesn’t assure quality.
Thanks to Darren McMillan for helpful review of this post.
Michael posted the following two comments on Twitter shortly after I published this post:
There’s nothing wrong with using numbers to add colour or warrant to a story. Problems start when numbers *become* the story.
Just as the map is not the territory, the numbers are not the story. I don’t think we are in opposition there.
I agree, we’re not in opposition. Consider this post an elaboration of a different perspective – inspired by Michaels tweets.
Michael Bolton posted some thought provoking Tweets the last few days:
Trying to measure quality into a product is like measuring height into a basket ball player.
Counting yesterday’s passing test cases is as relevant to the project as counting yesterday’s good weather is to the picnic
Counting test cases is like counting stories in today’s newspaper: the number tells you *nothing* you need to know.
Michael is a Tester with capital T and he is correct. But in this blog post, I’ll be in opposition to Michael. Not to prove that he’s wrong, not out of disrespect, but to make a point that while counting will not make us happy (or good testers), it can be a useful activity.
Numbers illustrate things about reality. They can also illustrate something about the state of a project.
A number can be a very bold statement with a lot of impact. The following (made up) statement illustrates this: The test team executed 50 test cases and reported 40 defects. Defect reporting trend did not lower over time. We estimate there’s 80% probablity that there are still unfound critical defects in the system.
80%? Where did that come from. And what are critical bugs?
Actually, the exact number is not that important. Probabilities are often not correct at all, but people have learnt to relate the word “probability” to a certain meaning telling us something about a possible future (200 years ago it had a static meaning, by the way, but that’s another story).
But that’s okay: If this statement represents my gut feeling as a tester, then it’s my obligation to communicate it to my manager so he can use it to make an informed decision about whether it’s safe to release the product to production now.
After all, my manger depends on me as a tester to take these decisions. If he disagrees with me and says ”oh, but only few of the defects you found are really critical”, then fine with me – he may have a much better view of what’s important with this product than I have as a test consultant – and in any case: he’s taking the resonsbility. And if he’s rejecting the statement, we can go through the testing and issues we found together. I’ll be happy to do so. But often managers are too busy to do that.
Communicating test results in detail is usually easy, but assisting a project manager making a quality assessment is really difficult. The fundamental problem is that as testers, by the time we’ve finished our testing, we have only turned known unknowns into known knowns. The yet unknown unknowns are still left for future discovery.
Test leadership is to a large extent about leading testers into the unknown, mapping it as we go along, discovering as much of it as possible. Testers find previously unknown knowledge. A talented ”information digger” can also contribute by turning ”forgotten unknowns” into known unknowns. (I’ll get along to defining ”forgotten unknowns” in a forthcoming blog entry, for now you’ll have to beleive that it’s something real.)
Counting won’t help much there. In fact it could lead us off the discovery path and into a state of false comfort, which will lead to missing discoveries.
But when I have an important message which I need to communicate rapidly, I count!
Death by Virtual Memory
Every pc user knows that pc’s become slower and slower over time until the point where they are almost unusable. This is where upgrading RAM will usually help – until eventually you have to buy a new pc. Apparantly that’s just the way pc’s wear out.
Actually, pc’s don’t wear out – readers of my blog probably know that, but users (knowingly and unknowingly) add software which consumes resources of which memory is the most important one.
RAM used to be very expensive and therefore a scarce resource. Programmers used to do all sorts of tricks to fit their increasingly complex programs in memory. To help them focus on the programming task and not worry too much about resource scarcity, operating system designers invented something called virtual memory or swap memory.
Swap memory allowed the operating system to remove running processes from the (expensive and therefor scarce) RAM and store the state of the process on disk (‘swap’ it out – hence the name), from where it could later be restored into RAM and start running on the cpu. The technique is still employed by all modern operating systems, and while the amount of RAM has grown considerably to a level where lack of it is usually not a problem, virtual memory techniques are still useful with long running processes that only need to run once in a while and where it is not a problem if the initial response time is a second or more – and when they’re not running, the RAM can be used for caching file system data and other important things.
But what happens if load increases, e.g. if the number of users grow or the system becomes otherwise loaded and the processes running on the system start competing for memory? The good news is that functionally nothing changes: Virtual memory is transparant to the process, so the code will execute the same as it did before. But the bad news is that execution time increases rapidly when real memory become exhausted and the OS has to start using VM. If this only happens during nighttime or at other times when users or external systems are’nt depending on the system, all is probably okay, but if not then you can be in real trouble. In fact, the problem can be so bad that the system becomes useless.
In fact, with much more RAM and larger programs in today’s computers, the relative performance penalty is much higher than it used to be. This is because when the OS starts swapping, the amount of data that needs to be transferred in and out of the hard disk(s) is probably a factor of 10 higher than it would have been say 10 years ago. During that time, however, hard disk access speeds has only doubled, so overall, the damage you risk of hitting the virtual memory “wall” is much higher now than it used to be.
An interesting factor which I have found useful to look for is the fact that anti virus systems installed on your servers often make the virutal memory problem worse. They do so because they install hooks into applications running on the system, monitoring all i/o. This monitoring performs well as long as the anti virus system can keep its database and code in memory, but when memory starvation starts occurring, it can turn into a real bad situation. How can we detect that situation (except by performance dropping)?
I’m not aware of any really useful tools that can sit in the background automatically detecting (or better: predicting) memory starvation problems on running servers or test systems. But there are ways to look for it: On Windows, I’ll be looking at the running processes, particularly focusing on the Page Faults Delta column, looking for processes consistently experiencing high numbers here:
This is an important performance testing subject. And one which is too often overlooked.
I came across this on a web site today:
I think among the reasons [insert product name] works so well is that it was developed not by endless betas and adding too many features requested by wankers who do nothing but play on their computers, but instead by people who understand the importance of simplicity and usability.
[Insert product name] has exactly what I need in a simple enough form that I can figure it out.
The product in question is Apple’s Aperture, but that’s not important. But Ken Rockwell (who wrote this) “even dream in Photoshop”. It’s a big statement.
This is about shifting focus to from technology to product value. Looking at myself in the mirror, I hope I won’t see a “wanker” who makes computer systems just for the fun of playing with technology. But I know the feeling. After all, it was the passion for technology that started my computer career almost 30 years ago. In that respect, I hope I’ve grown up a bit! 😉
I had the opportunity of being teacher for my son Frederik’s 7th grade class today at Kvikmarken. The school had decided to send the ”real” teachers on a workshop together, so the job of teaching the children was left to volounteering parents. I’m enthusiastic about learning, so felt obliged to volunteer.
I gave them a crash course in software testing.
These boys and girls are really smart. They quickly grasped an understanding of what software testing is about: Exploring and learning. Remember, they are brought up with laptops, mobile phones, and the internet, so they’ve learned to cope with all the shortcomings and bugs that come with today’s technology. I could have expected them to be blind of bugs. But no, they do see them – and they are splendid explorers and learners.
I started my lesson with a flash back 40-50 years ago when computers were as big as a classroom and software was written on note paper and stored on punched tape. I compared this with today’s state of the art commodity computer: An iPad, smaller than a school book – and a lot more powerful than computers then.
Complexity has increased by a factor of at least one million – and it shows! (Though I should add that the evolution of software development tools and methods has also improved a good deal.)
They now know what a bug is, why there are bugs in software, what testing is about and how testing is a learning and discovery process. They also have an idea of why we’re testing – they particularly liked this one: We test because it feels good to break stuff!
Finally, the had a chance to prove their collective exploration abilities in a testing exercise. They did splendidly!
The last slide of my presentation contained a quote from James Bach on Twitter yesterday, words of a true tester on bug hunt using his senses:
Say “it looks bad” and I hear you.
Say “it smells bad” and I taste BLOOD.
I’m happy that Frederik’s classmatetes liked my lesson: ”Great presentation, thanks!”, ”Wow, testing is fun!”, ”You’ve got a really cool job!”
I think there’s good reason to expect software engineering to improve a lot when these boys and girls get to be responsible for software engineering!
This blog post is about motorsport. What does motorsport have to do with software engineering, you ask? Read on!
I’m a big fan of motorsport and Le Mans 24 hours in particular. Le Mans is a 24 hours motor race with about 50 race cars of four different classes competing in the same race. Le Mans is also a legend, run first time in the 1920’s. To run a race over 24 hours is very challenging for teams and machinery. An F1 race is only 2 hours and cars are only a bit faster. We’re 240,000 spectators, and about 40,000 danes travel the 1500 km to get there – including me and two of my boys, so it’s also a big, great party.
But to me as an engineer, Le Mans is also intellectually inspiring. Le Mans is a reminder that while we can do a lot with technology, there’s also a lot that we can’t do and that the laws of physics will always set a limit on the track. In order to try to win, race car manufacturers and teams will constantly try to push that limit, but it will always be there.
When cars are withdrawn from an F1 race, its typically because of an accident – drivers making mistakes. While driver mistakes are unavoidable over such a long race, withdrawals are actually more common due to technical reasons: The equipment breaking down, engines blowing up, or just electrical gremlins pulling the plug. The fascinating part is that it has been like this since the very beginning.
So failures are more or less expected. 50 cars at the start line, and usually only some 25 at the finish. But Le Mans 2010 was a little different: It was Peugeots ”Black Swan Year”.
The Peugeot 908’s were again extremely fast, perfectly tuned, and ready to race. Audi had gone through a challenging development process with their new R15, which turned out to not be as fast as they had had hoped it would be in 2009, but was improved in 2010, so we all thought that 2010 was to be a year where Audi would be able to compete with Peugeot on speed. But Peugeot again set impressive lap times never before seen at Le Mans. Couple that with the fact that their team finally seemed to be a well working machine now (proved by the 2009 overall win), so it seemed that Audi could only hope for a podium.
Until a conrod broke on the leadning Peugeot at Tertre Rouge on Sunday morning. I was there with my camera, enjoying the early morning and the race, but I left that area only 10 minutes before so I didn’t have a chance to catch the action (aren’t you always in the wrong place at the right time?).
It came as a shock to everyone. I watched the TV pictures on the big screens around the track showing the team completely in shock about what had happened, and I looked down into the pit area where the Eurosport TV crew was trying to get comments from the team which seemed to be paralyzed. But it seemed to be a coincidence at the time. Until a few hours later when another Peugeot failed in a similar way. We started wondering what was going on? And with only one hour remaining of the race, the customer entered Peugeot 908 failed and the race was lost. Audi won 1-2-3 with their three R15+ cars.
It was devastating. The Peugeot Sport director was seen crying on TV. The french spectators and press went home early from the race. This was a nation loosing a battle with their negihbors.
Of course we didn’t know the technical reason why all Peugeots had failed at the race, but it seemed as if they had been ‘programmed’ to fail. About a month later, Peugeot released a statement that the three cars had suffered from the same failure and that the fourth car (which retired before the others due to a broken suspension) would have suffered the same problem if it had still been running during Sunday. Peugeot said that the breakdown came as a surprise. That they had tested the cars and engines and never expected this. I’m sure it was a surprise. I’m also sure their sports director didn’t expect this embarrasing disaster in front a whole nation of supporters. I’m sure they thought everything was Hunky-dory.
But at the same time, I’m not in doubt that the problem was rooted in history: That an engineer somewhere knew that there was a risk, but for reasons which are probably rooted in group thinking and organisational behaviour, kept the knowledge to himself – or simply chose to ignore it. Conrods have failed in cars since the first reciprocating engine was built, but engineers have learnt to handle this so today we have reliable engines that can easily do more than 300,000 km. When engineering has made something inherently unreliable reliable, people tend to forget about it. Management expect it to be under control.
This is true even for competetion engines, even though they are of course pushed much more and designed to be minimal and as light as possible in order to promote power output: I’m sure Peugeot management thought the conrod supplier had everything under control, which they might have had – but they could have worked to meet the wrong specification. We won’t know the details, and it’s not important either.
To win Le Mans you have to be running at the end. The Peugeots didn’t. They obviously forgot what it takes to make something inherently unreliable reliable: It takes focus on what can possibly go wrong. Software is not different: When software fails, it’s often also because someone forgot to raise and issue or because someone chose to ignore it. Many disasters in systems are rooted in history, which also means that they could have been prevented.
This is where professional pessimists on a team can help. Where testers’ negative attitude can mean the difference between success and failure.
For me, Peugeot’s black swan event at Le Mans 2010 is a reminder that we all have shared responsibility for seeking out and communicating these details. By testing, careful inspection, talking to developers and users, and by constantly focusing on problems. We’re on a mission to prevent disaster by making the risks known so managers can take informed descisions.
Le Mans 2011 will be interesting in a new way since the cars will be technologically different with hybrid engines. This is new technology, so we should probably expect it to fail more at random or just affect performance during the race: Longer pitstops and the like. But lets see, it’s a big and long race. Anything can happen! I’ve got our tickets booked, so let me know if you’ll be at Le Mans in June – and we can meet up and enjoy the cars. And perhaps discuss engineering and testing?
Skype's first Black Swan
Skype went out for about 24 hours just before Christmas. Skype management is embarrassed and promise this will not happen again, which of course is true. The particuar situation is now prevented. However, the question is: Will Skype never be out again?
Skype’s CIO explains what went wrong in this post mortem of what I’d call Skype’s first Black Swan.
To summarise, it was a high load on Skypes infrastructure which triggered a bug in a certain version of the Windows client for Skype which again increased the load on the infrastructure, thereby rapidly taking the entire network down and making it almost impossible to get it up again.
The bug was always there of course, and it was probably already known internally at Skype. It is also possible that the risk of server overloading and service degradation had been identified, but obviously not in the context of making a complete system crash a likely possibility (if so, they would have prevented it). Further, I’m quite certain that the risk of the client bug affecting the server load had not been identified. Humans are positive thinkers, as Taleb documents in his book: The Black Swan: The Impact of the Highly Improbable.
So Skype’s challenge now is to prevent outages in general by identifying and preventing Black Swans in general. This will involve a cross organisational backwards thinking process, which the innovation driven company has probably not been focusing on at all until now. (I may actually be wrong here, Janus Friis, one of the founders of Skype, used to work in a support function of an ISP so he may have been involved in preventing problems, but generally, startups think very positively, and even if Skype has millions of users, it’s still a very young company – a startup.)
One may think that this is going to be extremely expensive for Skype since they will have to predict every possible way their system can go wrong. It does not have to be that expensive, although it will cost money.
When securing a nuclear facility, engineers don’t have to analyze every possible way a disaster can happen, instead they think: How can we prevent failure at every level? This is what I mean with “backwards thinking” – start assuming something is failing, then work backwards identifying ways to prevent it becoming worse.
This is done on multiple levels: On component level, asking what can go wrong here and how can we prevent a bug or incident from affecting the rest of the system? And on system level, assuming that disaster is happning, how can we prevent it from developing.
I assume that’s what Skype is doing now.
Wishing everyone a happy 2011!
Bohr on testing
When Niels Bohr and his team of geniuses at his institute in Copenhagen developed quantum physics, they fundamentally changed science by proving wrong an assumption dating very far back in science: That everything happens for a reason. In science, building and verifying models to predict events based on knowledge of variables is an important activity, and while this is still meaningful in certain situations, quantum mechanics proved that on the microscopic level of atoms and particles, things don’t happen for a reason. Put in another way: You can have effects without cause. In fact, effects don’t have causes as such at this level.
This means that in particle physics, you cannot predict precisely what’s going to happen, even if you know all variables of a system. Well in fact you can, but only if you’re prepared to give up knowing anything about when and where the effect will happen.
This is counterintuitive to our daily understanding of how the world works. But there’s more: According to quantum physics, it is impossible to seperate knowledge of variables of a system from the system itself. The observation of the system is always part of the system, and thus changes the system in an unpredictable way.
If you find this to be confusing, don’t be embarrassed. Even Einstein never accepted this lack of causality.
Bohr was a great scientist, but he was also a great philosopher. He did a lot of thinking about what this lack of causaility and the inseperability of observation from events would teach us about our understanding of nature. On several occasions he pointed out that even on the macroscopic level, we cannot ignore what is happening on the atomic and particle level. First of all because quantum physics did away with causality as a fundamental principle, but also because quantum effects are in fact visible to us in our daily macroscopic life: He used the example of the eye being able to react on stimuli as small as those of a single photon and argued that it is very likely that the entire organism contains other such amplification systems where microscopic events have macroscopic effects. In some of his philosophical essays he points out how psychology and quantum mechanics follow similar patterns of logic.
So does testing. In software testing we are working to find out how a compuster system is working. Computers are algorithmic machines designed in such a way that randomness is eliminated and data can be measured (read) without affecting the data, but the programs are written by humans and are used by humans, so the system in which the computer program is used is both complex and inherently unpredictable.
We’re also affecting what we’re testing. Not by affecting the computer system itself, but by affecting the development of the software by discovering facts about the system and how it works in relation to other systems and the users.
In some schools of software testing, the activity is reduced to a predictable one: Some advocate having “a single point of truth” about what is going to be developed in an iteration, and that tests should verify that implementation is correct – nothing more. They beleive that it is possible to assemble “all information” about a system before development starts, and that any information not present is not a requirement and as such should not be part of the delivery.
That is an incorrect approach to software engineering and to testing in particular. Testing is much more than verification of implementation, and the results of testing are as unpredictable as the development process itself is. We must also remember that it is fundamentally impossible to collect all requirements about a product: We can increase the probability of getting a well working system by collecting as much information as possible about how the system should work and how it is actually working (by testing), and comparing the two, but the information will always be fundamentally incomplete.
Fortunately we’re not stupid. It is consistent with quantum physics.
Studying the fundamental mechanisms of nature can lead to a better understanding of what we are working with as software engineers and as software testers in particular.