Find us on GitHub

Teaching basic lab skills
for research computing

Testing: Introduction

Testing/Introduction at YouTube

Hello, and welcome to the first episode of the Software Carpentry lecture on testing. In this episode, we'll look at what testing can actually do, and start exploring how you can go about it systematically.

Nobody actually enjoys testing software.

So if you'd like to skip this lecture, you can, provided that:

Your programs always work correctly the first time you run them, or

You don't actually care whether they're doing the right thing or not, as long as their output looks plausible.

You can also skip this lecture if you enjoy wasting time and taking longer to get things to work.

You might not think that has much to do with testing, but study after study has shown that the more you invest up front in quality, the sooner your program will be ready to use. Just as in manufacturing and medicine, slowing down a little is the best way to speed things up a lot.

Testing actually serves two purposes.

It tells you whether your program is doing what it's supposed to do.

But if it's done right, it will also tell you what your program actually is supposed to be doing.

Tests are runnable specifications of a program's behavior.

Unlike design documents or comments in the code, you can actually run your tests, so it's harder for them to fall out of sync with the program's actual behavior. In well-run projects, tests also act as examples to show newcomers how the code should be used, and how it's supposed to behave under different circumstances. We'll explore this idea in detail in a later episode.

Before we go on, though, it's important to understand that there's a lot more to software quality than testing. Testing doesn't create quality, it measures it.

As Steve McConnell said, trying to improve the quality of software by doing more testing is like trying to lose weight by weighing yourself more often.

But a good set of tests will help you track down bugs more quickly, which in turn speeds up development.

It's also important to understand that testing can only do so much. For example, suppose you're testing a function that compares two 7-digit phone numbers.

There are 107 such numbers…

…which means that there are 1014 possible test cases for your function.

At a million tests per second, it would take you 155 days to run them all.

And that's only one simple function: exhaustively testing a real program with hundreds or thousands of functions, each taking half a dozen arguments, would take many times longer than the expected lifetime of the universe.

And how would you actually write 1014 tests? More importantly, how would you check that the tests themselves were all correct?

In reality, "all" that testing can do is show that there might be a problem in a piece of code. If testing doesn't find a failure, there could still be bugs lurking there that we just didn't find. And if testing says there is a problem, it could well be a problem with the test rather than the program.

So why test? Because it's one of those cases where something that shouldn't work in theory is surprisingly effective in practice. It's just like mathematics: any theorem proof might contain a flaw that just hasn't been noticed yet, but somehow we manage to make progress.

The obstacle to testing isn't actually whether or not it's useful, but whether or not it's easy to do. If it isn't, people will always find excuses to do something else.

It's therefore important to make things as painless as possible. In particular, it has to be easy for people to:

add or change tests

understand the tests that have already been written

run those tests

and understand those tests' results.

And test results must be reliable.

If a testing tool says that code is working when it's not, or reports problems when there actually aren't any, people will lose faith in it and stop using it.

Let's start with the simplest kind of testing. A unit test is a test that exercises one component, or unit, in a program.

Every unit test has five parts. The first is the fixture

…which is the thing the test is run on, such as the inputs to a function, or some data files to be processed.

The second part is the action

…which is what we do to the fixture. Ideally, this just involves calling a function, but some tests may involve more.

The third part of every unit test is its expected result

…which is what we expect the piece of code we're testing to do or return. If we don't know the expected result, we can't tell whether the test passed or failed. As we'll see in a couple of episodes, defining fixtures and expected results can be a good way to design software.

The first three parts of the unit test are used over and over again. The fourth part is the actual result

…which is what happens when we run the test on a particular day, with a particular version of our software.

The fifth and final part of our test is a report..

…which tells us whether the test passed, or whether there's a failure of some kind that needs human attention. As with the actual result, this could be different each time we run the test.

So much for terminology: what does this all look like in practice? Suppose we're testing a function called dna_starts_with.

It returns True if its second argument is a prefix of the first, i.e., if one sequence starts with another.

And it returns False otherwise.

For example, 'actggt' does start with 'act'

…but not with 'ctg'.

We'll build a simple set of tests for this function from scratch to introduce some key ideas.

Then later introduce a Python library that can handle the things that are done the same way each time.

Let's start by testing our code directly using assert. Here, we call the function four times with different arguments, checking that the right value is returned each time.

This is much better than nothing, but it has several shortcomings.

First, there's a lot of repeated code: only a fraction of what's on each line is unique and interesting.

That repetition makes it easy to overlook things, like the not used to check that the last test returns False instead of True.

This code also only tests up to the first failure. If any of the tests doesn't produce the expected result, the assert statement will halt the program. It would be more helpful if we could get data from all of our tests every time they're run, since the more information we have, the faster we're likely to be able to track down bugs.

Here's a different approach. It requires a bit more typing to set up, but after that, makes testing a lot easier. First, let's put the inputs and output of each test in a table. The first two entries in each row are the argumets to our function, and the third is what the function should return for those arguments.

Right away, this is easy to read than line after line of assert and function calls.

It's also easy to add new tests: just insert a line with the right values.

Of course, those tests won't run themselves, so here's five lines of Python (plus a comment) to do that. This code simply loops over the entries in the table, calling the function with the arguments provided, and counting how many times the function returned the right result. When the loop finishes, this code prints out a summary to tell us how many of our tests passed.

This is better than the previous pile of assert statements and function calls because no runnable code has to be copied to add a new test. That makes the pattern in our tests clearer, and reduces the chances of us introducing a bug by copying and pasting incorrectly. This code also runs all of our tests every time, so we always get a complete picture of how well we did.

However, if any of our tests fail, this code won't tell us which ones. If we had a hundred tests, and two were failing, figuring out which two would take some time.

This slightly modified version of our code solves that problem.

The built-in function enumerate takes a list (or any other sequence) as an argument, and produces one pair for each entry in that list. The first half of each pair is an element index, and the second is the element itself.

In our case, the elements of Tests have three parts. We can extract the index and those three parts in a single step as shown here. The first time through the loop, i will be assigned 0, while seq, prefix, and expected will be assigned the three parts of our test. The next time through the loop, i will be assigned 1, and so on.

The two lines that call dna_starts_with, check its result, and increment the counter of successful tests are exactly the same as before.

So is the line after the loop that summarizes how many tests passed.

But these two lines are new. If a test fails, we immediately print out its index to make it easy to find in the Tests table.

This pattern—creating fixtures, acting on them, and collecting and reporting results—is the heart of almost all testing tools.

Many good libraries have been written in many languages to help programmers write tests that follow this pattern.

We'll look at one such library for Python in a couple of episodes.

First though, we'll have a look at how you should go about handling errors in your programs.

Thank you.