Find us on GitHub

Teaching basic lab skills
for research computing

Program Design: Testing

Program Design/Testing at YouTube

Hello, and welcome to the next episode of the Software Carpentry lecture on program design using invasion percolation as an example. In this episode, we'll take a closer look at how we test our invasion percolation program.

If you recall, in an earlier episode we found one bug…

…which makes us wonder, how many others are still lurking in our code?

More generally, how do we validate and verify a program like this?

Verification is the question, "Is our program free of bugs?" I.e., have we built it the right way?

Validation is the question, "Have we built the right program?" I.e., are we using a good model?

The second is a question for the scientists…

…so we'll concentrate on the first.

This is the first test case that we want to try. The grid shown on the left has 2's everywhere…

…except for three 1's that run in a straight line from the middle directly to the edge.

It should fill in as shown here.

And if it doesn't, it should be pretty easy for us to figure out what's gone wrong.

We restructured our program as shown here in order to make it easier to construct and run test cases.

If we take a closer look at the main body of the program, the first command-line argument specifies the scenario. If the name of the scenario is "random", then we get parameters, create a grid, fill it with random values, fill that grid to the edge, and report. Otherwise, we just say that we don't know what the scenario is.

Let's add another clause, so that if the scenario is "5×5 line", we will create a 5×5 grid, fill a line of cells from the center to the edge with lower values, and then check that fill_grid does the right thing.

Let's expand that English description into a few lines of code. We want to create a 5×5 grid, initialize it with the values shown earlier (i.e., 2's everywhere except for 1's from the center to the edge), call the fill_grid function that we're testing, and then check that we get the right result.

The grid creation and fill_grid functions already exist—they're part of our regular program.

So we need write functions to initialize the 5×5 grid with the values that we need to test, and then check after filling that it has been filled correctly.

We're going to have to write a similar pair of functions for each of our tests.

We'll write the first pair, and then use that experience to guide us when we refactor to make it easier to add more tests later.

Here's the function that initializes a grid of N×N cells with a line running from the center to the edge.

It's just as easy to write this function for the N×N case as for the 5×5 case, so we generalize early.

This part of the function is easy to understand. We find the value of N by looking at the grid, and then fill all of the cells with the integer 2.

This part, that fills the cells from the center to the edge in a straight line with the lower value 1, isn't as easy to understand. It's not immediately obvious that i should go in the range from 0 to N/2+1, and it's not immediately obvious either that the X coordinate should be N/2 and the Y coordinate should be i for the cells that we want to fill.

When we say "it's not obvious," what we mean is, "There's the possibility that it will contain bugs." If there are bugs in our test cases, then we're just making more work for ourselves.

We'll refactor this code later so that it's easier for us to see that it's doing the right thing.

Here's the code that checks that an N×N grid with a line of cells from the center to the edge has been filled correctly.

Again, it's as easy to check for the N×N case as the 5×5 case, so we've generalized the function.

But take a look at this condition. Are we sure that the only cells that should be filled are the ones with X coordinate equal to N/2 and Y coordinate from 0 to N/2? Shouldn't that be N/2+1? Or maybe it's 1 to N/2.

Or maybe the X coordinate should be N/2+1.

In fact, these two functions are correct…

…and when they're run, they report that fill_grid behaves properly.

But writing and checking two functions like this for each test won't actually increase our confidence in our program…

…because the tests themselves might contain bugs.

We need a simpler way to create and check tests, so that our testing is actually helping us create a correct program rather than giving us more things to worry about. How do we do that?

Well, let's go back to our example. The grid on the left should fill in as shown on the right.

Why don't we just draw our test cases exactly as shown? The reason is that modern programming languages, including Python, don't actually let you draw things. But we can get close with a little bit of work.

Here are the values that we want to put in our test grid: 2's everywhere, except for 1's from the center to the edge. We've represented it as a multiline string, which is easy to read…

…and also easy to write…

…which means it'll be easy for us to create lots of other test cases. We won't have to write code: we can just write strings.

The word "fixture" is the technical term for "the thing that the test is run on". It's the thing you set up in order to check whether a piece of code is working. We'll see this term a lot more in future lectures.

Here's the result that we expect when we fill in this grid. Again, it's a multiline string, so it's easy to write, and easy to read.

The '*' character means "this cell should be filled".

The '.' character means "this cell should hold whatever value it had at the start", i.e., it shouldn't have changed.

Here's how we would actually use these two strings in test code.

First, we're going to put the strings holding our fixtures and the expected results in a list of pairs. We can then loop over this list to check each fixture and result in turn. Again, this makes it very easy to add more tests: we just define two multiline strings, and then add one more pair to this list called TESTS.

We write a function called run_tests, and as the doc string says, it runs all of our tests at once.

Inside the loop, we get fixture and result

…we use the values in fixture to initialize a grid by breaking that multiline string into pieces and converting those pieces into integers.

We then call the fill_grid function that we actually want to test…

…and then we take the actual result, which is in grid, the number of cells that were filled, the initial fixture, and the expected result, and we pass it into a function that checks to make sure everything is right. We only have to write create_fixture_grid and check_result_grid once.

Doing that is left as an exercise for the viewer.

Describing the fixtures and the results as strings is easy, but writing those two new functions might seem like a lot of work. The question is, when you say it's a lot of work…

…what are you comparing it to?

Are you comparing it to the time it would take to inspect printouts of real grids, or step through the program over and over again in the debugger?

And did you think to include the time it would take to re-do this after every change to your program?

Or are you comparing it to the time it would take to retract a published paper after you find a bug in your code? Because that's what we're trying to prevent.

In real applications, it's not unusual for test code to be anywhere from 20% to 200% of the size of the actual application code.

And yes, 200% does mean more test code than application code.

But that's no different from physical experiments. If you look at the size and cost of the machines used to create a space probe, it's many times greater than the size and cost of the space probe itself.

The good news is that there are frameworks to help you do this…

…and we will look at those in future lectures.

The other good news is, once your tests have been written, changing the program itself becomes much easier. In particular, we're now in a position to replace our fill_grid function with one that is harder to get right, but which will run many times faster. If our tests have been designed well, they shouldn't have to be rewritten because they'll all continue to work the same way. This is a common pattern in scientific programming. You create a simple version first, check it, and then replace the parts one by one with more sophisticated parts that are harder to check, but give you better performance.

We'll take a look at how to do this in the next episode.

Thank you.