Find us on GitHub

# Testing: Floating Point Hello, and welcome to the fifth episode of the Software Carpentry lecture on testing. In this episode, we’ll take a look at how computers represent real numbers, and some of the traps waiting for anyone who uses them. Let’s go back to the fields of Saskatchewan. Instead of working directly with pixels in photographs… …we would probably actually work with some kind of geographic information system that would store each field’s coordinates and ground cover. Those coordinates would probably be recorded as latitude and longitude. Which the computer would store as floating point numbers rather than integers. And that’s where our problems start. To understand, we have to step back and look at how computers store numbers. Finding a good representation for floating point numbers is hard. The root of the problem is that we cannot represent an infinite number of real values with a finite set of bit patterns. And unlike integers, no matter what values we do represent, there will be an infinite number of values between each of them that we can’t. The explanation that follows is simplified—possibly over-simplified—to keep it manageable. Please, if you’re doing any calculation on a computer at all, take half an hour to read Goldberg’s excellent paper, “What Every Computer Scientist Should Know About Floating-Point Arithmetic”. So: floating point numbers are usually represented using sign, magnitude, and an exponent. In a 32-bit word, the IEEE 754 standard calls for 1 bit of sign, 23 bits for the magnitude (or mantissa), and 8 bits for the exponent. To illustrate the problems with floating point, we’ll use a much dumber representation. We’ll only worry about positive values without fractional parts. And to make it all fit on our medias, we’ll only use 5 bits: 3 for the magnitude, and 2 for the exponent. Here are the possible values (in decimal) that we can represent this way. Each one is the mantissa times two to the exponent. For example, the decimal values 48 is binary 110 times 2 to the binary 11 power. Which is 6 times 2 to the third… …or 6 times 8. Of course, real floating point representations like the IEEE 754 standard don’t have all the redundancy that you see in this table. Again, if you’d like to know how they get rid of it, and what implications this has, please read Goldberg’s paper. Here’s a clearer view of the values our scheme can represent. The first thing you should notice is that there are a lot of values we can’t store. We can do 8 and 10, for example, but not 9. This is exactly like the problems hand calculators have with fractions like 1/3: in decimal, we have to round that to 0.3333 or 0.3334. But if this scheme has no representation for 9… …then 8+1 must be stored as either 8 or 10. Which raises an interesting question: if 8+1 is 8, what is 8+1+1? If we add from the left, 8+1 is 8, plus another 1 is 8 again. If we add from the right, though, 1+1 is 2, and 2+8 is 10. Changing the order of operations can make the difference between right and wrong. This is the sort of thing that numerical analysts spend their time on. In this case, if we sort the values we’re adding, then add from smallest to largest, it gives us the best chance of getting the best possible answer. In other situations, like inverting a matrix, the rules are much, much more complicated—so complicated that we’ll skip past them completely.  The spacing between the values we can represent is uneven. However, the relative spacing between each set of values stays the same, i.e., the first group is separated by 1, then the separation becomes 2, then 4, then 8, so that the ratio of the spacing to the values stays roughly constant. This happens because we’re multiplying the same fixed set of mantissas by ever-larger exponents, and it points us at a couple of useful definitions. The absolute error in some approximation to a value is simply the absolute value of the difference between the two. The relative error, on the other hand, is the ratio of the absolute error to the value we’re approximating. This example might help. If we’re off by 1 in approximating 8+1 and 56+1, that’s the same absolute error, but the relative error is much larger in the first case than in the second. When we’re thinking about floating point numbers, relative error is almost always more useful than absolute—after all, it makes little sense to say that we’re off by a hundredth when the value in question is a billionth. What does this have to do with testing? Well, let’s have a look at a little program The loop runs over the integers from 1 to 9 inclusive. Using those values, we create the numbers 0.9, 0.09, 0.009, and so on, and put them in the list `vals`. (Remember, the double star means “raised to the power” in Python.) We then calculate the sum of those numbers. Clearly, this should be 0.9, 0.99, 0.999, and so on. But is it? Let’s calculate the same value a different way, by subtracting .1 from 1, then subtracting .01 from 1, and so on. This should create exactly the same sequence of numbers. Let’s check by finding the difference and printing it out. Here’s our answer. The first column is the loop index; the second, what we got by totaling up 0.9, 0.99, and so on, and the third is the difference between that number and what we got by subtracting from 1.0. The first thing you should notice is that the very first value contributing to our sum is already slightly off. Even with 23 bits for a mantissa, we cannot exactly represent 0.9 in base 2, any more than we can exactly represent 1/3 in base 10. Doubling the size of the mantissa would reduce the error, but we can’t ever eliminate it. The good news is, 9 times 10-1 and 1 minus 0.1 are exactly the same—it might not be precisely right, but at least it’s consistent. The same cannot be said for some of the later values, though: at some points, there actually is a small difference between what we get from our two different formulas. And if you weren’t confused enough, sometimes the accumulation of addition errors actually makes the result more accurate once again. So what does this have to do with testing? Well, if the function you’re testing uses floating point numbers, what do you compare its result to? If we compared the sum of the first few numbers in `vals` to what it’s supposed to be, the answer could be `False`. Even if we’re initializing the list with the right values… …and calculating the sum correctly. This is a hard problem—a very hard problem. No one has a good generic answer, because the root of our problem is that we’re using approximations, and each approximation has to be judged on its own merits. So what can you do to test your programs? The first rule is, compare what you get to analytic solutions whenever you can. For example, if you’re looking at the behavior of drops of liquid helium, start by checking your program’s output on a stationary spherical drop in zero gravity. You should be able to calculate the right answer in that case, and if your program doesn’t work for it, it probably won’t work for anything else. The second rule is to compare more complex versions of your code to simpler ones. If you’re about to replace a simple algorithm for calculating heat transfer with one that’s more complex, but hopefully faster, don’t throw the old code away: use its output as a check on the correctness of the new code. And if you bump into someone at a conference who has a program that can calculate some of the same results as yours, swap data sets: it’ll help you both. The third rule is, never use `==` (or `!=`) on floating point numbers, because two numbers calculated in different ways will probably not have exactly the same bits. (It’s OK to use less than, greater than or equals, and other orderings, though—we’ll explore why in the exercises.) If you want to compare floating point numbers, use a function like `nose.assert_almost_equals`. As its documentation says: This checks whether two numbers are within some tolerance of each other. If they are, it declares them equal. And if you thought to wonder whether that tolerance is absolute or relative, you’re ready to start testing floating-point code. Thank you.