Hello, and welcome to the second episode of the Software Carpentry lecture on sets and dictionaries. In this episode, we're going to take a look at a little bit of computer science theory—just enough to help you understand how sets work, and what their limitations are.
Let's start by trying an experiment.
Create a set, add "a string" to it, and print the set. As expected, the string is in the set.
Now let's try adding a list to the same set.
Whoops: why doesn't that work?
And what does that word "unhashable" mean?
In order to understand what's going on, we have to take a look at how sets are stored in a computer's memory.
We could use a list.
To create a set, we'd just create an empty list.
To check whether a value is in the set, we would then loop over the values in that list, returning True
as soon as we found the one we were looking for, or False
if we hit the end of the list without finding it.
Adding a value to a set would work the same way: we'd loop over the values already in the list and return right away if the new value was already there, or append it to the end of the list if we didn't find it.
But how efficient would this implementation be?
With N items in the set, these versions of in
and add
would take between 1 and N steps to run, depending on how quickly they found the item in question (if at all).
With a bit of handwaving, we can say that the "average" time to do something with an N-element set would be N/2 steps.
If we change the way we store sets, we can do a lot better—in fact, we can get the time down to a constant, no matter how big the set is.
But to get this speedup, we have to accept a few constraints on our program.
Let's start with something simple: storing a set of integers.
If the range of possible values in the set is small, and fixed, we could just use a list of Booleans.
Here, True
at index i
means "the integer 'i' is in the set", and False
means it isn't.
But what if we want to store any integer at all? It isn't practical to create a list of 232 Booleans.
The solution is to use one of the great inventions of computer science, called a hash table. This is just a list of some fixed length, which we'll call L.
When we want to store an integer I, we calculate I mod L, and put it there.
Remember, '%' in Python is mod, the remainder operator. I mod L is always in the range from 0 to L minus 1, which is concidentally the range of legal indices for a list of length L, so this calculation always produces a valid index.
Here's an example. Our list has five slots, with indices from 0 to 4. Since 3378 mod 5 is 3, we store 3378 in the third place in our list. 1625 goes in slot zero (since 1625 mod 5 is 0), and 101 goes in slot 1.
So far, so good: to add or find a value, we do one simple bit of arithmetic, and look in exactly one place, regardless of how big the set is.
But what do we do when there's a collision?
For example, if we want to add 206 to our set, it ought to go in position 1, but that's already occupied by 101.
There are basically two ways to handle this. The first is to search forward from the location the value is supposed to be in until we find an empty slot, and store the value there.
The second is to store a list of values in each slot.
Each approach has pros and cons that we won't go into here, and each works well enough until the hash table is about 3/4 full.
After that, the time to look up values, or insert new ones, starts to climb rapidly.
When this happens, we can get the time back down to a constant (or close to a constant) by enlarging the table.
For example, here's our size-5 table with three values in it.
Adding another element would take us over the magic 3/4 threshold…
…so instead we'll resize the hash table so that it has nine slots…
…and then insert 206. In essence, we're spending memory to save time, a tradeoff that comes up again and again in program design.
All right, we can store integers. What about strings? Where should they go?
To find out, we'll use a hash function to generate an integer index from the characters in the string. Our function will always produce the same value for any particular string, which means we'll always know where to look for that string in our hash table.
Let's start with the string "zebra".
It consists of the five characters 'z', 'e', 'b', 'r', and 'a'.
Each character is stored in memory as a small integer: 97 for lower-case 'a', 98 for lower-case 'b', and so on up to 122 for lower-case 'z'.
We can add up these integers to produce a number that will be the same for every copy of the string "zebra".
And once we have this integer, we can use mod as before to figure out where in the hash table the string should be stored.
In general, if we can define a hash function for something—i.e., if we can figure out how to turn that "something" into an integer—we can store it in a hash table.
But this only works as long as nothing changes behind our backs, which is why we got that strange "unhashable" error message at the start of this episode.
Here's a picture of what's in memory when we put "zebra" in a hash table. The string's hash code is 532, and the hash table's length is 5, so since 532 mod 5 is 2, we put a reference to the string in location 2 in the hash table.
Let's see what happens if we try to put a list in the hash table instead.
We'll start with a list containing the same five characters 'z', 'e', 'b', 'r', and 'a'.
Our list is represented in memory like this…
…so we add up the characters' values, take the remainder mod 5…
…and put a reference to the list in location 2 in the hash table.
This is what's in memory when we're done.
What happens if we now change the values in the list?
We start with this…
…then change the first character in the list from 'z' to 's'.
If we recalculate the hash code, this list should be put in location 0…
…but it's actually still in location 2. Our list is the wrong place!
This is bad—very bad. If we go looking for the list ['s','e','b','r','a']
…
…we'll look in location 0 rather than location 2, and get the result False
when we should get True
.
And if we go looking for the original list ['z','e','b','r','a']
…
…we'll look in slot 2, and either get the wrong answer True
…
…or blow up completely, since there's a value there, but it's not the one we asked about.
This problem arises with any mutable structure, i.e., anything whose contents or value can be changed after its creation. Integers and strings are safe, since their values are fixed, but the whole point of lists is that we can grow them, shrink them, and overwrite their contents. What should we do?
One option would be to have each list keep track of the sets that it is in, and move itself whenever its values change.
However, this would be very expensive: every time our program touched a list, Python would have to recalculate its hash code and update all the sets that held references to the list.
Option number two is to shrug our shoulders and say, "It's the programmer's fault." This is what most languages do…
…but it's also very expensive. This time, though, the time that's wasted is the programmer's time, tracking down and fixing bugs.
Python uses a third option. It only allows programmers to put immutable values in sets.
After all, if something's value can't change, neither can its hash code, or its location in a hash table.
In practice, this turns out to be a fairly minor restriction: occasionally annoying, but never a show-stopper.
But if we can't store lists in sets, what do we do with values that naturally have several parts, like people's names?
Again, there are several options. The first is to concatenate those values somehow.
For example, if we want to store "Charles" and "Darwin", we'd create the string "Charles|Darwin", and store that.
We have to use a character like '|' instead of something more natural, like a space, because if we join "Paul Antoine" and "St. Cyr" using a space, there would be three possible ways to split it apart again.
Concatenating values is actually a pretty bad idea. First, we have to find a concatenator that can never come up in our data—essentially, make a bet on what's going to happen in the future.
Second, our code will wind up being littered with string joins and string splits, which will make it slower and harder to read.
The second option—the right one—is to use tuples instead of lists.
A tuple is just an immutable list…
…i.e., a sequence of values that cannot be changed after its creation.
Tuples are created exactly like lists…
…except we use parentheses instead of square brackets.
They are indexed the same way, too, and functions like len
do exactly what you'd expect.
But you cannot assign a new value to a tuple element, i.e., you cannot change the tuple after it has been created.
This means that a tuple's hash code never changes, and that means that tuples can be put in sets. We'll see other uses of tuples in later lectures.
Let's step back for a moment. This lecture has been about the "science" in computer science.
Things like the design of hash tables…
…and the tradeoffs between mutability, usability, and performance.
It's a lot to digest in one sitting…
…but sometimes the only way to understand why things work the way they do is to understand the theory they're based on.
We promise to get back to practicalities in the next episode.