Hello, and welcome to the shorter version of the second episode of the Software Carpentry lecture on sets and dictionaries. In this episode, we're going to explain why you cannot put a list in a set, and show you what you should do instead.
Let's start by trying an experiment.
Create a set, add "a string" to it, and print the set. As expected, the string is in the set.
Now let's try adding a list to the same set.
Whoops: why doesn't that work?
And what does that word "unhashable" mean?
In order to understand what's going on, we have to take a look at how sets are stored in a computer's memory.
When you create a set, the computer allocates a blob of memory to store references to the set's elements.
When you add something to the set, or try to look something up, the computer uses a hash function to figure out where to look. A hash function is any function that turns data values into a single integer that can be used as an index into another data structure.
For example, let's take a look at how a string is stored in a set.
We'll start with the string "zebra".
The string has five letters: 'z', 'e', 'b', 'r', and 'a'.
Each character is stored in memory as a small integer: 97 for lower-case 'a', 98 for lower-case 'b', and so on up to 122 for lower-case 'z'.
We can add up these integers to produce a number that will be the same for every copy of the string. This is our hash function.
Once we have this integer, we can use it to figure out where in the hash table the string should be stored.
Now let's take a look at how a list would be stored, if we were allowed to store lists in sets.
For the sake of this example, let's assume that the list contains the same five characters.
So it's represented like this.
For our hash function, we can add up the characters' values as before.
The final picture of what's in memory looks similar to what we had when we stored a string.
But what happens if we now change the contents of the list?
Suppose, for example, that we change the first letter from 'z' to 's'.
The hash function's value is now 523 instead of 532.
Which means that the modified list belongs in a different place in the set.
This is bad—very bad. If we now ask, "Is the list containing 's', 'e', 'b', 'r', and 'a' in the set S?" the answer will be "no", because the reference to the list isn't stored in the location that our hash function tells us to look. It's as if we changed our name from "Tom Riddle" to "Lord Voldemort", but left all the personnel records filed under 'R'.
This problem arises with any mutable structure, i.e., anything whose contents or value can be changed after its creation. Integers and strings are safe, since their values are fixed, but the whole point of lists is that we can grow them, shrink them, and overwrite their contents. What should we do?
One option would be to have each list keep track of the sets that it is in, and move itself whenever its values change.
However, this would be very expensive: every time our program touched a list, Python would have to recalculate its hash code and update all the sets that held references to the list.
Option number two is to shrug our shoulders and say, "It's the programmer's fault." This is what most languages do…
…but it's also very expensive. This time, though, the time that's wasted is the programmer's time, tracking down and fixing bugs.
Python uses a third option. It only allows programmers to put immutable values in sets.
After all, if something's value can't change, neither can its hash code, or its location in a hash table.
In practice, this turns out to be a fairly minor restriction: occasionally annoying, but never a show-stopper.
This is fine for basic types like integers and strings, which are immutable.
But what do we do with values that naturally have several parts, like people's names?
Again, there are several options. The first is to concatenate those values somehow.
For example, if we want to store "Charles" and "Darwin", we'd create the string "Charles|Darwin", and store that.
We have to use a character like '|' instead of something more natural, like a space, because if we join "Paul Antoine" and "St. Cyr" using a space, there would be three possible ways to split it apart again.
Concatenating values is actually a pretty bad idea. First, we have to find a concatenator that can never come up in our data—essentially, make a bet on what's going to happen in the future.
Second, our code will wind up being littered with string joins and string splits, which will make it slower and harder to read.
The second option—the right one—is to use tuples instead of lists.
A tuple is just an immutable list…
…i.e., a sequence of values that cannot be changed after its creation.
Tuples are created exactly like lists…
…except we use parentheses instead of square brackets.
They are indexed the same way, too, and functions like len
do exactly what you'd expect.
But you cannot assign a new value to a tuple element, i.e., you cannot change the tuple after it has been created.
This means that a tuple's hash code never changes, and that means that tuples can be put in sets. We'll see other uses of tuples in later lectures.
Tuples in sets are very useful, but one thing beginners often trip over is that you cannot look up partial values in tuples.
For example, there's no way to say, "Give me all the tuples in this set that end with the name 'Darwin'."
The next episode of this lecture will introduce another data structure that allows you to do something very much like this.
Thank you.