Find us on GitHub

Teaching basic lab skills
for research computing

Sets and Dictionaries: Introduction

Sets and Dictionaries/Introduction at YouTube

Hello, and welcome to the first episode of the Software Carpentry lecture on sets and dictionaries, two ways of storing data that can make many programs both simpler and faster. In this episode we'll have a look at sets, which are the simpler of the two.

Most programmers meet lists (or arrays) early in their careers, and use them for almost everything thereafter. However, they are not the only way to store multiple values in a program.

In mathematics, you're far more likely to encounter sets, and these come up in many algorithms as well.

A set is an unordered collection of distinct items.

The word "collection" means that a set can hold zero or more values.

The word "distinct" means that any particular value is either in the set or not: a set can't store two or more copies of the same thing.

And finally, "unordered" means that values are simply "in" the set. They're not in any particular order, and there's no first value or last value.

This third part is what novices trip over the most: the first few times people use sets, they tend to expect that values are stored in alphabetical or numerical order, or in the order they were added to the set, but they're not.

Sets were a relatively late addition to Python: the language was already ten years old when they put in.

That's still better than most languages, though. Despite how often sets come up in various algorithms, most languages still don't provide them as a built-in type.

If you're using Python 2.6 or earlier, you create a set by calling set and passing it a list of the values you want in the set.

In version 2.7 or later, you can simply write down the set as you would in a math class.

However, in both versions, you have to call set to create an empty set…

…because when sets were added to the language, empty curly braces—the mathematical notation for an empty set—were already being used for something else (which we'll see in a couple of episodes).

We'll use the Python 2.7 notation—the more natural one—in this lecture.

As I said a moment ago, the values in a set are distinct, i.e., any value is simply present or absent. This means that sets can easily be used to get rid of duplicate values.

For example, suppose we want to find out what letters are used in a word. If we create a set, then add the characters in the word to the set one by one, the set will automatically discard duplicates. When we print the set out, what we get are the unique values.

Notice that these values aren't printed in alphabetical order, or in the order they were added to the set.

This is because set elements are not ordered, and it's important to keep this in mind when using them.

Here's a much shorter way to find the distinct letters in a word:

Just create a set, passing the word in as an argument.

This works because Python knows how to loop over the characters in a string the same way that it loops over the items in a list. In general, if you can loop over it, you can build a set from it.

What you can't do is build a set from several distinct values in one go.

Many novices try to build a set like this, but Python insists that all the initial values be in a single container that can be looped over, like a list or string.

To show you what we can do with sets, let's create three holding the integers 0 through 9, the first half of that same range of numbers (0 through 4), and the odd values 1, 3, 5, 7, and 9. Notice that in the first case, the function range is returning a list of ten integers, which is passed directly into set to create the set we want.

We've already seen how to add elements using set's add method.

We can also empty out a set by calling its clear method. Just like the methods of strings and lists, we call a method using object dot method name instead of a "naked" function name.

We can use other methods to find the difference between two sets…

…or their intersection. In each case, the method creates and returns a new set instead of modifying either the set the method is called for, or the one passed in as an argument.

issubset doesn't produce a new set. Instead, it just reports whether all the elements in one set are present in another. Notice the order: lows.issubset(ten) checks whether lows is a subset of ten, not the other way around.

Of course, the complementary method issuperset also exists, and does what you'd expect.

Now let's remove the zero from lows, leaving the set of integers 1, 2, 3, and 4. This is similar to deleting an element from a list, but there's an important difference. When deleting from a list, you specify the location of the element to delete. When deleting from a set, you must specify the value you want to take out. If that value isn't in the set, this method does nothing.

Here's another method that creates a new set. symmetric_difference is sometimes called "exclusive or"; it returns the values that are in one set or another, but not in both. It's useful from time to time…

…but you'll probably use plain old union more often.

Finally, you can count how many values are in a set using len, just as you would to find the length of a list or string…

…and check whether a particular value is in the set or not using in.

To help make programs eaiser to type and read, most of the methods we've just seen can be written using arithmetic operators as well. For example, instead of lows.issubset(ten), we can write lows <= ten, just as we would if we were using pen and paper. There are even a couple of operators, like less than <, that don't have long-winded equivalents.

One operator that isn't in this list is "not".

Mathematicians are quite comfortable negating sets: for example, the negation of "the set {1, 2}" is "all integers that aren't 1 or 2".

This is a lot harder to do in a program, though. To continue with our example, is the string "pterodactyl" in the negation of the set {1, 2} or not?

We'll take another look at this problem when we get to classes and object-oriented programming in a later lecture.

Right now, let's use sets to clean up some field data.

We have two files. The first has the names of all the birds our supervisor considers uninteresting from a research point of view.

The second contains the names of all the birds we saw during a three-week field trip in a mosquito-infested hellhole in northern Ontario.

Our task is to copy those observations, removing the uninteresting birds as we do so.

This is the main body of our program.

We start by reading in the set of birds we're supposed to be subtracting from our observations. That's going to take a few lines of code, so for now, we'll pretend we can do this by calling a function called read_set, which we'll come back and write in a minute or two.

Next, we open our input file for reading, and our output file for writing.

We then read lines from the input file one at a time and strip off any leading or trailing whitespace. In particular, line.strip will remove the newline and/or carriage return characters at the ends of the lines we're reading in, so that we just have the names of the birds.

These two lines are the heart of our program. If the bird name we just read is not in the set of names we're supposed to ignore, we print it out. Everything else in our program is really just there to support these two lines—a realization that we'll return to later when we talk about program design.

Finally, once we've read and filtered all of our input, we close the input and output files.

So far, so good: all that's left is to write the function that reads in the set of names we're supposed to ignore.

We start by creating an empty set, which we will then fill up with bird names, and open the input file.

We then read lines from the input one at a time, stripping off whitespace…

…and put the strings left over in the set.

When we're done, we return the set that we created.

Stepping back, the main program and the read_set function have almost exactly the same structure. Each one initializes some data, opens a file or files, loops over input data, does something with that data, then closes the files it's using. We'll have a closer look at this pattern, and how we can save ourselves writing it over and over again, in a later lecture.

Finally, here's a few tests of our program. If there's nothing to remove, our output file is the same as our input. If we're removing just one name, it's subtracted as it should be, while if the name of every bird we saw is in the removals list, the output is empty. These three tests don't exhaust all the possibilities, but the odds are pretty good that if a program passes them, it's doing the right thing.

Thank you.