Find us on GitHub

Teaching basic lab skills
for research computing

Sets and Dictionaries: Nanotech Example

Sets and Dictionaries/Nanotech Example at YouTube

Hello, and welcome to the final episode of the Software Carpentry lecture on sets and dictionaries. In this episode, we'll build a solution to our original nanotechnology problem.

If you recall, our goal is to find out how many molecules of various kinds we could make using the atoms in our warehouse.

The formulas for the molecules we know how to make are stored in a file that's formatted like this.

And our inventory is stored in a file that's formatted like this.

Using what we've learned in the previous few episodes, we can now write a simple, efficient program to solve our problem.

Let's start with our inventory. Our input consists of pairs of strings and numbers, which naturally suggests using a dictionary for storage.

The keys will be atomic symbols…

…and the values will be the number of atoms of that kind we currently have in stock.

The result will look like this.

We can use this same structure to represent individual molecules…

…such as water.

So let's store our formulas as a dictionary of dictionaries.

The keys in the formulas dictionary will be the name of molecules…

…and the values will be dictionaries storing the formulas for particular molecules.

Here, for example, we have an outer dictionary that maps the word 'water' to a dictionary storing the number of oxygen and hydrogen atoms in a single water molecule, and the word 'ammonia' to a dictionary storing the number of nitrogen and hydrogen atoms in a single ammonia molecule.

The number of molecules of any particular type we can make is limited by the scarcest atom that molecule requires. In mathematical terms, we want the minimum over the atoms used in the molecule of the ratio of how many of that atom we have to how many of that atom we need.

As a special case, if the atom isn't explicitly listed in our inventory, its count is implicitly zero.

We'll store the results of our calculation in yet another dictionary, this one mapping…

… the names of molecules…

…to how many of that molecule we can make.

The main body of the program is simple: read in the input files, do our calculation, and print the result.

Reading the inventory file is simple: take each interesting line in the file, split it to get an atomic symbol and a count, and store them together in a dictionary.

For clarity's sake, we'll use this helper function to read a file and strip out blank lines and comments.

Using that same function, reading in a file of molecular formulas is only slightly more complex.

We create the dictionary we're going to store the results in.

And then for each line in the file that has some data…

…we split on the colon ':' to separate the molecule's name (which may contain spaces) from its formula.

We then split the formulas into a list of strings, which alternate between atomic symbols and counts…

…and loop over those values, moving forward two elements at a time…

…storing the atomic symbol and count in a dictionary.

Once we're done, we store that dictionary as the value for the molecule name in the main dictionary.

Now it's time to figure out how many molecules of each kind we can make.

inventory maps atomic symbols to counts, and so does formulas[name]. Let's loop over all the molecules in our formulas and "divide" those two dictionaries.

It might seem overkill to write yet another helper function in this case, but experience shows that if you have a choice between big functions in which nothing is obviously wrong…

…or little functions in which obviously nothing is wrong, you should choose the latter.

Here's the function that "divides" one dictionary by another. We loop over all the atoms in the molecule we're trying to build, see what limits the available inventory puts on us, and return the minimum of all those results. This function uses a few patterns that come up frequently in many kinds of programs, so let's have a closer look.

The input arguments have identical formats: each one maps atomic symbols to counts.

The first pattern is to initialize the value we're going to return to None, then test for that value inside the loop to make sure we re-set it to a legal value the first time we have real data. In this case, we could just as easily use -1 or some other "impossible" value as an "uninitialized" flag for number.

Since we're looping over the keys of molecule, we know that we can get the value stored in molecule[atom]. However, that atom might not be a key in inventory, so we use inventory.get(atom, 0) to get either the stored value, or a sensible default—in this case zero, because if the atom's symbol isn't in the dictionary, we don't have any of it. This is our second pattern.

The third is using calculate, test, and store to find a single value—in this case, the minimum—from a set of calculated values. We could calculate the list of available over required values, then find the minimum of the list, but doing the minimum test as we go along saves us having to store the list of intermediate values. It's probably not a noticeable time saving in this case, but it would be with larger data sets.

Finally, we need to show how many molecules of each kind we can make. We could just loop over our result dictionary, printing each molecule's name and the number of times we could make it, but to keep things interesting, let's print the results by descending count, with the molecules we can make most of at the top. First, we invert the dictionary using a helper function (which we'll write in a second). Then, for each molecule name associated with each count, we print out the count and name.

Here, we get the keys from the inverted dictionary—i.e., the counts of how many molecules we can make—and sort them. Notice that we pass in reverse=True to sort in descending order, i.e., with the greatest values first. We've also sorted the molecule names, just to be tidy.

And here's our function to invert a dictionary. For each key/value pair in the source dictionary, we use the value as the key in the new dictionary, and the original key as the value.

Here, we're relying on another common pattern: when the values in a dictionary are some kind of collection, such as a set or list, we store an empty collection of that with the key if the key isn't already present, so that we can add our new value whether the key was previously present or not.

And here's one afterthought: let's go back to our show_counts function and only print out molecules that we can actually make, i.e., ones whose counts are greater than zero. If we have a lot of possible molecules in our database, but not much inventory, this will save us a lot of less-than-useful output.

Time to test. Let's try an empty inventory against a single formula. There's no output, which is what we expect.

Add one atom of helium to our inventory: that seems right.

Add some hydrogen, but don't give the program any formulas that use hydrogen: still correct.

Add the formula for water, which does use hydrogen, but don't provide any oxygen: no water in the output, but helium is still appearing as it should.

Add the formula for molecular hydrogen, and sure enough, we can make a couple of those.

Put some oxygen in the warehouse, and sure enough, we can make a couple of water molecules.

There are quite a few other interesting tests still to run, but things are looking good so far.

And our code is a lot simpler than it would be if we used lists of pairs of atom names and counts, or some other data structure.

This is the end of our lecture on sets and dictionaries. We hope you'll take a few minutes to try a few of the exercises.