Find us on GitHub

Sets and Dictionaries: Phylogenetic Trees

Hello, and welcome to this episode of the Software Carpentry lecture on sets and dictionaries. In the next few minutes, we're going to look at how to reverse engineer the process of evolution, and how dictionaries can make that simpler.

You don't have to look at the natural world very hard to realize that some organisms are more alike than others.

For example, if you look at the appearance, anatomy, and lifecycles of these seven fish…

You can see that these two are closely related…

…as are these two…

…and these two.

But what about the seventh? Where does it fit in? And how do the pairs relate to each other.

As Theodosius Dobzhansky said almost a century ago, nothing in biology makes sense except in the light of evolution. Since mutations usually occur one at a time, the more similarities there are between the DNA of two species, the more recently they had a common ancestor. We can use this idea to reconstruct the evolutionary tree for a group of organisms using a hierarchical clustering algorithm.

The first step is to find the two species that are most similar, and construct their plausible common ancestor.

We then pair two more…

…and two more…

…and start joining pairs to individuals…

…or pairs with other pairs.

Eventually, all the organisms are connected.

We can redraw those connections as a tree…

…using the heights of branches to show the number of differences between the species we're joining up.

Let's turn this into an algorithm.

Initially, our universe U contains all the organisms we're interested in. While there are still organisms that haven't been connected to the tree, we find the two that are closest, calculate their common parent, remove the two we just paired up from the set, and insert the newly-created parent.

The set of ungrouped organisms shrinks by one each time, so this algorithm eventually terminates.

And we can keep track of the pairings on the side to reconstruct the tree when we're done.

Now, what does "closest" mean?

One simple algorithm is called unweighted pair-group method using arithmetic averages.

One simple algorithm is called unweighted pair-group method using arithmetic averages.

Let's illustrate it by calculating a phylogenetic tree for humans, vampires, werewolves, and mermaids. The distances between each pair of species is shown in this table. (We only show the lower triangle because it's symmetric.)

The closest entries—i.e., the pair with minimum distance—is human and werewolf.

We will replace this with a common ancestor, which we will call HW.

Its height will be 1/2 the value of the entry.

And for each other species X, we will calculate a new score for HW and X as (HX + WX - HW)/2.

For example, we will combine HV and VW which are in yellow, and HM and MW which are green.

Which leaves us with this new matrix.

The height of HW is half of the 5 we eliminated, or 2.5.

Repeating this step, we combine HW with V…

…and finally HWV with M.

Our final tree looks like this.

And the 'missing' heights are implied by the differences between branch values.

Right: how do we translate this into software?

We illustrated our algorithm with a triangular matrix…

…but the order of the rows and columns is arbitrary.

It's really just a lookup table mapping pairs of organisms to numbers.

And as soon as we think of 'lookup tables', we should think of dictionaries.

The keys are pairs of organisms.

which we will keep in alphabetical order so that there's no confusion between 'HW' and 'WH'

and the values are the distances between those organisms.

…becomes this dictionary.

Time to create some code. Let's write out the algorithm we discussed earlier.

While we have some scores left to process, find the closest pair… create a new parent… print it out (or record it somewhere to use later)… remove entries that refer to the pair we're combining… and add new entries to the scores table.

This code assumes that the scores are in a dictionary called 'scores'.

It also assumes that the names of the species are in a list called 'species'.

And yes, it took us a couple of false starts to come up with this code: we would have shown you the blind alleys, but this episode is already rather long.

Let's start writing the functions our overall algorithm assumes. The first is min_pair, which finds the closest pair of organisms in the scores table. The algorithm is simple…

…but it assumes we have a way to generate all valid combinations of organisms from the species list. We'll need to write that.

It's also worth noting the assert statements, which check that the data we're working with is sensible, and that we actually found a minimum value. Remember, good programs fail early and fail often.

Now let's write the function that generates pairs of species. This uses a double loop to construct a list of pairs.

Notice the starting index on the inner loop: if the outer loop is at position 'i', the inner loop starts at 'i+1', which ensures that each possible pair is only generated once.

Here's what our program looks like so far: there's a main body, and two functions find_min_pair and combos.

Looking back at the program's main body, the next function we need to write is 'create_new_parent'. It's pretty simple: we just concatenate the names of the two organisms we're combining. The new score is just half the score for the pair we're combining.

We put square brackets around the name to make it more readable, and so that we can easily see the order in which things were paired up.

Three functions down: what's next?

Looking back at our main program, we need to remove entries from the scores table that refer to the organisms we're pairing up. Oh, and we need to take their names out of the list of species, too. That's pretty easy to do…

…but notice that this function returns the old score, because we need it in the main program. This isn't a great design: there's nothing in the name 'remove_entries' to suggest that we're saving and returning the old score, and it isn't immediately obvious why we're doing this. As an exercise, try rewriting the main program so that 'remove_entries' doesn't have to return anything, and see if you think it's easier to understand.

So much for our fourth function…

Our fifth updates the scores table and species list. It's actually pretty complicated: for each species that isn't being paired up, we have to…

…calculate the combination of that species with the two halves of the new pair…

…saving the scores for later use…

…create a pair that includes the newly-constructed parent…

…and save our work.

When we're done, we have to add the newly-created parent to the list of species…

…and then there's this call to sort the species list. If we don't do this, our program will produce a wrong answer. After this episode is over, take a few minutes to review the code and see if you can figure out why. Again, this is an example of not very good design: if it isn't clear why something is being done where it is, it should be moved or explained.

Our program now has five functions and a main body.

But we're still not done: we need to write the 'tidy_up' function we referred to in 'update'.

It in turn assumes a function called 'make_pair' that combines a pair of species…

…which simply constructs a tuple of the species' names ordered alphabetically. Remember, we're ordering names so that each pair has a unique representation.

Our two new functions fit into our program as shown here.

We have nothing left to write, so let's try it out. If we run it with the four species we used as an example earlier, the output is as shown here. These are the same values we had in our example tree, so we seem to be on the right track. As exercises, try writing unit tests for this code, and then modify the program so that it displays the entire reconstructed tree. Oh, and try to figure out why the 'update' function has to sort things, or what symptoms you'll see if you don't do this.

Thank you for listening.