Hello, and welcome to the fifth and final episode of the Software Carpentry lecture on regular expressions. In this episode, we're going to work through a moderately complex problem using regular expressions, and introduce a few more tools in the regular expression library.
Our starting point is an archive of several thousand papers and theses written in LaTeX, a text-based document formatting program.
Our documents look like this.
They all use the same labels to refer to items in a shared bibliography.
Our job is to find out how often citations appear together, i.e., how often paper X is cited in the same document as paper Y.
As a first step, we need to extract the set of citation labels from each document, and that's the problem we'll tackle in this episode.
Let's have a closer look at our input. In LaTeX, citations are written using '\cite', with cross-reference labels in curly braces.
A single citation can include two or more labels separated by commas.
There may be white space before or after labels.
There can even be line breaks where a citation is split across two lines.
And there can be multiple citations per line.
Our first idea is to use a group to capture everything inside the curly braces following the word 'cite'. It seems to work in one simple case.
But what if there are multiple citations on a single line?
It looks like we're capturing the text between the citations.
The reason is that regular expression matching is greedy: it matches as much text as it can, and the '.' in '.+' will match all the characters from the first curly brace to the last one, including the intervening citations and curly braces.
Luckily, the diagnosis of our problem suggests its solution: let's have the regular expression match everything except a closing curly brace.
This is remarkably easy to do: if the first character in a set in square brackets is the circumflex '^', then the set is negated, i.e., it matches everything except the characters in the set. The expression '[^}]' therefore matches every character except a closing curly brace.
This works for a single citation.
All we've done is change '.' to the negated set.
What about multiple citations on a single line?
Well, it's not gobbling up text we don't want it to, but it's only capturing the first citation.
Somehow, we need to extract all matches, not just the first.
Luckily, the regular expression library has a function to do exactly this: if we use
re.findall instead of
re.search, it will give us back a list of all the substrings that matched our pattern.
Remember, whatever your problem is, someone has probably run into it before, and there's probably something in the library to help you. Knowing what's in the library is as important to a programmer as knowing what's in the literature is to a scientist.
findall a try. It seems to produce the right output.
Not bad for a 7-character change.
OK, what about spaces in citations?
The good news is, nothing breaks. The bad news is, the spaces are saved by
findall, which isn't really what we want.
We could tidy this up after the fact using
But let's modify the pattern instead.
This modification solves half our problem.
If you recall, '\s' is an abbrevation for the set of whitespace characters, so these uses of '\s*' match zero or more spaces immediately after the opening curly brace, or immediately before the closing one.
However, the space after the 'Y' is still being returned in the matched text. Once again, the problem is that regular expressions are greedy: the space after the 'Y' isn't a closing curly brace, so it's matched by the negated character set, and included in the returned string. The '\s*' that's supposed to match the trailing space is then matched against zero characters instead of one. It's not what we want, but it's legal.
OK, so let's force our match to line up with the break from word to non-word characters using '\b'.
The change is to put '\b' after the first unwanted spaces, and before the last ones. Since the curly braces around the citation labels are also non-word characters, the pattern matches even if there aren't any opening or trailing spaces.
The last hurdle is to handle multiple labels inside a single pair of curly braces.
The pattern we've built so far doesn't explode when there are two or more labels—it even handles spaces after the commas. But it returns all those labels as a single lump of text.
We actually could write a pattern that would break everything up on commas, but it would need some very advanced features of the regular expression library, and would be very difficult to read.
Instead, let's use another basic function,
re.split, to separate multiple labels.
re.split does the same thing as
string.split, but unlike its simpler cousin, it breaks things up everywhere that a pattern matches.
The best way to show
re.split in action is to write the function we originally set out to create. Let's start with a skeleton that includes some test data, a function that does nothing (but doesn't just fail), and a couple of lines that call that function and display the result. So far, so good.
Now let's write our function. For readability's sake, we'll put our patterns at the top and give them memorable names. Inside the function, we'll pull out all the citations using the first pattern, then split each result everywhere there's a comma with optional spaces before or after it. We'll stuff all the results into a set, and return that; if no matches were found, that set will be empty.
We can use one more trick from the regular expression library to make this function more efficient. Instead of turning the regular expression into a finite state machine over and over again, we can compile the regular expression and save the resulting object. That object has methods with the same names as the functions we've been using from the library, like
findall, but if we're using the same pattern over and over again, compiling it once and re-using the compiled object is much faster.
As you can see, the changes required are very small: instead of saving the textual representations of our expressions, we compile them, and then instead of calling the top-level functions from the regular expression library, we call the methods of those saved objects.
And here's the result: a set of all the citations in our test data, pulled out with just a dozen lines of code.
This is the end of our lecture on regular expressions. We'll touch on a few other ideas in the exercises, but we hope these five episodes have shown you enough to help you solve some everyday data crunching problems.