Hello, and welcome to the fourth episode of the Software Carpentry lecture on regular expressions. In this episode, we'll have a look at a few more patterns you can use to build up regular expressions.
If you recall, we're trying to parse data from notebooks recording background evil levels in millivaders at several sites in the Shire a couple of years after the explosion of the Death Star.
These records are in different formats, and a couple of episodes ago…
…we managed to build this function to extract the dates from those records. Inside this function, we're applying a regular expression to the record. If it matches, we're returning the matched groups, reordering them as necessary so that we always get back year, month, and day.
This version of the function does a better job of pulling data out of our records. First, it gets the site and reading, as well as the year, month, and day. Second, and more importantly, this function is more declarative: the variable patterns
stores one entry for each format of record we think we have to parse.
The first element of each entry is a regular expression to match data in that format.
The remaining fields in the entry are a permutation of the indices of the groups in that pattern.
In our loop, we pull the pattern and the indices for the year, month, day, site, and reading out of each entry in the table in turn.
If the pattern matches…
we then return the matched groups, permuting them according to the indices so that the data always comes back in the same order: year, month, day, site, and reading.
Why is this better? Well, every time we have another data format to match, all we have to do is add one more entry. This makes this function very easy to extend, and very easy to test.
So let's take a look at Notebook #3. It has the date as three fields, the site name in parentheses, and then the reading. We know how to parse dates in this format…
…and the fields are separated by spaces…
…but how do we match against those parentheses?
So far, when we've seen parentheses in regular expressions, they haven't matched characters: they've created groups.
The way we solve this problem—i.e., the way we match a literal open parenthesis '(' or close parenthesis ')' using a regular expression—is to put backslash-open parenthesis '\(' or backslash-close parenthesis '\)' in the RE.
This is another example of an escape sequence. Just as we use the two-character sequence '\t' in a string to represent a literal tab character, we use the two-character sequence '\(' or '\)' in a regular expression to match the literal character '(' or ')'.
However, in order to get that backslash '\' into the string, we have to escape it by doubling it up.
So the string representation of the regular expression that matches an opening parenthesis is actually '\\('. This might be confusing, so let's take a look at how the various layers work.
Our program text—i.e., what's stored in our .py
file—looks like this. Here, we have two backslashes, an open parenthesis, two backslashes, and a close parenthesis inside quotes.
When Python reads that file in, it turns the two-character sequence '\\' into a single literal '\' character in the string in memory. That's the first level of escaping.
When we hand the string '\(\)' to the regular expression library, it takes the two-character sequence '\(' and turns it into an arc in the finite state machine that matches a literal parenthesis. Turning this over, if we want a literal parenthesis to be matched, we have to give the regular expression library '\('. If we want to put '\(' in a string, we have to write it in our .py
file as '\\('.
With that out of the way, let's go back to Notebook #3. The regular expression that will extract the five fields from each record…
…looks like this: '([A-Z][a-z]+) ([0-9]{1,2}) ([0-9]{4}) \\((.+)\\) (.+)'
A word beginning with an upper-case character followed by one or more lower-case characters, a space, one or two digits, another space, four digits, another space, some stuff involving backslashes and parentheses, another space, and then one or more characters, which is the reading.
If we take a closer look at that "stuff", '\\(' and '\\)' are how we write the regular expressions that match a literal open parenthesis '(' or close parenthesis ')' character in our data.
The two inner parentheses that don't have backslashes in front of them create a group, but don't match any literal characters.
We create that group so that we can save the results of the match—in this case, the name of the site.
Now that we know how to work with backslahes in regular expressions, we can take a look at character sets that come up frequently enough to deserve their own abbreviations.
If you use '\d' in a regular expression, it matches the digits 0 through 9.
If you use '\s', it matches the whitespace characters (space, tab, carriage return, and newline).
And '\w' matches word characters: it's equivalent to the set shown on the right of upper-case letters, lower-case letters, digits, and the underscore '[A-Za-z0-9_]'
.
This might seem a funny definition of "word"; it's actually the set of characters that can appear in a variable name in a programming language like C or Python.
Again, in order to write one of these regular expressions as a string in Python, you have to double up the backslashes.
Now that we've seen these character sets, we can take a look at an example of really bad design.
'\S' means "non-space characters", i.e., everything that isn't a space, tab, carriage return, or newline. That might seem to contradict what I said a few seconds ago…
but that's an upper-case 'S', not a lower-case 's'.
Similarly, and unfortunately, '\W' means "non-word characters"…
…provided it's an upper-case 'W'. Upper- and lower-case 'S' and 'W' look very similar, particularly when there aren't other characters right next to them to give context.
This means that these sequences are very easy to mis-type…
…and what's worse, even easier to mis-read. Everyone eventually uses an upper-case 'S' when they meant to use a lower-case 's', or vice versa, and then wastes a few hours trying to track it down. So please, if you're ever designing a library that's likely to be widely used, try to choose a notation that doesn't make mistakes this easy.
Along with the abbreviations for character sets, the regular expression library recognizes a few shortcuts that match things that aren't actual characters.
For example, if you put a circumflex '^' at the start of a pattern, it matches the beginning of the input text.
So the pattern '^mask'
will match the text 'mask size'
because the letters 'mask' come at the start of the string.
But that same pattern will not match the word 'unmask'
.
Going to the other end, if dollar sign '$' is the last character in the pattern, it matches the end of the input text rather than a literal '$'.
So 'temp$' will match the string 'high-temp'…
…but it won't match the string 'temperature'.
A third shortcut that's often useful is '\b', often called "break". It matches the boundary between word and non-word characters: it doesn't actually match any characters—it doesn't consume any input—but it matches the transition between non-word characters and letters, digits, and the underscore.
If we have '\\bage\\b'
, it will match the string 'the age of'
, because there's a non-word character right before the 'a', and another non-word character right after the 'e'.
That same pattern will not match the word 'phage'
because there isn't a transition from non-word to word characters, or vice versa, right before the 'a'.
We've now seen about a dozen of the atoms that are used to build regular expressions. There are many more, and every language or library adds a few of its own. In the next episode, we'll take a closer look at the functions in the regular expression library that are used to apply these to problems.