Hello, and welcome to the second episode of the Software Carpentry lecture on regular expressions. In this episode, we'll have a look at some operators you can use in your regular expressions.
If you recall, we have several notebooks full of data measuring background evil levels in millivaders. Notebook #1 has these as site, date, and background evil level…
… with single tabs as separators.
Some of the site names have spaces…
…and the dates are in the international standard format, with four digits for the year, two for the month, and two for the day.
The data in Notebook #2 also has three fields…
…but these are separated by slashes.
Months are reported using their names, and are of varying length. The days are also of varying length.
We saw in the previous episode that regular expressions are patterns that can be used to match text. Letters and digits match themselves; vertical bar '|' means OR; the dot '.' matches any single character; you can use parentheses '()' to enforce grouping; the re.search
method returns a match object if a match is found or None
if one is not; and if a match is found, match.group(k)
is the text that matched parenthesized group k.
Before we look at how to use regular expressions to extract data from Notebook #2, let's see how we would do it with simple strings. If our record is the string shown in the first line of code ('Davison/May 22, 2010/1721.3'
) we could split on slashes to get the site, the date, and the reading, then split the middle field on spaces to get month, day, and year, and then remove the comma from the day if it is present, because if you recall, some of our readings don't have a comma after the day.
This is a procedural way to solve the problem: we are telling the computer how to do something.
Regular expressions, by contrast, are declarative: we tell the computer what we want, and it figures out how to do it.
Our first attempt to parse this data relies on the star '*' operator.
'*' means "zero or more repetitions of the pattern that comes before it".
It is a postfix operator, just like the 2 in x2.
So '.*' means "zero or more characters", because '.' matches any character, and '*' forces the preceding pattern—the '.'—to match zero or more times.
In order for the entire pattern to match, the slashes '/' have to line up exactly, because '/' matches against itself. That's why this seems to grab the site name, the date, and the reading correctly.
Unfortunately, we've been over-generous. Our pattern matches the string '//', and here we're printing out a '*' as well as the group so that you can see there actually are three lines of output.
'.*' can match the empty string, because that's zero or more occurrences of a character.
That means our pattern will accept badly-formatted data, which is likely to cause us headaches down the road.
Let's try a variation that uses '+' (plus) instead of '*' (star).
In a regular expression, '+' is a postfix operator meaning "one or more", i.e., it has to match at least one occurrence of the pattern that comes before it. As you can see, the pattern '(.+)/(.+)/(.+)' doesn't match a string containing only slashes because there aren't characters before, between, or after the slashes for the .+'s to match.
If we go back and check it against real data, it seems to be doing the right thing.
We're actually going to be matching a lot of patterns against a lot of strings, so let's write a function that will apply a pattern to a piece of text, report if there is no match, and if there is a match, print out all of the groups in order. Here, we're testing our little function against the record we were just using.
If we're using regular expressions to extract the site, the date, and the reading, why not break up the date while we're at it? This patterns pulls out the month, the day, and the year at the same time as it pulls out the site and the reading.
But wait a second: why doesn't this work?
You probably didn't notice that this record does not have a comma after the day. The pattern does have one, so this pattern doesn't match this string.
Let's fix that by putting a question mark '?' after the comma. In a regular expression, '?' is a postfix operator meaning "0 or 1 of whatever comes before it".
I.e., the pattern that comes before the question mark is optional. Now, this pattern successfully matches data without a comma…
…and when we test on data with a comma, it still works.
Let's tighten up our pattern a little bit more. We don't want to match this record.
Somebody has mis-typed the year, and given us three digits instead of four—either that, or whoever took this reading was taking advantage of the physics department's time machine.
We could use four dots in a row to force the pattern to match exactly four digits…
…but this won't win any awards for readability.
Instead, let's put the digit '4' in curly braces '{}' after the dot.
Curly braces with a number between them in a regular expression is a postfix operator meaning "match the pattern exactly this many times". Here, we mean "match '.' four times against the string".
Let's do a few more tests. Here are some records in which the dates are either correct or mangled. And here's a pattern that should match all the records that are correct, but should fail to match all the records that have been mangled. We are expecting four digits for the year…
…and we are allowing 1 or 2 digits for the day: the expression '{M,N}' matches a pattern from M to N times. Here, we're allowing from 1 to 2 characters for the day.
When we run this pattern against our test data, we see that three records match. The second and third make sense: 'May 2' is valid, and 'May 22' is valid.
But why does 'May' with no date at all match this pattern? Let's look at that test case more closely.
The groups are 'Davison' (that looks right), 'May' (looks good so far), a ',' on its own (which is clearly wrong), and then the right year and the right reading.
Here's what's happened. The space ' ' after 'May' matches the space ' ' in the pattern.
The expression "1 or 2 occurrences of any character" matches the comma ',' because ',' is a character and it's occurring once.
The expression ',?' is then not matched against anything, because it's allowed to match zero characters. '?' means "optional", and in this case, the regular expression pattern matcher is deciding not to match it against anything, because that's the only way to get the whole pattern to match the whole string.
And then of course the second space matches the second space in our data. This is obviously not what we want, so let's modify our pattern again.
The pattern here ('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)'
) does the right thing for the case where there's no day, and also does the right thing for the case where there are characters for the day.
What's going on? Well, instead of using '.', we're using '[0-9]'. In a regular expression, square brackets '[]' are used to create a set of characters.
For example, the expression '[aeiou]' will match exactly one vowel: it matches one instance of any character in the set. You can either write these sets out character by character, as we've done with vowels, or if the characters are in a contiguous range, write them as "first character '-' last character", as we've done with the digits.
Here's our completed pattern: '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)'
We've added one more feature to it: the name of the month has to begin with an upper-case letter, i.e., a character in the set '[A-Z]'…
…followed by one or more lower-case characters in the set '[a-z]'.
The day is one or more occurrences of the digits 0 through 9.
This will allow "days" like '0', '00', '99', and so on.
We're going to check for that after we convert the day to an integer…
…since the valid range depends on which month we're in, and that can't be done declaratively—think, for example, about how we would have to handle leap years.
Finally, the year is exactly four digits, so it's the set of characters '[0-9]' repeated four times.
Again, we'll check for invalid values like '0000' after we convert to integer.
With the tools we've seen so far, we can write a simple function that will extract the date from either of the notebooks we're looking at, and return the year, the month, and the day as strings. First, we test to see if the record has a match for an ISO-formatted date: four digits for the year, dash, two for the month, dash, two for the day. If it does, then we're done: we return those three fields. Otherwise, we test the record to see if we can find the name of a month, one or two digits for the day, and then four digits for the year, within slashes. If so, we return those, permuting the order so that it's year, month, day. If neither pattern matched then we return None
to signal that we can't do anything. This is a very common way to use regular expressions: rather than trying to combine everything into one enormous pattern, we have one pattern for each valid format of data. We test, and if the test succeeds, we return what we found. If it doesn't, we move on to the next pattern. Working this way is more readable; it's also easier to extend if we have to handle other data formats.
Thank you.