Hello, and welcome to the first episode of the Software Carpentry lecture on regular expressions. In order to understand what regular expressions are for, let's have a look at the sort of cleanup job that comes up frequently when dealing with real field data.
A couple of years after the Death Star exploded…
…a hot-shot reporter at the Daily Planet heard that…
…children in the Shire…
…were starting to act a little bit strangely.
Your supervisor…
…sent some of his grad students off to collect some data.
Things didn't go so well for them…
…but their notebooks were recovered and later transcribed.
Your job is to read in 20 or 30 files, each of which contains several hundred measurements of background evil levels, and convert them into a uniform format so that they can be processed further.
Each of the readings has the name of the site where the reading was taken, the date the reading was taken on, and of course the background evil level in millivaders. The problem is, these files aren't formatted in the same way.
Some of them use tabs to separate fields, others use commas.
And the dates are written in several different styles.
Let's take a look at one of those files. As you can see…
…it uses a single tab between each column as a separator.
(While the spaces in site names are visually similar, they're different characters.)
The good news is, the dates are written in the international standard format: four digits for the year, two for the month, and two for the day.
Let's have a look at the second notebook.
Here, they're using slashes as separators.
There don't appear to be spaces in the site names…
…but the month names and day numbers are of varying length. The months are text, and the order is month-day-year rather than year-month-day. Parsing these files using string search would be difficult and error-prone.
The right solution is to use regular expressions.
A regular expression is just a pattern that a string can match. You've probably used these already.
When you say "*.txt"
to a computer, you're matching all of the filenames that end in ".txt"
. The "*"
matches any number of characters: it's a pattern.
A warning before we go any further: the notation for regular expressions is ugly, even by the standards of programming.
The problem is that we're writing patterns to match strings, but we're writing those patterns as strings…
…using only the symbols that are on the keyboard, instead of inventing new symbols the way mathematicians often do.
Let's start by reading in data from two files, and grabbing the first few lines of each.
When we print out the results in the list readings
, we can see that we've got six lines from the first data file, and six from the second. We'll test our regular expressions against this data to see how well or how poorly we're matching different formats of records as we go along.
Without regular expressions, we can select records that have the month "06" just by saying, if "06" is in the record
.
If we want to select data for two months, we have to say, if '06' in the record
or if '07' in the record
.
We should realize that there's a problem here. If we say, '05' in record
, it isn't matching against the month: it's matching against the day.
Right now, we have no easy way to distinguish those two cases. This is a problem we'll come back to later.
Let's try using a regular expression to do our matches instead of the simple string-in operator. We import the regular expressions library, and then say, for each record, if regular expression search can find a match for the string '06'
in the record, then we'll print it out.
So far, this is matching exactly what "06" in r
would match—it's not much of an improvement.
But look what happens if we want to match a month of "06" or a month of "07". We can combine the two in a single pattern. Let's take a closer look at this code.
The first argument to re.search
is the pattern we are searching for.
That pattern is written down as a string.
The second argument is the data we are searching in.
It's quite common to get these reversed: a very common mistake is to put the data first and the pattern second. This can be quite hard to track down, so please be careful.
The vertical bar in the pattern means "or".
We're telling the regular expression engine that we want to match either the text on the left of the vertical bar, or the text on the right, but we're going to do the match in a single search.
We're going to be trying to match a lot of patterns against our data, so let's write a function that will tell us which records match a particular pattern. Our function show_matches
takes a pattern and a list of strings, and then for each of those strings, if the pattern matches, we print out two stars as a marker, otherwise we just print out some blanks.
Let's test our function right away. If we try to match '06|07'
against the data that we read in earlier, it seems to be doing the right thing: we've got stars beside the two records that have month '06'
or month '07'
.
But why doesn't this work? If we match '06|7'
, it seems to be matching a lot of things that don't have the month '06'
or '07'
.
Think back to mathematics. The expression ab+c means ,"a times b plus c."
Multiplication is implied simply by putting a and b next to each other, and it has higher precedence than addition: we always do multiplication before we do addition.
If we want to force the other meaning, we write, "a times (b plus c)."
The same thing happens with regular expressions. If we say '06|7'
, it means exactly that: either '06'
or the digit '7'
.
And if you look back at our data, there are a lot of 7's in our file.
If we want to match '06'
or '07'
, we can parenthesizes as shown here: '0(6|7)'
.
Having said that, the expression '06|07'
is probably more readable to most people anyway.
Let's go back to our function and our data. If we do matches for '05'
, then as we said earlier, we're pulling up records that have '05'
as the day, rather than as the month. We can force our match to do the right thing by taking advantage of context.
If we want to match a month…
…there should be a dash '-'
before and after the numbers. So if we try to match '-05-'
…
…we show no matches, which is the correct answer: we don't have any readings in this sample of our data set for May.
Matching is all well and good, but what we really want to do is extract data: we want to pull the year, the month, and the day out of our data set so that we can reformat them.
When a regular expression matches a piece of text, the regular expression library remembers what matched against every parenthesized sub-expression. Parentheses aren't just used for grouping: they're also used to remember things.
Here's an example.
The pattern to match years, '(2009|2010|2011)'
, has been put in parentheses. This will match 2009, or 2010, or 2011, but it will remember which of those it matched.
The second string is just the first record from our data.
(If you recall, '\t'
represents a tab.)
When re.search
is called, it returns a match object if a match is found.
If no match is found it returns None
, meaning, "There's no useful information."
The expression match.group
returns the text that matched a particular parenthesized sub-expression. For example, match.group(1)
returns whatever matched against the pattern inside the first pair of parentheses counting from the left.
It's important to note that the first sub-expression is extracted with match.group(1)
, the second with 2, and so forth. When we're looking at groups, we count from 1 to N, rather than from 0 to N-1, as is normal in the rest of Python.
The reason for this is that match.group(0)
returns all of the text that the entire pattern matched.
What if we want to match the month as well as the year? A regular expression to match legal months would be '01'
or '02'
or '03'
and so forth all the way up to '12'
.
The expression to match the day would be three times longer. This is pretty cumbersome: it's hard to type, and more importantly, hard to read.
In a regular expression, you can use a dot—the period character '.'
—to match any single character.
So the expression '....-..-..'
matches any four characters—exactly four characters—followed by a dash, followed by two more characters, followed by another dash, followed by two more characters.
If we put each set of dots in parentheses, we should get out three groups recording the year, month, and day every time there's a successful match.
Let's test that out. Here, we're calling re.search
with the pattern we just described and the first record from out data. When we print out match.group(1)
, 2, and 3, sure enough, we get '2009'
, '11'
, and '17'
, just as we wanted.
Try doing that with substring searches.
To recapitulate, letters and digits in a pattern match against themselves: the character 'A'
in a pattern matches the character 'A'
in the data, and so forth.
Vertical bar '|'
means OR…
…dot '.'
matches any single character…
…we use parentheses '()'
to enforce grouping…
…re.search
returns a match object if the pattern matches, and None
if there isn't a match….
…and if a match was found, match.group(k)
is the text that matched group k
.
More generally, stepping back from the details of regular expressions…
…the right way to build up patterns is to start with something simple that matches part of the data you're working with.
Test it against your data, but also test that it doesn't match things that it shouldn't, because it can be very hard to track down false positives.
Once you've done that, extend it piece by piece to handle other cases.
We'll take a look at how to do more of this in the next episode.
Thank you.