Hello, and welcome to the third episode of the Software Carpentry lecture on regular expressions. In this episode, we'll take a look at the mechanics of regular expressions, i.e., what the computer actually does when it's matching a regular expression against a piece of text. You don't need to know this in order to do a lot with regular expressions, but if you are composing or debugging complex REs, it helps to understand what's going on behind the scenes.

If you recall, we have some notebooks with recordings of background evil levels in millivaders, taken in the Shire a couple of years after the Death Star exploded. The data in these notebooks is in different formats…

…and we're using regular expressions to parse them.

In the last episode, one of the regular expressions we came up with was this complicated beast: `'(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)'`

It matches one or more characters followed by a literal slash '/', followed by a single upper-case character and one or more lower-case characters, followed by a space, then one or two digits, an optional comma, another space, exactly four digits, another literal slash, and one or more characters at the end. That's a pretty complex match, and you might be asking yourself…

How does it do it?

The answer is that regular expressions are implemented using *finite state machines*.

Here's a very simple finite state machine that matches exactly, and only, the single character "lower case a".

We start matching with the incoming arrow on the left. It takes us to the first state in our finite state machine.

The only way to get from there to the second state is to match exactly one character, because the arc between state 1 and state 2 is labelled with the character 'a'.

The second state is marked with a dot in the middle, which means it's an end state. We must be in one of these states at the end of our match in order for the match to be valid.

So now we have a finite state machine that matches the very simple regular expression 'a'. Let's see if we can do something a little more interesting.

Here's a finite state machine that matches one 'a' followed by zero or more 'a's.

The first 'a' gets us from an initial state to an end state. But we don't have to stop there: the curved arc at the top allows us to match another 'a', and brings us back to the same state.

We can then match another 'a', and another 'a', and so on indefinitely.

Note that we don't have to stop in the end state the first time we reach it: we just have to be in an end state when we run out of input.

What is the pattern? It's 'a+': one 'a', followed by zero or more other 'a's, is the same as one or more 'a's.

Here's a finite state machine that matches against the letter 'a' or nothing.

The top arc isn't marked, so that transition is free. We can go from the first state to the second state without consuming any of our input.

This is "a or nothing"…

…which is the same as 'a?', i.e., the optional character 'a'.

So now we have three patterns in our library.

This regular expression looks like the one that matches one or more 'a's…

…except there's an extra arc to get us from the first state to the second without consuming any input.

So this will match 'a*', i.e., nothing at all (taking that free transition from the first state to the second) or one or more 'a's.

As you can see, we're building up a library of more and more complex patterns. The tools we've seen so far are enough to implement most of the regular expressions we've encountered in the previous two episodes.

Take a look, for example, at this finite state machine. What is the corresponding regular expression? We can either take the top route or the bottom. The top route is 'a+'; the bottom route is a 'b', followed by either a 'c' or a 'd'.

So this is `'a+|(b(c|d))'`

. An input string that matches any of these paths will leave us in that final end state.

The most important thing about a finite state machine is that the action it takes at a node depends on only…

…the arcs out of that node…

…and the characters in the target data.

Finite state machines have no memory: they do *not* keep track of how they got into a state. All decision-making is local to that state.

This means that there are many kinds of data that regular expressions *cannot* match. For example, it is impossible to write a regular expression to check if nested parentheses match.

If you want to see whether '(((…)))' is balanced, you need some kind of memory…

…or at least a counter, and there isn't any memory in a finite state machine.

Similarly, if you want to check whether a word contains each vowel only once, the only way to do it is to write a regular expression with 120 clauses, because you need to check for each possible permutation of 'aeiou' independently. You cannot write a regular expression that matches an arbitrary vowel, and then subtracts *that* vowel from the set of vowels yet to be matched, because once again, that would require some kind of memory, and finite state machines don't have any.

If regular expressions are limited this way, why do we use them? There are two answers.

The first is, they're really fast.

After you do some pre-calculation—essentially, after you compile the regular expression to create a finite state machine—a regular expression can be matched against input by looking at each input character only once. That means the execution time only grows with the size of the data. The time required for many other matching techniques grows much faster than the size of the input data. So, while regular expressions are limited, the things they do, they do very, very well.

Also, regular expressions are readable. Going back to our previous example, you might not think so,

But imagine writing out a piece of code that matched that same input. The regular expression is much easier to read, and probably much more efficient, than its procedural equivalent.

Regular expressions can do a lot more than the simple things we've seen so far.

We'll explore their power in the next episode.