Regular Expressions in Python

Regular Expressions in Python

You may already know how to read files and search for text by line number, word number, column number or by using find to search for specific text (if not, take a look here). This is all great, but it is not very flexible.

For example, imagine searching for all surnames and titles from the below text…

Dear Mr. Johnson, 
  Dear Miss. Jameson,
Dear   Ms.   Jackson, 
Dear Mrs.    Peterson, 
  Dear    Mr. Sampson
Dear Dr.Johanson,
Dear Rev Richardson,

How would you go about trying to write a program that can do this?

Searching and extracting text from files is remarkably complicated. Fortunately, computer scientists have solved this problem. The solution has been adopted by nearly all programming languages. The solution is to use what are called regular expressions.

Regular Expressions in Python

Regular expressions can look scary, but are pretty simple once you understand the rules. The syntax for regular expressions appeared and was standardised in the Perl language, and now nearly all programming languages support “Perl Compatible Regular Expressions” (PCRE). Python provides the re and regexp modules, that support most of PCRE. Let’s take a look using re. Open a new ipython session and type;

import re
help(re)

This will show you the help for the re module, which should look something like this;

Help on module re:

NAME
    re - Support for regular expressions (RE).

FILE
    /path/to/re.py

MODULE DOCS
    http://docs.python.org/library/re

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.

Regular expressions can be used for three things; Searching, pattern extraction and replacing.

Regular Expression Searching

Searching is when you want to look some text in a file. Here is the text for Hamlet’s soliloquy. Copy and paste this into a text file called textfile.

To be, or not to be, that is the question:
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing end them: to die, to sleep
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub,
For in that sleep of death, what dreams may come,
When we have shuffled off this mortal coil,
Must give us pause. There's the respect
That makes Calamity of so long life:
For who would bear the Whips and Scorns of time,
The Oppressor's wrong, the proud man's Contumely,
The pangs of despised Love, the Law's delay,
The insolence of Office, and the Spurns
That patient merit of the unworthy takes,
When he himself might his Quietus make
With a bare Bodkin? Who would Fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovered Country, from whose bourn
No Traveller returns, Puzzles the will,
And makes us rather bear those ills we have,
Than fly to others that we know not of.
Thus Conscience does make Cowards of us all,
And thus the Native hue of Resolution
Is sicklied o'er, with the pale cast of Thought,
And enterprises of great pitch and moment,
With this regard their Currents turn awry,
And lose the name of Action. Soft you now,
The fair Ophelia? Nymph, in thy Orisons
Be all my sins remembered.

Now, open up a new ipython session and type;

from __future__ import print_function
import re
lines = open("textfile", "r").readlines()
for line in lines:
    if re.search(r"dream", line):
        print(line, end="")

This will search for the lines that contain the word dream and will print them, e.g.

For in that sleep of death, what dreams may come,

re.search is used to search, in this case for the string dream in string line. If the text is found, then re.search returns True, else it returns False. Note that we put an r in front of the search string. This is to tell Python that this is a raw string which should not be escaped (more about this later..)

The above was a simple, case-sensitive regular expression search. To perform a case-insensitive search, you use re.IGNORECASE, e.g. type;

for line in lines:
    if re.search(r"dream", line, re.IGNORECASE):
        print(line)

and you will see;

To sleep, perchance to Dream; Aye, there's the rub,
For in that sleep of death, what dreams may come,

So far, so the same as line.find().. Regular expressions are powerful as they provide a sub-language to control the search. Let’s say you want to find all lines containing the as a word. You can do that using the special character \s, which means space, e.g.

for line in lines:
    if re.search(r"\sthe\s", line):
        print(line,end="")

will print

To be, or not to be, that is the question:
Whether 'tis Nobler in the mind to suffer
The Heart-ache, and the thousand Natural shocks
To sleep, perchance to Dream; Aye, there's the rub,
Must give us pause. There's the respect
For who would bear the Whips and Scorns of time,
The Oppressor's wrong, the proud man's Contumely,
The pangs of despised Love, the Law’s delay,
The insolence of Office, and the Spurns
That patient merit of the unworthy takes,
But that the dread of something after death,
No Traveller returns, Puzzles the will,
And thus the Native hue of Resolution
Is sicklied o'er, with the pale cast of Thought,
And lose the name of Action. Soft you now,

Now, let’s search for all lines that contain the where the the is part of a word. We can do this by using \w which means any non-space character, e.g. type;

for line in lines:
    if re.search(r"the\w", line):
        print(line,end="")

and you will see;

Whether 'tis Nobler in the mind to suffer
And by opposing end them: to die, to sleep
To sleep, perchance to Dream; Aye, there's the rub,
And makes us rather bear those ills we have,
Than fly to others that we know not of.
With this regard their Currents turn awry,

And combining these, together, find lines containing words that start with the, type;

for line in lines:
   if re.search(r"\sthe\w", line):
       print(line,end="")

and you will see;

And by opposing end them: to die, to sleep
To sleep, perchance to Dream; Aye, there's the rub,
With this regard their Currents turn awry,

There are a lot of special characters. They are

  • \d Match any digit (number)
  • \s Match a space
  • \w Match any word character (alphanumeric and “_”)
  • \S Match any non-whitespace character
  • \D Match any non-digit character
  • . Match any character
  • \t Match a tab
  • \n Match a newline

Note that the backslash is a special character which is normally removed (escaped) in Python. The r in front of the string tells Python not to interpret, escape or remove the backslash. You must include the r or else your regular expressions may not compile.

As well as matching characters, you can match collections of characters, to match th followed by a, i or y, you would use square brackets, and need to type;

for line in lines:
    if re.search(r"th[aiy]", line):
        print(line,end="")

You should see;

To be, or not to be, that is the question:
For in that sleep of death, what dreams may come,
When we have shuffled off this mortal coil,
That patient merit of the unworthy takes,
But that the dread of something after death,
Than fly to others that we know not of.
With this regard their Currents turn awry,
The fair Ophelia? Nymph, in thy Orisons

You can control which characters are matched in the square brackets using;

  • [abc] Match a, b or c
  • [a-z] Match any character between a to z
  • [A-Z] Match any character between A to Z
  • [a-zA-Z] Match any character from a to z and A to Z (any letter)
  • [0-9] Match any digit
  • [02468] Match any even digit
  • [^0-9] Matches NOT digits (^ means NOT)

You can also use repetition in your matching.

  • * Match 0 or more times, e.g. \w* means match 0 or more word characters
  • + Match 1 or more times, e.g. \w+ means match 1 or more word characters
  • ? Match 0 or 1 times, e.g. \w? means match 0 or 1 word characters
  • {n} Match exactly n times, e.g. \w{3} means match exactly 3 word characters
  • {n,} Match at least n times, e.g. \w{5,} means match at least 5 word characters
  • {m,n} Match between m and n times, e.g. \w{5,7} means match 5-7 word characters

We can use this to find all lines that contain words with 10-12 characters, by typing;

for line in lines:
    if re.search(r"\w{10,12}", line):
        print(line,end="")

You should see;

The Slings and Arrows of outrageous Fortune,
That Flesh is heir to? 'Tis a consummation
The undiscovered Country, from whose bourn
Thus Conscience does make Cowards of us all,
And thus the Native hue of Resolution
And enterprises of great pitch and moment,
Be all my sins remembered.

Finally, flags can be attached to the match. To match only at the beginning of the line use a carat, ^, e.g. type;

for line in lines:
    if re.search(r"^the\s", line, re.IGNORECASE):
        print(line,end="")

will match the only at the beginning of the string, e.g. resulting in;

The Slings and Arrows of outrageous Fortune,
The Heart-ache, and the thousand Natural shocks
The Oppressor's wrong, the proud man's Contumely,
The pangs of despised Love, the Law’s delay,
The insolence of Office, and the Spurns
The undiscovered Country, from whose bourn
The fair Ophelia? Nymph, in thy Orisons

To match at the end of the line, using a dollar, $, e.g.

for line in lines:
    if re.search(r"on$", line):
        print(line,end="")

matches all lines that end in on, e.g.

That Flesh is heir to? 'Tis a consummation
And thus the Native hue of Resolution

Pattern extraction

Searching is great, but substring matching is the real power of regular expressions. You can group parts of the regular expression to let you extract the matching part of the string. You do this using round brackets. Try typing;

line = lines[0]
print(line)

This has put the first line of the text into the variable line, resulting in

To be, or not to be, that is the question:

being printed to the screen. Now type;

m = re.search(r"the\s(\w+)", line)

This matches the followed by a space, followed by 1 or more word characters. The returned object, m, contains information about the match. We can query this object by typing;

print(m.group(0))

This prints;

the question

m.group(0) returns the entire matched substring, in this case the question. However, we put \w+ into parentheses, and so this part is available as a sub-match, in m.group(1)

print(m.group(1))

will print question.

If we have added extra groups, these would be available as m.group(2), m.group(3) etc., e.g. try typing;

m = re.search(r"to (\w+), or not (\w+) (\w+)", line, re.IGNORECASE)
print(m.group(0))

to get To be, or not to be. Now look at the individual matches, e.g. type

print(m.group(1))

to get be, then type

print(m.group(2))

to get to, then finally type

print(m.group(3))

to get the last be.

For example, we could use this to extract all of the words that follow the in the text, e.g. try typing;

for line in lines:
    m = re.search(r"\sthe\s(\w+)", line, re.IGNORECASE)
    if m:
        print(line,end="")
        print(m.group(1))

and you should see;

To be, or not to be, that is the question:
question
Whether 'tis Nobler in the mind to suffer
mind
The Heart-ache, and the thousand Natural shocks
thousand
To sleep, perchance to Dream; Aye, there's the rub,
rub
Must give us pause. There's the respect
respect
For who would bear the Whips and Scorns of time,
Whips
The Oppressor's wrong, the proud man's Contumely,
proud
The pangs of despised Love, the Law’s delay,
Law
The insolence of Office, and the Spurns
Spurns
That patient merit of the unworthy takes,
unworthy
But that the dread of something after death,
dread

Pattern Replacing

As well as using regular expressions for searching for text, you can also use it to replace text. You do this using re.sub. Type;

line = re.sub(r"be", "code", line)
print(line)

You should now have printed;

To code, or not to code, that is the question:

As you can see, every match is replaced by code. We can replace n matches by passing that in as an extra argument. Try this by typing;

line = lines[0]
line = re.sub(r"be", "code", line, 1)
print(line)

and you should see;

To code, or not to be, that is the question:

In this case, we only replace 1 time, hence only the first match is replaced.

We can add some logic to the replacement, e.g. replace be or question with code. Try this by typing;

line = lines[0]
line = re.sub(r"be|question", "code", line)
print(line)

and you will see

To code, or not to code, that is the code:

If you want to do a case-insensitive match, you need to compile the first string, e.g. type

line = lines[0]
line = re.sub( re.compile(r"to be", re.IGNORECASE), "ice-cream", line )
print(line)

This should print;

ice-cream, or not ice-cream, that is the question:

You can also nest re.sub calls together if you want to perform multiple substitutions. Try this by typing;

line = lines[0]
line = re.sub( re.compile(r"to", re.IGNORECASE), "go", re.sub(r"be", "home", line) )
print(line)

and you will get printed;

go home, or not go home, that is the question:

Health Warning

Regular expressions are very powerful. You can use them to search for specific output from your programs and to do powerful text manipulation. However, as you have seen, they are very “write-only”. Extremely difficult to understand for non-experts, and complex regular expressions can be difficult even for your future-self to understand (i.e. “what was I thinking when I wrote that last year? What does it mean and what does it do?”). You should ALWAYS comment your regular expressions and explain in English exactly what you intended to match when you wrote them. Once you have memorised the rules, you will find regular expressions are very easy to read, use and are extremely powerful. However, without comments, they will be completely unintelligable to everyone else who looks at or relies on your code.


Exercise

Matching

Here is the list of surnames from above. Copy and paste these surnames into a file called greetings.txt.

Dear Mr. Johnson, 
  Dear Miss. Jameson,
Dear   Ms.   Jackson, 
Dear Mrs.    Peterson, 
  Dear    Mr. Sampson
Dear Dr.Johanson,
Dear Rev Richardson,

Can you write a regular expression that will match each line, extracting the title and surname for each person?

Note that you can match the . character using \\., e.g. to match Dr. use re.search(r"Dr\\.", line)

If you get stuck, an example output is here

Replacing

Find all words that follow “the” in “textfile” (the Hamlet soliloquy) and replace them with “banana”.

If you get stuck, take a look at the example output here

Previous Next