Chapter 9 - Text Pattern Matching with Regular Expressions (Python)
Here's a concise walkthrough of the main ideas in Chapter 9, each with a small example.
Manually finding patterns vs regex
You can detect patterns (like US phone numbers) with plain string logic, but it's verbose and rigid, while regex does the same with a short pattern.
Example (manual phone check):
def is_phone_number(text):
if len(text) != 12:
return False
for i in range(0, 3):
if not text[i].isdecimal():
return False
if text[3] != '-':
return False
for i in range(4, 7):
if not text[i].isdecimal():
return False
if text[7] != '-':
return False
for i in range(8, 12):
if not text[i].isdecimal():
return False
return True
Regex equivalent pattern: r'\d{3}-\d{3}-\d{4}'.
Basic regex workflow with re
Steps: import re, compile a pattern, call search(), then group() on the match.
Example:
import re
phone_re = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phone_re.search('My number is 415-555-4242.')
print(mo.group()) # '415-555-4242'
Groups with parentheses
Parentheses create groups so you can pull out parts of a match (group 1, group 2, etc.).
Example:
import re
phone_re = re.compile(r'(\d{3})-(\d{3}-\d{4})')
mo = phone_re.search('My number is 415-555-4242.')
print(mo.group(1)) # '415'
print(mo.group(2)) # '555-4242'
print(mo.groups()) # ('415', '555-4242')
You can unpack groups into variables:
area, rest = mo.groups()
Escaping special characters
Characters like ()[]{}.+*?^$|\ have special meanings in regex; prefix with \ to match them literally.
Example (match (415) 555-4242):
import re
pattern = re.compile(r'(\(\d{3}\)) (\d{3}-\d{4})')
mo = pattern.search('My phone number is (415) 555-4242.')
print(mo.group(1)) # '(415)'
print(mo.group(2)) # '555-4242'
Alternation with |
| means "this or that", and you can combine it with groups for shared prefixes.
Example:
import re
pet_re = re.compile(r'Cat(erpillar|astrophe|ch|egory)')
mo = pet_re.search('Catch me if you can.')
print(mo.group()) # 'Catch'
print(mo.group(1)) # 'ch'
search() vs findall()
search()returns the first match (orNone).findall()returns all matches as a list; if the pattern has groups, you get tuples.
Example (no groups):
import re
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
print(pattern.findall('Cell: 415-555-9999 Work: 212-555-0000'))
# ['415-555-9999', '212-555-0000']
With groups:
pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
print(pattern.findall('Cell: 415-555-9999 Work: 212-555-0000'))
# [('415', '555', '9999'), ('212', '555', '0000')]
Character classes [...] and negated [^...]
[...] matches any one character from the set; [^...] matches any character not in the set.
Example (vowels vs non-vowels):
import re
vowel_re = re.compile(r'[aeiouAEIOU]')
print(vowel_re.findall('RoboCop eats BABY FOOD.'))
# ['o', 'o', 'o', 'e', 'a', 'A', 'O', 'O']
consonant_re = re.compile(r'[^aeiouAEIOU]')
print(consonant_re.findall('RoboCop eats BABY FOOD.'))
# includes consonants, spaces, punctuation
Shorthand character classes \d \w \s and opposites
Useful built-ins:
\ddigits,\Dnon-digits\wletters/digits/underscore,\Wnon-\w\swhitespace,\Snon-whitespace
Example:
import re
pattern = re.compile(r'\d+\s\w+')
print(pattern.findall('12 drummers, 11 pipers, 10 lords'))
# ['12 drummers', '11 pipers', '10 lords']
Dot . wildcard
. matches any character except newline.
Example:
import re
at_re = re.compile(r'.at')
print(at_re.findall('The cat in the hat sat on the flat mat.'))
# ['cat', 'hat', 'sat', 'lat', 'mat']
To match a literal dot, use \..
Quantifiers: ?, *, +, \{m,n\}
Quantifiers say how many of the preceding piece to match.
?– 0 or 1 (optional)*– 0 or more+– 1 or more\{m\}– exactly m\{m,n\}– between m and n (inclusive),\{m,\}/\{,n\}for open-ended
Examples:
import re
opt_ex = re.compile(r'42!?') # '42' or '42!'
star_ex = re.compile(r'Eggs(and spam)*') # Eggs, Eggs and spam, Eggs and spam and spam...
plus_ex = re.compile(r'(Ha)+') # 'Ha', 'HaHa', ...
count_ex = re.compile(r'(Ha){3,5}') # 3 to 5 'Ha'
Use parentheses if you want the quantifier to apply to a whole group, not just one char.
Greedy vs non-greedy (? after quantifier)
*, +, \{m,n\} are greedy: they match as much as possible; adding ? makes them lazy (shortest possible).
Example:
import re
greedy = re.compile(r'(Ha){3,5}')
print(greedy.search('HaHaHaHaHa').group()) # 'HaHaHaHaHa'
lazy = re.compile(r'(Ha){3,5}?')
print(lazy.search('HaHaHaHaHa').group()) # 'HaHaHa'
Similarly, .* is greedy, .*? is lazy.
.* and .*? (match "anything")
.* means "any chars, 0+ times"; use in groups to capture "whatever is here".
Example (First/Last name):
import re
name_re = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = name_re.search('First Name: Al Last Name: Sweigart')
print(mo.group(1)) # 'Al'
print(mo.group(2)) # 'Sweigart'
Example of greedy vs lazy tags:
lazy_tag = re.compile(r'<.*?>')
greedy_tag = re.compile(r'<.*>')
Matching newlines with re.DOTALL
Normally . does not match \n. Pass re.DOTALL to let . match newlines too.
Example:
import re
no_nl = re.compile('.*')
print(no_nl.search('Line1\nLine2').group()) # 'Line1'
with_nl = re.compile('.*', re.DOTALL)
print(with_nl.search('Line1\nLine2').group()) # whole string
Anchors: ^, $, word boundaries \b / \B
^– match at start of string$– match at end of string^...$– whole string must match\b– word boundary;\B– not a word boundary
Examples:
import re
begins_hello = re.compile(r'^Hello')
ends_digit = re.compile(r'\d$')
all_digits = re.compile(r'^\d+$')
word_re = re.compile(r'\bcat.*?\b')
print(word_re.findall('The cat found a catapult catalog in the catacombs.'))
# ['cat', 'catapult', 'catalog', 'catacombs']
middle_re = re.compile(r'\Bcat\B')
print(middle_re.findall('certificate')) # ['cat']
Case-insensitive matching
Pass re.IGNORECASE (or re.I) to ignore case.
import re
regex = re.compile(r'hello', re.I)
print(regex.search('HELLO World').group()) # HELLO
Substitution with sub()
Replace matches with a new string.
import re
censored = re.compile(r'Agent \w+')
result = censored.sub('REDACTED', 'Agent Alice gave the documents to Agent Bob.')
print(result) # REDACTED gave the documents to REDACTED.
Verbose mode with re.VERBOSE
Spread a complex regex across multiple lines with comments.
import re
phone_regex = re.compile(r"""
(\d{3}|\(\d{3}\)) # area code (with or without parens)
(\s|-|\.)? # separator
\d{3} # first 3 digits
(\s|-|\.) # separator
\d{4} # last 4 digits
""", re.VERBOSE)
Quick mental model
- Regex = mini-language for patterns over text (phone numbers, emails, etc.).
- Build a pattern (
re.compile), then usesearch()/findall()and group/quantifier tools to extract exactly what you need.
Overall idea of the chapter
Chapter 9 shows how regular expressions give you powerful pattern-matching tools for finding, extracting, and replacing text. They look complex at first, but mastering a few building blocks (character classes, quantifiers, groups, anchors) covers most real-world use cases.