Skip to main content

Chapter 9 - Text Pattern Matching with Regular Expressions (Python)

Here's a concise walkthrough of the main ideas in Chapter 9, each with a small example.


Manually finding patterns vs regex

You can detect patterns (like US phone numbers) with plain string logic, but it's verbose and rigid, while regex does the same with a short pattern.

Example (manual phone check):

def is_phone_number(text):
if len(text) != 12:
return False
for i in range(0, 3):
if not text[i].isdecimal():
return False
if text[3] != '-':
return False
for i in range(4, 7):
if not text[i].isdecimal():
return False
if text[7] != '-':
return False
for i in range(8, 12):
if not text[i].isdecimal():
return False
return True

Regex equivalent pattern: r'\d{3}-\d{3}-\d{4}'.


Basic regex workflow with re

Steps: import re, compile a pattern, call search(), then group() on the match.

Example:

import re

phone_re = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phone_re.search('My number is 415-555-4242.')
print(mo.group()) # '415-555-4242'

Groups with parentheses

Parentheses create groups so you can pull out parts of a match (group 1, group 2, etc.).

Example:

import re

phone_re = re.compile(r'(\d{3})-(\d{3}-\d{4})')
mo = phone_re.search('My number is 415-555-4242.')
print(mo.group(1)) # '415'
print(mo.group(2)) # '555-4242'
print(mo.groups()) # ('415', '555-4242')

You can unpack groups into variables:

area, rest = mo.groups()

Escaping special characters

Characters like ()[]{}.+*?^$|\ have special meanings in regex; prefix with \ to match them literally.

Example (match (415) 555-4242):

import re

pattern = re.compile(r'(\(\d{3}\)) (\d{3}-\d{4})')
mo = pattern.search('My phone number is (415) 555-4242.')
print(mo.group(1)) # '(415)'
print(mo.group(2)) # '555-4242'

Alternation with |

| means "this or that", and you can combine it with groups for shared prefixes.

Example:

import re

pet_re = re.compile(r'Cat(erpillar|astrophe|ch|egory)')
mo = pet_re.search('Catch me if you can.')
print(mo.group()) # 'Catch'
print(mo.group(1)) # 'ch'

search() vs findall()

  • search() returns the first match (or None).
  • findall() returns all matches as a list; if the pattern has groups, you get tuples.

Example (no groups):

import re

pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
print(pattern.findall('Cell: 415-555-9999 Work: 212-555-0000'))
# ['415-555-9999', '212-555-0000']

With groups:

pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
print(pattern.findall('Cell: 415-555-9999 Work: 212-555-0000'))
# [('415', '555', '9999'), ('212', '555', '0000')]

Character classes [...] and negated [^...]

[...] matches any one character from the set; [^...] matches any character not in the set.

Example (vowels vs non-vowels):

import re

vowel_re = re.compile(r'[aeiouAEIOU]')
print(vowel_re.findall('RoboCop eats BABY FOOD.'))
# ['o', 'o', 'o', 'e', 'a', 'A', 'O', 'O']

consonant_re = re.compile(r'[^aeiouAEIOU]')
print(consonant_re.findall('RoboCop eats BABY FOOD.'))
# includes consonants, spaces, punctuation

Shorthand character classes \d \w \s and opposites

Useful built-ins:

  • \d digits, \D non-digits
  • \w letters/digits/underscore, \W non-\w
  • \s whitespace, \S non-whitespace

Example:

import re

pattern = re.compile(r'\d+\s\w+')
print(pattern.findall('12 drummers, 11 pipers, 10 lords'))
# ['12 drummers', '11 pipers', '10 lords']

Dot . wildcard

. matches any character except newline.

Example:

import re

at_re = re.compile(r'.at')
print(at_re.findall('The cat in the hat sat on the flat mat.'))
# ['cat', 'hat', 'sat', 'lat', 'mat']

To match a literal dot, use \..


Quantifiers: ?, *, +, \{m,n\}

Quantifiers say how many of the preceding piece to match.

  • ? – 0 or 1 (optional)
  • * – 0 or more
  • + – 1 or more
  • \{m\} – exactly m
  • \{m,n\} – between m and n (inclusive), \{m,\} / \{,n\} for open-ended

Examples:

import re

opt_ex = re.compile(r'42!?') # '42' or '42!'
star_ex = re.compile(r'Eggs(and spam)*') # Eggs, Eggs and spam, Eggs and spam and spam...
plus_ex = re.compile(r'(Ha)+') # 'Ha', 'HaHa', ...
count_ex = re.compile(r'(Ha){3,5}') # 3 to 5 'Ha'

Use parentheses if you want the quantifier to apply to a whole group, not just one char.


Greedy vs non-greedy (? after quantifier)

*, +, \{m,n\} are greedy: they match as much as possible; adding ? makes them lazy (shortest possible).

Example:

import re

greedy = re.compile(r'(Ha){3,5}')
print(greedy.search('HaHaHaHaHa').group()) # 'HaHaHaHaHa'

lazy = re.compile(r'(Ha){3,5}?')
print(lazy.search('HaHaHaHaHa').group()) # 'HaHaHa'

Similarly, .* is greedy, .*? is lazy.


.* and .*? (match "anything")

.* means "any chars, 0+ times"; use in groups to capture "whatever is here".

Example (First/Last name):

import re

name_re = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = name_re.search('First Name: Al Last Name: Sweigart')
print(mo.group(1)) # 'Al'
print(mo.group(2)) # 'Sweigart'

Example of greedy vs lazy tags:

lazy_tag = re.compile(r'<.*?>')
greedy_tag = re.compile(r'<.*>')

Matching newlines with re.DOTALL

Normally . does not match \n. Pass re.DOTALL to let . match newlines too.

Example:

import re

no_nl = re.compile('.*')
print(no_nl.search('Line1\nLine2').group()) # 'Line1'

with_nl = re.compile('.*', re.DOTALL)
print(with_nl.search('Line1\nLine2').group()) # whole string

Anchors: ^, $, word boundaries \b / \B

  • ^ – match at start of string
  • $ – match at end of string
  • ^...$ – whole string must match
  • \b – word boundary; \B – not a word boundary

Examples:

import re

begins_hello = re.compile(r'^Hello')
ends_digit = re.compile(r'\d$')
all_digits = re.compile(r'^\d+$')

word_re = re.compile(r'\bcat.*?\b')
print(word_re.findall('The cat found a catapult catalog in the catacombs.'))
# ['cat', 'catapult', 'catalog', 'catacombs']

middle_re = re.compile(r'\Bcat\B')
print(middle_re.findall('certificate')) # ['cat']

Case-insensitive matching

Pass re.IGNORECASE (or re.I) to ignore case.

import re

regex = re.compile(r'hello', re.I)
print(regex.search('HELLO World').group()) # HELLO

Substitution with sub()

Replace matches with a new string.

import re

censored = re.compile(r'Agent \w+')
result = censored.sub('REDACTED', 'Agent Alice gave the documents to Agent Bob.')
print(result) # REDACTED gave the documents to REDACTED.

Verbose mode with re.VERBOSE

Spread a complex regex across multiple lines with comments.

import re

phone_regex = re.compile(r"""
(\d{3}|\(\d{3}\)) # area code (with or without parens)
(\s|-|\.)? # separator
\d{3} # first 3 digits
(\s|-|\.) # separator
\d{4} # last 4 digits
""", re.VERBOSE)

Quick mental model

  • Regex = mini-language for patterns over text (phone numbers, emails, etc.).
  • Build a pattern (re.compile), then use search() / findall() and group/quantifier tools to extract exactly what you need.

Overall idea of the chapter

Chapter 9 shows how regular expressions give you powerful pattern-matching tools for finding, extracting, and replacing text. They look complex at first, but mastering a few building blocks (character classes, quantifiers, groups, anchors) covers most real-world use cases.