Chapter 9 - Text Pattern Matching with Regular Expressions (Python)

Here's a concise walkthrough of the main ideas in Chapter 9, each with a small example.

Manually finding patterns vs regex

You can detect patterns (like US phone numbers) with plain string logic, but it's verbose and rigid, while regex does the same with a short pattern.

Example (manual phone check):

def is_phone_number(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    return True

Regex equivalent pattern: r'\d{3}-\d{3}-\d{4}'.

Basic regex workflow with `re`

Steps: import re, compile a pattern, call search(), then group() on the match.

Example:

import re

phone_re = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phone_re.search('My number is 415-555-4242.')
print(mo.group())    # '415-555-4242'

Groups with parentheses

Parentheses create groups so you can pull out parts of a match (group 1, group 2, etc.).

Example:

import re

phone_re = re.compile(r'(\d{3})-(\d{3}-\d{4})')
mo = phone_re.search('My number is 415-555-4242.')
print(mo.group(1))      # '415'
print(mo.group(2))      # '555-4242'
print(mo.groups())      # ('415', '555-4242')

You can unpack groups into variables:

area, rest = mo.groups()

Escaping special characters

Characters like ()[]{}.+*?^$|\ have special meanings in regex; prefix with \ to match them literally.

Example (match (415) 555-4242):

import re

pattern = re.compile(r'(\(\d{3}\)) (\d{3}-\d{4})')
mo = pattern.search('My phone number is (415) 555-4242.')
print(mo.group(1))   # '(415)'
print(mo.group(2))   # '555-4242'

Alternation with `|`

| means "this or that", and you can combine it with groups for shared prefixes.

Example:

import re

pet_re = re.compile(r'Cat(erpillar|astrophe|ch|egory)')
mo = pet_re.search('Catch me if you can.')
print(mo.group())    # 'Catch'
print(mo.group(1))   # 'ch'

`search()` vs `findall()`

search() returns the first match (or None).
findall() returns all matches as a list; if the pattern has groups, you get tuples.

Example (no groups):

import re

pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
print(pattern.findall('Cell: 415-555-9999 Work: 212-555-0000'))
# ['415-555-9999', '212-555-0000']

With groups:

pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
print(pattern.findall('Cell: 415-555-9999 Work: 212-555-0000'))
# [('415', '555', '9999'), ('212', '555', '0000')]

Character classes `[...]` and negated `[^...]`

[...] matches any one character from the set; [^...] matches any character not in the set.

Example (vowels vs non-vowels):

import re

vowel_re = re.compile(r'[aeiouAEIOU]')
print(vowel_re.findall('RoboCop eats BABY FOOD.'))
# ['o', 'o', 'o', 'e', 'a', 'A', 'O', 'O']

consonant_re = re.compile(r'[^aeiouAEIOU]')
print(consonant_re.findall('RoboCop eats BABY FOOD.'))
# includes consonants, spaces, punctuation

Shorthand character classes `\d \w \s` and opposites

Useful built-ins:

\d digits, \D non-digits
\w letters/digits/underscore, \W non-\w
\s whitespace, \S non-whitespace

Example:

import re

pattern = re.compile(r'\d+\s\w+')
print(pattern.findall('12 drummers, 11 pipers, 10 lords'))
# ['12 drummers', '11 pipers', '10 lords']

Dot `.` wildcard

. matches any character except newline.

Example:

import re

at_re = re.compile(r'.at')
print(at_re.findall('The cat in the hat sat on the flat mat.'))
# ['cat', 'hat', 'sat', 'lat', 'mat']

To match a literal dot, use \..

Quantifiers: `?`, `*`, `+`, `\{m,n\}`

Quantifiers say how many of the preceding piece to match.

? – 0 or 1 (optional)
* – 0 or more
+ – 1 or more
\{m\} – exactly m
\{m,n\} – between m and n (inclusive), \{m,\} / \{,n\} for open-ended

Examples:

import re

opt_ex = re.compile(r'42!?')        # '42' or '42!'
star_ex = re.compile(r'Eggs(and spam)*')  # Eggs, Eggs and spam, Eggs and spam and spam...
plus_ex = re.compile(r'(Ha)+')      # 'Ha', 'HaHa', ...
count_ex = re.compile(r'(Ha){3,5}') # 3 to 5 'Ha'

Use parentheses if you want the quantifier to apply to a whole group, not just one char.

Greedy vs non-greedy (`?` after quantifier)

*, +, \{m,n\} are greedy: they match as much as possible; adding ? makes them lazy (shortest possible).

Example:

import re

greedy = re.compile(r'(Ha){3,5}')
print(greedy.search('HaHaHaHaHa').group())   # 'HaHaHaHaHa'

lazy = re.compile(r'(Ha){3,5}?')
print(lazy.search('HaHaHaHaHa').group())     # 'HaHaHa'

Similarly, .* is greedy, .*? is lazy.

`.` and `.?` (match "anything")

.* means "any chars, 0+ times"; use in groups to capture "whatever is here".

Example (First/Last name):

import re

name_re = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = name_re.search('First Name: Al Last Name: Sweigart')
print(mo.group(1))   # 'Al'
print(mo.group(2))   # 'Sweigart'

Example of greedy vs lazy tags:

lazy_tag = re.compile(r'<.*?>')
greedy_tag = re.compile(r'<.*>')

Matching newlines with `re.DOTALL`

Normally . does not match \n. Pass re.DOTALL to let . match newlines too.

Example:

import re

no_nl = re.compile('.*')
print(no_nl.search('Line1\nLine2').group())  # 'Line1'

with_nl = re.compile('.*', re.DOTALL)
print(with_nl.search('Line1\nLine2').group())  # whole string

Anchors: `^`, `$`, word boundaries `\b` / `\B`

^ – match at start of string
$ – match at end of string
^...$ – whole string must match
\b – word boundary; \B – not a word boundary

Examples:

import re

begins_hello = re.compile(r'^Hello')
ends_digit = re.compile(r'\d$')
all_digits = re.compile(r'^\d+$')

word_re = re.compile(r'\bcat.*?\b')
print(word_re.findall('The cat found a catapult catalog in the catacombs.'))
# ['cat', 'catapult', 'catalog', 'catacombs']

middle_re = re.compile(r'\Bcat\B')
print(middle_re.findall('certificate'))  # ['cat']

Case-insensitive matching

Pass re.IGNORECASE (or re.I) to ignore case.

import re

regex = re.compile(r'hello', re.I)
print(regex.search('HELLO World').group())  # HELLO

Substitution with `sub()`

Replace matches with a new string.

import re

censored = re.compile(r'Agent \w+')
result = censored.sub('REDACTED', 'Agent Alice gave the documents to Agent Bob.')
print(result)  # REDACTED gave the documents to REDACTED.

Verbose mode with `re.VERBOSE`

Spread a complex regex across multiple lines with comments.

import re

phone_regex = re.compile(r"""
    (\d{3}|\(\d{3}\))   # area code (with or without parens)
    (\s|-|\.)?           # separator
    \d{3}                # first 3 digits
    (\s|-|\.)            # separator
    \d{4}                # last 4 digits
""", re.VERBOSE)

Quick mental model

Regex = mini-language for patterns over text (phone numbers, emails, etc.).
Build a pattern (re.compile), then use search() / findall() and group/quantifier tools to extract exactly what you need.

Overall idea of the chapter

Chapter 9 shows how regular expressions give you powerful pattern-matching tools for finding, extracting, and replacing text. They look complex at first, but mastering a few building blocks (character classes, quantifiers, groups, anchors) covers most real-world use cases.

Manually finding patterns vs regex​

Basic regex workflow with re​

Groups with parentheses​

Escaping special characters​

Alternation with |​

search() vs findall()​

Character classes [...] and negated [^...]​

Shorthand character classes \d \w \s and opposites​

Dot . wildcard​

Quantifiers: ?, *, +, \{m,n\}​

Greedy vs non-greedy (? after quantifier)​

.* and .*? (match "anything")​

Matching newlines with re.DOTALL​

Anchors: ^, $, word boundaries \b / \B​

Case-insensitive matching​

Substitution with sub()​

Verbose mode with re.VERBOSE​

Quick mental model​

Overall idea of the chapter​

Manually finding patterns vs regex

Basic regex workflow with `re`

Groups with parentheses

Escaping special characters

Alternation with `|`

`search()` vs `findall()`

Character classes `[...]` and negated `[^...]`

Shorthand character classes `\d \w \s` and opposites

Dot `.` wildcard

Quantifiers: `?`, `*`, `+`, `\{m,n\}`

Greedy vs non-greedy (`?` after quantifier)

`.` and `.?` (match "anything")

Matching newlines with `re.DOTALL`

Anchors: `^`, `$`, word boundaries `\b` / `\B`

Case-insensitive matching

Substitution with `sub()`

Verbose mode with `re.VERBOSE`

Quick mental model

Overall idea of the chapter