Chapter 17 - PDF and Word Documents (Python)

Here's a concise walkthrough of the main ideas in Chapter 17, each with a small example.

Overview

PDF and Word files are binary documents with layout, fonts, and images, so you can't treat them like plain text files opened with open().

The chapter uses:

pypdf for PDFs (reading text, extracting images, combining/modifying pages, encrypt/decrypt).
pdfminer.high_level as a fallback text extractor.
python-docx (imported as docx) for .docx Word files.

PDF documents with PyPDF

Extracting text

Basic pattern:

Create PdfReader from a filename.
Iterate reader.pages (each is a Page).
Call page.extract_text() and concatenate.

Example program extractpdftext.py:

import pypdf
import pdfminer.high_level

PDF_FILENAME = 'Recursion_Chapter1.pdf'
TEXT_FILENAME = 'recursion.txt'

text = ''
try:
    reader = pypdf.PdfReader(PDF_FILENAME)
    for page in reader.pages:
        text += page.extract_text()
except Exception:
    text = pdfminer.high_level.extract_text(PDF_FILENAME)

with open(TEXT_FILENAME, 'w', encoding='utf-8') as f:
    f.write(text)

Uses PyPDF first; if it fails (weird PDF), falls back to pdfminer.high_level.extract_text() which returns the entire document as one string.

Post-processing with AI

Extracted text has hard newlines, hyphenated line breaks, and headers/footers mixed in. The chapter suggests using an LLM with a prompt that rejoins hyphenated words, removes headers/footers, and puts each paragraph on one line.

Extracting images

Each Page has .images, a list-like of Image objects, each with .name (including extension) and .data (raw bytes).

Program extractpdfimages.py:

import pypdf

PDF_FILENAME = 'Recursion_Chapter1.pdf'
reader = pypdf.PdfReader(PDF_FILENAME)

image_num = 0

for i, page in enumerate(reader.pages):
    print(f'Reading page {i+1} - {len(page.images)} images found...')
    try:
        for image in page.images:
            filename = f'{image_num}_page{i+1}_{image.name}'
            with open(filename, 'wb') as file:
                file.write(image.data)
            print(f'Wrote {filename}...')
            image_num += 1
    except Exception as exc:
        print(f'Skipped page {i+1} due to error: {exc}')

Notes:

Uses enumerate so i+1 is the human page number.
Builds unique filenames from an image counter, page number, and image.name.
Must open output files in 'wb'.

Creating PDFs from other pages

PyPDF writing is about reusing pages, not drawing arbitrary text.

PdfWriter and append

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf', (0, 5))

with open('first_five_pages.pdf', 'wb') as f:
    writer.write(f)

pypdf.PdfWriter() creates an empty PDF in memory.
append(filename, (start, stop)) copies pages [start, start+1, ..., stop-1] (like range(start, stop)).
Tuples are interpreted as range(start, stop[, step]), lists as explicit page lists.

merge for inserting (not just appending)

merge(insert_index, filename, range_or_list):

writer.merge(2, 'Recursion_Chapter1.pdf', (0, 5))

Copies pages 0–4 from the source and inserts them starting at index 2 in the writer; existing pages shift back.

Rotating pages

Use Page.rotate(degrees) where degrees is ±90, ±180, or ±270.

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')

for i in range(len(writer.pages)):
    writer.pages[i].rotate(90)

with open('rotated.pdf', 'wb') as f:
    writer.write(f)

No support for arbitrary angles (only 90° increments).

Inserting blank pages

add_blank_page() appends; insert_blank_page(index=...) inserts.

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')

writer.add_blank_page()             # blank at end
writer.insert_blank_page(index=2)   # blank at logical page 3

with open('with_blanks.pdf', 'wb') as f:
    writer.write(f)

Blank pages inherit size from existing pages.

Watermarks and overlays

Use Page.merge_page(other_page, over=...):

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')

watermark_page = pypdf.PdfReader('watermark.pdf').pages[0]

for page in writer.pages:
    page.merge_page(watermark_page, over=False)   # watermark/underlay

with open('with_watermark.pdf', 'wb') as f:
    writer.write(f)

over=False → watermark (underlay).
over=True → stamp/overlay (above existing content).
merge_page is a Page method; don't confuse with PdfWriter.merge.

Encrypting and decrypting PDFs

Encrypting

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')
writer.encrypt('swordfish', algorithm='AES-256')

with open('encrypted.pdf', 'wb') as f:
    writer.write(f)

encrypt(password, algorithm='AES-256') uses strong AES-256.
You can provide user and owner passwords separately.
PDFs have no password reset; forget the password → data is effectively lost.

Detecting and decrypting

import pypdf

reader = pypdf.PdfReader('encrypted.pdf')
writer = pypdf.PdfWriter()

print(reader.is_encrypted)   # True

print(reader.decrypt('wrong').name)      # 'NOT_DECRYPTED'
print(reader.decrypt('swordfish').name)  # 'OWNER_PASSWORD' or 'USER_PASSWORD'

writer.append(reader)
with open('decrypted.pdf', 'wb') as f:
    writer.write(f)

reader.is_encrypted indicates encryption.
reader.decrypt(password) returns an object whose .name indicates success or failure.

Project: Combine select pages from many PDFs

Goal: in a folder with many PDFs whose first page is a cover sheet, create one combined PDF that skips the first page of each input file.

# combine_pdfs.py - Combines all PDFs in CWD into a single PDF
import pypdf, os

# Get all the PDF filenames.
pdf_filenames = []
for filename in os.listdir('.'):
    if filename.endswith('.pdf'):
        pdf_filenames.append(filename)

pdf_filenames.sort(key=str.lower)

writer = pypdf.PdfWriter()

# Loop through all the PDF files:
for pdf_filename in pdf_filenames:
    reader = pypdf.PdfReader(pdf_filename)
    # Copy all pages after the first page:
    writer.append(pdf_filename, (1, len(reader.pages)))

# Save the resulting PDF to a file:
with open('combined.pdf', 'wb') as file:
    writer.write(file)

Ideas for similar programs:

Cut out specific page ranges.
Reverse/reorder pages.
Keep only pages that contain certain text (using extract_text() for filtering).

Word documents with python-docx

Setup and data model

Install python-docx (package name), but import as import docx.
A .docx document has:
- Document – entire file.
- Paragraph objects – each paragraph (ENTER/RETURN).
- Run objects – contiguous text with the same style (bold, italic, etc.).

Styles:

Paragraph styles (apply to Paragraph).
Character styles (apply to Run).
Linked styles (can apply to both; for Run you add ' Char').

Reading Word documents

import docx

doc = docx.Document('demo.docx')

len(doc.paragraphs)             # e.g. 7
print(doc.paragraphs[0].text)   # 'Document Title'
print(doc.paragraphs[1].text)   # 'A plain paragraph with some bold and some italic'

len(doc.paragraphs[1].runs)     # 4
print(doc.paragraphs[1].runs[0].text)  # 'A plain paragraph with some '
print(doc.paragraphs[1].runs[1].text)  # 'bold'
print(doc.paragraphs[1].runs[2].text)  # ' and some '
print(doc.paragraphs[1].runs[3].text)  # 'italic'

The second paragraph has four runs matching style changes.

Getting full text from a .docx

Helper readDocx.py:

import docx

def get_text(filename):
    doc = docx.Document(filename)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)

Usage:

import readDocx

print(readDocx.get_text('demo.docx'))

Styling paragraphs and runs

Styles

View styles in Word/LibreOffice and note their names; defaults include 'Normal', 'Title', 'Heading 1', 'Quote', 'Intense Quote', 'List Bullet', etc.

Usage:

paragraph.style = 'Heading 1'

For linked style 'Quote' on a Run, use 'Quote Char':

run.style = 'Quote Char'

Run attributes (bold/italic/etc.)

Run attributes can be True, False, or None (inherit style): bold, italic, underline, strike, double_strike, all_caps, small_caps, shadow, outline, rtl, imprint, emboss.

Example restyling demo.docx:

import docx

doc = docx.Document('demo.docx')

doc.paragraphs[0].style = 'Normal'       # change title to Normal

p = doc.paragraphs[1]
p.runs[0].style = 'Quote Char'   # quote style for the first part
p.runs[1].underline = True       # underline 'bold'
p.runs[3].underline = True       # underline 'italic'

doc.save('restyled.docx')

Overall idea of the chapter

Chapter 17 shows how to automate PDF and Word tasks: extracting text and images from PDFs with pypdf, combining/rotating/watermarking/encrypting PDFs with PdfWriter, and reading/writing .docx files with python-docx (paragraphs, runs, styles). The key pattern is: read the document object model, manipulate programmatically, save to a new file.

Overview​

PDF documents with PyPDF​

Extracting text​

Post-processing with AI​

Extracting images​

Creating PDFs from other pages​

PdfWriter and append​

merge for inserting (not just appending)​

Rotating pages​

Inserting blank pages​

Watermarks and overlays​

Encrypting and decrypting PDFs​

Encrypting​

Detecting and decrypting​

Project: Combine select pages from many PDFs​

Word documents with python-docx​

Setup and data model​

Reading Word documents​

Getting full text from a .docx​

Styling paragraphs and runs​

Styles​

Run attributes (bold/italic/etc.)​

Overall idea of the chapter​