Skip to main content

Chapter 17 - PDF and Word Documents (Python)

Here's a concise walkthrough of the main ideas in Chapter 17, each with a small example.


Overview

PDF and Word files are binary documents with layout, fonts, and images, so you can't treat them like plain text files opened with open().

The chapter uses:

  • pypdf for PDFs (reading text, extracting images, combining/modifying pages, encrypt/decrypt).
  • pdfminer.high_level as a fallback text extractor.
  • python-docx (imported as docx) for .docx Word files.

PDF documents with PyPDF

Extracting text

Basic pattern:

  1. Create PdfReader from a filename.
  2. Iterate reader.pages (each is a Page).
  3. Call page.extract_text() and concatenate.

Example program extractpdftext.py:

import pypdf
import pdfminer.high_level

PDF_FILENAME = 'Recursion_Chapter1.pdf'
TEXT_FILENAME = 'recursion.txt'

text = ''
try:
reader = pypdf.PdfReader(PDF_FILENAME)
for page in reader.pages:
text += page.extract_text()
except Exception:
text = pdfminer.high_level.extract_text(PDF_FILENAME)

with open(TEXT_FILENAME, 'w', encoding='utf-8') as f:
f.write(text)

Uses PyPDF first; if it fails (weird PDF), falls back to pdfminer.high_level.extract_text() which returns the entire document as one string.

Post-processing with AI

Extracted text has hard newlines, hyphenated line breaks, and headers/footers mixed in. The chapter suggests using an LLM with a prompt that rejoins hyphenated words, removes headers/footers, and puts each paragraph on one line.


Extracting images

Each Page has .images, a list-like of Image objects, each with .name (including extension) and .data (raw bytes).

Program extractpdfimages.py:

import pypdf

PDF_FILENAME = 'Recursion_Chapter1.pdf'
reader = pypdf.PdfReader(PDF_FILENAME)

image_num = 0

for i, page in enumerate(reader.pages):
print(f'Reading page {i+1} - {len(page.images)} images found...')
try:
for image in page.images:
filename = f'{image_num}_page{i+1}_{image.name}'
with open(filename, 'wb') as file:
file.write(image.data)
print(f'Wrote {filename}...')
image_num += 1
except Exception as exc:
print(f'Skipped page {i+1} due to error: {exc}')

Notes:

  • Uses enumerate so i+1 is the human page number.
  • Builds unique filenames from an image counter, page number, and image.name.
  • Must open output files in 'wb'.

Creating PDFs from other pages

PyPDF writing is about reusing pages, not drawing arbitrary text.

PdfWriter and append

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf', (0, 5))

with open('first_five_pages.pdf', 'wb') as f:
writer.write(f)
  • pypdf.PdfWriter() creates an empty PDF in memory.
  • append(filename, (start, stop)) copies pages [start, start+1, ..., stop-1] (like range(start, stop)).
  • Tuples are interpreted as range(start, stop[, step]), lists as explicit page lists.

merge for inserting (not just appending)

merge(insert_index, filename, range_or_list):

writer.merge(2, 'Recursion_Chapter1.pdf', (0, 5))

Copies pages 0–4 from the source and inserts them starting at index 2 in the writer; existing pages shift back.


Rotating pages

Use Page.rotate(degrees) where degrees is ±90, ±180, or ±270.

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')

for i in range(len(writer.pages)):
writer.pages[i].rotate(90)

with open('rotated.pdf', 'wb') as f:
writer.write(f)

No support for arbitrary angles (only 90° increments).


Inserting blank pages

add_blank_page() appends; insert_blank_page(index=...) inserts.

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')

writer.add_blank_page() # blank at end
writer.insert_blank_page(index=2) # blank at logical page 3

with open('with_blanks.pdf', 'wb') as f:
writer.write(f)

Blank pages inherit size from existing pages.


Watermarks and overlays

Use Page.merge_page(other_page, over=...):

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')

watermark_page = pypdf.PdfReader('watermark.pdf').pages[0]

for page in writer.pages:
page.merge_page(watermark_page, over=False) # watermark/underlay

with open('with_watermark.pdf', 'wb') as f:
writer.write(f)
  • over=False → watermark (underlay).
  • over=True → stamp/overlay (above existing content).
  • merge_page is a Page method; don't confuse with PdfWriter.merge.

Encrypting and decrypting PDFs

Encrypting

import pypdf

writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')
writer.encrypt('swordfish', algorithm='AES-256')

with open('encrypted.pdf', 'wb') as f:
writer.write(f)
  • encrypt(password, algorithm='AES-256') uses strong AES-256.
  • You can provide user and owner passwords separately.
  • PDFs have no password reset; forget the password → data is effectively lost.

Detecting and decrypting

import pypdf

reader = pypdf.PdfReader('encrypted.pdf')
writer = pypdf.PdfWriter()

print(reader.is_encrypted) # True

print(reader.decrypt('wrong').name) # 'NOT_DECRYPTED'
print(reader.decrypt('swordfish').name) # 'OWNER_PASSWORD' or 'USER_PASSWORD'

writer.append(reader)
with open('decrypted.pdf', 'wb') as f:
writer.write(f)
  • reader.is_encrypted indicates encryption.
  • reader.decrypt(password) returns an object whose .name indicates success or failure.

Project: Combine select pages from many PDFs

Goal: in a folder with many PDFs whose first page is a cover sheet, create one combined PDF that skips the first page of each input file.

# combine_pdfs.py - Combines all PDFs in CWD into a single PDF
import pypdf, os

# Get all the PDF filenames.
pdf_filenames = []
for filename in os.listdir('.'):
if filename.endswith('.pdf'):
pdf_filenames.append(filename)

pdf_filenames.sort(key=str.lower)

writer = pypdf.PdfWriter()

# Loop through all the PDF files:
for pdf_filename in pdf_filenames:
reader = pypdf.PdfReader(pdf_filename)
# Copy all pages after the first page:
writer.append(pdf_filename, (1, len(reader.pages)))

# Save the resulting PDF to a file:
with open('combined.pdf', 'wb') as file:
writer.write(file)

Ideas for similar programs:

  • Cut out specific page ranges.
  • Reverse/reorder pages.
  • Keep only pages that contain certain text (using extract_text() for filtering).

Word documents with python-docx

Setup and data model

  • Install python-docx (package name), but import as import docx.
  • A .docx document has:
    • Document – entire file.
    • Paragraph objects – each paragraph (ENTER/RETURN).
    • Run objects – contiguous text with the same style (bold, italic, etc.).

Styles:

  • Paragraph styles (apply to Paragraph).
  • Character styles (apply to Run).
  • Linked styles (can apply to both; for Run you add ' Char').

Reading Word documents

import docx

doc = docx.Document('demo.docx')

len(doc.paragraphs) # e.g. 7
print(doc.paragraphs[0].text) # 'Document Title'
print(doc.paragraphs[1].text) # 'A plain paragraph with some bold and some italic'

len(doc.paragraphs[1].runs) # 4
print(doc.paragraphs[1].runs[0].text) # 'A plain paragraph with some '
print(doc.paragraphs[1].runs[1].text) # 'bold'
print(doc.paragraphs[1].runs[2].text) # ' and some '
print(doc.paragraphs[1].runs[3].text) # 'italic'

The second paragraph has four runs matching style changes.

Getting full text from a .docx

Helper readDocx.py:

import docx

def get_text(filename):
doc = docx.Document(filename)
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
return '\n'.join(full_text)

Usage:

import readDocx

print(readDocx.get_text('demo.docx'))

Styling paragraphs and runs

Styles

View styles in Word/LibreOffice and note their names; defaults include 'Normal', 'Title', 'Heading 1', 'Quote', 'Intense Quote', 'List Bullet', etc.

Usage:

paragraph.style = 'Heading 1'

For linked style 'Quote' on a Run, use 'Quote Char':

run.style = 'Quote Char'

Run attributes (bold/italic/etc.)

Run attributes can be True, False, or None (inherit style): bold, italic, underline, strike, double_strike, all_caps, small_caps, shadow, outline, rtl, imprint, emboss.

Example restyling demo.docx:

import docx

doc = docx.Document('demo.docx')

doc.paragraphs[0].style = 'Normal' # change title to Normal

p = doc.paragraphs[1]
p.runs[0].style = 'Quote Char' # quote style for the first part
p.runs[1].underline = True # underline 'bold'
p.runs[3].underline = True # underline 'italic'

doc.save('restyled.docx')

Overall idea of the chapter

Chapter 17 shows how to automate PDF and Word tasks: extracting text and images from PDFs with pypdf, combining/rotating/watermarking/encrypting PDFs with PdfWriter, and reading/writing .docx files with python-docx (paragraphs, runs, styles). The key pattern is: read the document object model, manipulate programmatically, save to a new file.