Chapter 17 - PDF and Word Documents (Python)
Here's a concise walkthrough of the main ideas in Chapter 17, each with a small example.
Overview
PDF and Word files are binary documents with layout, fonts, and images, so you can't treat them like plain text files opened with open().
The chapter uses:
pypdffor PDFs (reading text, extracting images, combining/modifying pages, encrypt/decrypt).pdfminer.high_levelas a fallback text extractor.python-docx(imported asdocx) for.docxWord files.
PDF documents with PyPDF
Extracting text
Basic pattern:
- Create
PdfReaderfrom a filename. - Iterate
reader.pages(each is aPage). - Call
page.extract_text()and concatenate.
Example program extractpdftext.py:
import pypdf
import pdfminer.high_level
PDF_FILENAME = 'Recursion_Chapter1.pdf'
TEXT_FILENAME = 'recursion.txt'
text = ''
try:
reader = pypdf.PdfReader(PDF_FILENAME)
for page in reader.pages:
text += page.extract_text()
except Exception:
text = pdfminer.high_level.extract_text(PDF_FILENAME)
with open(TEXT_FILENAME, 'w', encoding='utf-8') as f:
f.write(text)
Uses PyPDF first; if it fails (weird PDF), falls back to pdfminer.high_level.extract_text() which returns the entire document as one string.
Post-processing with AI
Extracted text has hard newlines, hyphenated line breaks, and headers/footers mixed in. The chapter suggests using an LLM with a prompt that rejoins hyphenated words, removes headers/footers, and puts each paragraph on one line.
Extracting images
Each Page has .images, a list-like of Image objects, each with .name (including extension) and .data (raw bytes).
Program extractpdfimages.py:
import pypdf
PDF_FILENAME = 'Recursion_Chapter1.pdf'
reader = pypdf.PdfReader(PDF_FILENAME)
image_num = 0
for i, page in enumerate(reader.pages):
print(f'Reading page {i+1} - {len(page.images)} images found...')
try:
for image in page.images:
filename = f'{image_num}_page{i+1}_{image.name}'
with open(filename, 'wb') as file:
file.write(image.data)
print(f'Wrote {filename}...')
image_num += 1
except Exception as exc:
print(f'Skipped page {i+1} due to error: {exc}')
Notes:
- Uses
enumeratesoi+1is the human page number. - Builds unique filenames from an image counter, page number, and
image.name. - Must open output files in
'wb'.
Creating PDFs from other pages
PyPDF writing is about reusing pages, not drawing arbitrary text.
PdfWriter and append
import pypdf
writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf', (0, 5))
with open('first_five_pages.pdf', 'wb') as f:
writer.write(f)
pypdf.PdfWriter()creates an empty PDF in memory.append(filename, (start, stop))copies pages[start, start+1, ..., stop-1](likerange(start, stop)).- Tuples are interpreted as
range(start, stop[, step]), lists as explicit page lists.
merge for inserting (not just appending)
merge(insert_index, filename, range_or_list):
writer.merge(2, 'Recursion_Chapter1.pdf', (0, 5))
Copies pages 0–4 from the source and inserts them starting at index 2 in the writer; existing pages shift back.
Rotating pages
Use Page.rotate(degrees) where degrees is ±90, ±180, or ±270.
import pypdf
writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')
for i in range(len(writer.pages)):
writer.pages[i].rotate(90)
with open('rotated.pdf', 'wb') as f:
writer.write(f)
No support for arbitrary angles (only 90° increments).
Inserting blank pages
add_blank_page() appends; insert_blank_page(index=...) inserts.
import pypdf
writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')
writer.add_blank_page() # blank at end
writer.insert_blank_page(index=2) # blank at logical page 3
with open('with_blanks.pdf', 'wb') as f:
writer.write(f)
Blank pages inherit size from existing pages.
Watermarks and overlays
Use Page.merge_page(other_page, over=...):
import pypdf
writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')
watermark_page = pypdf.PdfReader('watermark.pdf').pages[0]
for page in writer.pages:
page.merge_page(watermark_page, over=False) # watermark/underlay
with open('with_watermark.pdf', 'wb') as f:
writer.write(f)
over=False→ watermark (underlay).over=True→ stamp/overlay (above existing content).merge_pageis a Page method; don't confuse withPdfWriter.merge.
Encrypting and decrypting PDFs
Encrypting
import pypdf
writer = pypdf.PdfWriter()
writer.append('Recursion_Chapter1.pdf')
writer.encrypt('swordfish', algorithm='AES-256')
with open('encrypted.pdf', 'wb') as f:
writer.write(f)
encrypt(password, algorithm='AES-256')uses strong AES-256.- You can provide user and owner passwords separately.
- PDFs have no password reset; forget the password → data is effectively lost.
Detecting and decrypting
import pypdf
reader = pypdf.PdfReader('encrypted.pdf')
writer = pypdf.PdfWriter()
print(reader.is_encrypted) # True
print(reader.decrypt('wrong').name) # 'NOT_DECRYPTED'
print(reader.decrypt('swordfish').name) # 'OWNER_PASSWORD' or 'USER_PASSWORD'
writer.append(reader)
with open('decrypted.pdf', 'wb') as f:
writer.write(f)
reader.is_encryptedindicates encryption.reader.decrypt(password)returns an object whose.nameindicates success or failure.
Project: Combine select pages from many PDFs
Goal: in a folder with many PDFs whose first page is a cover sheet, create one combined PDF that skips the first page of each input file.
# combine_pdfs.py - Combines all PDFs in CWD into a single PDF
import pypdf, os
# Get all the PDF filenames.
pdf_filenames = []
for filename in os.listdir('.'):
if filename.endswith('.pdf'):
pdf_filenames.append(filename)
pdf_filenames.sort(key=str.lower)
writer = pypdf.PdfWriter()
# Loop through all the PDF files:
for pdf_filename in pdf_filenames:
reader = pypdf.PdfReader(pdf_filename)
# Copy all pages after the first page:
writer.append(pdf_filename, (1, len(reader.pages)))
# Save the resulting PDF to a file:
with open('combined.pdf', 'wb') as file:
writer.write(file)
Ideas for similar programs:
- Cut out specific page ranges.
- Reverse/reorder pages.
- Keep only pages that contain certain text (using
extract_text()for filtering).
Word documents with python-docx
Setup and data model
- Install
python-docx(package name), but import asimport docx. - A
.docxdocument has:Document– entire file.Paragraphobjects – each paragraph (ENTER/RETURN).Runobjects – contiguous text with the same style (bold, italic, etc.).
Styles:
- Paragraph styles (apply to
Paragraph). - Character styles (apply to
Run). - Linked styles (can apply to both; for
Runyou add' Char').
Reading Word documents
import docx
doc = docx.Document('demo.docx')
len(doc.paragraphs) # e.g. 7
print(doc.paragraphs[0].text) # 'Document Title'
print(doc.paragraphs[1].text) # 'A plain paragraph with some bold and some italic'
len(doc.paragraphs[1].runs) # 4
print(doc.paragraphs[1].runs[0].text) # 'A plain paragraph with some '
print(doc.paragraphs[1].runs[1].text) # 'bold'
print(doc.paragraphs[1].runs[2].text) # ' and some '
print(doc.paragraphs[1].runs[3].text) # 'italic'
The second paragraph has four runs matching style changes.
Getting full text from a .docx
Helper readDocx.py:
import docx
def get_text(filename):
doc = docx.Document(filename)
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
return '\n'.join(full_text)
Usage:
import readDocx
print(readDocx.get_text('demo.docx'))
Styling paragraphs and runs
Styles
View styles in Word/LibreOffice and note their names; defaults include 'Normal', 'Title', 'Heading 1', 'Quote', 'Intense Quote', 'List Bullet', etc.
Usage:
paragraph.style = 'Heading 1'
For linked style 'Quote' on a Run, use 'Quote Char':
run.style = 'Quote Char'
Run attributes (bold/italic/etc.)
Run attributes can be True, False, or None (inherit style): bold, italic, underline, strike, double_strike, all_caps, small_caps, shadow, outline, rtl, imprint, emboss.
Example restyling demo.docx:
import docx
doc = docx.Document('demo.docx')
doc.paragraphs[0].style = 'Normal' # change title to Normal
p = doc.paragraphs[1]
p.runs[0].style = 'Quote Char' # quote style for the first part
p.runs[1].underline = True # underline 'bold'
p.runs[3].underline = True # underline 'italic'
doc.save('restyled.docx')
Overall idea of the chapter
Chapter 17 shows how to automate PDF and Word tasks: extracting text and images from PDFs with pypdf, combining/rotating/watermarking/encrypting PDFs with PdfWriter, and reading/writing .docx files with python-docx (paragraphs, runs, styles). The key pattern is: read the document object model, manipulate programmatically, save to a new file.