Chapter 22 - Recognizing Text in Images (Python)

Here's a concise but thorough summary of Chapter 22, covering all sections and main ideas with small examples.

What OCR is and what this chapter uses

Optical character recognition (OCR) means extracting text from an image so you can then use normal string/regex tools on it.
The chapter uses Tesseract (an open-source OCR engine) via PyTesseract, plus Pillow for image handling and NAPS2 to produce searchable PDFs.

Installing Tesseract and PyTesseract

You must install the Tesseract engine separately, then install the Python wrapper PyTesseract (which also installs Pillow).

Platform specifics:

Windows:
- Download installer from the UB Mannheim Tesseract page (github.com/UB-Mannheim/tesseract/wiki).
- During install you can optionally add "Additional script data" and "Additional language data" for non-English languages (these .traineddata files add ~600 MB if you install all).
- Add C:\Program Files\Tesseract-OCR (or chosen folder) to your PATH so tesseract.exe is callable.
macOS:
- Install Homebrew, then run brew install tesseract and brew install tesseract-lang in Terminal for extra languages.
Linux:
- sudo apt install tesseract-ocr to install core engine.
- sudo apt install tesseract-ocr-all for all languages, or tesseract-ocr-fra / tesseract-ocr-deu / tesseract-ocr-jpn etc. for specific ISO-code packs.
PyTesseract:
- Install via pip; PyTesseract expects tesseract to be on PATH and works with Pillow images.

OCR fundamentals with PyTesseract

Basic pattern to get text from an image:

import pytesseract as tess
from PIL import Image

img = Image.open('ocr-example.png')
text = tess.image_to_string(img)
print(text)

Image.open() loads the image; tess.image_to_string(img) returns a string with recognized text.
Works very well on clean, computer-generated text (like screenshots of a book), but you should always assume some imperfect recognition.

Typical OCR issues

Even on good images, OCR text often:

Keeps end-of-line hyphenation (e.g. dig- + ital, pro- + grams).
Drops layout info like fonts, sizes, columns, and exact whitespace.
Misreads characters (e.g. lowercase j vs i).
Jumbles text from tables/multi-column layouts.

Numbers are especially easy to mis-scan and hard to visually catch.

Preprocessing images (to improve accuracy)

For better results, you may pre-edit images (manually or via tools like OpenCV) before OCR:

Avoid multi-column pages; split each column into its own image.
Use typewritten, not handwritten, text.
Use conventional fonts (no cursive/stylized fonts).
Rotate image so text lines are perfectly upright.
Prefer dark text on light background (not white on black).
Remove dark borders and add a small white margin if text touches the edge.
Improve brightness/contrast so text stands out.
Remove noise dots/pixels.

The chapter references an OpenCV tutorial ("Preprocessing Images for OCR with Python and OpenCV") for automating these steps.

Using LLMs to fix OCR mistakes

OCR errors are often subtle character/spacing issues that spell-check can't catch.
Large language models (ChatGPT, Gemini, LLaMA, etc.) are well-suited to cleaning OCR output given a carefully worded prompt.

Example workflow using a rough scan of a 19th-century Frankenstein page:

Tesseract output has issues like Iv instead of It, capitalization errors, and hyphenated line breaks.
You give the LLM a prompt instructing it to:
- Fix spacing and character recognition errors only.
- Not "correct" original spelling/grammar.
- Put each paragraph on a single line.
- Undo hyphenated line breaks.
The LLM can turn the OCR text into a cleaner version (e.g., Iv → It, Motion → motion, rejoined words).

But:

LLMs can miss errors, change the wrong things, or introduce new mistakes, so you still need human review (possibly two humans).

Recognizing non-English text with Tesseract

After installing language packs, you can see what you have with:

import pytesseract as tess
tess.get_languages()
# ['afr', 'amh', 'ara', ..., 'vie', 'yid', 'yor']

Codes are mostly ISO 639-3; some add _cyrl etc. for specific scripts.

To recognize a specific language:

import pytesseract as tess
from PIL import Image

img = Image.open('frankenstein_jpn.png')
text = tess.image_to_string(img, lang='jpn')
print(text)   # Japanese text string

If you instead use lang='eng' on Japanese text, Tesseract returns English-looking gibberish.

To recognize multiple languages in one image:

tess.image_to_string(img, lang='eng+jpn')

Combine codes with + for multi-language OCR.

NAPS2: creating searchable PDFs with OCR

NAPS2 (Not Another PDF Scanner 2):

Free, open source app for Windows/macOS/Linux.
Can combine images/PDFs into a single PDF and embed Tesseract text layer at correct positions so the PDF is searchable and copyable.
Can be run headlessly from Python using subprocess, no GUI pop-ups.

Installing NAPS2

Download from https://www.naps2.com/download.

Per OS:

Windows / macOS: run installer normally.

Linux: download Flatpak, then:

flatpak install naps2-X.X.X-linux-x64.flatpak
flatpak run com.naps2.Naps2   # to launch GUI

Running NAPS2 from Python to OCR images into PDF

Example: create output.pdf from frankenstein.png with embedded English OCR text:

import subprocess

naps2_path = [r'C:\Program Files\NAPS2\NAPS2.Console.exe']  # Windows

proc = subprocess.run(
    naps2_path + [
        '-i', 'frankenstein.png',
        '-o', 'output.pdf',
        '--install', 'ocr-eng',
        '--ocrlang', 'eng',
        '-n', '0',
        '-f',
        '-v',
    ],
    capture_output=True,
)

macOS path:

naps2_path = ['/Applications/NAPS2.app/Contents/MacOS/NAPS2', 'console']

Linux path:

naps2_path = ['flatpak', 'run', 'com.naps2.Naps2', 'console']

The command generates output.pdf with your image, plus an invisible OCR text layer so you can select/search/copy text.

Meaning of key arguments:

-i 'frankenstein.png' – input image(s).
-o 'output.pdf' – output PDF file path.
--install 'ocr-eng' – install English OCR pack if missing (ocr- + language code).
--ocrlang 'eng' – language(s) for OCR; can be 'eng+jpn+rus', etc.
-n '0' – do zero scans (don't use scanner hardware, just images/PDFs).
-f – overwrite existing output.pdf.
-v – verbose mode.

Specifying input pages / multiple files

The -i argument accepts a mini-language:

Multiple files separated by ;:
```
'cat.png;dog.png;moose.png'
```
→ cat = page 1, dog = page 2, moose = page 3.
Specific pages in PDFs using Python-style indices/slices (0-based):
```
'spam.pdf[0:2];eggs.pdf[-1]'
```
→ spam pages 1–2, then last page of eggs.

You can mix images and PDFs and choose precise pages, including negative indices from the end.

If NAPS2 doesn't fit your needs, the chapter recommends the ocrmypdf Python package as an alternative.

Practice program: Browser Text Scraper

Task: ocrscreen.py should:

Take a screenshot of the whole screen via PyAutoGUI.
Crop to the region that contains the embedded PDF/text (using fixed coordinates).
Run OCR on the cropped image with PyTesseract.
Append the recognized text to output.txt.

Template given:

import pyautogui
# TODO - Add the additionally needed import statements.

# The coordinates for the text portion. Change as needed:
LEFT = 400
TOP = 200
RIGHT = 1000
BOTTOM = 800

# Capture a screenshot:
img = pyautogui.screenshot()

# Crop the screenshot to the text portion:
img = img.crop((LEFT, TOP, RIGHT, BOTTOM))

# Run OCR on the cropped image:
# TODO - Add the PyTesseract code here.

# Add the OCR text to the end of output.txt:
# TODO - Call open() in append mode and append the OCR text.

Using this, you can scroll embedded "uncopyable" documents into view, run the script repeatedly, and reconstruct the full text into a local file.

Overall idea of the chapter

Chapter 22 covers OCR: Tesseract (via PyTesseract) extracts text from images, works best on clean typewritten text with known language, and can handle multiple languages with lang='eng+jpn'. Preprocessing (rotation, contrast, noise removal) and LLMs can improve accuracy but always require human review. NAPS2 creates searchable PDFs with embedded OCR text layers, controllable from Python via subprocess.

What OCR is and what this chapter uses​

Installing Tesseract and PyTesseract​

OCR fundamentals with PyTesseract​

Typical OCR issues​

Preprocessing images (to improve accuracy)​

Using LLMs to fix OCR mistakes​

Recognizing non-English text with Tesseract​

NAPS2: creating searchable PDFs with OCR​

Installing NAPS2​

Running NAPS2 from Python to OCR images into PDF​

Specifying input pages / multiple files​

Practice program: Browser Text Scraper​

Overall idea of the chapter​