Chapter 22 - Recognizing Text in Images (Python)
Here's a concise but thorough summary of Chapter 22, covering all sections and main ideas with small examples.
What OCR is and what this chapter uses
- Optical character recognition (OCR) means extracting text from an image so you can then use normal string/regex tools on it.
- The chapter uses Tesseract (an open-source OCR engine) via PyTesseract, plus Pillow for image handling and NAPS2 to produce searchable PDFs.
Installing Tesseract and PyTesseract
You must install the Tesseract engine separately, then install the Python wrapper PyTesseract (which also installs Pillow).
Platform specifics:
-
Windows:
- Download installer from the UB Mannheim Tesseract page (
github.com/UB-Mannheim/tesseract/wiki). - During install you can optionally add "Additional script data" and "Additional language data" for non-English languages (these
.traineddatafiles add ~600 MB if you install all). - Add
C:\Program Files\Tesseract-OCR(or chosen folder) to your PATH sotesseract.exeis callable.
- Download installer from the UB Mannheim Tesseract page (
-
macOS:
- Install Homebrew, then run
brew install tesseractandbrew install tesseract-langin Terminal for extra languages.
- Install Homebrew, then run
-
Linux:
sudo apt install tesseract-ocrto install core engine.sudo apt install tesseract-ocr-allfor all languages, ortesseract-ocr-fra/tesseract-ocr-deu/tesseract-ocr-jpnetc. for specific ISO-code packs.
-
PyTesseract:
- Install via pip; PyTesseract expects
tesseractto be on PATH and works with Pillow images.
- Install via pip; PyTesseract expects
OCR fundamentals with PyTesseract
Basic pattern to get text from an image:
import pytesseract as tess
from PIL import Image
img = Image.open('ocr-example.png')
text = tess.image_to_string(img)
print(text)
Image.open()loads the image;tess.image_to_string(img)returns a string with recognized text.- Works very well on clean, computer-generated text (like screenshots of a book), but you should always assume some imperfect recognition.
Typical OCR issues
Even on good images, OCR text often:
- Keeps end-of-line hyphenation (e.g.
dig-+ital,pro-+grams). - Drops layout info like fonts, sizes, columns, and exact whitespace.
- Misreads characters (e.g. lowercase
jvsi). - Jumbles text from tables/multi-column layouts.
Numbers are especially easy to mis-scan and hard to visually catch.
Preprocessing images (to improve accuracy)
For better results, you may pre-edit images (manually or via tools like OpenCV) before OCR:
- Avoid multi-column pages; split each column into its own image.
- Use typewritten, not handwritten, text.
- Use conventional fonts (no cursive/stylized fonts).
- Rotate image so text lines are perfectly upright.
- Prefer dark text on light background (not white on black).
- Remove dark borders and add a small white margin if text touches the edge.
- Improve brightness/contrast so text stands out.
- Remove noise dots/pixels.
The chapter references an OpenCV tutorial ("Preprocessing Images for OCR with Python and OpenCV") for automating these steps.
Using LLMs to fix OCR mistakes
- OCR errors are often subtle character/spacing issues that spell-check can't catch.
- Large language models (ChatGPT, Gemini, LLaMA, etc.) are well-suited to cleaning OCR output given a carefully worded prompt.
Example workflow using a rough scan of a 19th-century Frankenstein page:
- Tesseract output has issues like
Ivinstead ofIt, capitalization errors, and hyphenated line breaks. - You give the LLM a prompt instructing it to:
- Fix spacing and character recognition errors only.
- Not "correct" original spelling/grammar.
- Put each paragraph on a single line.
- Undo hyphenated line breaks.
- The LLM can turn the OCR text into a cleaner version (e.g.,
Iv→It,Motion→motion, rejoined words).
But:
- LLMs can miss errors, change the wrong things, or introduce new mistakes, so you still need human review (possibly two humans).
Recognizing non-English text with Tesseract
After installing language packs, you can see what you have with:
import pytesseract as tess
tess.get_languages()
# ['afr', 'amh', 'ara', ..., 'vie', 'yid', 'yor']
Codes are mostly ISO 639-3; some add _cyrl etc. for specific scripts.
To recognize a specific language:
import pytesseract as tess
from PIL import Image
img = Image.open('frankenstein_jpn.png')
text = tess.image_to_string(img, lang='jpn')
print(text) # Japanese text string
If you instead use lang='eng' on Japanese text, Tesseract returns English-looking gibberish.
To recognize multiple languages in one image:
tess.image_to_string(img, lang='eng+jpn')
Combine codes with + for multi-language OCR.
NAPS2: creating searchable PDFs with OCR
NAPS2 (Not Another PDF Scanner 2):
- Free, open source app for Windows/macOS/Linux.
- Can combine images/PDFs into a single PDF and embed Tesseract text layer at correct positions so the PDF is searchable and copyable.
- Can be run headlessly from Python using
subprocess, no GUI pop-ups.
Installing NAPS2
Download from https://www.naps2.com/download.
Per OS:
-
Windows / macOS: run installer normally.
-
Linux: download Flatpak, then:
flatpak install naps2-X.X.X-linux-x64.flatpak
flatpak run com.naps2.Naps2 # to launch GUI
Running NAPS2 from Python to OCR images into PDF
Example: create output.pdf from frankenstein.png with embedded English OCR text:
import subprocess
naps2_path = [r'C:\Program Files\NAPS2\NAPS2.Console.exe'] # Windows
proc = subprocess.run(
naps2_path + [
'-i', 'frankenstein.png',
'-o', 'output.pdf',
'--install', 'ocr-eng',
'--ocrlang', 'eng',
'-n', '0',
'-f',
'-v',
],
capture_output=True,
)
macOS path:
naps2_path = ['/Applications/NAPS2.app/Contents/MacOS/NAPS2', 'console']
Linux path:
naps2_path = ['flatpak', 'run', 'com.naps2.Naps2', 'console']
The command generates output.pdf with your image, plus an invisible OCR text layer so you can select/search/copy text.
Meaning of key arguments:
-i 'frankenstein.png'– input image(s).-o 'output.pdf'– output PDF file path.--install 'ocr-eng'– install English OCR pack if missing (ocr-+ language code).--ocrlang 'eng'– language(s) for OCR; can be'eng+jpn+rus', etc.-n '0'– do zero scans (don't use scanner hardware, just images/PDFs).-f– overwrite existingoutput.pdf.-v– verbose mode.
Specifying input pages / multiple files
The -i argument accepts a mini-language:
-
Multiple files separated by
;:'cat.png;dog.png;moose.png'→ cat = page 1, dog = page 2, moose = page 3.
-
Specific pages in PDFs using Python-style indices/slices (0-based):
'spam.pdf[0:2];eggs.pdf[-1]'→ spam pages 1–2, then last page of eggs.
You can mix images and PDFs and choose precise pages, including negative indices from the end.
If NAPS2 doesn't fit your needs, the chapter recommends the ocrmypdf Python package as an alternative.
Practice program: Browser Text Scraper
Task: ocrscreen.py should:
- Take a screenshot of the whole screen via PyAutoGUI.
- Crop to the region that contains the embedded PDF/text (using fixed coordinates).
- Run OCR on the cropped image with PyTesseract.
- Append the recognized text to
output.txt.
Template given:
import pyautogui
# TODO - Add the additionally needed import statements.
# The coordinates for the text portion. Change as needed:
LEFT = 400
TOP = 200
RIGHT = 1000
BOTTOM = 800
# Capture a screenshot:
img = pyautogui.screenshot()
# Crop the screenshot to the text portion:
img = img.crop((LEFT, TOP, RIGHT, BOTTOM))
# Run OCR on the cropped image:
# TODO - Add the PyTesseract code here.
# Add the OCR text to the end of output.txt:
# TODO - Call open() in append mode and append the OCR text.
Using this, you can scroll embedded "uncopyable" documents into view, run the script repeatedly, and reconstruct the full text into a local file.
Overall idea of the chapter
Chapter 22 covers OCR: Tesseract (via PyTesseract) extracts text from images, works best on clean typewritten text with known language, and can handle multiple languages with lang='eng+jpn'. Preprocessing (rotation, contrast, noise removal) and LLMs can improve accuracy but always require human review. NAPS2 creates searchable PDFs with embedded OCR text layers, controllable from Python via subprocess.