Chapter 22 - Recognizing Text in Images (JavaScript)

Here's a JavaScript-flavoured version of the same concepts, with small JS/Node examples for each idea.

What OCR is and what this chapter uses

Same idea: extract text from images so you can work with it as strings. In JavaScript:

tesseract.js – a pure JavaScript/WebAssembly port of Tesseract that runs in Node and the browser (no native binary needed).
sharp (from Chapter 21) for image preprocessing.
For searchable PDFs, you can still call NAPS2 via child_process, or use pdf-lib to embed text layers.

npm install tesseract.js

Installing and setup

Unlike Python's PyTesseract which requires a separate Tesseract binary, tesseract.js is self-contained — it downloads language data automatically on first use.

const Tesseract = require("tesseract.js");

No PATH configuration needed. Language .traineddata files are fetched from a CDN by default (or you can bundle them locally).

OCR fundamentals with tesseract.js

Basic pattern to get text from an image:

const Tesseract = require("tesseract.js");

async function recognize(imagePath) {
  const { data } = await Tesseract.recognize(imagePath, "eng");
  console.log(data.text);
}

recognize("ocr-example.png");

Tesseract.recognize(image, lang) returns an object with data.text containing the recognized string.
Works with file paths, Buffers, or URLs.

Using a worker for multiple images (better performance)

const Tesseract = require("tesseract.js");

async function main() {
  const worker = await Tesseract.createWorker("eng");

  const { data: data1 } = await worker.recognize("page1.png");
  console.log(data1.text);

  const { data: data2 } = await worker.recognize("page2.png");
  console.log(data2.text);

  await worker.terminate();
}

main();

Creating a worker once and reusing it avoids reloading the language model each time.

Typical OCR issues

Same issues as Python:

End-of-line hyphenation preserved.
Layout info (fonts, columns) lost.
Character misreads (especially numbers).
Multi-column text gets jumbled.

Preprocessing images (to improve accuracy)

Use sharp to preprocess before OCR:

const sharp = require("sharp");
const Tesseract = require("tesseract.js");

async function ocrWithPreprocessing(imagePath) {
  // Convert to grayscale, increase contrast, sharpen
  const processed = await sharp(imagePath)
    .greyscale()
    .normalize()       // auto contrast
    .sharpen()
    .toBuffer();

  const { data } = await Tesseract.recognize(processed, "eng");
  return data.text;
}

Same guidelines apply:

Avoid multi-column pages.
Use typewritten, conventional fonts.
Rotate so text is perfectly upright.
Dark text on light background.
Remove borders and noise.

Using LLMs to fix OCR mistakes

Same approach works in JavaScript — send OCR output to an LLM API:

// Example using fetch with an LLM API
async function cleanOcrText(rawText) {
  const response = await fetch("https://api.anthropic.com/v1/messages", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "x-api-key": process.env.ANTHROPIC_API_KEY,
      "anthropic-version": "2023-06-01",
    },
    body: JSON.stringify({
      model: "claude-sonnet-4-5-20250929",
      max_tokens: 4096,
      messages: [
        {
          role: "user",
          content: `Fix OCR errors in this text. Only fix spacing and character recognition errors. Do not correct original spelling/grammar. Join hyphenated line breaks. Put each paragraph on one line.\n\n${rawText}`,
        },
      ],
    }),
  });

  const data = await response.json();
  return data.content[0].text;
}

Always review LLM output — it can miss errors or introduce new ones.

Recognizing non-English text

tesseract.js supports the same language codes. Pass the language when creating a worker or calling recognize:

const Tesseract = require("tesseract.js");

// Japanese OCR
const { data } = await Tesseract.recognize("frankenstein_jpn.png", "jpn");
console.log(data.text);

Multiple languages:

// English + Japanese
const { data } = await Tesseract.recognize("mixed_text.png", "eng+jpn");
console.log(data.text);

Language data is downloaded automatically on first use for each language code.

Creating searchable PDFs with OCR

Option 1: Call NAPS2 via child_process

Same approach as Python — NAPS2 works the same regardless of calling language:

const { execSync } = require("child_process");

// macOS
execSync([
  "/Applications/NAPS2.app/Contents/MacOS/NAPS2",
  "console",
  "-i", "frankenstein.png",
  "-o", "output.pdf",
  "--install", "ocr-eng",
  "--ocrlang", "eng",
  "-n", "0",
  "-f",
].join(" "));

Option 2: tesseract.js + pdf-lib

Generate a searchable PDF programmatically by overlaying invisible text on the image:

const Tesseract = require("tesseract.js");

// tesseract.js can output PDF directly
async function createSearchablePdf(imagePath, outputPath) {
  const worker = await Tesseract.createWorker("eng");

  const { data } = await worker.recognize(imagePath);

  // Get PDF output
  const pdf = await worker.getPDF("OCR Result");
  const fs = require("fs");
  fs.writeFileSync(outputPath, Buffer.from(pdf.data));

  await worker.terminate();
}

createSearchablePdf("frankenstein.png", "output.pdf");

worker.getPDF() generates a PDF with the image and an invisible text layer — equivalent to what NAPS2 produces.

Practice program: Browser Text Scraper

JavaScript equivalent using screenshot-desktop and tesseract.js:

npm install screenshot-desktop tesseract.js sharp

const screenshot = require("screenshot-desktop");
const sharp = require("sharp");
const Tesseract = require("tesseract.js");
const fs = require("fs");

// Coordinates for the text portion (adjust as needed)
const LEFT = 400;
const TOP = 200;
const WIDTH = 600;  // RIGHT - LEFT
const HEIGHT = 600; // BOTTOM - TOP

async function ocrScreen() {
  // Capture screenshot
  const imgBuffer = await screenshot({ format: "png" });

  // Crop to text region
  const cropped = await sharp(imgBuffer)
    .extract({ left: LEFT, top: TOP, width: WIDTH, height: HEIGHT })
    .toBuffer();

  // Run OCR
  const { data } = await Tesseract.recognize(cropped, "eng");

  // Append to output.txt
  fs.appendFileSync("output.txt", data.text + "\n", "utf-8");
  console.log("Text appended to output.txt");
}

ocrScreen();

Overall idea of the chapter

Chapter 22 in JavaScript: tesseract.js provides a self-contained OCR engine (no native binary install needed) that runs in Node and browser, supports multiple languages with 'eng+jpn' syntax, and can output searchable PDFs directly with worker.getPDF(). Preprocessing with sharp (greyscale, normalize, sharpen) improves accuracy. For batch PDF creation, NAPS2 can still be called via child_process. LLM APIs can clean OCR output but always need human review.

What OCR is and what this chapter uses​

Installing and setup​

OCR fundamentals with tesseract.js​

Using a worker for multiple images (better performance)​

Typical OCR issues​

Preprocessing images (to improve accuracy)​

Using LLMs to fix OCR mistakes​

Recognizing non-English text​

Creating searchable PDFs with OCR​

Option 1: Call NAPS2 via child_process​

Option 2: tesseract.js + pdf-lib​

Practice program: Browser Text Scraper​

Overall idea of the chapter​