Skip to main content

Chapter 22 - Recognizing Text in Images (JavaScript)

Here's a JavaScript-flavoured version of the same concepts, with small JS/Node examples for each idea.


What OCR is and what this chapter uses

Same idea: extract text from images so you can work with it as strings. In JavaScript:

  • tesseract.js – a pure JavaScript/WebAssembly port of Tesseract that runs in Node and the browser (no native binary needed).
  • sharp (from Chapter 21) for image preprocessing.
  • For searchable PDFs, you can still call NAPS2 via child_process, or use pdf-lib to embed text layers.
npm install tesseract.js

Installing and setup

Unlike Python's PyTesseract which requires a separate Tesseract binary, tesseract.js is self-contained — it downloads language data automatically on first use.

const Tesseract = require("tesseract.js");

No PATH configuration needed. Language .traineddata files are fetched from a CDN by default (or you can bundle them locally).


OCR fundamentals with tesseract.js

Basic pattern to get text from an image:

const Tesseract = require("tesseract.js");

async function recognize(imagePath) {
const { data } = await Tesseract.recognize(imagePath, "eng");
console.log(data.text);
}

recognize("ocr-example.png");
  • Tesseract.recognize(image, lang) returns an object with data.text containing the recognized string.
  • Works with file paths, Buffers, or URLs.

Using a worker for multiple images (better performance)

const Tesseract = require("tesseract.js");

async function main() {
const worker = await Tesseract.createWorker("eng");

const { data: data1 } = await worker.recognize("page1.png");
console.log(data1.text);

const { data: data2 } = await worker.recognize("page2.png");
console.log(data2.text);

await worker.terminate();
}

main();

Creating a worker once and reusing it avoids reloading the language model each time.

Typical OCR issues

Same issues as Python:

  • End-of-line hyphenation preserved.
  • Layout info (fonts, columns) lost.
  • Character misreads (especially numbers).
  • Multi-column text gets jumbled.

Preprocessing images (to improve accuracy)

Use sharp to preprocess before OCR:

const sharp = require("sharp");
const Tesseract = require("tesseract.js");

async function ocrWithPreprocessing(imagePath) {
// Convert to grayscale, increase contrast, sharpen
const processed = await sharp(imagePath)
.greyscale()
.normalize() // auto contrast
.sharpen()
.toBuffer();

const { data } = await Tesseract.recognize(processed, "eng");
return data.text;
}

Same guidelines apply:

  • Avoid multi-column pages.
  • Use typewritten, conventional fonts.
  • Rotate so text is perfectly upright.
  • Dark text on light background.
  • Remove borders and noise.

Using LLMs to fix OCR mistakes

Same approach works in JavaScript — send OCR output to an LLM API:

// Example using fetch with an LLM API
async function cleanOcrText(rawText) {
const response = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": process.env.ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
},
body: JSON.stringify({
model: "claude-sonnet-4-5-20250929",
max_tokens: 4096,
messages: [
{
role: "user",
content: `Fix OCR errors in this text. Only fix spacing and character recognition errors. Do not correct original spelling/grammar. Join hyphenated line breaks. Put each paragraph on one line.\n\n${rawText}`,
},
],
}),
});

const data = await response.json();
return data.content[0].text;
}

Always review LLM output — it can miss errors or introduce new ones.


Recognizing non-English text

tesseract.js supports the same language codes. Pass the language when creating a worker or calling recognize:

const Tesseract = require("tesseract.js");

// Japanese OCR
const { data } = await Tesseract.recognize("frankenstein_jpn.png", "jpn");
console.log(data.text);

Multiple languages:

// English + Japanese
const { data } = await Tesseract.recognize("mixed_text.png", "eng+jpn");
console.log(data.text);

Language data is downloaded automatically on first use for each language code.


Creating searchable PDFs with OCR

Option 1: Call NAPS2 via child_process

Same approach as Python — NAPS2 works the same regardless of calling language:

const { execSync } = require("child_process");

// macOS
execSync([
"/Applications/NAPS2.app/Contents/MacOS/NAPS2",
"console",
"-i", "frankenstein.png",
"-o", "output.pdf",
"--install", "ocr-eng",
"--ocrlang", "eng",
"-n", "0",
"-f",
].join(" "));

Option 2: tesseract.js + pdf-lib

Generate a searchable PDF programmatically by overlaying invisible text on the image:

const Tesseract = require("tesseract.js");

// tesseract.js can output PDF directly
async function createSearchablePdf(imagePath, outputPath) {
const worker = await Tesseract.createWorker("eng");

const { data } = await worker.recognize(imagePath);

// Get PDF output
const pdf = await worker.getPDF("OCR Result");
const fs = require("fs");
fs.writeFileSync(outputPath, Buffer.from(pdf.data));

await worker.terminate();
}

createSearchablePdf("frankenstein.png", "output.pdf");

worker.getPDF() generates a PDF with the image and an invisible text layer — equivalent to what NAPS2 produces.


Practice program: Browser Text Scraper

JavaScript equivalent using screenshot-desktop and tesseract.js:

npm install screenshot-desktop tesseract.js sharp
const screenshot = require("screenshot-desktop");
const sharp = require("sharp");
const Tesseract = require("tesseract.js");
const fs = require("fs");

// Coordinates for the text portion (adjust as needed)
const LEFT = 400;
const TOP = 200;
const WIDTH = 600; // RIGHT - LEFT
const HEIGHT = 600; // BOTTOM - TOP

async function ocrScreen() {
// Capture screenshot
const imgBuffer = await screenshot({ format: "png" });

// Crop to text region
const cropped = await sharp(imgBuffer)
.extract({ left: LEFT, top: TOP, width: WIDTH, height: HEIGHT })
.toBuffer();

// Run OCR
const { data } = await Tesseract.recognize(cropped, "eng");

// Append to output.txt
fs.appendFileSync("output.txt", data.text + "\n", "utf-8");
console.log("Text appended to output.txt");
}

ocrScreen();

Overall idea of the chapter

Chapter 22 in JavaScript: tesseract.js provides a self-contained OCR engine (no native binary install needed) that runs in Node and browser, supports multiple languages with 'eng+jpn' syntax, and can output searchable PDFs directly with worker.getPDF(). Preprocessing with sharp (greyscale, normalize, sharpen) improves accuracy. For batch PDF creation, NAPS2 can still be called via child_process. LLM APIs can clean OCR output but always need human review.