Chapter 17 - PDF and Word Documents (JavaScript)
Here's a JavaScript-flavoured version of the same concepts, using pdf-lib and pdf-parse for PDFs and docx (officegen/docx) for Word files.
Overview
PDF and Word files are binary documents. In JavaScript, the main packages are:
pdf-parse– extract text from PDFs.pdf-lib– create, modify, merge, rotate, encrypt PDFs.docx(npm package) – create.docxWord files.mammoth– extract text/HTML from existing.docxfiles.
PDF documents with pdf-parse and pdf-lib
Extracting text
Use pdf-parse to extract all text from a PDF:
npm install pdf-parse
const fs = require("fs");
const pdfParse = require("pdf-parse");
const dataBuffer = fs.readFileSync("Recursion_Chapter1.pdf");
const data = await pdfParse(dataBuffer);
console.log(data.numpages); // number of pages
console.log(data.text); // all extracted text
fs.writeFileSync("recursion.txt", data.text, "utf-8");
pdf-parse returns all text as one string. For per-page extraction, you'd need pdfjs-dist (Mozilla's PDF.js).
Post-processing with AI
Same issues apply: hard newlines, hyphenated breaks, headers/footers. Same suggestion: use an LLM to clean up extracted text.
Extracting images
For image extraction, use pdfjs-dist (PDF.js) or pdf-lib to access embedded objects. This is more complex than Python's pypdf:
npm install pdfjs-dist
const pdfjsLib = require("pdfjs-dist/legacy/build/pdf.js");
const fs = require("fs");
const data = new Uint8Array(fs.readFileSync("document.pdf"));
const doc = await pdfjsLib.getDocument({ data }).promise;
for (let i = 1; i <= doc.numPages; i++) {
const page = await doc.getPage(i);
const ops = await page.getOperatorList();
// Image operators would need further processing
console.log(`Page ${i}: ${ops.fnArray.length} operations`);
}
Image extraction in JS is significantly more involved than Python's simple page.images API.
Creating PDFs from other pages (pdf-lib)
pdf-lib is the main library for PDF manipulation in Node.js:
npm install pdf-lib
Copying pages between PDFs
const { PDFDocument } = require("pdf-lib");
const fs = require("fs");
const srcBytes = fs.readFileSync("Recursion_Chapter1.pdf");
const srcDoc = await PDFDocument.load(srcBytes);
const newDoc = await PDFDocument.create();
const pages = await newDoc.copyPages(srcDoc, [0, 1, 2, 3, 4]); // first 5
for (const page of pages) {
newDoc.addPage(page);
}
const pdfBytes = await newDoc.save();
fs.writeFileSync("first_five_pages.pdf", pdfBytes);
PDFDocument.create()makes an empty PDF.copyPages(srcDoc, indices)copies specific pages.addPage(page)appends;insertPage(index, page)inserts.
Rotating pages
const { PDFDocument, degrees } = require("pdf-lib");
const fs = require("fs");
const pdfBytes = fs.readFileSync("Recursion_Chapter1.pdf");
const doc = await PDFDocument.load(pdfBytes);
for (const page of doc.getPages()) {
page.setRotation(degrees(90));
}
const rotatedBytes = await doc.save();
fs.writeFileSync("rotated.pdf", rotatedBytes);
Use degrees(90), degrees(180), degrees(270) for clockwise rotation.
Inserting blank pages
const { PDFDocument } = require("pdf-lib");
const fs = require("fs");
const pdfBytes = fs.readFileSync("Recursion_Chapter1.pdf");
const doc = await PDFDocument.load(pdfBytes);
doc.addPage(); // blank at end (default letter size)
doc.insertPage(2); // blank at index 2
const newBytes = await doc.save();
fs.writeFileSync("with_blanks.pdf", newBytes);
Watermarks and overlays
pdf-lib supports embedding pages from other PDFs as overlays:
const { PDFDocument } = require("pdf-lib");
const fs = require("fs");
const mainBytes = fs.readFileSync("document.pdf");
const watermarkBytes = fs.readFileSync("watermark.pdf");
const mainDoc = await PDFDocument.load(mainBytes);
const watermarkDoc = await PDFDocument.load(watermarkBytes);
const [watermarkPage] = await mainDoc.embedPdf(watermarkDoc, [0]);
for (const page of mainDoc.getPages()) {
const { width, height } = page.getSize();
page.drawPage(watermarkPage, {
x: 0,
y: 0,
width,
height,
});
}
const result = await mainDoc.save();
fs.writeFileSync("with_watermark.pdf", result);
drawPage places the watermark page as an overlay on each page.
Encrypting and decrypting PDFs
pdf-lib does not natively support encryption. For encryption in Node.js, use qpdf (command-line tool) or muhammara:
const { execSync } = require("child_process");
// Encrypt with qpdf (must be installed on system)
execSync(
'qpdf --encrypt swordfish swordfish 256 -- input.pdf encrypted.pdf'
);
// Decrypt
execSync(
'qpdf --password=swordfish --decrypt encrypted.pdf decrypted.pdf'
);
Alternatively, use the muhammara npm package for programmatic encryption.
Project: Combine select pages from many PDFs
Same goal: skip the first page (cover sheet) of each PDF and combine the rest.
const { PDFDocument } = require("pdf-lib");
const fs = require("fs");
const path = require("path");
const pdfFilenames = fs
.readdirSync(".")
.filter((f) => f.endsWith(".pdf"))
.sort((a, b) => a.toLowerCase().localeCompare(b.toLowerCase()));
const combined = await PDFDocument.create();
for (const filename of pdfFilenames) {
const srcBytes = fs.readFileSync(filename);
const srcDoc = await PDFDocument.load(srcBytes);
const pageCount = srcDoc.getPageCount();
if (pageCount > 1) {
// Copy all pages except the first
const indices = Array.from({ length: pageCount - 1 }, (_, i) => i + 1);
const pages = await combined.copyPages(srcDoc, indices);
for (const page of pages) {
combined.addPage(page);
}
}
}
const resultBytes = await combined.save();
fs.writeFileSync("combined.pdf", resultBytes);
Word documents
Reading .docx files with mammoth
mammoth extracts text or HTML from .docx files:
npm install mammoth
const mammoth = require("mammoth");
const result = await mammoth.extractRawText({ path: "demo.docx" });
console.log(result.value); // plain text of the document
For HTML output:
const result = await mammoth.convertToHtml({ path: "demo.docx" });
console.log(result.value); // HTML string
mammoth doesn't expose paragraph/run-level objects like python-docx.
Creating .docx files with docx
The docx npm package creates Word documents:
npm install docx
const docx = require("docx");
const fs = require("fs");
const doc = new docx.Document({
sections: [
{
children: [
new docx.Paragraph({
text: "Document Title",
heading: docx.HeadingLevel.TITLE,
}),
new docx.Paragraph({
children: [
new docx.TextRun("A plain paragraph with some "),
new docx.TextRun({ text: "bold", bold: true }),
new docx.TextRun(" and some "),
new docx.TextRun({ text: "italic", italics: true }),
],
}),
],
},
],
});
const buffer = await docx.Packer.toBuffer(doc);
fs.writeFileSync("demo.docx", buffer);
Paragraph≈ Python'sParagraph.TextRun≈ Python'sRun— contiguous text with the same style.- Style options:
bold,italics,underline,strike,allCaps,smallCaps,font,size.
Styling with TextRun
new docx.TextRun({
text: "Styled text",
bold: true,
italics: true,
underline: { type: docx.UnderlineType.SINGLE },
font: "Times New Roman",
size: 28, // half-points (28 = 14pt)
});
Overall idea of the chapter
Chapter 17 in JavaScript uses pdf-parse for text extraction, pdf-lib for PDF manipulation (copy pages, rotate, insert blanks, watermark), external tools for encryption, mammoth for reading .docx, and the docx package for creating Word files. The same document object model concepts apply: PDFs are page-based, Word docs are paragraph/run-based.