Chapter 17 - PDF and Word Documents (JavaScript)

Here's a JavaScript-flavoured version of the same concepts, using pdf-lib and pdf-parse for PDFs and docx (officegen/docx) for Word files.

Overview

PDF and Word files are binary documents. In JavaScript, the main packages are:

pdf-parse – extract text from PDFs.
pdf-lib – create, modify, merge, rotate, encrypt PDFs.
docx (npm package) – create .docx Word files.
mammoth – extract text/HTML from existing .docx files.

PDF documents with pdf-parse and pdf-lib

Extracting text

Use pdf-parse to extract all text from a PDF:

npm install pdf-parse

const fs = require("fs");
const pdfParse = require("pdf-parse");

const dataBuffer = fs.readFileSync("Recursion_Chapter1.pdf");
const data = await pdfParse(dataBuffer);

console.log(data.numpages);    // number of pages
console.log(data.text);        // all extracted text

fs.writeFileSync("recursion.txt", data.text, "utf-8");

pdf-parse returns all text as one string. For per-page extraction, you'd need pdfjs-dist (Mozilla's PDF.js).

Post-processing with AI

Same issues apply: hard newlines, hyphenated breaks, headers/footers. Same suggestion: use an LLM to clean up extracted text.

Extracting images

For image extraction, use pdfjs-dist (PDF.js) or pdf-lib to access embedded objects. This is more complex than Python's pypdf:

npm install pdfjs-dist

const pdfjsLib = require("pdfjs-dist/legacy/build/pdf.js");
const fs = require("fs");

const data = new Uint8Array(fs.readFileSync("document.pdf"));
const doc = await pdfjsLib.getDocument({ data }).promise;

for (let i = 1; i <= doc.numPages; i++) {
  const page = await doc.getPage(i);
  const ops = await page.getOperatorList();
  // Image operators would need further processing
  console.log(`Page ${i}: ${ops.fnArray.length} operations`);
}

Image extraction in JS is significantly more involved than Python's simple page.images API.

Creating PDFs from other pages (pdf-lib)

pdf-lib is the main library for PDF manipulation in Node.js:

npm install pdf-lib

Copying pages between PDFs

const { PDFDocument } = require("pdf-lib");
const fs = require("fs");

const srcBytes = fs.readFileSync("Recursion_Chapter1.pdf");
const srcDoc = await PDFDocument.load(srcBytes);

const newDoc = await PDFDocument.create();
const pages = await newDoc.copyPages(srcDoc, [0, 1, 2, 3, 4]); // first 5

for (const page of pages) {
  newDoc.addPage(page);
}

const pdfBytes = await newDoc.save();
fs.writeFileSync("first_five_pages.pdf", pdfBytes);

PDFDocument.create() makes an empty PDF.
copyPages(srcDoc, indices) copies specific pages.
addPage(page) appends; insertPage(index, page) inserts.

Rotating pages

const { PDFDocument, degrees } = require("pdf-lib");
const fs = require("fs");

const pdfBytes = fs.readFileSync("Recursion_Chapter1.pdf");
const doc = await PDFDocument.load(pdfBytes);

for (const page of doc.getPages()) {
  page.setRotation(degrees(90));
}

const rotatedBytes = await doc.save();
fs.writeFileSync("rotated.pdf", rotatedBytes);

Use degrees(90), degrees(180), degrees(270) for clockwise rotation.

Inserting blank pages

const { PDFDocument } = require("pdf-lib");
const fs = require("fs");

const pdfBytes = fs.readFileSync("Recursion_Chapter1.pdf");
const doc = await PDFDocument.load(pdfBytes);

doc.addPage();                // blank at end (default letter size)
doc.insertPage(2);            // blank at index 2

const newBytes = await doc.save();
fs.writeFileSync("with_blanks.pdf", newBytes);

Watermarks and overlays

pdf-lib supports embedding pages from other PDFs as overlays:

const { PDFDocument } = require("pdf-lib");
const fs = require("fs");

const mainBytes = fs.readFileSync("document.pdf");
const watermarkBytes = fs.readFileSync("watermark.pdf");

const mainDoc = await PDFDocument.load(mainBytes);
const watermarkDoc = await PDFDocument.load(watermarkBytes);

const [watermarkPage] = await mainDoc.embedPdf(watermarkDoc, [0]);

for (const page of mainDoc.getPages()) {
  const { width, height } = page.getSize();
  page.drawPage(watermarkPage, {
    x: 0,
    y: 0,
    width,
    height,
  });
}

const result = await mainDoc.save();
fs.writeFileSync("with_watermark.pdf", result);

drawPage places the watermark page as an overlay on each page.

Encrypting and decrypting PDFs

pdf-lib does not natively support encryption. For encryption in Node.js, use qpdf (command-line tool) or muhammara:

const { execSync } = require("child_process");

// Encrypt with qpdf (must be installed on system)
execSync(
  'qpdf --encrypt swordfish swordfish 256 -- input.pdf encrypted.pdf'
);

// Decrypt
execSync(
  'qpdf --password=swordfish --decrypt encrypted.pdf decrypted.pdf'
);

Alternatively, use the muhammara npm package for programmatic encryption.

Project: Combine select pages from many PDFs

Same goal: skip the first page (cover sheet) of each PDF and combine the rest.

const { PDFDocument } = require("pdf-lib");
const fs = require("fs");
const path = require("path");

const pdfFilenames = fs
  .readdirSync(".")
  .filter((f) => f.endsWith(".pdf"))
  .sort((a, b) => a.toLowerCase().localeCompare(b.toLowerCase()));

const combined = await PDFDocument.create();

for (const filename of pdfFilenames) {
  const srcBytes = fs.readFileSync(filename);
  const srcDoc = await PDFDocument.load(srcBytes);
  const pageCount = srcDoc.getPageCount();

  if (pageCount > 1) {
    // Copy all pages except the first
    const indices = Array.from({ length: pageCount - 1 }, (_, i) => i + 1);
    const pages = await combined.copyPages(srcDoc, indices);
    for (const page of pages) {
      combined.addPage(page);
    }
  }
}

const resultBytes = await combined.save();
fs.writeFileSync("combined.pdf", resultBytes);

Word documents

Reading .docx files with mammoth

mammoth extracts text or HTML from .docx files:

npm install mammoth

const mammoth = require("mammoth");

const result = await mammoth.extractRawText({ path: "demo.docx" });
console.log(result.value);  // plain text of the document

For HTML output:

const result = await mammoth.convertToHtml({ path: "demo.docx" });
console.log(result.value);  // HTML string

mammoth doesn't expose paragraph/run-level objects like python-docx.

Creating .docx files with docx

The docx npm package creates Word documents:

npm install docx

const docx = require("docx");
const fs = require("fs");

const doc = new docx.Document({
  sections: [
    {
      children: [
        new docx.Paragraph({
          text: "Document Title",
          heading: docx.HeadingLevel.TITLE,
        }),
        new docx.Paragraph({
          children: [
            new docx.TextRun("A plain paragraph with some "),
            new docx.TextRun({ text: "bold", bold: true }),
            new docx.TextRun(" and some "),
            new docx.TextRun({ text: "italic", italics: true }),
          ],
        }),
      ],
    },
  ],
});

const buffer = await docx.Packer.toBuffer(doc);
fs.writeFileSync("demo.docx", buffer);

Paragraph ≈ Python's Paragraph.
TextRun ≈ Python's Run — contiguous text with the same style.
Style options: bold, italics, underline, strike, allCaps, smallCaps, font, size.

Styling with TextRun

new docx.TextRun({
  text: "Styled text",
  bold: true,
  italics: true,
  underline: { type: docx.UnderlineType.SINGLE },
  font: "Times New Roman",
  size: 28,  // half-points (28 = 14pt)
});

Overall idea of the chapter

Chapter 17 in JavaScript uses pdf-parse for text extraction, pdf-lib for PDF manipulation (copy pages, rotate, insert blanks, watermark), external tools for encryption, mammoth for reading .docx, and the docx package for creating Word files. The same document object model concepts apply: PDFs are page-based, Word docs are paragraph/run-based.

Overview​

PDF documents with pdf-parse and pdf-lib​

Extracting text​

Post-processing with AI​

Extracting images​

Creating PDFs from other pages (pdf-lib)​

Copying pages between PDFs​

Rotating pages​

Inserting blank pages​

Watermarks and overlays​

Encrypting and decrypting PDFs​

Project: Combine select pages from many PDFs​

Word documents​

Reading .docx files with mammoth​

Creating .docx files with docx​

Styling with TextRun​

Overall idea of the chapter​