Skip to main content

Chapter 13 - Web Scraping (JavaScript)

Here's a JavaScript-flavoured version of the same concepts, with small JS/Node examples for each idea.


Web scraping overview

Web scraping means writing programs that download and process web content (HTML) instead of doing everything manually in a browser.

In JavaScript, you'll use:

  • open (Node child_process) or browser APIs – open pages in a browser.
  • fetch (built-in since Node 18) – download pages/files.
  • cheerio – parse HTML to extract data (like Beautiful Soup).
  • puppeteer or playwright – control a real browser for interactive pages.

HTTP and HTTPS basics

Same concepts apply: URLs use HTTPS (encrypted HTTP) to talk between browser and server. HTTPS hides page contents from eavesdroppers but not which host you're talking to.


Project: showmap.js with open

Goal: open OpenStreetMap for a given address from the command line or clipboard.

Step 1: Figure out the URL

Same URL pattern:

https://www.openstreetmap.org/search?query=<address_string>

Step 2: Handle command line arguments

const args = process.argv.slice(2);

if (args.length > 0) {
const address = args.join(" ");
// TODO: open browser
} else {
// TODO: clipboard fallback
}

process.argv[0] is node, [1] is the script path, [2]+ are your arguments.

Step 3: Use clipboard fallback and open browser

const { execSync } = require("child_process");
const clipboard = require("clipboardy");

const args = process.argv.slice(2);
const address = args.length > 0 ? args.join(" ") : clipboard.readSync();

const url = "https://www.openstreetmap.org/search?query=" + encodeURIComponent(address);

// Cross-platform open
const os = require("os");
const cmd =
os.platform() === "darwin" ? "open" :
os.platform() === "win32" ? "start" : "xdg-open";
execSync(`${cmd} "${url}"`);

Ideas for similar programs

Same suggestions apply:

  • Open all links on a page in separate tabs.
  • Open local weather page.
  • Open a set of social/bookmark sites you check daily.
const { execSync } = require("child_process");
execSync('open "file:///Users/al/Desktop/help.html"');

Downloading with fetch

Node 18+ has built-in fetch (or use node-fetch for older versions).

Downloading web pages

const response = await fetch(
"https://automatetheboringstuff.com/files/rj.txt"
);
console.log(response.ok); // true
const text = await response.text();
console.log(text.length); // ~178k chars
console.log(text.slice(0, 210)); // first 210 chars

Checking for errors

Check response.ok or response.status after every fetch.

const response = await fetch(
"https://inventwithpython.com/page_that_does_not_exist"
);
if (!response.ok) {
console.log(`There was a problem: ${response.status} ${response.statusText}`);
}

Saving downloaded files

Use fs.writeFileSync with a Buffer for binary data.

const fs = require("fs");

const response = await fetch(
"https://automatetheboringstuff.com/files/rj.txt"
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);

const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("RomeoAndJuliet.txt", buffer);

For streaming large files, use response.body (a ReadableStream) piped to a write stream:

const fs = require("fs");
const { Readable } = require("stream");
const { finished } = require("stream/promises");

const response = await fetch("https://example.com/large-file.zip");
const fileStream = fs.createWriteStream("large-file.zip");
await finished(Readable.fromWeb(response.body).pipe(fileStream));

Accessing a weather API (OpenWeather)

Same concepts: call HTTP APIs, parse JSON, interpret responses.

General API notes:

  • Most APIs require registration and an API key.
  • Free tiers often rate-limit requests.
  • Keep API keys secret — use environment variables or a .env file, not hard-coded strings.
  • JSON is parsed with response.json() or JSON.parse().

Geocoding: city to lat/lon

const cityName = "San Francisco";
const stateCode = "CA";
const countryCode = "US";
const API_KEY = process.env.OPENWEATHER_KEY;

const response = await fetch(
`https://api.openweathermap.org/geo/1.0/direct` +
`?q=${cityName},${stateCode},${countryCode}&appid=${API_KEY}`
);
const data = await response.json();
const { lat, lon } = data[0];

Current weather from lat/lon

const response = await fetch(
`https://api.openweathermap.org/data/2.5/weather` +
`?lat=${lat}&lon=${lon}&appid=${API_KEY}`
);
const data = await response.json();

const desc = data.weather[0].description;
const tempK = data.main.temp;
const tempC = Math.round((tempK - 273.15) * 10) / 10;
const tempF = Math.round((tempK * (9 / 5) - 459.67) * 10) / 10;

Key fields:

  • data.weather[0].main – "Clear", "Rain", etc.
  • data.weather[0].description – longer description.
  • data.main.temp – temperature (K).
  • data.main.feels_like, data.main.humidity etc.

5-day forecast

const response = await fetch(
`https://api.openweathermap.org/data/2.5/forecast` +
`?lat=${lat}&lon=${lon}&appid=${API_KEY}`
);
const data = await response.json();

const entry = data.list[0];
const ts = new Date(entry.dt * 1000); // Unix seconds → JS milliseconds
const tempK = entry.main.temp;
console.log(ts.toLocaleString(), tempK);

Exploring other APIs

Same advice: use fetch for raw HTTP, or search npm for higher-level client libraries.


Understanding HTML

Same primer applies regardless of language:

  • HTML is plaintext with tags <tag> ... </tag> forming elements.
  • Attributes like id and class identify and categorize elements.
  • CSS controls styling.
  • Don't parse HTML with regex; use a parser like cheerio.

Example element:

<b>Hello</b>, world!

A link with attributes:

<a href="https://inventwithpython.com">This text is a link</a>

Viewing page source

Right-click → "View Page Source" to see the raw HTML.

Developer Tools

Press F12 to open DevTools; right-click → "Inspect Element" to see the HTML for specific page parts. Essential for scraping: locate classes/ids, then target them in code.


Finding HTML elements (with DevTools, before code)

Same workflow as Python:

  1. Visit the page manually.
  2. Right-click the target text → Inspect Element.
  3. Note the tag and class/id (e.g. <p class="forecast-text">).
  4. Copy the CSS selector from DevTools.

You'll later use that selector with cheerio's $('selector') or Puppeteer's page.$('selector').

Example with cheerio (preview):

const cheerio = require("cheerio");

const html = "<p class='forecast-text'>Sunny, high near 64.</p>";
const $ = cheerio.load(html);
console.log($(".forecast-text").text()); // 'Sunny, high near 64.'

Overall idea of the chapter

Chapter 13 covers web scraping from the ground up: opening pages with child_process, downloading content with fetch, calling HTTP APIs (OpenWeather geocoding + weather), understanding HTML structure, and using DevTools to identify elements — all before introducing parsers like cheerio and browser automation with Puppeteer/Playwright.