Chapter 13 - Web Scraping (JavaScript)
Here's a JavaScript-flavoured version of the same concepts, with small JS/Node examples for each idea.
Web scraping overview
Web scraping means writing programs that download and process web content (HTML) instead of doing everything manually in a browser.
In JavaScript, you'll use:
open(Nodechild_process) or browser APIs – open pages in a browser.fetch(built-in since Node 18) – download pages/files.cheerio– parse HTML to extract data (like Beautiful Soup).puppeteerorplaywright– control a real browser for interactive pages.
HTTP and HTTPS basics
Same concepts apply: URLs use HTTPS (encrypted HTTP) to talk between browser and server. HTTPS hides page contents from eavesdroppers but not which host you're talking to.
Project: showmap.js with open
Goal: open OpenStreetMap for a given address from the command line or clipboard.
Step 1: Figure out the URL
Same URL pattern:
https://www.openstreetmap.org/search?query=<address_string>
Step 2: Handle command line arguments
const args = process.argv.slice(2);
if (args.length > 0) {
const address = args.join(" ");
// TODO: open browser
} else {
// TODO: clipboard fallback
}
process.argv[0] is node, [1] is the script path, [2]+ are your arguments.
Step 3: Use clipboard fallback and open browser
const { execSync } = require("child_process");
const clipboard = require("clipboardy");
const args = process.argv.slice(2);
const address = args.length > 0 ? args.join(" ") : clipboard.readSync();
const url = "https://www.openstreetmap.org/search?query=" + encodeURIComponent(address);
// Cross-platform open
const os = require("os");
const cmd =
os.platform() === "darwin" ? "open" :
os.platform() === "win32" ? "start" : "xdg-open";
execSync(`${cmd} "${url}"`);
Ideas for similar programs
Same suggestions apply:
- Open all links on a page in separate tabs.
- Open local weather page.
- Open a set of social/bookmark sites you check daily.
const { execSync } = require("child_process");
execSync('open "file:///Users/al/Desktop/help.html"');
Downloading with fetch
Node 18+ has built-in fetch (or use node-fetch for older versions).
Downloading web pages
const response = await fetch(
"https://automatetheboringstuff.com/files/rj.txt"
);
console.log(response.ok); // true
const text = await response.text();
console.log(text.length); // ~178k chars
console.log(text.slice(0, 210)); // first 210 chars
Checking for errors
Check response.ok or response.status after every fetch.
const response = await fetch(
"https://inventwithpython.com/page_that_does_not_exist"
);
if (!response.ok) {
console.log(`There was a problem: ${response.status} ${response.statusText}`);
}
Saving downloaded files
Use fs.writeFileSync with a Buffer for binary data.
const fs = require("fs");
const response = await fetch(
"https://automatetheboringstuff.com/files/rj.txt"
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("RomeoAndJuliet.txt", buffer);
For streaming large files, use response.body (a ReadableStream) piped to a write stream:
const fs = require("fs");
const { Readable } = require("stream");
const { finished } = require("stream/promises");
const response = await fetch("https://example.com/large-file.zip");
const fileStream = fs.createWriteStream("large-file.zip");
await finished(Readable.fromWeb(response.body).pipe(fileStream));
Accessing a weather API (OpenWeather)
Same concepts: call HTTP APIs, parse JSON, interpret responses.
General API notes:
- Most APIs require registration and an API key.
- Free tiers often rate-limit requests.
- Keep API keys secret — use environment variables or a
.envfile, not hard-coded strings. - JSON is parsed with
response.json()orJSON.parse().
Geocoding: city to lat/lon
const cityName = "San Francisco";
const stateCode = "CA";
const countryCode = "US";
const API_KEY = process.env.OPENWEATHER_KEY;
const response = await fetch(
`https://api.openweathermap.org/geo/1.0/direct` +
`?q=${cityName},${stateCode},${countryCode}&appid=${API_KEY}`
);
const data = await response.json();
const { lat, lon } = data[0];
Current weather from lat/lon
const response = await fetch(
`https://api.openweathermap.org/data/2.5/weather` +
`?lat=${lat}&lon=${lon}&appid=${API_KEY}`
);
const data = await response.json();
const desc = data.weather[0].description;
const tempK = data.main.temp;
const tempC = Math.round((tempK - 273.15) * 10) / 10;
const tempF = Math.round((tempK * (9 / 5) - 459.67) * 10) / 10;
Key fields:
data.weather[0].main– "Clear", "Rain", etc.data.weather[0].description– longer description.data.main.temp– temperature (K).data.main.feels_like,data.main.humidityetc.
5-day forecast
const response = await fetch(
`https://api.openweathermap.org/data/2.5/forecast` +
`?lat=${lat}&lon=${lon}&appid=${API_KEY}`
);
const data = await response.json();
const entry = data.list[0];
const ts = new Date(entry.dt * 1000); // Unix seconds → JS milliseconds
const tempK = entry.main.temp;
console.log(ts.toLocaleString(), tempK);
Exploring other APIs
Same advice: use fetch for raw HTTP, or search npm for higher-level client libraries.
Understanding HTML
Same primer applies regardless of language:
- HTML is plaintext with tags
<tag> ... </tag>forming elements. - Attributes like
idandclassidentify and categorize elements. - CSS controls styling.
- Don't parse HTML with regex; use a parser like
cheerio.
Example element:
<b>Hello</b>, world!
A link with attributes:
<a href="https://inventwithpython.com">This text is a link</a>
Viewing page source
Right-click → "View Page Source" to see the raw HTML.
Developer Tools
Press F12 to open DevTools; right-click → "Inspect Element" to see the HTML for specific page parts. Essential for scraping: locate classes/ids, then target them in code.
Finding HTML elements (with DevTools, before code)
Same workflow as Python:
- Visit the page manually.
- Right-click the target text → Inspect Element.
- Note the tag and class/id (e.g.
<p class="forecast-text">). - Copy the CSS selector from DevTools.
You'll later use that selector with cheerio's $('selector') or Puppeteer's page.$('selector').
Example with cheerio (preview):
const cheerio = require("cheerio");
const html = "<p class='forecast-text'>Sunny, high near 64.</p>";
const $ = cheerio.load(html);
console.log($(".forecast-text").text()); // 'Sunny, high near 64.'
Overall idea of the chapter
Chapter 13 covers web scraping from the ground up: opening pages with child_process, downloading content with fetch, calling HTTP APIs (OpenWeather geocoding + weather), understanding HTML structure, and using DevTools to identify elements — all before introducing parsers like cheerio and browser automation with Puppeteer/Playwright.