Chapter 13 - Web Scraping (Python)

Here's a concise walkthrough of the main ideas in Chapter 13, each with a small example.

Web scraping overview

Web scraping means writing programs that download and process web content (HTML) instead of doing everything manually in a browser.

This chapter uses:

webbrowser – open pages in a browser.
requests – download pages/files.
bs4 (Beautiful Soup) – parse HTML to extract data.
selenium and playwright – control a real browser for interactive pages.

HTTP and HTTPS basics

URLs like https://autbor.com/example3.html use HTTPS (encrypted HTTP) to talk between browser and server.

HTTPS hides the contents of pages from eavesdroppers but not which host you're talking to; VPNs, ISPs, and Tor are briefly discussed in terms of privacy and traffic visibility.

Project: showmap.py with webbrowser

Goal: open OpenStreetMap for a given address from the command line or clipboard.

Step 1: Figure out the URL

OpenStreetMap supports a search URL like:

https://www.openstreetmap.org/search?query=<address_string>

The #map=... part isn't required and browsers handle URL encoding (spaces to %20).

Step 2: Handle command line arguments

If there are arguments, treat everything after the script name as the address.

# showmap.py
import webbrowser, sys

if len(sys.argv) > 1:
    address = ' '.join(sys.argv[1:])   # from CLI
    # TODO: clipboard case
    # TODO: open browser

sys.argv includes the filename at index 0 and each space-separated token afterward.

Step 3: Use clipboard fallback and open browser

If no arguments, use clipboard; then open the browser.

# showmap.py - Launches map for address
import webbrowser, sys, pyperclip

if len(sys.argv) > 1:
    address = ' '.join(sys.argv[1:])
else:
    address = pyperclip.paste()

webbrowser.open('https://www.openstreetmap.org/search?query=' + address)

Ideas for similar programs

Suggestions using webbrowser.open:

Open all links on a page in separate tabs.
Open local weather page.
Open a set of social/bookmark sites you check daily.
Open a local HTML help file via file://....

Example (local help):

import webbrowser
webbrowser.open('file:///C:/Users/al/Desktop/help.html')

Downloading with the requests module

requests simplifies downloading pages/files and handling common network issues.

Downloading web pages

Use requests.get(url) → Response object; check status_code or use raise_for_status.

import requests

response = requests.get(
    'https://automatetheboringstuff.com/files/rj.txt'
)
print(response.status_code == requests.codes.ok)  # True
print(len(response.text))                         # ~178k chars
print(response.text[:210])                       # first 210 chars

Checking for errors with raise_for_status()

response.raise_for_status() raises HTTPError on non-OK responses.

import requests

response = requests.get('https://inventwithpython.com/page_that_does_not_exist')
try:
    response.raise_for_status()
except Exception as exc:
    print(f'There was a problem: {exc}')

Always call raise_for_status() after get() unless failure is explicitly acceptable.

Saving downloaded files

Open a file in binary mode ('wb') and write chunks from iter_content().

import requests

response = requests.get(
    'https://automatetheboringstuff.com/files/rj.txt'
)
response.raise_for_status()

with open('RomeoAndJuliet.txt', 'wb') as f:
    for chunk in response.iter_content(100000):
        f.write(chunk)

Review steps:

requests.get()
open(..., 'wb')
Loop over response.iter_content(chunk_size)
write() each chunk.

For video downloads (YouTube etc.), the chapter points to yt-dlp in Chapter 24 instead of requests.

Accessing a weather API (OpenWeather)

Shows how to call HTTP APIs, parse JSON, and interpret responses.

General API notes:

Most APIs require registration and an API key.
Free tiers often rate-limit requests.
Keep API keys secret; don't hard-code them in public code — read from a file or env variable instead.
JSON is parsed with json.loads(response.text) into Python lists/dicts.

Geocoding: city to lat/lon

Endpoint (geocoding):

https://api.openweathermap.org/geo/1.0/direct
    ?q={city_name},{state_code},{country_code}&appid={API_key}

Example:

import requests, json

city_name = 'San Francisco'
state_code = 'CA'
country_code = 'US'
API_key = 'YOUR_REAL_API_KEY'

response = requests.get(
    f'https://api.openweathermap.org/geo/1.0/direct'
    f'?q={city_name},{state_code},{country_code}&appid={API_key}'
)
response_data = json.loads(response.text)
lat = response_data[0]['lat']
lon = response_data[0]['lon']

If multiple matches, response_data is a list of dicts; if nothing matches, it's an empty list.

The chapter breaks the URL into scheme (https://), host, path (/geo/1.0/direct), and query string (?q=...&appid=...).

Current weather from lat/lon

Endpoint:

https://api.openweathermap.org/data/2.5/weather
    ?lat={lat}&lon={lon}&appid={API_key}

Example:

response = requests.get(
    f'https://api.openweathermap.org/data/2.5/weather'
    f'?lat={lat}&lon={lon}&appid={API_key}'
)
response_data = json.loads(response.text)

desc = response_data['weather'][0]['description']
temp_k = response_data['main']['temp']         # Kelvin
temp_c = round(temp_k - 273.15, 1)
temp_f = round(temp_k * (9/5) - 459.67, 1)

Key fields:

response_data['weather'][0]['main'] – "Clear", "Rain", etc.
response_data['weather'][0]['description'] – longer description.
response_data['main']['temp'] – temperature (K).
['feels_like'], ['humidity'] etc. for more info.
On invalid coordinates, you get error dict like {"cod":"400","message":"wrong latitude"}.

5-day forecast

Endpoint:

https://api.openweathermap.org/data/2.5/forecast
    ?lat={lat}&lon={lon}&appid={API_key}

Response structure:

response_data['list'] – list of ~40 forecast entries (every 3 hours).
Each entry:
- ['dt'] – Unix timestamp (use datetime.datetime.fromtimestamp(dt)).
- ['main'] – dict with temp, feels_like, humidity etc.
- ['weather'][0] – dict with main, description etc.

Example:

from datetime import datetime
import json

entry = response_data['list'][0]
ts = datetime.fromtimestamp(entry['dt'])
temp_k = entry['main']['temp']
print(ts, temp_k)

Exploring other APIs

Mentions other weather APIs (weather.gov, weatherapi.com) and suggests:

Using requests for raw HTTP JSON/XML.
Searching PyPI for higher-level client libraries that wrap these APIs.

Understanding HTML

Gives a quick HTML primer and stresses using proper parsers (like Beautiful Soup), not regex, for HTML.

Key ideas:

HTML is plaintext with tags <tag> ... </tag> forming elements.
Attributes like id and class identify and categorize elements.
CSS (Cascading Style Sheets) controls styling.
Don't parse HTML with regex; use a parser like bs4.

Example element:

<b>Hello</b>, world!

Renders "Hello" in bold; <b>...</b> is the element.

A link with attributes:

<a href="https://inventwithpython.com">This text is a link</a>

href attribute gives the URL.

Elements often have id for unique identification; this is what you'll frequently target when scraping.

Viewing page source

Right-click → "View Page Source" to see the raw HTML the browser received. You don't need to understand everything — just enough to locate the data you care about.

Developer Tools

Press F12 (Firefox/Chrome/Edge) to open DevTools; right-click → "Inspect Element" to see the HTML corresponding to specific page parts.

This is essential for scraping: you use DevTools to inspect and locate classes/ids for elements, then write code that finds them in downloaded HTML.

Finding HTML elements (with DevTools, before code)

Walkthrough: scraping a forecast from https://weather.gov for ZIP 94105.

Steps:

Manually visit the forecast page for the ZIP code.
Right-click the specific forecast text → Inspect Element.

DevTools shows HTML like:

<p class="forecast-text">
    Sunny, with a high near 64. West wind 11 to 16 mph, with gusts as high as 21 mph.
</p>

So the desired data is in a <p> with class="forecast-text".

Right-click that element in DevTools → "Copy → CSS selector" to get a selector string such as:
```
div.row-odd:nth-child(1) > div:nth-child(2)
```

You'll later use such selectors with Beautiful Soup's .select() or Selenium's .find_element() to programmatically locate this element.

Overall idea of the chapter

Chapter 13 covers web scraping from the ground up: opening pages with webbrowser, downloading content with requests, calling HTTP APIs (OpenWeather geocoding + weather), understanding HTML structure, and using DevTools to identify the elements you want to extract — all before introducing parsers like Beautiful Soup and browser automation with Selenium/Playwright.

Web scraping overview​

HTTP and HTTPS basics​

Project: showmap.py with webbrowser​

Step 1: Figure out the URL​

Step 2: Handle command line arguments​

Step 3: Use clipboard fallback and open browser​

Ideas for similar programs​

Downloading with the requests module​

Downloading web pages​

Checking for errors with raise_for_status()​

Saving downloaded files​

Accessing a weather API (OpenWeather)​

Geocoding: city to lat/lon​

Current weather from lat/lon​

5-day forecast​

Exploring other APIs​

Understanding HTML​

Viewing page source​

Developer Tools​

Finding HTML elements (with DevTools, before code)​

Overall idea of the chapter​

Web scraping overview

HTTP and HTTPS basics

Project: showmap.py with webbrowser

Step 1: Figure out the URL

Step 2: Handle command line arguments

Step 3: Use clipboard fallback and open browser

Ideas for similar programs

Downloading with the requests module

Downloading web pages

Checking for errors with raise_for_status()

Saving downloaded files

Accessing a weather API (OpenWeather)

Geocoding: city to lat/lon

Current weather from lat/lon

5-day forecast

Exploring other APIs

Understanding HTML

Viewing page source

Developer Tools

Finding HTML elements (with DevTools, before code)

Overall idea of the chapter