Skip to main content

Chapter 13 - Web Scraping (Python)

Here's a concise walkthrough of the main ideas in Chapter 13, each with a small example.


Web scraping overview

Web scraping means writing programs that download and process web content (HTML) instead of doing everything manually in a browser.

This chapter uses:

  • webbrowser – open pages in a browser.
  • requests – download pages/files.
  • bs4 (Beautiful Soup) – parse HTML to extract data.
  • selenium and playwright – control a real browser for interactive pages.

HTTP and HTTPS basics

URLs like https://autbor.com/example3.html use HTTPS (encrypted HTTP) to talk between browser and server.

HTTPS hides the contents of pages from eavesdroppers but not which host you're talking to; VPNs, ISPs, and Tor are briefly discussed in terms of privacy and traffic visibility.


Project: showmap.py with webbrowser

Goal: open OpenStreetMap for a given address from the command line or clipboard.

Step 1: Figure out the URL

OpenStreetMap supports a search URL like:

https://www.openstreetmap.org/search?query=<address_string>

The #map=... part isn't required and browsers handle URL encoding (spaces to %20).

Step 2: Handle command line arguments

If there are arguments, treat everything after the script name as the address.

# showmap.py
import webbrowser, sys

if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:]) # from CLI
# TODO: clipboard case
# TODO: open browser

sys.argv includes the filename at index 0 and each space-separated token afterward.

Step 3: Use clipboard fallback and open browser

If no arguments, use clipboard; then open the browser.

# showmap.py - Launches map for address
import webbrowser, sys, pyperclip

if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:])
else:
address = pyperclip.paste()

webbrowser.open('https://www.openstreetmap.org/search?query=' + address)

Ideas for similar programs

Suggestions using webbrowser.open:

  • Open all links on a page in separate tabs.
  • Open local weather page.
  • Open a set of social/bookmark sites you check daily.
  • Open a local HTML help file via file://....

Example (local help):

import webbrowser
webbrowser.open('file:///C:/Users/al/Desktop/help.html')

Downloading with the requests module

requests simplifies downloading pages/files and handling common network issues.

Downloading web pages

Use requests.get(url)Response object; check status_code or use raise_for_status.

import requests

response = requests.get(
'https://automatetheboringstuff.com/files/rj.txt'
)
print(response.status_code == requests.codes.ok) # True
print(len(response.text)) # ~178k chars
print(response.text[:210]) # first 210 chars

Checking for errors with raise_for_status()

response.raise_for_status() raises HTTPError on non-OK responses.

import requests

response = requests.get('https://inventwithpython.com/page_that_does_not_exist')
try:
response.raise_for_status()
except Exception as exc:
print(f'There was a problem: {exc}')

Always call raise_for_status() after get() unless failure is explicitly acceptable.

Saving downloaded files

Open a file in binary mode ('wb') and write chunks from iter_content().

import requests

response = requests.get(
'https://automatetheboringstuff.com/files/rj.txt'
)
response.raise_for_status()

with open('RomeoAndJuliet.txt', 'wb') as f:
for chunk in response.iter_content(100000):
f.write(chunk)

Review steps:

  1. requests.get()
  2. open(..., 'wb')
  3. Loop over response.iter_content(chunk_size)
  4. write() each chunk.

For video downloads (YouTube etc.), the chapter points to yt-dlp in Chapter 24 instead of requests.


Accessing a weather API (OpenWeather)

Shows how to call HTTP APIs, parse JSON, and interpret responses.

General API notes:

  • Most APIs require registration and an API key.
  • Free tiers often rate-limit requests.
  • Keep API keys secret; don't hard-code them in public code — read from a file or env variable instead.
  • JSON is parsed with json.loads(response.text) into Python lists/dicts.

Geocoding: city to lat/lon

Endpoint (geocoding):

https://api.openweathermap.org/geo/1.0/direct
?q={city_name},{state_code},{country_code}&appid={API_key}

Example:

import requests, json

city_name = 'San Francisco'
state_code = 'CA'
country_code = 'US'
API_key = 'YOUR_REAL_API_KEY'

response = requests.get(
f'https://api.openweathermap.org/geo/1.0/direct'
f'?q={city_name},{state_code},{country_code}&appid={API_key}'
)
response_data = json.loads(response.text)
lat = response_data[0]['lat']
lon = response_data[0]['lon']

If multiple matches, response_data is a list of dicts; if nothing matches, it's an empty list.

The chapter breaks the URL into scheme (https://), host, path (/geo/1.0/direct), and query string (?q=...&appid=...).

Current weather from lat/lon

Endpoint:

https://api.openweathermap.org/data/2.5/weather
?lat={lat}&lon={lon}&appid={API_key}

Example:

response = requests.get(
f'https://api.openweathermap.org/data/2.5/weather'
f'?lat={lat}&lon={lon}&appid={API_key}'
)
response_data = json.loads(response.text)

desc = response_data['weather'][0]['description']
temp_k = response_data['main']['temp'] # Kelvin
temp_c = round(temp_k - 273.15, 1)
temp_f = round(temp_k * (9/5) - 459.67, 1)

Key fields:

  • response_data['weather'][0]['main'] – "Clear", "Rain", etc.
  • response_data['weather'][0]['description'] – longer description.
  • response_data['main']['temp'] – temperature (K).
  • ['feels_like'], ['humidity'] etc. for more info.
  • On invalid coordinates, you get error dict like {"cod":"400","message":"wrong latitude"}.

5-day forecast

Endpoint:

https://api.openweathermap.org/data/2.5/forecast
?lat={lat}&lon={lon}&appid={API_key}

Response structure:

  • response_data['list'] – list of ~40 forecast entries (every 3 hours).
  • Each entry:
    • ['dt'] – Unix timestamp (use datetime.datetime.fromtimestamp(dt)).
    • ['main'] – dict with temp, feels_like, humidity etc.
    • ['weather'][0] – dict with main, description etc.

Example:

from datetime import datetime
import json

entry = response_data['list'][0]
ts = datetime.fromtimestamp(entry['dt'])
temp_k = entry['main']['temp']
print(ts, temp_k)

Exploring other APIs

Mentions other weather APIs (weather.gov, weatherapi.com) and suggests:

  • Using requests for raw HTTP JSON/XML.
  • Searching PyPI for higher-level client libraries that wrap these APIs.

Understanding HTML

Gives a quick HTML primer and stresses using proper parsers (like Beautiful Soup), not regex, for HTML.

Key ideas:

  • HTML is plaintext with tags <tag> ... </tag> forming elements.
  • Attributes like id and class identify and categorize elements.
  • CSS (Cascading Style Sheets) controls styling.
  • Don't parse HTML with regex; use a parser like bs4.

Example element:

<b>Hello</b>, world!

Renders "Hello" in bold; <b>...</b> is the element.

A link with attributes:

<a href="https://inventwithpython.com">This text is a link</a>

href attribute gives the URL.

Elements often have id for unique identification; this is what you'll frequently target when scraping.

Viewing page source

Right-click → "View Page Source" to see the raw HTML the browser received. You don't need to understand everything — just enough to locate the data you care about.

Developer Tools

Press F12 (Firefox/Chrome/Edge) to open DevTools; right-click → "Inspect Element" to see the HTML corresponding to specific page parts.

This is essential for scraping: you use DevTools to inspect and locate classes/ids for elements, then write code that finds them in downloaded HTML.


Finding HTML elements (with DevTools, before code)

Walkthrough: scraping a forecast from https://weather.gov for ZIP 94105.

Steps:

  1. Manually visit the forecast page for the ZIP code.

  2. Right-click the specific forecast text → Inspect Element.

  3. DevTools shows HTML like:

    <p class="forecast-text">
    Sunny, with a high near 64. West wind 11 to 16 mph, with gusts as high as 21 mph.
    </p>

    So the desired data is in a <p> with class="forecast-text".

  4. Right-click that element in DevTools → "Copy → CSS selector" to get a selector string such as:

    div.row-odd:nth-child(1) > div:nth-child(2)

You'll later use such selectors with Beautiful Soup's .select() or Selenium's .find_element() to programmatically locate this element.


Overall idea of the chapter

Chapter 13 covers web scraping from the ground up: opening pages with webbrowser, downloading content with requests, calling HTTP APIs (OpenWeather geocoding + weather), understanding HTML structure, and using DevTools to identify the elements you want to extract — all before introducing parsers like Beautiful Soup and browser automation with Selenium/Playwright.