Chapter 13 - Web Scraping (Python)
Here's a concise walkthrough of the main ideas in Chapter 13, each with a small example.
Web scraping overview
Web scraping means writing programs that download and process web content (HTML) instead of doing everything manually in a browser.
This chapter uses:
webbrowser– open pages in a browser.requests– download pages/files.bs4(Beautiful Soup) – parse HTML to extract data.seleniumandplaywright– control a real browser for interactive pages.
HTTP and HTTPS basics
URLs like https://autbor.com/example3.html use HTTPS (encrypted HTTP) to talk between browser and server.
HTTPS hides the contents of pages from eavesdroppers but not which host you're talking to; VPNs, ISPs, and Tor are briefly discussed in terms of privacy and traffic visibility.
Project: showmap.py with webbrowser
Goal: open OpenStreetMap for a given address from the command line or clipboard.
Step 1: Figure out the URL
OpenStreetMap supports a search URL like:
https://www.openstreetmap.org/search?query=<address_string>
The #map=... part isn't required and browsers handle URL encoding (spaces to %20).
Step 2: Handle command line arguments
If there are arguments, treat everything after the script name as the address.
# showmap.py
import webbrowser, sys
if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:]) # from CLI
# TODO: clipboard case
# TODO: open browser
sys.argv includes the filename at index 0 and each space-separated token afterward.
Step 3: Use clipboard fallback and open browser
If no arguments, use clipboard; then open the browser.
# showmap.py - Launches map for address
import webbrowser, sys, pyperclip
if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:])
else:
address = pyperclip.paste()
webbrowser.open('https://www.openstreetmap.org/search?query=' + address)
Ideas for similar programs
Suggestions using webbrowser.open:
- Open all links on a page in separate tabs.
- Open local weather page.
- Open a set of social/bookmark sites you check daily.
- Open a local HTML help file via
file://....
Example (local help):
import webbrowser
webbrowser.open('file:///C:/Users/al/Desktop/help.html')
Downloading with the requests module
requests simplifies downloading pages/files and handling common network issues.
Downloading web pages
Use requests.get(url) → Response object; check status_code or use raise_for_status.
import requests
response = requests.get(
'https://automatetheboringstuff.com/files/rj.txt'
)
print(response.status_code == requests.codes.ok) # True
print(len(response.text)) # ~178k chars
print(response.text[:210]) # first 210 chars
Checking for errors with raise_for_status()
response.raise_for_status() raises HTTPError on non-OK responses.
import requests
response = requests.get('https://inventwithpython.com/page_that_does_not_exist')
try:
response.raise_for_status()
except Exception as exc:
print(f'There was a problem: {exc}')
Always call raise_for_status() after get() unless failure is explicitly acceptable.
Saving downloaded files
Open a file in binary mode ('wb') and write chunks from iter_content().
import requests
response = requests.get(
'https://automatetheboringstuff.com/files/rj.txt'
)
response.raise_for_status()
with open('RomeoAndJuliet.txt', 'wb') as f:
for chunk in response.iter_content(100000):
f.write(chunk)
Review steps:
requests.get()open(..., 'wb')- Loop over
response.iter_content(chunk_size) write()each chunk.
For video downloads (YouTube etc.), the chapter points to yt-dlp in Chapter 24 instead of requests.
Accessing a weather API (OpenWeather)
Shows how to call HTTP APIs, parse JSON, and interpret responses.
General API notes:
- Most APIs require registration and an API key.
- Free tiers often rate-limit requests.
- Keep API keys secret; don't hard-code them in public code — read from a file or env variable instead.
- JSON is parsed with
json.loads(response.text)into Python lists/dicts.
Geocoding: city to lat/lon
Endpoint (geocoding):
https://api.openweathermap.org/geo/1.0/direct
?q={city_name},{state_code},{country_code}&appid={API_key}
Example:
import requests, json
city_name = 'San Francisco'
state_code = 'CA'
country_code = 'US'
API_key = 'YOUR_REAL_API_KEY'
response = requests.get(
f'https://api.openweathermap.org/geo/1.0/direct'
f'?q={city_name},{state_code},{country_code}&appid={API_key}'
)
response_data = json.loads(response.text)
lat = response_data[0]['lat']
lon = response_data[0]['lon']
If multiple matches, response_data is a list of dicts; if nothing matches, it's an empty list.
The chapter breaks the URL into scheme (https://), host, path (/geo/1.0/direct), and query string (?q=...&appid=...).
Current weather from lat/lon
Endpoint:
https://api.openweathermap.org/data/2.5/weather
?lat={lat}&lon={lon}&appid={API_key}
Example:
response = requests.get(
f'https://api.openweathermap.org/data/2.5/weather'
f'?lat={lat}&lon={lon}&appid={API_key}'
)
response_data = json.loads(response.text)
desc = response_data['weather'][0]['description']
temp_k = response_data['main']['temp'] # Kelvin
temp_c = round(temp_k - 273.15, 1)
temp_f = round(temp_k * (9/5) - 459.67, 1)
Key fields:
response_data['weather'][0]['main']– "Clear", "Rain", etc.response_data['weather'][0]['description']– longer description.response_data['main']['temp']– temperature (K).['feels_like'],['humidity']etc. for more info.- On invalid coordinates, you get error dict like
{"cod":"400","message":"wrong latitude"}.
5-day forecast
Endpoint:
https://api.openweathermap.org/data/2.5/forecast
?lat={lat}&lon={lon}&appid={API_key}
Response structure:
response_data['list']– list of ~40 forecast entries (every 3 hours).- Each entry:
['dt']– Unix timestamp (usedatetime.datetime.fromtimestamp(dt)).['main']– dict withtemp,feels_like,humidityetc.['weather'][0]– dict withmain,descriptionetc.
Example:
from datetime import datetime
import json
entry = response_data['list'][0]
ts = datetime.fromtimestamp(entry['dt'])
temp_k = entry['main']['temp']
print(ts, temp_k)
Exploring other APIs
Mentions other weather APIs (weather.gov, weatherapi.com) and suggests:
- Using
requestsfor raw HTTP JSON/XML. - Searching PyPI for higher-level client libraries that wrap these APIs.
Understanding HTML
Gives a quick HTML primer and stresses using proper parsers (like Beautiful Soup), not regex, for HTML.
Key ideas:
- HTML is plaintext with tags
<tag> ... </tag>forming elements. - Attributes like
idandclassidentify and categorize elements. - CSS (Cascading Style Sheets) controls styling.
- Don't parse HTML with regex; use a parser like
bs4.
Example element:
<b>Hello</b>, world!
Renders "Hello" in bold; <b>...</b> is the element.
A link with attributes:
<a href="https://inventwithpython.com">This text is a link</a>
href attribute gives the URL.
Elements often have id for unique identification; this is what you'll frequently target when scraping.
Viewing page source
Right-click → "View Page Source" to see the raw HTML the browser received. You don't need to understand everything — just enough to locate the data you care about.
Developer Tools
Press F12 (Firefox/Chrome/Edge) to open DevTools; right-click → "Inspect Element" to see the HTML corresponding to specific page parts.
This is essential for scraping: you use DevTools to inspect and locate classes/ids for elements, then write code that finds them in downloaded HTML.
Finding HTML elements (with DevTools, before code)
Walkthrough: scraping a forecast from https://weather.gov for ZIP 94105.
Steps:
-
Manually visit the forecast page for the ZIP code.
-
Right-click the specific forecast text → Inspect Element.
-
DevTools shows HTML like:
<p class="forecast-text">
Sunny, with a high near 64. West wind 11 to 16 mph, with gusts as high as 21 mph.
</p>So the desired data is in a
<p>withclass="forecast-text". -
Right-click that element in DevTools → "Copy → CSS selector" to get a selector string such as:
div.row-odd:nth-child(1) > div:nth-child(2)
You'll later use such selectors with Beautiful Soup's .select() or Selenium's .find_element() to programmatically locate this element.
Overall idea of the chapter
Chapter 13 covers web scraping from the ground up: opening pages with webbrowser, downloading content with requests, calling HTTP APIs (OpenWeather geocoding + weather), understanding HTML structure, and using DevTools to identify the elements you want to extract — all before introducing parsers like Beautiful Soup and browser automation with Selenium/Playwright.