Chapter 18 - CSV, JSON, and XML Files (Python)
Here's a concise walkthrough of the main ideas in Chapter 18, each with a small example.
Overview: CSV, JSON, XML
These are data serialization formats: ways to turn structured data into text for saving, sending over networks, or copying.
- CSV: simple spreadsheet-like rows/columns, everything is text.
- JSON: JavaScript-style objects/arrays; good general data format, widely used in web APIs.
- XML: older, tag-based format; verbose and powerful but often more complex than needed.
Python standard library modules:
csvfor CSV.jsonfor JSON.xml(especiallyxml.etree.ElementTree) for XML.
CSV format and the csv module
What CSV looks like and limits
Each line = row; commas separate cells.
Example CSV for example3.xlsx:
4/5/2035 13:34,Apples,73
4/5/2035 3:41,Cherries,85
...
Limitations vs Excel:
- Only strings (no data types).
- No fonts, colors, multiple sheets, merged cells, images, charts, widths/heights.
- Simpler, widely supported, easy in text editors.
You shouldn't parse CSV with split(',') because commas can appear in values and must be escaped; use csv instead.
Reading CSV with csv.reader
import csv
example_file = open('example3.csv', encoding='utf-8')
example_reader = csv.reader(example_file)
example_data = list(example_reader)
example_file.close()
print(example_data[0][0]) # '4/5/2035 13:34'
print(example_data[0][1]) # 'Apples'
print(example_data[6][1]) # 'Strawberries'
csv.reader(file)returns an iterator;list(reader)→ list of row lists.- Access cell by
example_data[row][col](0-based).
Iterating in a for loop and line_num
Preferred for large files: iterate directly over the reader to avoid loading entire file at once.
import csv
example_file = open('example3.csv', encoding='utf-8')
example_reader = csv.reader(example_file)
for row in example_reader:
print('Row #' + str(example_reader.line_num) + ' ' + str(row))
reader.line_numis 1-based line number.- You can only iterate a
readeronce; reopen file for another pass. - Use
line_num == 1to skip headers.
Writing CSV with csv.writer
import csv
output_file = open('output.csv', 'w', newline='', encoding='utf-8')
output_writer = csv.writer(output_file)
output_writer.writerow(['spam', 'eggs', 'bacon', 'ham'])
output_writer.writerow(['Hello, world!', 'eggs', 'bacon', 'ham'])
output_writer.writerow([1, 2, 3.141592, 4])
output_file.close()
- Use
newline=''on Windows to avoid double-spaced rows. writerow(list)writes one row; return value = number of characters written.csvhandles quoting (e.g.,"Hello, world!").
Result:
spam,eggs,bacon,ham
"Hello, world!",eggs,bacon,ham
1,2,3.141592,4
Custom delimiters and line terminators (TSV)
import csv
output_file = open('output.tsv', 'w', newline='', encoding='utf-8')
output_writer = csv.writer(
output_file,
delimiter='\t',
lineterminator='\n\n',
)
output_writer.writerow(['spam', 'eggs', 'bacon', 'ham'])
output_writer.writerow(['Hello, world!', 'eggs', 'bacon', 'ham'])
output_writer.writerow([1, 2, 3.141592, 4])
output_file.close()
Produces TSV with tabs between cells and blank lines between rows.
DictReader and DictWriter (header rows)
For CSVs with headers, DictReader and DictWriter let you work with dicts keyed by column name.
Reading with DictReader (headers in file)
import csv
example_file = open('exampleWithHeader3.csv', encoding='utf-8')
dict_reader = csv.DictReader(example_file)
rows = list(dict_reader)
example_file.close()
print(rows[0])
# {'Timestamp': '4/5/2035 13:34', 'Fruit': 'Apples', 'Quantity': '73'}
example_file = open('exampleWithHeader3.csv', encoding='utf-8')
dict_reader = csv.DictReader(example_file)
for row in dict_reader:
print(row['Timestamp'], row['Fruit'], row['Quantity'])
DictReader reads header row automatically and skips it for data rows.
Supplying your own headers
For a CSV without headers, you can provide fieldnames:
import csv
example_file = open('example3.csv', encoding='utf-8')
dict_reader = csv.DictReader(example_file, ['time', 'name', 'amount'])
for row in dict_reader:
print(row['time'], row['name'], row['amount'])
Writing with DictWriter
import csv
output_file = open('output.csv', 'w', newline='', encoding='utf-8')
fieldnames = ['Name', 'Pet', 'Phone']
dict_writer = csv.DictWriter(output_file, fieldnames)
dict_writer.writeheader()
dict_writer.writerow({'Name': 'Alice', 'Pet': 'cat', 'Phone': '555-1234'})
dict_writer.writerow({'Name': 'Bob', 'Phone': '555-9999'})
dict_writer.writerow({'Phone': '555-5555', 'Name': 'Carol', 'Pet': 'dog'})
output_file.close()
Produces:
Name,Pet,Phone
Alice,cat,555-1234
Bob,,555-9999
Carol,dog,555-5555
- Missing keys → empty cells.
- Order determined by
fieldnames, not dict key order.
Project: Remove the header from CSV files
Task: in a folder full of CSVs, remove the first line from each and save result in a headerRemoved subfolder.
# removeCsvHeader.py
import csv, os
os.makedirs('headerRemoved', exist_ok=True)
for csv_filename in os.listdir('.'):
if not csv_filename.endswith('.csv'):
continue
print('Removing header from ' + csv_filename + '...')
# Read CSV file (skip first row).
csv_rows = []
csv_file_obj = open(csv_filename, encoding='utf-8')
reader_obj = csv.reader(csv_file_obj)
for row in reader_obj:
if reader_obj.line_num == 1:
continue
csv_rows.append(row)
csv_file_obj.close()
# Write new CSV without header.
out_path = os.path.join('headerRemoved', csv_filename)
csv_file_obj = open(out_path, 'w', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file_obj)
for row in csv_rows:
csv_writer.writerow(row)
csv_file_obj.close()
Ideas for similar CSV programs: compare rows/CSV files, convert CSV ↔ Excel, validate/clean data, or use CSV as input to other scripts.
JSON
What JSON looks like vs Python
JSON is JavaScript-like but very close to Python data syntax.
Differences vs Python:
None→nullTrue/False→true/false- Strings must use double quotes (no single quotes).
- No comments, no trailing commas.
- Only basic types: strings, numbers, Booleans, arrays (lists), objects (dicts),
null.
Many web APIs (OpenWeather, Facebook, Twitter, etc.) use JSON.
json.loads: JSON string → Python
import json
json_string = '{"name": "Alice Doe", "age": 30, "car": null, "programmer": true}'
python_data = json.loads(json_string)
print(python_data)
# {'name': 'Alice Doe', 'age': 30, 'car': None, 'programmer': True}
.loads() parses a JSON text string into Python dict/list/etc.
json.dumps: Python → JSON string
import json
python_data = {
'name': 'Alice Doe',
'age': 30,
'car': None,
'programmer': True,
'address': {
'street': '100 Larkin St.',
'city': 'San Francisco',
'zip': '94102',
},
'phone': [
{'type': 'mobile', 'number': '415-555-7890'},
{'type': 'work', 'number': '415-555-1234'},
],
}
json_string = json.dumps(python_data)
print(json_string) # single-line JSON
Pretty-printing:
json_string = json.dumps(python_data, indent=2)
print(json_string)
You can then write json_string to a .json file, send as an HTTP body, etc.
XML and xml.etree.ElementTree
XML basics
XML is tag-based, similar to HTML.
Example:
<person>
<name>Alice Doe</name>
<age>30</age>
<programmer>true</programmer>
<address>
<street>100 Larkin St.</street>
<city>San Francisco</city>
<zip>94102</zip>
</address>
<phone>
<phoneEntry>
<type>mobile</type>
<number>415-555-7890</number>
</phoneEntry>
<phoneEntry>
<type>work</type>
<number>415-555-1234</number>
</phoneEntry>
</phone>
</person>
Key points:
- Elements:
<tag>...</tag>or self-closing<tag />. - One root element (
<person>). - Elements can have attributes:
key="value". - Everything inside tags is text (strings only).
- Represents hierarchy via nested elements.
XML is verbose and flexible; JSON is usually preferred unless legacy systems require XML.
DOM vs SAX vs ElementTree
Parsing strategies:
- DOM: load entire document into memory (easy but memory-heavy).
- SAX: stream API, event-driven, for very large XML.
xml.etree.ElementTree: DOM-like; reads whole document, convenient for small/moderate XML.
Reading XML with ElementTree
Using fromstring for a string:
import xml.etree.ElementTree as ET
xml_string = """<person>...</person>"""
root = ET.fromstring(xml_string)
Alternatively from a file:
tree = ET.parse('my_data.xml')
root = tree.getroot()
Basic navigation:
print(root.tag) # 'person'
children = list(root)
print(root[0].tag) # 'name'
print(root[0].text) # 'Alice Doe'
print(root[4].tag) # 'address'
print(root[4][0].tag) # 'street'
print(root[4][0].text)# '100 Larkin St.'
Iterating immediate children:
for elem in root:
print(elem.tag, '--', elem.text)
Iterating all descendants:
for elem in root.iter():
print(elem.tag, '--', elem.text)
Filtering by tag:
for elem in root.iter('number'):
print(elem.tag, '--', elem.text)
# number -- 415-555-7890
# number -- 415-555-1234
Overall idea of the chapter
Chapter 18 covers three data serialization formats: CSV (rows/columns with csv.reader/csv.writer/DictReader/DictWriter), JSON (json.loads/json.dumps for parsing and generating), and XML (xml.etree.ElementTree for navigating tag-based documents). CSV is simplest for tabular data, JSON is the standard for web APIs and structured data, and XML is for legacy systems or document-oriented formats.