Skip to main content

Chapter 18 - CSV, JSON, and XML Files (Python)

Here's a concise walkthrough of the main ideas in Chapter 18, each with a small example.


Overview: CSV, JSON, XML

These are data serialization formats: ways to turn structured data into text for saving, sending over networks, or copying.

  • CSV: simple spreadsheet-like rows/columns, everything is text.
  • JSON: JavaScript-style objects/arrays; good general data format, widely used in web APIs.
  • XML: older, tag-based format; verbose and powerful but often more complex than needed.

Python standard library modules:

  • csv for CSV.
  • json for JSON.
  • xml (especially xml.etree.ElementTree) for XML.

CSV format and the csv module

What CSV looks like and limits

Each line = row; commas separate cells.

Example CSV for example3.xlsx:

4/5/2035 13:34,Apples,73
4/5/2035 3:41,Cherries,85
...

Limitations vs Excel:

  • Only strings (no data types).
  • No fonts, colors, multiple sheets, merged cells, images, charts, widths/heights.
  • Simpler, widely supported, easy in text editors.

You shouldn't parse CSV with split(',') because commas can appear in values and must be escaped; use csv instead.

Reading CSV with csv.reader

import csv

example_file = open('example3.csv', encoding='utf-8')
example_reader = csv.reader(example_file)
example_data = list(example_reader)
example_file.close()

print(example_data[0][0]) # '4/5/2035 13:34'
print(example_data[0][1]) # 'Apples'
print(example_data[6][1]) # 'Strawberries'
  • csv.reader(file) returns an iterator; list(reader) → list of row lists.
  • Access cell by example_data[row][col] (0-based).

Iterating in a for loop and line_num

Preferred for large files: iterate directly over the reader to avoid loading entire file at once.

import csv

example_file = open('example3.csv', encoding='utf-8')
example_reader = csv.reader(example_file)

for row in example_reader:
print('Row #' + str(example_reader.line_num) + ' ' + str(row))
  • reader.line_num is 1-based line number.
  • You can only iterate a reader once; reopen file for another pass.
  • Use line_num == 1 to skip headers.

Writing CSV with csv.writer

import csv

output_file = open('output.csv', 'w', newline='', encoding='utf-8')
output_writer = csv.writer(output_file)

output_writer.writerow(['spam', 'eggs', 'bacon', 'ham'])
output_writer.writerow(['Hello, world!', 'eggs', 'bacon', 'ham'])
output_writer.writerow([1, 2, 3.141592, 4])

output_file.close()
  • Use newline='' on Windows to avoid double-spaced rows.
  • writerow(list) writes one row; return value = number of characters written.
  • csv handles quoting (e.g., "Hello, world!").

Result:

spam,eggs,bacon,ham
"Hello, world!",eggs,bacon,ham
1,2,3.141592,4

Custom delimiters and line terminators (TSV)

import csv

output_file = open('output.tsv', 'w', newline='', encoding='utf-8')
output_writer = csv.writer(
output_file,
delimiter='\t',
lineterminator='\n\n',
)
output_writer.writerow(['spam', 'eggs', 'bacon', 'ham'])
output_writer.writerow(['Hello, world!', 'eggs', 'bacon', 'ham'])
output_writer.writerow([1, 2, 3.141592, 4])
output_file.close()

Produces TSV with tabs between cells and blank lines between rows.

DictReader and DictWriter (header rows)

For CSVs with headers, DictReader and DictWriter let you work with dicts keyed by column name.

Reading with DictReader (headers in file)

import csv

example_file = open('exampleWithHeader3.csv', encoding='utf-8')
dict_reader = csv.DictReader(example_file)
rows = list(dict_reader)
example_file.close()

print(rows[0])
# {'Timestamp': '4/5/2035 13:34', 'Fruit': 'Apples', 'Quantity': '73'}

example_file = open('exampleWithHeader3.csv', encoding='utf-8')
dict_reader = csv.DictReader(example_file)
for row in dict_reader:
print(row['Timestamp'], row['Fruit'], row['Quantity'])

DictReader reads header row automatically and skips it for data rows.

Supplying your own headers

For a CSV without headers, you can provide fieldnames:

import csv

example_file = open('example3.csv', encoding='utf-8')
dict_reader = csv.DictReader(example_file, ['time', 'name', 'amount'])
for row in dict_reader:
print(row['time'], row['name'], row['amount'])

Writing with DictWriter

import csv

output_file = open('output.csv', 'w', newline='', encoding='utf-8')
fieldnames = ['Name', 'Pet', 'Phone']
dict_writer = csv.DictWriter(output_file, fieldnames)

dict_writer.writeheader()
dict_writer.writerow({'Name': 'Alice', 'Pet': 'cat', 'Phone': '555-1234'})
dict_writer.writerow({'Name': 'Bob', 'Phone': '555-9999'})
dict_writer.writerow({'Phone': '555-5555', 'Name': 'Carol', 'Pet': 'dog'})
output_file.close()

Produces:

Name,Pet,Phone
Alice,cat,555-1234
Bob,,555-9999
Carol,dog,555-5555
  • Missing keys → empty cells.
  • Order determined by fieldnames, not dict key order.

Project: Remove the header from CSV files

Task: in a folder full of CSVs, remove the first line from each and save result in a headerRemoved subfolder.

# removeCsvHeader.py
import csv, os

os.makedirs('headerRemoved', exist_ok=True)

for csv_filename in os.listdir('.'):
if not csv_filename.endswith('.csv'):
continue
print('Removing header from ' + csv_filename + '...')

# Read CSV file (skip first row).
csv_rows = []
csv_file_obj = open(csv_filename, encoding='utf-8')
reader_obj = csv.reader(csv_file_obj)
for row in reader_obj:
if reader_obj.line_num == 1:
continue
csv_rows.append(row)
csv_file_obj.close()

# Write new CSV without header.
out_path = os.path.join('headerRemoved', csv_filename)
csv_file_obj = open(out_path, 'w', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file_obj)
for row in csv_rows:
csv_writer.writerow(row)
csv_file_obj.close()

Ideas for similar CSV programs: compare rows/CSV files, convert CSV ↔ Excel, validate/clean data, or use CSV as input to other scripts.


JSON

What JSON looks like vs Python

JSON is JavaScript-like but very close to Python data syntax.

Differences vs Python:

  • Nonenull
  • True / Falsetrue / false
  • Strings must use double quotes (no single quotes).
  • No comments, no trailing commas.
  • Only basic types: strings, numbers, Booleans, arrays (lists), objects (dicts), null.

Many web APIs (OpenWeather, Facebook, Twitter, etc.) use JSON.

json.loads: JSON string → Python

import json

json_string = '{"name": "Alice Doe", "age": 30, "car": null, "programmer": true}'

python_data = json.loads(json_string)
print(python_data)
# {'name': 'Alice Doe', 'age': 30, 'car': None, 'programmer': True}

.loads() parses a JSON text string into Python dict/list/etc.

json.dumps: Python → JSON string

import json

python_data = {
'name': 'Alice Doe',
'age': 30,
'car': None,
'programmer': True,
'address': {
'street': '100 Larkin St.',
'city': 'San Francisco',
'zip': '94102',
},
'phone': [
{'type': 'mobile', 'number': '415-555-7890'},
{'type': 'work', 'number': '415-555-1234'},
],
}

json_string = json.dumps(python_data)
print(json_string) # single-line JSON

Pretty-printing:

json_string = json.dumps(python_data, indent=2)
print(json_string)

You can then write json_string to a .json file, send as an HTTP body, etc.


XML and xml.etree.ElementTree

XML basics

XML is tag-based, similar to HTML.

Example:

<person>
<name>Alice Doe</name>
<age>30</age>
<programmer>true</programmer>
<address>
<street>100 Larkin St.</street>
<city>San Francisco</city>
<zip>94102</zip>
</address>
<phone>
<phoneEntry>
<type>mobile</type>
<number>415-555-7890</number>
</phoneEntry>
<phoneEntry>
<type>work</type>
<number>415-555-1234</number>
</phoneEntry>
</phone>
</person>

Key points:

  • Elements: <tag>...</tag> or self-closing <tag />.
  • One root element (<person>).
  • Elements can have attributes: key="value".
  • Everything inside tags is text (strings only).
  • Represents hierarchy via nested elements.

XML is verbose and flexible; JSON is usually preferred unless legacy systems require XML.

DOM vs SAX vs ElementTree

Parsing strategies:

  • DOM: load entire document into memory (easy but memory-heavy).
  • SAX: stream API, event-driven, for very large XML.
  • xml.etree.ElementTree: DOM-like; reads whole document, convenient for small/moderate XML.

Reading XML with ElementTree

Using fromstring for a string:

import xml.etree.ElementTree as ET

xml_string = """<person>...</person>"""
root = ET.fromstring(xml_string)

Alternatively from a file:

tree = ET.parse('my_data.xml')
root = tree.getroot()

Basic navigation:

print(root.tag)       # 'person'
children = list(root)
print(root[0].tag) # 'name'
print(root[0].text) # 'Alice Doe'

print(root[4].tag) # 'address'
print(root[4][0].tag) # 'street'
print(root[4][0].text)# '100 Larkin St.'

Iterating immediate children:

for elem in root:
print(elem.tag, '--', elem.text)

Iterating all descendants:

for elem in root.iter():
print(elem.tag, '--', elem.text)

Filtering by tag:

for elem in root.iter('number'):
print(elem.tag, '--', elem.text)
# number -- 415-555-7890
# number -- 415-555-1234

Overall idea of the chapter

Chapter 18 covers three data serialization formats: CSV (rows/columns with csv.reader/csv.writer/DictReader/DictWriter), JSON (json.loads/json.dumps for parsing and generating), and XML (xml.etree.ElementTree for navigating tag-based documents). CSV is simplest for tabular data, JSON is the standard for web APIs and structured data, and XML is for legacy systems or document-oriented formats.