Chapter 24 - Text-to-Speech and Speech Recognition Engines (Python)

Here's a concise but thorough summary of Chapter 24, covering all sections and main ideas with small examples.

Goal and tools

The chapter adds audio I/O to your programs: text-to-speech with pyttsx3, and speech-to-text with Whisper, plus yt-dlp for downloading audio/video to transcribe.
Both pyttsx3 and Whisper run locally and work for many languages, not just English; they only need internet to download models once.

Text-to-speech with pyttsx3

Engine basics

pyttsx3 uses the OS TTS engine: SAPI5 (Windows), NSSpeechSynthesizer (macOS), eSpeak (Linux; may require sudo apt install espeak).
Install with pip install pyttsx3.

Basic example (hello_tts.py):

import pyttsx3

engine = pyttsx3.init()
engine.say('Hello. How are you doing?')
engine.runAndWait()          # speaks and blocks
feeling = input('>')
engine.say('Yes. I am feeling ' + feeling + ' as well.')
engine.runAndWait()

init() → Engine object.
say(text) queues speech; runAndWait() actually plays and blocks until finished.

Voice properties: rate, volume, voices

You can inspect settings via getProperty:

import pyttsx3
engine = pyttsx3.init()

engine.getProperty('volume')   # e.g. 1.0 (100%)
engine.getProperty('rate')     # e.g. 200 words/min
engine.getProperty('voices')   # list of Voice objects

Enumerate voices:

for voice in engine.getProperty('voices'):
    print(voice.name, voice.gender, voice.age, voice.languages)

Example change voice, rate, and volume:

engine.setProperty('rate', 300)        # faster
engine.setProperty('volume', 0.5)      # 50%
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[1].id)
engine.say('The quick brown fox jumps over the yellow lazy dog.')
engine.runAndWait()

Note the property names: plural 'voices' when reading list, singular 'voice' when setting.

Saving speech to WAV

Use save_to_file + runAndWait:

import pyttsx3
engine = pyttsx3.init()
engine.save_to_file('Hello. How are you doing?', 'hello.wav')
engine.runAndWait()   # actually writes hello.wav

Only WAV output is supported, not MP3 or others.
It can handle long text; the conversion is fast relative to the audio length.

Speech recognition with Whisper

Basic transcription

Install with pip install openai-whisper (package name is openai-whisper, not whisper).
First load_model() download can be hundreds of MB; models include 'tiny', 'base', 'small', 'medium', 'large-v3'.

Example:

import whisper

model = whisper.load_model('base')
result = model.transcribe('hello.wav')
print(result['text'])   # "Hello. How are you doing?"

Smaller models → faster, less accurate; larger models → slower, more accurate.
The author recommends 'base' for most uses, 'medium' when you need more accuracy.
All models make errors; human review is always required.

You can explicitly set language:

model.transcribe('hello.wav', language='English')

result is a dict; the main field is 'text' (full transcription), and others contain timing and segment info.

Using GPU

By default Whisper uses CPU. If you have an NVIDIA GPU and installed appropriate dependencies, you can use:

model = whisper.load_model('base', device='cuda')

This can greatly speed up transcription; setup steps are in Whisper's docs / Appendix A.

Creating subtitle files (SRT/VTT/TSV/JSON)

Whisper can derive subtitles from timing info in result.

Example to create SRT subtitles:

import whisper

model = whisper.load_model('base')
result = model.transcribe('hello.wav')

from whisper.utils import get_writer
write_function = get_writer('srt', '.')   # format, output folder
write_function(result, 'audio')           # creates audio.srt

First arg to get_writer: 'srt', 'vtt', 'txt', 'tsv', or 'json'.
Second arg: output folder ('.' = current directory).
Call returned function with result and base filename to generate audio.srt, audio.vtt, etc.

Subtitle formats:

SRT: numbered blocks, hh:mm:ss,ms timestamps.
VTT: WEBVTT header, mm:ss.mmm timestamps, no numbering.
TSV: start end text lines (ms), useful for further processing with the csv module.
JSON: machine-readable structure.

Downloading videos/audio with yt-dlp

Basic video download

yt-dlp lets you download from YouTube and many other sites.
Install via Appendix A instructions, then:

import yt_dlp

video_url = 'https://www.youtube.com/watch?v=kSrnLbioN6w'
with yt_dlp.YoutubeDL() as ydl:
    ydl.download([video_url])

The filename is based on the video title and may be .mp4, .mkv, etc.
Sites may block downloads (age/login/geo restrictions, anti-scraping), so keep yt-dlp updated.

Extracting audio

Configure options to get audio only:

import yt_dlp

video_url = 'https://www.youtube.com/watch?v=kSrnLbioN6w'
options = {
    'quiet': True,                    # suppress output
    'no_warnings': True,              # suppress warnings
    'outtmpl': 'downloaded_content.%(ext)s',
    'format': 'm4a/bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'm4a',
    }],
}

with yt_dlp.YoutubeDL(options) as ydl:
    ydl.download([video_url])

This saves only the audio as downloaded_content.m4a (or similar extension).

Get the exact filename with glob:

from pathlib import Path

matching = list(Path().glob('downloaded_content.*'))
downloaded_filename = str(matching[0])   # e.g. 'downloaded_content.m4a'

You'd then feed this file into Whisper for transcription.

Downloading metadata only

import yt_dlp, json

video_url = 'https://www.youtube.com/watch?v=kSrnLbioN6w'
options = {
    'quiet': True,
    'no_warnings': True,
    'skip_download': True,      # no media file
}

with yt_dlp.YoutubeDL(options) as ydl:
    info = ydl.extract_info(video_url)
    json_info = ydl.sanitize_info(info)
    print('TITLE:', json_info['title'])
    print('KEYS:', json_info.keys())

    with open('metadata.json', 'w', encoding='utf-8') as f:
        f.write(json.dumps(json_info))

skip_download avoids downloading the video; extract_info returns metadata; sanitize_info makes it JSON-safe before writing.

Overall idea of the chapter

Chapter 24 covers audio I/O: pyttsx3 for local text-to-speech (controllable rate, volume, voice; outputs to speakers or WAV), Whisper for local speech recognition (model sizes from tiny to large-v3, subtitle generation in SRT/VTT/TSV/JSON, multi-language support), and yt-dlp for downloading audio/video/metadata to feed into Whisper. All tools run locally after initial setup.

Goal and tools​

Text-to-speech with pyttsx3​

Engine basics​

Voice properties: rate, volume, voices​

Saving speech to WAV​

Speech recognition with Whisper​

Basic transcription​

Using GPU​

Creating subtitle files (SRT/VTT/TSV/JSON)​

Downloading videos/audio with yt-dlp​

Basic video download​

Extracting audio​

Downloading metadata only​

Overall idea of the chapter​