Chapter 24 - Text-to-Speech and Speech Recognition Engines (Python)
Here's a concise but thorough summary of Chapter 24, covering all sections and main ideas with small examples.
Goal and tools
- The chapter adds audio I/O to your programs: text-to-speech with pyttsx3, and speech-to-text with Whisper, plus yt-dlp for downloading audio/video to transcribe.
- Both pyttsx3 and Whisper run locally and work for many languages, not just English; they only need internet to download models once.
Text-to-speech with pyttsx3
Engine basics
- pyttsx3 uses the OS TTS engine: SAPI5 (Windows), NSSpeechSynthesizer (macOS), eSpeak (Linux; may require
sudo apt install espeak). - Install with
pip install pyttsx3.
Basic example (hello_tts.py):
import pyttsx3
engine = pyttsx3.init()
engine.say('Hello. How are you doing?')
engine.runAndWait() # speaks and blocks
feeling = input('>')
engine.say('Yes. I am feeling ' + feeling + ' as well.')
engine.runAndWait()
init()→Engineobject.say(text)queues speech;runAndWait()actually plays and blocks until finished.
Voice properties: rate, volume, voices
You can inspect settings via getProperty:
import pyttsx3
engine = pyttsx3.init()
engine.getProperty('volume') # e.g. 1.0 (100%)
engine.getProperty('rate') # e.g. 200 words/min
engine.getProperty('voices') # list of Voice objects
Enumerate voices:
for voice in engine.getProperty('voices'):
print(voice.name, voice.gender, voice.age, voice.languages)
Example change voice, rate, and volume:
engine.setProperty('rate', 300) # faster
engine.setProperty('volume', 0.5) # 50%
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[1].id)
engine.say('The quick brown fox jumps over the yellow lazy dog.')
engine.runAndWait()
Note the property names: plural 'voices' when reading list, singular 'voice' when setting.
Saving speech to WAV
Use save_to_file + runAndWait:
import pyttsx3
engine = pyttsx3.init()
engine.save_to_file('Hello. How are you doing?', 'hello.wav')
engine.runAndWait() # actually writes hello.wav
- Only WAV output is supported, not MP3 or others.
- It can handle long text; the conversion is fast relative to the audio length.
Speech recognition with Whisper
Basic transcription
- Install with
pip install openai-whisper(package name isopenai-whisper, notwhisper). - First
load_model()download can be hundreds of MB; models include'tiny','base','small','medium','large-v3'.
Example:
import whisper
model = whisper.load_model('base')
result = model.transcribe('hello.wav')
print(result['text']) # "Hello. How are you doing?"
- Smaller models → faster, less accurate; larger models → slower, more accurate.
- The author recommends
'base'for most uses,'medium'when you need more accuracy. - All models make errors; human review is always required.
You can explicitly set language:
model.transcribe('hello.wav', language='English')
result is a dict; the main field is 'text' (full transcription), and others contain timing and segment info.
Using GPU
By default Whisper uses CPU. If you have an NVIDIA GPU and installed appropriate dependencies, you can use:
model = whisper.load_model('base', device='cuda')
This can greatly speed up transcription; setup steps are in Whisper's docs / Appendix A.
Creating subtitle files (SRT/VTT/TSV/JSON)
Whisper can derive subtitles from timing info in result.
Example to create SRT subtitles:
import whisper
model = whisper.load_model('base')
result = model.transcribe('hello.wav')
from whisper.utils import get_writer
write_function = get_writer('srt', '.') # format, output folder
write_function(result, 'audio') # creates audio.srt
- First arg to
get_writer:'srt','vtt','txt','tsv', or'json'. - Second arg: output folder (
'.'= current directory). - Call returned function with
resultand base filename to generateaudio.srt,audio.vtt, etc.
Subtitle formats:
- SRT: numbered blocks,
hh:mm:ss,mstimestamps. - VTT:
WEBVTTheader,mm:ss.mmmtimestamps, no numbering. - TSV:
start end textlines (ms), useful for further processing with thecsvmodule. - JSON: machine-readable structure.
Downloading videos/audio with yt-dlp
Basic video download
yt-dlplets you download from YouTube and many other sites.- Install via Appendix A instructions, then:
import yt_dlp
video_url = 'https://www.youtube.com/watch?v=kSrnLbioN6w'
with yt_dlp.YoutubeDL() as ydl:
ydl.download([video_url])
- The filename is based on the video title and may be
.mp4,.mkv, etc. - Sites may block downloads (age/login/geo restrictions, anti-scraping), so keep
yt-dlpupdated.
Extracting audio
Configure options to get audio only:
import yt_dlp
video_url = 'https://www.youtube.com/watch?v=kSrnLbioN6w'
options = {
'quiet': True, # suppress output
'no_warnings': True, # suppress warnings
'outtmpl': 'downloaded_content.%(ext)s',
'format': 'm4a/bestaudio/best',
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'm4a',
}],
}
with yt_dlp.YoutubeDL(options) as ydl:
ydl.download([video_url])
This saves only the audio as downloaded_content.m4a (or similar extension).
Get the exact filename with glob:
from pathlib import Path
matching = list(Path().glob('downloaded_content.*'))
downloaded_filename = str(matching[0]) # e.g. 'downloaded_content.m4a'
You'd then feed this file into Whisper for transcription.
Downloading metadata only
import yt_dlp, json
video_url = 'https://www.youtube.com/watch?v=kSrnLbioN6w'
options = {
'quiet': True,
'no_warnings': True,
'skip_download': True, # no media file
}
with yt_dlp.YoutubeDL(options) as ydl:
info = ydl.extract_info(video_url)
json_info = ydl.sanitize_info(info)
print('TITLE:', json_info['title'])
print('KEYS:', json_info.keys())
with open('metadata.json', 'w', encoding='utf-8') as f:
f.write(json.dumps(json_info))
skip_downloadavoids downloading the video;extract_inforeturns metadata;sanitize_infomakes it JSON-safe before writing.
Overall idea of the chapter
Chapter 24 covers audio I/O: pyttsx3 for local text-to-speech (controllable rate, volume, voice; outputs to speakers or WAV), Whisper for local speech recognition (model sizes from tiny to large-v3, subtitle generation in SRT/VTT/TSV/JSON, multi-language support), and yt-dlp for downloading audio/video/metadata to feed into Whisper. All tools run locally after initial setup.