Chapter 24 - Text-to-Speech and Speech Recognition Engines (JavaScript)
Here's a JavaScript-flavoured version of the same concepts, with small JS/Node examples for each idea.
Goal and tools
In JavaScript:
- say.js – cross-platform text-to-speech using OS engines (same engines as pyttsx3).
- whisper-node or @xenova/transformers – local speech recognition (Whisper models via ONNX/WebAssembly).
- ytdl-core or calling yt-dlp via
child_process– downloading audio/video.
Text-to-speech with say.js
Engine basics
say.js wraps the same OS TTS engines: SAPI5 (Windows), say command (macOS), eSpeak/Festival (Linux).
npm install say
Basic example:
const say = require("say");
say.speak("Hello. How are you doing?", null, null, (err) => {
if (err) console.error(err);
console.log("Done speaking.");
});
say.speak(text, voice, speed, callback)– speaks and calls back when finished.- Pass
nullfor defaults.
Voice, rate, and stopping
const say = require("say");
// Speak with a specific voice and speed (1.0 = normal)
say.speak("The quick brown fox.", "Alex", 0.8);
// Stop speaking
say.stop();
List available voices by platform:
- macOS:
say -v '?'in terminal - Windows: check SAPI5 voices in system settings
- Linux:
espeak --voices
Saving speech to file
const say = require("say");
// Export to WAV file
say.export("Hello. How are you doing?", null, 1.0, "hello.wav", (err) => {
if (err) console.error(err);
console.log("Saved to hello.wav");
});
Alternative: Google Cloud TTS or edge-tts
For higher-quality voices, edge-tts (Microsoft Edge's free TTS) can be called via child_process:
pip install edge-tts # Python tool, but callable from Node
const { execSync } = require("child_process");
execSync(
'edge-tts --text "Hello, how are you?" --write-media hello.mp3'
);
Or use the Web Speech API in browser-based JavaScript:
// Browser only
const utterance = new SpeechSynthesisUtterance("Hello. How are you doing?");
utterance.rate = 1.0;
utterance.volume = 0.5;
speechSynthesis.speak(utterance);
Speech recognition with Whisper
Option 1: Call Python Whisper via child_process
The simplest approach — use the Python Whisper CLI:
const { execSync } = require("child_process");
const result = execSync("whisper hello.wav --model base --output_format txt", {
encoding: "utf-8",
});
console.log(result);
Option 2: @xenova/transformers (Whisper in JS)
Run Whisper models directly in Node via ONNX:
npm install @xenova/transformers
const { pipeline } = require("@xenova/transformers");
async function transcribe(audioPath) {
const transcriber = await pipeline(
"automatic-speech-recognition",
"Xenova/whisper-base"
);
const result = await transcriber(audioPath);
console.log(result.text); // "Hello. How are you doing?"
return result;
}
transcribe("hello.wav");
Model sizes available: whisper-tiny, whisper-base, whisper-small, whisper-medium. Same trade-off: smaller = faster, larger = more accurate.
Specify language:
const result = await transcriber(audioPath, {
language: "english",
task: "transcribe",
});
Option 3: whisper-node
npm install whisper-node
const whisper = require("whisper-node");
const transcript = await whisper("hello.wav", {
modelName: "base",
whisperOptions: { language: "en" },
});
for (const segment of transcript) {
console.log(`[${segment.start} --> ${segment.end}] ${segment.speech}`);
}
Creating subtitle files
From Python Whisper CLI
The easiest way to generate subtitles from Node:
const { execSync } = require("child_process");
// Generate SRT subtitles
execSync("whisper hello.wav --model base --output_format srt");
// Creates hello.srt in current directory
// Other formats: vtt, txt, tsv, json
execSync("whisper hello.wav --model base --output_format vtt");
Manual SRT generation from segments
If using @xenova/transformers with timestamps:
const { pipeline } = require("@xenova/transformers");
const fs = require("fs");
async function generateSrt(audioPath, outputPath) {
const transcriber = await pipeline(
"automatic-speech-recognition",
"Xenova/whisper-base"
);
const result = await transcriber(audioPath, {
return_timestamps: true,
});
let srt = "";
result.chunks.forEach((chunk, i) => {
const start = formatTime(chunk.timestamp[0]);
const end = formatTime(chunk.timestamp[1]);
srt += `${i + 1}\n${start} --> ${end}\n${chunk.text.trim()}\n\n`;
});
fs.writeFileSync(outputPath, srt, "utf-8");
}
function formatTime(seconds) {
const h = Math.floor(seconds / 3600);
const m = Math.floor((seconds % 3600) / 60);
const s = Math.floor(seconds % 60);
const ms = Math.round((seconds % 1) * 1000);
return `${pad(h)}:${pad(m)}:${pad(s)},${pad(ms, 3)}`;
}
function pad(n, len = 2) {
return String(n).padStart(len, "0");
}
generateSrt("hello.wav", "hello.srt");
Subtitle formats (same as Python):
- SRT: numbered blocks with
hh:mm:ss,mstimestamps. - VTT:
WEBVTTheader,mm:ss.mmmtimestamps. - TSV: tab-separated
start end text(milliseconds). - JSON: machine-readable structure.
Downloading videos/audio with yt-dlp
Option 1: Call yt-dlp via child_process
Most reliable approach — same tool as Python:
const { execSync } = require("child_process");
// Download video
execSync("yt-dlp https://www.youtube.com/watch?v=kSrnLbioN6w");
// Download audio only
execSync(
'yt-dlp -x --audio-format m4a -o "downloaded_content.%(ext)s" https://www.youtube.com/watch?v=kSrnLbioN6w'
);
Option 2: ytdl-core (pure JS)
npm install @distube/ytdl-core
const ytdl = require("@distube/ytdl-core");
const fs = require("fs");
const url = "https://www.youtube.com/watch?v=kSrnLbioN6w";
// Download audio stream
ytdl(url, { filter: "audioonly" })
.pipe(fs.createWriteStream("downloaded_content.m4a"))
.on("finish", () => console.log("Download complete"));
Note: ytdl-core can break when YouTube changes its API. yt-dlp via child_process is more reliable.
Downloading metadata only
const { execSync } = require("child_process");
const fs = require("fs");
const result = execSync(
'yt-dlp --skip-download --print-json "https://www.youtube.com/watch?v=kSrnLbioN6w"',
{ encoding: "utf-8" }
);
const info = JSON.parse(result);
console.log("TITLE:", info.title);
console.log("DURATION:", info.duration);
fs.writeFileSync("metadata.json", JSON.stringify(info, null, 2));
Overall idea of the chapter
Chapter 24 in JavaScript: text-to-speech with say.js (wraps OS engines, export to WAV) or the Web Speech API in browsers, speech recognition with @xenova/transformers (Whisper models in JS/ONNX) or by calling Python's Whisper CLI, subtitle generation (SRT/VTT) from timestamp data, and video/audio downloading with yt-dlp via child_process (most reliable) or ytdl-core (pure JS but fragile). All tools can run locally after initial setup.