Chapter 24 - Text-to-Speech and Speech Recognition Engines (JavaScript)

Here's a JavaScript-flavoured version of the same concepts, with small JS/Node examples for each idea.

Goal and tools

In JavaScript:

say.js – cross-platform text-to-speech using OS engines (same engines as pyttsx3).
whisper-node or @xenova/transformers – local speech recognition (Whisper models via ONNX/WebAssembly).
ytdl-core or calling yt-dlp via child_process – downloading audio/video.

Text-to-speech with say.js

Engine basics

say.js wraps the same OS TTS engines: SAPI5 (Windows), say command (macOS), eSpeak/Festival (Linux).

npm install say

Basic example:

const say = require("say");

say.speak("Hello. How are you doing?", null, null, (err) => {
  if (err) console.error(err);
  console.log("Done speaking.");
});

say.speak(text, voice, speed, callback) – speaks and calls back when finished.
Pass null for defaults.

Voice, rate, and stopping

const say = require("say");

// Speak with a specific voice and speed (1.0 = normal)
say.speak("The quick brown fox.", "Alex", 0.8);

// Stop speaking
say.stop();

List available voices by platform:

macOS: say -v '?' in terminal
Windows: check SAPI5 voices in system settings
Linux: espeak --voices

Saving speech to file

const say = require("say");

// Export to WAV file
say.export("Hello. How are you doing?", null, 1.0, "hello.wav", (err) => {
  if (err) console.error(err);
  console.log("Saved to hello.wav");
});

Alternative: Google Cloud TTS or edge-tts

For higher-quality voices, edge-tts (Microsoft Edge's free TTS) can be called via child_process:

pip install edge-tts  # Python tool, but callable from Node

const { execSync } = require("child_process");

execSync(
  'edge-tts --text "Hello, how are you?" --write-media hello.mp3'
);

Or use the Web Speech API in browser-based JavaScript:

// Browser only
const utterance = new SpeechSynthesisUtterance("Hello. How are you doing?");
utterance.rate = 1.0;
utterance.volume = 0.5;
speechSynthesis.speak(utterance);

Speech recognition with Whisper

Option 1: Call Python Whisper via child_process

The simplest approach — use the Python Whisper CLI:

const { execSync } = require("child_process");

const result = execSync("whisper hello.wav --model base --output_format txt", {
  encoding: "utf-8",
});
console.log(result);

Option 2: @xenova/transformers (Whisper in JS)

Run Whisper models directly in Node via ONNX:

npm install @xenova/transformers

const { pipeline } = require("@xenova/transformers");

async function transcribe(audioPath) {
  const transcriber = await pipeline(
    "automatic-speech-recognition",
    "Xenova/whisper-base"
  );

  const result = await transcriber(audioPath);
  console.log(result.text); // "Hello. How are you doing?"
  return result;
}

transcribe("hello.wav");

Model sizes available: whisper-tiny, whisper-base, whisper-small, whisper-medium. Same trade-off: smaller = faster, larger = more accurate.

Specify language:

const result = await transcriber(audioPath, {
  language: "english",
  task: "transcribe",
});

Option 3: whisper-node

npm install whisper-node

const whisper = require("whisper-node");

const transcript = await whisper("hello.wav", {
  modelName: "base",
  whisperOptions: { language: "en" },
});

for (const segment of transcript) {
  console.log(`[${segment.start} --> ${segment.end}] ${segment.speech}`);
}

Creating subtitle files

From Python Whisper CLI

The easiest way to generate subtitles from Node:

const { execSync } = require("child_process");

// Generate SRT subtitles
execSync("whisper hello.wav --model base --output_format srt");
// Creates hello.srt in current directory

// Other formats: vtt, txt, tsv, json
execSync("whisper hello.wav --model base --output_format vtt");

Manual SRT generation from segments

If using @xenova/transformers with timestamps:

const { pipeline } = require("@xenova/transformers");
const fs = require("fs");

async function generateSrt(audioPath, outputPath) {
  const transcriber = await pipeline(
    "automatic-speech-recognition",
    "Xenova/whisper-base"
  );

  const result = await transcriber(audioPath, {
    return_timestamps: true,
  });

  let srt = "";
  result.chunks.forEach((chunk, i) => {
    const start = formatTime(chunk.timestamp[0]);
    const end = formatTime(chunk.timestamp[1]);
    srt += `${i + 1}\n${start} --> ${end}\n${chunk.text.trim()}\n\n`;
  });

  fs.writeFileSync(outputPath, srt, "utf-8");
}

function formatTime(seconds) {
  const h = Math.floor(seconds / 3600);
  const m = Math.floor((seconds % 3600) / 60);
  const s = Math.floor(seconds % 60);
  const ms = Math.round((seconds % 1) * 1000);
  return `${pad(h)}:${pad(m)}:${pad(s)},${pad(ms, 3)}`;
}

function pad(n, len = 2) {
  return String(n).padStart(len, "0");
}

generateSrt("hello.wav", "hello.srt");

Subtitle formats (same as Python):

SRT: numbered blocks with hh:mm:ss,ms timestamps.
VTT: WEBVTT header, mm:ss.mmm timestamps.
TSV: tab-separated start end text (milliseconds).
JSON: machine-readable structure.

Downloading videos/audio with yt-dlp

Option 1: Call yt-dlp via child_process

Most reliable approach — same tool as Python:

const { execSync } = require("child_process");

// Download video
execSync("yt-dlp https://www.youtube.com/watch?v=kSrnLbioN6w");

// Download audio only
execSync(
  'yt-dlp -x --audio-format m4a -o "downloaded_content.%(ext)s" https://www.youtube.com/watch?v=kSrnLbioN6w'
);

Option 2: ytdl-core (pure JS)

npm install @distube/ytdl-core

const ytdl = require("@distube/ytdl-core");
const fs = require("fs");

const url = "https://www.youtube.com/watch?v=kSrnLbioN6w";

// Download audio stream
ytdl(url, { filter: "audioonly" })
  .pipe(fs.createWriteStream("downloaded_content.m4a"))
  .on("finish", () => console.log("Download complete"));

Note: ytdl-core can break when YouTube changes its API. yt-dlp via child_process is more reliable.

Downloading metadata only

const { execSync } = require("child_process");
const fs = require("fs");

const result = execSync(
  'yt-dlp --skip-download --print-json "https://www.youtube.com/watch?v=kSrnLbioN6w"',
  { encoding: "utf-8" }
);

const info = JSON.parse(result);
console.log("TITLE:", info.title);
console.log("DURATION:", info.duration);

fs.writeFileSync("metadata.json", JSON.stringify(info, null, 2));

Overall idea of the chapter

Chapter 24 in JavaScript: text-to-speech with say.js (wraps OS engines, export to WAV) or the Web Speech API in browsers, speech recognition with @xenova/transformers (Whisper models in JS/ONNX) or by calling Python's Whisper CLI, subtitle generation (SRT/VTT) from timestamp data, and video/audio downloading with yt-dlp via child_process (most reliable) or ytdl-core (pure JS but fragile). All tools can run locally after initial setup.

Goal and tools​

Text-to-speech with say.js​

Engine basics​

Voice, rate, and stopping​

Saving speech to file​

Alternative: Google Cloud TTS or edge-tts​

Speech recognition with Whisper​

Option 1: Call Python Whisper via child_process​

Option 2: @xenova/transformers (Whisper in JS)​

Option 3: whisper-node​

Creating subtitle files​

From Python Whisper CLI​

Manual SRT generation from segments​

Downloading videos/audio with yt-dlp​

Option 1: Call yt-dlp via child_process​

Option 2: ytdl-core (pure JS)​

Downloading metadata only​

Overall idea of the chapter​