Learn AI Series (#94) - Text-to-Speech (TTS)

@scipio 71

about 2 months ago

StemSocial

Learn AI Series (#94) - Text-to-Speech (TTS)

What will I learn

You will learn the TTS pipeline: converting text into natural-sounding speech;
text preprocessing: normalization, grapheme-to-phoneme conversion, and prosody assignment;
Tacotron 2 and its encoder-decoder architecture for predicting Mel spectrograms from text;
vocoders: WaveNet, WaveRNN, and HiFi-GAN for converting spectrograms to audio waveforms;
VITS: end-to-end TTS that eliminates the two-stage pipeline entirely;
voice cloning: synthesizing speech in a specific person's voice from just a few seconds of audio;
building a practical TTS system with Coqui TTS and XTTS v2;
TTS evaluation: MOS scores, MCD, and the challenges of measuring speech quality.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#94) - Text-to-Speech (TTS)

Solutions to Episode #93 Exercises

Exercise 1: CTC alignment visualizer.

import numpy as np


class CTCAlignmentVisualizer:
    def __init__(self, n_frames=50, vocab_size=29):
        self.n_frames = n_frames
        self.vocab_size = vocab_size
        self.idx_to_char = {
            0: '<blank>', 1: 'a', 2: 'b', 3: 'c',
            4: 'd', 5: 'e', 6: 'f', 7: 'g',
            8: 'h', 9: 'i', 10: 'j', 11: 'k',
            12: 'l', 13: 'm', 14: 'n', 15: 'o',
            16: 'p', 17: 'q', 18: 'r', 19: 's',
            20: 't', 21: 'u', 22: 'v', 23: 'w',
            24: 'x', 25: 'y', 26: 'z', 27: ' ',
            28: "'"}
        self.rng = np.random.RandomState(42)

    def build_alignment(self):
        logits = self.rng.randn(
            self.n_frames, self.vocab_size) * 0.3
        segments = [
            (0, 8, 0),     # blank
            (9, 14, 8),    # h
            (15, 18, 0),   # blank
            (19, 24, 5),   # e
            (25, 28, 0),   # blank
            (29, 34, 12),  # l
            (35, 35, 0),   # blank (double-l sep)
            (36, 40, 12),  # l
            (41, 44, 0),   # blank
            (45, 49, 15),  # o
        ]
        for start, end, char_idx in segments:
            for t in range(start, min(end + 1,
                                      self.n_frames)):
                logits[t, char_idx] += 4.0
        probs = np.exp(logits)
        probs /= probs.sum(axis=1, keepdims=True)
        return probs

    def greedy_decode(self, probs):
        preds = probs.argmax(axis=1)
        chars = []
        prev = 0
        for p in preds:
            if p != 0 and p != prev:
                chars.append(
                    self.idx_to_char.get(p, '?'))
            prev = p
        return ''.join(chars)

    def collapse_no_blank(self, probs):
        preds = probs.argmax(axis=1)
        chars = []
        prev = -1
        for p in preds:
            if p != prev and p != 0:
                chars.append(
                    self.idx_to_char.get(p, '?'))
            elif p != prev and p == 0:
                pass
            prev = p
        return ''.join(chars)

    def collapse_ignore_blanks(self, probs):
        preds = probs.argmax(axis=1)
        non_blank = [p for p in preds if p != 0]
        chars = []
        prev = -1
        for p in non_blank:
            if p != prev:
                chars.append(
                    self.idx_to_char.get(p, '?'))
            prev = p
        return ''.join(chars)

    def run(self):
        probs = self.build_alignment()
        print("Frame-by-frame top-3 tokens:")
        print(f"{'Frame':>5}  {'#1':>12} "
              f"{'#2':>12} {'#3':>12}")
        print("-" * 48)
        for t in range(self.n_frames):
            top3 = np.argsort(probs[t])[::-1][:3]
            parts = []
            for idx in top3:
                name = self.idx_to_char[idx]
                parts.append(
                    f"{name:>6}({probs[t,idx]:.2f})")
            print(f"{t:>5}  {'  '.join(parts)}")

        correct = self.greedy_decode(probs)
        wrong = self.collapse_ignore_blanks(probs)
        print(f"\nWith blank separator: "
              f"'{correct}'")
        print(f"Without blank separator: "
              f"'{wrong}'")
        print(f"Correct: "
              f"{'YES' if correct == 'hello' else 'NO'}")


viz = CTCAlignmentVisualizer()
viz.run()

The blank token between the two l's is what makes double-l possible. Without that separator, consecutive frames emitting 'l' collapse into a single 'l', giving "helo" instead of "hello". This is CTC's fundamental mechanism for handling repeated characters.

Exercise 2: WER error analyzer.

import numpy as np


class WERAnalyzer:
    def compute_wer(self, reference, hypothesis):
        ref = reference.lower().split()
        hyp = hypothesis.lower().split()
        n, m = len(ref), len(hyp)
        d = np.zeros((n + 1, m + 1), dtype=int)
        ops = [['' for _ in range(m + 1)]
               for _ in range(n + 1)]
        for i in range(n + 1):
            d[i, 0] = i
        for j in range(m + 1):
            d[0, j] = j
        for i in range(1, n + 1):
            for j in range(1, m + 1):
                if ref[i-1] == hyp[j-1]:
                    d[i,j] = d[i-1,j-1]
                    ops[i][j] = 'match'
                else:
                    choices = [
                        d[i-1,j-1] + 1,
                        d[i,j-1] + 1,
                        d[i-1,j] + 1]
                    d[i,j] = min(choices)
                    if d[i,j] == d[i-1,j-1] + 1:
                        ops[i][j] = 'sub'
                    elif d[i,j] == d[i,j-1] + 1:
                        ops[i][j] = 'ins'
                    else:
                        ops[i][j] = 'del'
        subs = ins = dels = 0
        i, j = n, m
        alignment = []
        while i > 0 or j > 0:
            if i > 0 and j > 0 and (
                    ops[i][j] == 'match'):
                alignment.append(
                    ('match', ref[i-1], hyp[j-1]))
                i -= 1; j -= 1
            elif i > 0 and j > 0 and (
                    ops[i][j] == 'sub'):
                alignment.append(
                    ('sub', ref[i-1], hyp[j-1]))
                subs += 1; i -= 1; j -= 1
            elif j > 0 and ops[i][j] == 'ins':
                alignment.append(
                    ('ins', '***', hyp[j-1]))
                ins += 1; j -= 1
            else:
                alignment.append(
                    ('del', ref[i-1], '***'))
                dels += 1; i -= 1
        alignment.reverse()
        return {
            'wer': d[n,m] / max(n, 1),
            'subs': subs, 'ins': ins,
            'dels': dels, 'alignment': alignment}

    def error_distribution(self, pairs):
        total_s = total_i = total_d = 0
        total_ref = 0
        for ref, hyp in pairs:
            r = self.compute_wer(ref, hyp)
            total_s += r['subs']
            total_i += r['ins']
            total_d += r['dels']
            total_ref += len(ref.split())
        total_err = total_s + total_i + total_d
        return {
            'overall_wer': total_err / max(
                total_ref, 1),
            'sub_pct': total_s / max(
                total_err, 1) * 100,
            'ins_pct': total_i / max(
                total_err, 1) * 100,
            'del_pct': total_d / max(
                total_err, 1) * 100}

    def run(self):
        pairs = [
            ("the cat sat on the mat",
             "the cat sat on the mat"),
            ("recognize speech",
             "wreck a nice beach"),
            ("i went to the uh store yesterday",
             "i went to the store yesterday"),
            ("hello world",
             "hello hello world okay"),
            ("the quick brown fox jumps",
             "the quik brown box jump"),
        ]
        for ref, hyp in pairs:
            r = self.compute_wer(ref, hyp)
            print(f"WER: {r['wer']:.1%} "
                  f"(S={r['subs']} I={r['ins']}"
                  f" D={r['dels']})")
            for op, rw, hw in r['alignment']:
                tag = op[0].upper()
                print(f"  [{tag}] {rw:>15} | "
                      f"{hw}")
            print()
        dist = self.error_distribution(pairs)
        print(f"Aggregate WER: "
              f"{dist['overall_wer']:.1%}")
        print(f"  Substitutions: "
              f"{dist['sub_pct']:.0f}%")
        print(f"  Insertions:    "
              f"{dist['ins_pct']:.0f}%")
        print(f"  Deletions:     "
              f"{dist['del_pct']:.0f}%")


analyzer = WERAnalyzer()
analyzer.run()

Substitutions typically dominate real ASR error distributions because the model produces something for every audio segment -- it rarely drops or hallucinates entire words, but frequently confuses similar-sounding words.

Exercise 3: Whisper model size advisor.

class WhisperAdvisor:
    def __init__(self):
        self.models = {
            'tiny': dict(
                params=39, wer_clean=7.6,
                wer_noisy=13.2, rtf_cpu=0.4,
                rtf_gpu=0.02, ram_gb=0.5,
                multilingual=True),
            'base': dict(
                params=74, wer_clean=5.0,
                wer_noisy=9.8, rtf_cpu=0.7,
                rtf_gpu=0.03, ram_gb=0.8,
                multilingual=True),
            'small': dict(
                params=244, wer_clean=4.2,
                wer_noisy=7.9, rtf_cpu=2.1,
                rtf_gpu=0.06, ram_gb=1.5,
                multilingual=True),
            'medium': dict(
                params=769, wer_clean=3.8,
                wer_noisy=6.2, rtf_cpu=6.5,
                rtf_gpu=0.12, ram_gb=4.5,
                multilingual=True),
            'large': dict(
                params=1550, wer_clean=3.0,
                wer_noisy=5.1, rtf_cpu=14.0,
                rtf_gpu=0.25, ram_gb=9.0,
                multilingual=True),
        }

    def throughput(self, name, gpu, hours):
        m = self.models[name]
        rtf = m['rtf_gpu'] if gpu else m['rtf_cpu']
        return hours * rtf

    def recommend(self, constraints):
        best = None
        for name, m in self.models.items():
            if ('max_ram_gb' in constraints
                    and m['ram_gb']
                    > constraints['max_ram_gb']):
                continue
            if 'max_rtf' in constraints:
                rtf = m['rtf_cpu']
                if rtf > constraints['max_rtf']:
                    continue
            if ('max_wer' in constraints
                    and m['wer_clean']
                    > constraints['max_wer']):
                continue
            best = name
        return best

    def run(self):
        scenarios = [
            ("Edge device (1GB, realtime, EN)",
             dict(max_ram_gb=1.0, max_rtf=1.0)),
            ("Laptop (8GB, multilingual, batch)",
             dict(max_ram_gb=8.0)),
            ("GPU server (16GB, lowest WER)",
             dict(max_ram_gb=16.0)),
            ("Mobile (2GB, 2x realtime, EN)",
             dict(max_ram_gb=2.0, max_rtf=0.5)),
        ]
        for desc, cons in scenarios:
            pick = self.recommend(cons)
            m = self.models[pick]
            hrs = self.throughput(pick, False, 10)
            print(f"{desc}")
            print(f"  -> {pick} ({m['params']}M, "
                  f"WER={m['wer_clean']}%, "
                  f"RAM={m['ram_gb']}G)")
            print(f"  10h audio on CPU: "
                  f"{hrs:.1f}h processing\n")


advisor = WhisperAdvisor()
advisor.run()

Tiny fits edge and mobile nicely. Large wins on servers where RAM and batch processing time aren't constraints. Small is the balanced pick for laptops -- good accuracy without blowing past the RAM budget.

On to today's episode

Here we go! Last episode we turned speech into text. Today we're reversing the entire process: turning text into speech. Every time Siri reads you a notification, Google Maps tells you to turn left, or an audiobook narrator reads you a novel -- that's TTS at work. And the quality difference between the robotic-sounding systems from the 2000s and modern neural TTS is, honestly, staggering. Today's best systems are nearly indistinguishable from human speech, and we're going to understand exactly how that works ;-)

This episode draws on quite some prior knowledge from the series. The encoder-decoder paradigm from episodes #50-51 (sequence-to-sequence and attention), the adversarial training from episode #55 (GANs), the Mel spectrograms from episode #92 (audio fundamentals), and the speaker embeddings concept that relates to the face embeddings from episode #88 (face analysis). If you've been following along, every piece is going to click into place. If you haven't -- well, the episode references are there for a reason ;-)

The TTS pipeline

Traditional neural TTS splits the problem into two stages:

Acoustic model: converts text (or phonemes) into a Mel spectrogram -- a compact representation of what the speech should sound like
Vocoder: converts that Mel spectrogram into an actual audio waveform you can listen to

Text -> [Text Processing] -> phonemes -> [Acoustic Model] -> Mel spectrogram -> [Vocoder] -> waveform

But before any neural network gets involved, there's a text processing step that handles all the messy normalization work. Think about it: the string "$4.99" needs to become "four dollars and ninety nine cents". The abbreviation "Dr." is "doctor" before a name but "drive" in a street address. The number "1994" could be "nineteen ninety four" (a year) or "one thousand nine hundred ninety four" (a quantity). These are called text normalization problems and they're surprisingly tricky to get right for all edge cases.

After normalization, a grapheme-to-phoneme (G2P) module converts written text into pronunciation symbols. English spelling is famously inconsistent -- "read" (present) and "read" (past) are spelled identically but pronounced differently. G2P models handle this by mapping character sequences to phoneme sequences (usually using the International Phonetic Alphabet or a simpler system like ARPABET).

Finally, prosody assignment determines which words get emphesized, where pauses go, and how the pitch contour should flow across the sentence. "I didn't say she stole the money" has seven different meanings depending on which word gets stress. Getting prosody right is what separates natural-sounding TTS from obviously synthetic speech.

import re


class TextNormalizer:
    """Normalize raw text for TTS input.
    Handles common patterns that trip up
    acoustic models."""

    def __init__(self):
        self.ones = [
            '', 'one', 'two', 'three', 'four',
            'five', 'six', 'seven', 'eight',
            'nine', 'ten', 'eleven', 'twelve',
            'thirteen', 'fourteen', 'fifteen',
            'sixteen', 'seventeen', 'eighteen',
            'nineteen']
        self.tens = [
            '', '', 'twenty', 'thirty', 'forty',
            'fifty', 'sixty', 'seventy', 'eighty',
            'ninety']

    def number_to_words(self, n):
        """Convert integer to English words."""
        if n < 0:
            return 'minus ' + self.number_to_words(-n)
        if n < 20:
            return self.ones[n]
        if n < 100:
            rest = self.ones[n % 10]
            return (self.tens[n // 10]
                    + (' ' + rest if rest else ''))
        if n < 1000:
            rest = self.number_to_words(n % 100)
            return (self.ones[n // 100]
                    + ' hundred'
                    + (' ' + rest if rest else ''))
        return str(n)

    def normalize(self, text):
        """Apply common normalization rules."""
        text = re.sub(
            r'\$(\d+)\.(\d{2})',
            lambda m: (
                self.number_to_words(int(m.group(1)))
                + ' dollars and '
                + self.number_to_words(int(m.group(2)))
                + ' cents'),
            text)
        abbrevs = {
            'Mr.': 'mister', 'Mrs.': 'missus',
            'Dr.': 'doctor', 'Prof.': 'professor',
            'vs.': 'versus', 'etc.': 'et cetera',
        }
        for abbr, full in abbrevs.items():
            text = text.replace(abbr, full)
        text = re.sub(
            r'\b(\d+)\b',
            lambda m: self.number_to_words(
                int(m.group(1)))
            if int(m.group(1)) < 1000 else m.group(0),
            text)
        return text

    def run(self):
        tests = [
            "The price is $4.99 for 3 items.",
            "Dr. Smith saw 12 patients today.",
            "It costs $100.50 vs. $200.00.",
        ]
        for t in tests:
            norm = self.normalize(t)
            print(f"Input:  {t}")
            print(f"Output: {norm}\n")


normalizer = TextNormalizer()
normalizer.run()

This is a simplified version -- production TTS systems like eSpeak-ng or the normalization frontend of Festival have thousands of rules covering dates, times, currency, Roman numerals, abbreviations, and more. The point is: a LOT of work happens before the neural network even sees the text.

Tacotron 2: text to Mel spectrogram

Tacotron 2 (Shen et al., 2018) is an encoder-decoder model with attention that produces Mel spectrograms from character sequences. If you followed episodes #50-51 on sequence-to-sequence models and attention, the architecture should feel very familiar. The encoder processes characters through an embedding layer and convolutions, the attention mechanism aligns text to audio frames, and the decoder generates Mel frames autoregressively -- one frame at a time, each conditioned on the previous frame:

import torch
import torch.nn as nn


class SimpleTacotronEncoder(nn.Module):
    """Character encoder: embeddings +
    convolutions + bidirectional LSTM."""

    def __init__(self, vocab_size=80,
                 embed_dim=512, enc_dim=512):
        super().__init__()
        self.embed = nn.Embedding(
            vocab_size, embed_dim)
        self.convs = nn.Sequential(
            nn.Conv1d(embed_dim, enc_dim,
                      kernel_size=5, padding=2),
            nn.BatchNorm1d(enc_dim),
            nn.ReLU(),
            nn.Conv1d(enc_dim, enc_dim,
                      kernel_size=5, padding=2),
            nn.BatchNorm1d(enc_dim),
            nn.ReLU(),
            nn.Conv1d(enc_dim, enc_dim,
                      kernel_size=5, padding=2),
            nn.BatchNorm1d(enc_dim),
            nn.ReLU(),
        )
        self.lstm = nn.LSTM(
            enc_dim, enc_dim // 2,
            batch_first=True, bidirectional=True)

    def forward(self, chars):
        x = self.embed(chars)
        x = self.convs(x.transpose(1, 2))
        x = x.transpose(1, 2)
        x, _ = self.lstm(x)
        return x


class SimpleTacotronDecoder(nn.Module):
    """Mel decoder: prenet + attention +
    LSTM + projection."""

    def __init__(self, enc_dim=512,
                 dec_dim=1024, n_mels=80):
        super().__init__()
        self.prenet = nn.Sequential(
            nn.Linear(n_mels, 256), nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Dropout(0.5),
        )
        self.attention = nn.MultiheadAttention(
            dec_dim, num_heads=4,
            batch_first=True)
        self.enc_proj = nn.Linear(
            enc_dim, dec_dim)
        self.lstm = nn.LSTM(
            256 + dec_dim, dec_dim,
            num_layers=2, batch_first=True)
        self.mel_proj = nn.Linear(
            dec_dim, n_mels)
        self.stop_proj = nn.Linear(dec_dim, 1)

    def forward_step(self, prev_mel,
                     encoder_out, lstm_state):
        pre = self.prenet(
            prev_mel).unsqueeze(1)
        enc = self.enc_proj(encoder_out)
        ctx, attn_weights = self.attention(
            pre, enc, enc)
        lstm_in = torch.cat(
            [pre, ctx], dim=-1)
        out, lstm_state = self.lstm(
            lstm_in, lstm_state)
        mel_out = self.mel_proj(
            out.squeeze(1))
        stop_token = self.stop_proj(
            out.squeeze(1))
        return (mel_out, stop_token,
                attn_weights, lstm_state)


encoder = SimpleTacotronEncoder()
decoder = SimpleTacotronDecoder()

chars = torch.randint(0, 80, (2, 50))
enc_out = encoder(chars)
print(f"Encoder output: {enc_out.shape}")

prev_mel = torch.zeros(2, 80)
mel, stop, attn, state = (
    decoder.forward_step(
        prev_mel, enc_out, None))
print(f"Mel frame: {mel.shape}")
print(f"Stop token: {stop.shape}")
print(f"Attention: {attn.shape}")

A couple of things worth highlighting here. The prenet applies dropout at both training AND inference time. This is unusual -- normally dropout is only used during training. In Tacotron 2, keeping prenet dropout at inference acts as a source of variation, preventing the model from producing overly deterministic (and boring-sounding) speech. It's one of those tricks that sounds wrong in theory but works beautifully in practice.

The stop token is essential: the model needs to decide when the sentence is finished. Without it, generation continues indefinitely, producing babbling or silence. During training the stop token is supervised with a binary label (1 at the last frame, 0 elsewhere).

And here's a key diagnostic during TTS training: visualize the attention alignment. Good TTS produces a nearly diagonal attention matrix -- the model generates speech left to right, attending to each word in order. If the attention is scattered or jumps around, the speech will have repeated words, skipped words, or garbled pronunciation. A clean diagonal alignment means the model has learned the mapping between text and audio frames correctly.

Vocoders: from spectrogram to waveform

The Mel spectrogram from Tacotron is a compact representation, but it's not audio you can play. Converting it to a waveform requires reconstructing the missing phase information (the Mel spectrogram only contains magnitude, as we discussed in episode #92) and generating 16,000+ samples per second of audio.

WaveNet (van den Oord et al., 2016) was the first neural vocoder. It generates audio one sample at a time using dilated causal convolutions. Each sample is conditioned on all previous samples and the Mel spectrogram. The output quality was revolutionary -- the first truly natural-sounding neural audio. But generating 16,000 samples per second, one at a time, is painfully slow. A 5-second clip at 16 kHz means 80,000 sequential neural network forward passes. Not exactly real-time friendly.

WaveRNN (Kalchbrenner et al., 2018) improved speed by using a single-layer RNN with clever tricks: dual softmax for coarse/fine bits of each sample, weight pruning, and subscaling. It's fast enough for real-time on mobile devices.

HiFi-GAN (Kong et al., 2020) solved the speed problem definitively using a GAN architecture. The generator takes a Mel spectrogram and upsamples it to waveform resolution in a single forward pass -- real-time or faster on a GPU:

import torch
import torch.nn as nn


class HiFiGANGenerator(nn.Module):
    """Simplified HiFi-GAN generator:
    Mel spectrogram -> waveform via
    transposed convolution upsampling."""

    def __init__(self, n_mels=80,
                 upsample_rates=[8, 8, 2, 2]):
        super().__init__()
        channels = 512
        self.input_conv = nn.Conv1d(
            n_mels, channels, 7, padding=3)
        self.upsamples = nn.ModuleList()
        for rate in upsample_rates:
            self.upsamples.append(nn.Sequential(
                nn.LeakyReLU(0.1),
                nn.ConvTranspose1d(
                    channels, channels // 2,
                    kernel_size=rate * 2,
                    stride=rate,
                    padding=rate // 2),
            ))
            channels //= 2
        self.output_conv = nn.Sequential(
            nn.LeakyReLU(0.1),
            nn.Conv1d(channels, 1, 7, padding=3),
            nn.Tanh(),
        )

    def forward(self, mel):
        # mel: (batch, n_mels, time)
        x = self.input_conv(mel)
        for upsample in self.upsamples:
            x = upsample(x)
        return self.output_conv(x)


gen = HiFiGANGenerator()
mel = torch.randn(1, 80, 100)
waveform = gen(mel)
print(f"Input Mel: {mel.shape}")
print(f"  80 Mel bands x 100 time frames")
print(f"Output waveform: {waveform.shape}")
print(f"  Upsampling: 8*8*2*2 = "
      f"{8*8*2*2}x")
print(f"  100 frames -> "
      f"{100*256} audio samples")

The upsample rates [8, 8, 2, 2] multiply to 256, matching the standard Mel spectrogram hop length. So each Mel frame becomes exactly 256 audio samples. HiFi-GAN is trained adversarially: a multi-scale discriminator judges whether generated waveforms sound realistic at different frequency resolutions. The generator learns to produce audio that passes the discriminator's scrutiny. Combined with feature matching loss and Mel spectrogram reconstruction loss, you get fast, high-quality audio generation. Exactly the GAN training paradigm we studied in episode #55, but applied to 1D audio instead of 2D images.

VITS: end-to-end TTS

VITS (Kim et al., 2021) eliminates the two-stage pipeline entirely. It's a single model that goes directly from text to waveform, combining a VAE for latent representation, a normalizing flow for flexible distributions, and an adversarial training objective. Having said that, the real benefit isn't just architectural elegance -- it's that the acoustic model and vocoder can be optimized jointly, avoiding the error accumulation that happens when two separate models are chained together:

from TTS.api import TTS


# Using Coqui TTS library (VITS model)
tts = TTS(model_name=(
    "tts_models/en/ljspeech/vits"))

# Generate speech from text
tts.tts_to_file(
    text="Hello, this is a test of neural "
         "text to speech synthesis.",
    file_path="output.wav"
)

# List all available models
print("Available models:")
for m in TTS.list_models()[:10]:
    print(f"  {m}")

Coqui TTS wraps VITS (and many other architectures) in a nice API. The LJSpeech model was trained on a single female speaker reading passages from public domain books -- about 24 hours of clean studio audio. The quality is quite good for a single-speaker model. But the really interesting stuff happens when you want the model to speak in different voices.

Voice cloning: speaking in someone else's voice

Modern TTS can synthesize speech in a target speaker's voice from just a few seconds of reference audio. The key idea: speaker embeddings. A speaker encoder (similar to the face recognition embeddings we built in episode #88) maps a reference audio clip to a fixed-dimensional vector that captures the speaker's vocal characteristics -- pitch range, timbre, speaking rate, accent, vocal quality.

import numpy as np


class SpeakerEmbeddingDemo:
    """Demonstrate how speaker embeddings
    capture vocal identity."""

    def __init__(self, embed_dim=256):
        self.embed_dim = embed_dim
        self.rng = np.random.RandomState(42)

    def simulate_speaker_encoder(self,
                                  n_speakers=5):
        """Simulate speaker embeddings. In
        real systems these come from a trained
        encoder network (like GE2E or ECAPA)."""
        speakers = {}
        names = ['Alice', 'Bob', 'Carlos',
                 'Diana', 'Eve']
        for i, name in enumerate(names):
            base = self.rng.randn(self.embed_dim)
            base = base / np.linalg.norm(base)
            speakers[name] = {
                'base': base,
                'clips': []}
            for j in range(3):
                noise = self.rng.randn(
                    self.embed_dim) * 0.05
                clip_embed = base + noise
                clip_embed /= np.linalg.norm(
                    clip_embed)
                speakers[name]['clips'].append(
                    clip_embed)
        return speakers

    def cosine_similarity(self, a, b):
        return float(
            np.dot(a, b) / (
                np.linalg.norm(a)
                * np.linalg.norm(b)))

    def run(self):
        speakers = (
            self.simulate_speaker_encoder())
        names = list(speakers.keys())
        print("Same-speaker similarity "
              "(different clips):")
        for name in names[:3]:
            clips = speakers[name]['clips']
            sim = self.cosine_similarity(
                clips[0], clips[1])
            print(f"  {name} clip0 vs clip1: "
                  f"{sim:.4f}")

        print("\nCross-speaker similarity:")
        for i in range(3):
            for j in range(i + 1, min(i + 3, 5)):
                sim = self.cosine_similarity(
                    speakers[names[i]]['base'],
                    speakers[names[j]]['base'])
                print(
                    f"  {names[i]} vs {names[j]}: "
                    f"{sim:.4f}")

        print(f"\nSame speaker: ~0.95+ "
              f"(high similarity)")
        print(f"Diff speakers: ~0.0 "
              f"(near orthogonal)")
        print(f"Embedding dim: {self.embed_dim}")


demo = SpeakerEmbeddingDemo()
demo.run()

The speaker embedding is then conditioned into the TTS decoder -- concatenated with the text encoder output, or added to the hidden states, or passed through an adaptive layer normalization. The decoder learns to generate speech that matches both the text content AND the target speaker's voice characteristics.

XTTS v2 (from Coqui) takes this further with multilingual voice cloning. Provide an English reference clip and generate speech in Spanish, French, or German -- with the same voice characteristics. The speaker embedding captures voice identity independent of language content:

from TTS.api import TTS


# XTTS v2: multilingual voice cloning
tts = TTS(model_name=(
    "tts_models/multilingual/"
    "multi-dataset/xtts_v2"))

# Clone a voice from reference audio
tts.tts_to_file(
    text="This is a demonstration of "
         "voice cloning technology.",
    speaker_wav="reference_speaker.wav",
    language="en",
    file_path="cloned_english.wav"
)

# Same voice, different language
tts.tts_to_file(
    text="Dit is een demonstratie van "
         "stemkloning technologie.",
    speaker_wav="reference_speaker.wav",
    language="nl",
    file_path="cloned_dutch.wav"
)

print("Generated speech in 2 languages "
      "with the same cloned voice")

The ethical implications here are significant and shouldn't be glossed over. Voice cloning enables audio deepfakes -- synthesized speech that sounds exactly like a real person saying things they never said. A few seconds of someone's voice from a YouTube video or podcast is enough to clone them convincingly. Detection methods exist (analyzing micro-prosody patterns, spectral artifacts, and temporal inconsistencies that humans can't hear but algorithms can), but the arms race between generation and detection parallels exactly what we discussed for visual deepfakes in episode #88. The technology itself is neutral -- it powers audiobooks, accessibility tools, voice assistants for people who've lost their voice -- but the potential for misuse is real.

Comparing TTS architectures

Let's put the different approaches in perspetive:

class TTSArchitectureComparison:
    """Compare TTS system architectures
    across quality, speed, and complexity."""

    def __init__(self):
        self.systems = {
            'Tacotron2 + WaveNet': dict(
                quality_mos=4.53,
                rtf_gpu=0.01,
                rtf_cpu=50.0,
                trainable_params='28M + 6M',
                training_data='24h',
                end_to_end=False),
            'Tacotron2 + HiFi-GAN': dict(
                quality_mos=4.43,
                rtf_gpu=0.05,
                rtf_cpu=1.5,
                trainable_params='28M + 14M',
                training_data='24h',
                end_to_end=False),
            'VITS': dict(
                quality_mos=4.43,
                rtf_gpu=0.06,
                rtf_cpu=2.0,
                trainable_params='37M',
                training_data='24h',
                end_to_end=True),
            'XTTS v2': dict(
                quality_mos=4.20,
                rtf_gpu=0.15,
                rtf_cpu=8.0,
                trainable_params='467M',
                training_data='16000h+',
                end_to_end=True),
        }

    def run(self):
        print(f"{'System':>24} {'MOS':>5} "
              f"{'GPU RTF':>8} {'CPU RTF':>8} "
              f"{'Params':>12} {'E2E':>4}")
        print("-" * 66)
        for name, s in self.systems.items():
            print(
                f"{name:>24} "
                f"{s['quality_mos']:>5.2f} "
                f"{s['rtf_gpu']:>7.2f}x "
                f"{s['rtf_cpu']:>7.1f}x "
                f"{s['trainable_params']:>12} "
                f"{'Y' if s['end_to_end'] else 'N':>4}")
        print(f"\nMOS = Mean Opinion Score "
              f"(1-5, human = ~4.5-4.8)")
        print(f"RTF = Real-Time Factor "
              f"(< 1.0 = faster than realtime)")
        print(f"\nNote: Tacotron2+WaveNet has "
              f"highest MOS but WaveNet's")
        print(f"autoregressive generation is "
              f"50x slower than realtime")
        print(f"on CPU -- impractical for "
              f"anything but batch generation.")


comp = TTSArchitectureComparison()
comp.run()

The tradeoff is clear. WaveNet delivers the best quality because it models each audio sample individually, but the autoregressive generation is brutally slow. HiFi-GAN sacrifices a tiny bit of quality for a massive speedup by generating the entire waveform in a single forward pass. VITS offers similar quality to Tacotron 2 + HiFi-GAN but in a cleaner, jointly-optimized package. And XTTS v2 trades some single-speaker quality for the ability to clone any voice in any language -- a fundamentally different capability.

TTS evaluation

Unlike ASR's clear-cut Word Error Rate metric (last episode), TTS quality is harder to measure objectively. The fundamental problem: there are many valid ways to say a sentence. Two pronunciations can both be perfectly natural but sound quite different -- different emphasis, different pacing, slightly different pitch contour. So "distance from a reference recording" doesn't fully capture quality.

MOS (Mean Opinion Score): human listeners rate the naturalness of synthesized speech on a 1-5 scale. Expensive, slow, subjective -- but the gold standard. Modern neural TTS achieves MOS scores of 4.0-4.5. Natural human speech scores around 4.5-4.8 (not a perfect 5.0 because even human speech gets criticized for mumbling, background noise, or unusual pronunciation).

Objective metrics exist but correlate imperfectly with human judgment:

import numpy as np


class TTSEvaluator:
    """Objective TTS evaluation metrics."""

    def mel_cepstral_distortion(self,
                                 ref_mfcc,
                                 gen_mfcc):
        """MCD: lower is better. Measures
        spectral distance between reference
        and generated speech."""
        n_frames = min(
            ref_mfcc.shape[1],
            gen_mfcc.shape[1])
        ref = ref_mfcc[:, :n_frames]
        gen = gen_mfcc[:, :n_frames]
        diff = ref - gen
        frame_dist = np.sqrt(
            2.0 * np.sum(diff ** 2, axis=0))
        return float(np.mean(frame_dist))

    def f0_correlation(self, ref_f0, gen_f0):
        """Pitch correlation: how well does
        the generated F0 contour match the
        reference? Higher is better."""
        n = min(len(ref_f0), len(gen_f0))
        ref = ref_f0[:n]
        gen = gen_f0[:n]
        mask = (ref > 0) & (gen > 0)
        if mask.sum() < 2:
            return 0.0
        return float(np.corrcoef(
            ref[mask], gen[mask])[0, 1])

    def voiced_unvoiced_error(self, ref_f0,
                               gen_f0):
        """V/UV error: percentage of frames
        where voicing decision disagrees."""
        n = min(len(ref_f0), len(gen_f0))
        ref_voiced = ref_f0[:n] > 0
        gen_voiced = gen_f0[:n] > 0
        errors = np.sum(
            ref_voiced != gen_voiced)
        return float(errors / max(n, 1))

    def run(self):
        rng = np.random.RandomState(42)
        n_mfcc, n_frames = 13, 200

        ref_mfcc = rng.randn(n_mfcc, n_frames)
        good_gen = ref_mfcc + rng.randn(
            n_mfcc, n_frames) * 0.3
        bad_gen = rng.randn(n_mfcc, n_frames) * 2

        mcd_good = self.mel_cepstral_distortion(
            ref_mfcc, good_gen)
        mcd_bad = self.mel_cepstral_distortion(
            ref_mfcc, bad_gen)
        print(f"MCD (good TTS): {mcd_good:.2f} dB")
        print(f"MCD (bad TTS):  {mcd_bad:.2f} dB")
        print(f"(Typical good TTS: MCD < 5.0 dB)")

        ref_f0 = np.where(
            rng.rand(n_frames) > 0.3,
            rng.uniform(100, 300, n_frames),
            0)
        good_f0 = np.where(
            ref_f0 > 0,
            ref_f0 + rng.randn(n_frames) * 10,
            rng.choice([0, 150], n_frames,
                       p=[0.9, 0.1]))
        bad_f0 = np.where(
            rng.rand(n_frames) > 0.5,
            rng.uniform(80, 400, n_frames),
            0)

        print(f"\nF0 correlation (good): "
              f"{self.f0_correlation(ref_f0, good_f0):.3f}")
        print(f"F0 correlation (bad):  "
              f"{self.f0_correlation(ref_f0, bad_f0):.3f}")
        print(f"V/UV error (good): "
              f"{self.voiced_unvoiced_error(ref_f0, good_f0):.1%}")
        print(f"V/UV error (bad):  "
              f"{self.voiced_unvoiced_error(ref_f0, bad_f0):.1%}")


evaluator = TTSEvaluator()
evaluator.run()

MCD (Mel Cepstral Distortion) measures the spectral distance between reference and generated MFCCs. F0 correlation checks whether the pitch contour of the generated speech matches the reference -- critical for naturalness because wrong pitch makes speech sound robotic or monotone. V/UV error catches cases where the model voices consonants that should be unvoiced (or vice versa), which produces audible artifacts.

But here's the fundamental problem: a system can score well on objective metrics while sounding subtly wrong to human ears. A slightly wrong rhythm, an unnaturally smooth F0 contour, or a barely perceptible breathiness artifact might not register in MCD or F0 correlation but will make a human listener think "something's off." That's why MOS remains the gold standard for TTS evaluation despite its cost.

What TTS still gets wrong

Even with all the progress, current TTS systems have blind spots:

Long-form coherence: generating a paragraph sounds great. Generating a full chapter of an audiobook? The prosody starts drifting, emphasis patterns become repetitive, and the "performance" feels flat compared to a human narrator who varies their delivery based on emotional context across pages.
Expressive speech: neutral, informative speech is basically solved. But conveying sarcasm, whispering, shouting, crying, or speaking with a specific emotional state? Much harder. Some models support style tokens or emotion conditioning, but the results aren't consistently convincing.
Rare words and proper nouns: just like ASR struggles with unusual words, TTS can mispronounce names, technical jargon, or words from other languages embedded in English text. "Scipio" would almost certainly be mangled by most TTS systems ;-)
Breathing and pauses: human speakers breathe. They pause to think. They speed up when excited and slow down for emphasis. Current TTS systems model this only coarsely -- the pauses are mechanically regular rather than communicatively motivated.

The field is moving fast, though. Current research into duration prediction, hierarchical prosody models, and context-aware generation is steadily chipping away at these limitations. What was impossible two years ago might be routine by the time you read this.

Samengevat

Text-to-speech traditionally uses a two-stage pipeline: an acoustic model (Tacotron 2) that converts text to Mel spectrograms, and a vocoder (WaveNet, HiFi-GAN) that converts spectrograms to audio waveforms;
text preprocessing handles normalization ("$4.99" -> "four dollars and ninety nine cents"), grapheme-to-phoneme conversion, and prosody assignment -- a surprising amount of work happens before the neural network;
Tacotron 2 is an encoder-decoder with attention; the attention alignment (should be nearly diagonal) is the key diagnostic for TTS training quality; the stop token determines when generation ends;
WaveNet pioneered neural vocoders but is impractically slow (autoregressive, one sample at a time); HiFi-GAN generates waveforms in a single forward pass using transposed convolutions and adversarial training -- real-time or faster;
VITS combines everything into one end-to-end model using VAE + normalizing flows + adversarial training, avoiding the error accumulation of two-stage systems;
voice cloning uses speaker embeddings (fixed-dimensional vectors capturing vocal identity) to synthesize speech in any target voice from a few seconds of reference audio; XTTS v2 supports multilingual cloning;
MOS (human listening tests) remains the gold standard for TTS evaluation; objective metrics like MCD, F0 correlation, and V/UV error are useful but imperfect proxies for perceived naturalness.

We've now covered both directions of the speech pipeline: audio in (episode #93, speech recognition) and audio out (this episode, text-to-speech). The audio domain has more to explore though -- there are entire families of tasks beyond speech where audio AI is making major advances, from classifying sounds to generating music to recognizing who's speaking.

Exercises

Exercise 1: Build a text normalization benchmark. Create a class TextNormBenchmark that: (a) implements normalization rules for: numbers (0-9999 -> English words), currency ($X.XX -> "X dollars and XX cents"), ordinals (1st -> "first", 2nd -> "second", 3rd -> "third", 4th-20th mapped, 21st-99th derived), time (3:45 -> "three forty five", 12:00 -> "twelve o'clock"), and common abbreviations (Mr., Mrs., Dr., Prof., vs., etc., Jr., Sr.), (b) creates a test suite of at least 15 input/expected-output pairs covering all rule types including edge cases (midnight "12:00", "$0.01", "1st" vs "21st"), (c) runs all test cases and reports how many pass/fail, (d) implements a coverage_report() that checks which types of patterns appear in a given text and how many are handled vs unhandled, (e) tests coverage on the paragraph: "Dr. Smith paid $12.99 at 3:30 for his 2nd visit. Mr. Jones owes $0.50 etc." and prints which patterns were normalized and which weren't. Verify that all test cases pass with correct English word output.

Exercise 2: Build a Mel spectrogram reconstruction evaluator. Create a class MelReconstructionEval that: (a) generates a "reference" Mel spectrogram by creating a synthetic audio signal (a 2-second speech-like signal: sweep from 150 to 3000 Hz with amplitude modulation at 4 Hz to simulate syllable rhythm, at 16 kHz sample rate) and computing its Mel spectrogram (80 bands, n_fft=1024, hop=256), (b) creates three "generated" versions with different quality levels: (1) "excellent" -- the reference plus small Gaussian noise (std=0.5 dB), (2) "moderate" -- the reference with larger noise (std=2.0 dB) plus 3 random frequency bands zeroed out, (3) "poor" -- a completely different signal's Mel spectrogram (use a different frequency sweep), (c) implements compute_mcd(ref, gen) that computes Mel Cepstral Distortion between two Mel spectrograms (apply DCT to get 13 cepstral coefficients first, then compute frame-wise Euclidean distance, then average), (d) implements compute_spectral_convergence(ref, gen) that computes the Frobenius norm of the difference divided by the Frobenius norm of the reference (lower is better), (e) prints a comparison table showing MCD and spectral convergence for all three quality levels. Verify that excellent < moderate < poor for both metrics.

Exercise 3: Build a vocoder upsampling simulator. Create a class VocoderUpsampleSim that: (a) generates a simple 80-band Mel spectrogram of 100 frames (representing ~1.6 seconds at hop_length=256 and sr=16000), where each frame has energy concentrated in bands 10-30 (simulating a vowel sound's formant structure), (b) implements three upsampling strategies: (1) nearest neighbor -- repeat each Mel frame 256 times to get raw samples, then apply a simple smoothing filter, (2) linear interpolation -- linearly interpolate between consecutive Mel frames at the sample level, (3) transposed convolution simulation -- use numpy to simulate a learned transposed convolution: multiply each Mel frame by a random "learned" kernel (256 values) and overlap-add the results, (c) for each strategy computes: output length (should be 100 * 256 = 25,600 samples), spectral smoothness (mean absolute difference between consecutive samples -- lower means smoother transitions), dynamic range (max - min of output), and zero-crossing rate, (d) prints a comparison table of all three strategies. Verify that transposed convolution produces the richest output (highest dynamic range and zero-crossing rate) because the "learned" kernels introduce complex patterns that simple interpolation cannot produce.

Thanks for reading!

@scipio

stem stemsocial steemstem python programming

0.000

0 comments

Learn AI Series (#94) - Text-to-Speech (TTS)

Learn AI Series (#94) - Text-to-Speech (TTS)

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#94) - Text-to-Speech (TTS)

Solutions to Episode #93 Exercises

On to today's episode

The TTS pipeline

Tacotron 2: text to Mel spectrogram

Vocoders: from spectrogram to waveform

VITS: end-to-end TTS

Voice cloning: speaking in someone else's voice

Comparing TTS architectures

TTS evaluation

What TTS still gets wrong

Samengevat

Exercises

Thanks for reading!

Curriculum (of the `Learn AI Series`):