Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant

avatar

Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant

variant-a-07-purple.png

What will I learn

  • You will build a complete local voice assistant from the components covered in this audio arc;
  • Whisper for speech recognition, a custom NLU module for intent parsing, and a TTS model for speech output;
  • custom wake word detection using a lightweight audio classifier;
  • streaming audio processing with real-time microphone capture and voice activity detection;
  • privacy-first design where everything runs on your machine with no cloud APIs;
  • performance benchmarking to understand latency bottlenecks in real-time voice pipelines;
  • a modular architecture that lets you swap any component independently.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant

Solutions to Episode #100 Exercises

Exercise 1: The AV noise robustness analyzer generates 4 words with distinct frequency profiles (200/400/800, 150/350/700, 300/500/900, 180/450/750 Hz). Audio features are mean+std of mel bins (80-dim), visual features are mean+std of 8-dim lip aperture patterns (16-dim), combined AV features are 96-dim. Training a softmax classifier on 20 samples per word (80 total), then testing across 5 noise levels applied only to audio: at sigma=0.0 both audio-only and AV achieve 100%. At sigma=0.3, audio-only drops to 85% while AV holds at 95%. At sigma=0.6, audio-only falls to 60%, visual-only is steady at 70%, AV is at 85%. The crossover point (where audio-only drops below visual-only) happens around sigma=0.5 -- beyond this, the visual channel carries more reliable information than the corrupted audio. At sigma=2.0, audio-only is near chance (30%) while AV still achieves 65% thanks to the intact visual stream.

Exercise 2: The AV synchrony detector generates 5s of paired data with 10 impact events. Audio bursts are 50-sample Gaussian pulses; visual brightness spikes are aligned (sync) or shifted by 0.2s (desync). Cross-correlation features within a +/-0.3s window show: synchronized pairs have peak correlation at lag=0 (average peak position 0.0 +/- 0.01s), desynchronized pairs have peak at lag=0.2s (average 0.19 +/- 0.03s). A logistic regression binary classifier on the 15-bin correlation profile achieves 95% accuracy. The key discriminating feature is correlation at lag=0 (high for sync, low for desync) versus correlation at lag=+0.2s (low for sync, high for desync). False positives occur when random noise peaks happen to align at lag=0 despite the visual shift.

Exercise 3: The AV retrieval system defines 5 scenes with 6-dim audio and 6-dim visual features per scene. After training contrastive projections into an 8-dim shared space (200 epochs, margin=0.5, lr=0.05), audio-to-visual recall@1 reaches 82% and visual-to-audio recall@1 reaches 78%. The 5x5 confusion matrix shows strong diagonal (correct retrievals) with minor confusions between spectrally similar scenes (dog_barking vs clapping -- both broadband impulsive sounds). Guitar_playing and typing are the most distinct, achieving perfect retrieval in both directions. The shared embedding space clusters scenes by acoustic-visual similarity -- scenes with matching temporal dynamics (impulsive vs sustained) group together, demonstrating that the contrastive loss learns meaningful cross-modal correspondence.

On to today's episode

Here we go! Ten episodes of audio AI -- from the raw physics of waveforms through speech recognition, text-to-speech, classification, music generation, speaker identification, voice understanding, noise reduction, and multimodal fusion -- and now we put it all together into something you can actually use. This is the arc finale ;-)

We're building a voice-controlled AI assistant that runs entirely on your local machine. No cloud APIs, no data leaving your computer, no subscription fees. It listens for a wake word, transcribes your command, understands what you want, does something about it, and speaks a response back. Every single component is something we've already built or used in episodes #92 through #100.

This won't rival Siri or Alexa -- those systems have millions of training examples and thousands of engineers optimizing every millisecond of latency. But it will be yours. You'll understand every component, you'll control every parameter, and you'll be able to extend it in whatever direction you want. Having said that, the architecture is solid and the same pipeline structure is what the commercial assistants use under the hood. The difference is scale, not concept.

Architecture overview

The full pipeline looks like this:

Microphone -> VAD -> Wake Word -> ASR (Whisper) -> NLU -> Action -> TTS -> Speaker

Six components, each a direct application of a previous episode:

  1. Audio capture + VAD: continuous microphone input with voice activity detection (episode #97 covered Silero VAD)
  2. Wake word detector: a lightweight binary classifier that triggers on "hey assistant" (episode #95 technique -- audio classification with mel spectrograms)
  3. ASR: OpenAI's Whisper to transcribe the spoken command (episode #93)
  4. NLU: intent extraction and slot filling from text (episode #98)
  5. Action executor: maps structured intents to actual operations
  6. TTS: text-to-speech to speak the response (episode #94)

Let's build each one, then wire them together into a working system.

Component 1: Audio capture and VAD

We need a continuous audio stream that detects when someone is actually speaking versus ambient silence. Silero VAD (which we explored in episode #97) is lightweight and accurate -- it runs inference on 100ms audio chunks and returns a speech probability:

import numpy as np
import torch
import queue
import pyaudio


class AudioStream:
    """Continuous microphone capture with VAD."""

    def __init__(self, sample_rate=16000,
                 chunk_duration=0.1):
        self.sr = sample_rate
        self.chunk_size = int(
            sample_rate * chunk_duration)
        self.audio_queue = queue.Queue()

        # Load Silero VAD
        self.vad_model, self.vad_utils = (
            torch.hub.load(
                'snakers4/silero-vad',
                'silero_vad'))

        # PyAudio setup
        self.pa = pyaudio.PyAudio()
        self.stream = self.pa.open(
            format=pyaudio.paFloat32,
            channels=1,
            rate=self.sr,
            input=True,
            frames_per_buffer=self.chunk_size,
            stream_callback=self._callback)

    def _callback(self, in_data,
                  frame_count, time_info,
                  status):
        audio = np.frombuffer(
            in_data, dtype=np.float32)
        self.audio_queue.put(audio)
        return (None, pyaudio.paContinue)

    def get_speech_segment(
            self, silence_threshold=0.8,
            max_duration=10.0):
        """Record until speech ends.
        Returns the complete utterance."""
        speech_chunks = []
        silence_counter = 0
        chunk_dur = (
            self.chunk_size / self.sr)
        max_chunks = int(
            max_duration / chunk_dur)
        silence_limit = int(
            silence_threshold / chunk_dur)

        is_speaking = False
        for _ in range(max_chunks):
            chunk = self.audio_queue.get()
            tensor = torch.from_numpy(chunk)

            # VAD inference
            speech_prob = self.vad_model(
                tensor, self.sr).item()

            if speech_prob > 0.5:
                is_speaking = True
                silence_counter = 0
                speech_chunks.append(chunk)
            elif is_speaking:
                silence_counter += 1
                speech_chunks.append(chunk)
                if silence_counter >= silence_limit:
                    break

        if speech_chunks:
            return np.concatenate(
                speech_chunks)
        return None

    def cleanup(self):
        self.stream.stop_stream()
        self.stream.close()
        self.pa.terminate()

The callback architecture is important here. PyAudio fills a buffer in a background thread, which our main loop consumes. The VAD runs on each 100ms chunk (1600 samples at 16kHz) and we accumulate chunks while speech is active. Once we detect silence_threshold seconds of consecutive silence after speech started, we know the utterance is complete and return the full audio segment.

Component 2: Wake word detection

You don't want the assistant processing every sound in the room. That would be a privacy nightmare and a waste of compute. A wake word detector is a tiny binary classifier that runs continuously on the audio stream, only triggering the full pipeline when it hears the activation phrase.

The approach: train a small CNN on mel spectrograms of the wake phrase versus everything else (the exact technique from episode #95):

import torch.nn as nn
import torchaudio
import os


class WakeWordDetector:
    """Lightweight wake word detection
    using a small CNN on mel spectrograms."""

    def __init__(self,
                 model_path="wake_word.pt",
                 threshold=0.85):
        self.threshold = threshold
        self.sr = 16000

        # Mel spectrogram transform
        self.mel_transform = (
            torchaudio.transforms
            .MelSpectrogram(
                sample_rate=self.sr,
                n_mels=40, n_fft=400,
                hop_length=160))

        # Small CNN classifier
        self.model = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.AdaptiveAvgPool2d((4, 4)),
            nn.Flatten(),
            nn.Linear(64 * 4 * 4, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid())

        if os.path.exists(model_path):
            self.model.load_state_dict(
                torch.load(model_path))
        self.model.eval()

    def check(self,
              audio_chunk: np.ndarray) -> bool:
        """Check if audio contains
        the wake word."""
        tensor = torch.from_numpy(
            audio_chunk).unsqueeze(0)
        mel = self.mel_transform(
            tensor).unsqueeze(0)

        with torch.no_grad():
            prob = self.model(mel).item()

        return prob > self.threshold

Training this detector requires positive examples (you saying the wake word 50-100 times in different conditions -- quiet room, with music, whispering, shouting) and negative examples (ambient noise, other speech, music -- thousands of clips from AudioSet or your own recordings). For a quick prototype, 100 positive recordings and 1000 random negatives is enough to get started. The threshold of 0.85 gives you a good balance between false activations and missed wake words -- adjust up for fewer false triggers, down for fewer misses.

Component 3: ASR with Whisper

Once the wake word fires, we capture the full command and transcribe it. Whisper (episode #93) handles this beautifully:

import whisper


class ASREngine:
    """Whisper-based speech-to-text."""

    def __init__(self, model_size="base"):
        self.model = whisper.load_model(
            model_size)

    def transcribe(self,
                   audio: np.ndarray) -> str:
        """Transcribe audio numpy array."""
        audio = audio.astype(np.float32)

        # Pad or trim to 30s
        audio = whisper.pad_or_trim(audio)

        # Compute log-mel spectrogram
        mel = (whisper
               .log_mel_spectrogram(audio)
               .unsqueeze(0)
               .to(self.model.device))

        # Decode
        options = whisper.DecodingOptions(
            language="en", fp16=False)
        result = whisper.decode(
            self.model, mel, options)

        return result.text.strip()

Whisper "base" (74M parameters) gives a solid balance between speed and accuracy for short voice commands. On a modern CPU it transcribes a 5-second clip in about 1-2 seconds. If latency is critical and you're willing to sacrifice some accuracy, use "tiny" (39M params, roughly 0.5s on CPU). If you have a GPU available, even "medium" (769M) runs in under a second.

The pad_or_trim call is Whisper-specific -- it expects exactly 30 seconds of audio input (480,000 samples at 16kHz). Short commands get zero-padded, which works fine because Whisper knows to stop transcribing when it encounters silence.

Component 4: NLU -- intent and slot extraction

From episode #98, we know Natural Language Understanding turns raw text into structured data: what does the user want (intent) and what are the specific details (slots). For a controlled vocabulary assistant, pattern matching is fast, predictible, and needs zero training data:

import re
from dataclasses import dataclass
from typing import Optional


@dataclass
class NLUResult:
    intent: str
    slots: dict
    confidence: float


class NLUEngine:
    """Pattern-based NLU for voice commands."""

    def __init__(self):
        self.intent_patterns = {
            "set_timer": [
                r"(?:set|start)\s+(?:a\s+)?"
                r"timer\s+(?:for\s+)?(.+)",
                r"timer\s+(.+)",
            ],
            "get_time": [
                r"what\s+time\s+is\s+it",
                r"what's\s+the\s+time",
                r"current\s+time",
            ],
            "get_weather": [
                r"(?:what's|how's)\s+the\s+"
                r"weather(?:\s+in\s+(.+))?",
                r"weather\s+(?:in|for)\s+(.+)",
            ],
            "play_music": [
                r"play\s+(?:some\s+)?"
                r"(.+?)(?:\s+music)?$",
                r"put\s+on\s+(?:some\s+)?(.+)",
            ],
            "set_reminder": [
                r"remind\s+me\s+to\s+(.+?)"
                r"(?:\s+(?:in|at)\s+(.+))?$",
                r"set\s+(?:a\s+)?reminder"
                r"\s+(?:to\s+)?(.+)",
            ],
            "stop": [
                r"^(?:stop|cancel|nevermind"
                r"|shut\s+up)$",
            ],
        }

    def parse(self, text: str) -> NLUResult:
        """Extract intent and slots."""
        text_lower = text.lower().strip()

        for intent, patterns in (
                self.intent_patterns.items()):
            for pattern in patterns:
                match = re.match(
                    pattern, text_lower)
                if match:
                    slots = self._extract_slots(
                        intent, match)
                    return NLUResult(
                        intent=intent,
                        slots=slots,
                        confidence=0.9)

        return NLUResult(
            intent="unknown",
            slots={"raw_text": text},
            confidence=0.3)

    def _extract_slots(self, intent: str,
                       match: re.Match) -> dict:
        """Pull slot values from groups."""
        groups = [g for g in match.groups()
                  if g is not None]
        slot_mappings = {
            "set_timer": ["duration"],
            "get_weather": ["location"],
            "play_music": ["query"],
            "set_reminder": ["task", "time"],
        }
        slots = {}
        for i, name in enumerate(
                slot_mappings.get(intent, [])):
            if i < len(groups):
                slots[name] = groups[i]
        return slots

Pattern matching isn't fancy, but for 10-20 controlled intents it's fast, predictable, and debuggable. When someone says something outside the patterns, it returns unknown with low confidence -- and you can add new patterns as needs emerge. For a more capable (but heavier) system, swap this out for the BERT-based joint intent+slot model from episode #98.

Component 5: Action execution

The action layer maps NLU output to actual operations. This is where the assistant does things:

import datetime
import threading


class ActionExecutor:
    """Execute actions based on NLU results."""

    def __init__(self):
        self.timers = {}

    def execute(self,
                nlu_result: NLUResult) -> str:
        """Run the intent handler."""
        handlers = {
            "set_timer": self._timer,
            "get_time": self._time,
            "get_weather": self._weather,
            "play_music": self._music,
            "set_reminder": self._reminder,
            "stop": self._stop,
            "unknown": self._unknown,
        }
        handler = handlers.get(
            nlu_result.intent, self._unknown)
        return handler(nlu_result.slots)

    def _timer(self, slots: dict) -> str:
        duration_text = slots.get(
            "duration", "")
        seconds = self._parse_duration(
            duration_text)
        if seconds:
            timer_id = f"t_{len(self.timers)}"
            t = threading.Timer(
                seconds, self._timer_done,
                args=[timer_id])
            t.start()
            self.timers[timer_id] = t
            return (f"Timer set for "
                    f"{duration_text}.")
        return ("Sorry, I didn't catch "
                "the duration.")

    def _time(self, slots: dict) -> str:
        now = datetime.datetime.now()
        return f"It's {now.strftime('%I:%M %p')}."

    def _weather(self, slots: dict) -> str:
        loc = slots.get(
            "location", "your area")
        return (f"I'd check the weather for "
                f"{loc}, but I'm running "
                f"offline. Add a weather API "
                f"key to enable this.")

    def _music(self, slots: dict) -> str:
        query = slots.get("query", "music")
        return (f"Playing {query}. "
                f"Hook up your music player "
                f"integration here.")

    def _reminder(self, slots: dict) -> str:
        task = slots.get(
            "task", "something")
        when = slots.get("time", "later")
        return (f"I'll remind you to "
                f"{task} {when}.")

    def _stop(self, slots: dict) -> str:
        return "Okay, stopping."

    def _unknown(self, slots: dict) -> str:
        return ("I didn't understand that. "
                "Could you rephrase?")

    def _parse_duration(
            self, text: str) -> Optional[int]:
        """Parse '5 minutes' to seconds."""
        patterns = [
            (r"(\d+)\s*(?:second|sec)", 1),
            (r"(\d+)\s*(?:minute|min)", 60),
            (r"(\d+)\s*(?:hour|hr)", 3600),
        ]
        total = 0
        for pattern, mult in patterns:
            match = re.search(pattern, text)
            if match:
                total += (int(match.group(1))
                          * mult)
        return total if total > 0 else None

    def _timer_done(self, timer_id: str):
        print(f"\nTimer {timer_id} finished!")

Notice how thin this layer is. The NLU already did the hard work of extracting structured information -- the action executor just maps it to function calls. This separation of concerns is what makes the architecture extensible: adding a new capability means adding patterns to the NLU and a handler to the executor. The rest of the pipeline stays untouched.

Component 6: TTS response

For speaking the response back, we use a pretrained TTS model (the same family of models we explored in episode #94):

from TTS.api import TTS
import subprocess
import sys


class TTSEngine:
    """Text-to-speech output."""

    def __init__(self):
        self.tts = TTS(
            model_name=(
                "tts_models/en/"
                "ljspeech/vits"))

    def speak(self, text: str,
              output_path: str = (
                  "/tmp/response.wav")):
        """Generate speech audio."""
        self.tts.tts_to_file(
            text=text,
            file_path=output_path)
        return output_path

    def speak_and_play(self, text: str):
        """Generate and play response."""
        path = self.speak(text)
        if sys.platform == "darwin":
            subprocess.run(["afplay", path])
        elif sys.platform == "linux":
            subprocess.run(["aplay", path])

The VITS model ("Variational Inference with adversarial learning for end-to-end Text-to-Speech") produces natural-sounding speech and runs reasonably fast on CPU. For even lower latency you could use a smaller model like piper (optimized for on-device inference), or pre-generate common responses ("yes?", "okay", "I didn't understand that") at startup and play them from cache instead of generating fresh audio every time.

Wiring it all together

Now the main loop that connects all six components into a working assistant:

class VoiceAssistant:
    """Complete voice assistant pipeline."""

    def __init__(self):
        print("Loading models...")
        self.audio = AudioStream()
        self.wake_word = WakeWordDetector()
        self.asr = ASREngine(
            model_size="base")
        self.nlu = NLUEngine()
        self.actions = ActionExecutor()
        self.tts = TTSEngine()
        print("Ready! Listening...")

    def run(self):
        """Main loop."""
        try:
            while True:
                # Phase 1: wake word
                chunk = (
                    self.audio.audio_queue.get())

                if self.wake_word.check(chunk):
                    print("Wake word detected!")
                    self.tts.speak_and_play(
                        "Yes?")

                    # Phase 2: capture command
                    audio_data = (
                        self.audio
                        .get_speech_segment(
                            silence_threshold=1.0,
                            max_duration=10.0))

                    if audio_data is None:
                        continue

                    # Phase 3: transcribe
                    text = self.asr.transcribe(
                        audio_data)
                    print(f"You said: {text}")

                    if not text.strip():
                        continue

                    # Phase 4: understand
                    nlu_result = (
                        self.nlu.parse(text))
                    print(
                        f"Intent: "
                        f"{nlu_result.intent}"
                        f", Slots: "
                        f"{nlu_result.slots}")

                    # Phase 5: act
                    response = (
                        self.actions
                        .execute(nlu_result))
                    print(
                        f"Response: {response}")

                    # Phase 6: speak
                    self.tts.speak_and_play(
                        response)

        except KeyboardInterrupt:
            print("\nShutting down...")
            self.audio.cleanup()


if __name__ == "__main__":
    assistant = VoiceAssistant()
    assistant.run()

The flow is dead simple once you see it laid out: listen passively for the wake word (cheap CNN, runs on every 100ms chunk), then switch to active mode (capture full utterance, run expensive Whisper inference, parse, act, respond), then go back to passive listening. This two-phase approach is how every commercial voice assistant works -- you can't afford to run full ASR on every second of ambient audio, so you use a tiny classifier as a gate.

Performance benchmarking

Where does the time go? Understanding latency is crucial for making the assistant feel responsive. Let's measure each component:

import time


class LatencyBenchmark:
    """Measure per-component latency."""

    def __init__(self,
                 assistant: VoiceAssistant):
        self.assistant = assistant

    def benchmark(self, test_audio_path: str,
                  n_runs: int = 10):
        """Benchmark over N iterations."""
        import torchaudio
        waveform, sr = torchaudio.load(
            test_audio_path)
        audio_np = waveform.squeeze().numpy()

        results = {
            "asr": [], "nlu": [],
            "action": [], "tts": [],
            "total": []}

        for i in range(n_runs):
            total_start = time.perf_counter()

            t = time.perf_counter()
            text = (self.assistant.asr
                    .transcribe(audio_np))
            results["asr"].append(
                time.perf_counter() - t)

            t = time.perf_counter()
            nlu_result = (
                self.assistant.nlu
                .parse(text))
            results["nlu"].append(
                time.perf_counter() - t)

            t = time.perf_counter()
            response = (
                self.assistant.actions
                .execute(nlu_result))
            results["action"].append(
                time.perf_counter() - t)

            t = time.perf_counter()
            self.assistant.tts.speak(response)
            results["tts"].append(
                time.perf_counter() - t)

            results["total"].append(
                time.perf_counter()
                - total_start)

        print(f"\nBenchmark ({n_runs} runs):")
        print(f"{'Component':<12}"
              f"{'Mean':>8}"
              f"{'Min':>8}"
              f"{'Max':>8}")
        print("-" * 38)
        for comp, times in results.items():
            mean_ms = np.mean(times) * 1000
            min_ms = np.min(times) * 1000
            max_ms = np.max(times) * 1000
            print(f"{comp:<12}"
                  f"{mean_ms:>7.0f}ms"
                  f"{min_ms:>7.0f}ms"
                  f"{max_ms:>7.0f}ms")

Typical results on a modern CPU (M-series Mac or recent x86):

ComponentLatency
ASR (Whisper base)800-1500ms
NLU (regex)<1ms
Action execution<1ms
TTS (VITS)300-800ms
Total1.1-2.3s

The bottlenecks are crystal clear: ASR and TTS dominate. The NLU and action layers are essentially free. To speed things up:

  • Use Whisper "tiny" insted of "base" -- halves ASR time at modest accuracy cost
  • Pre-generate common TTS responses at startup
  • Run ASR on GPU if available (3-5x speedup)
  • Use streaming ASR to start processing before the user finishes speaking
  • Swap VITS for piper-tts (optimized for edge inference, 50-100ms generation)

On a GPU (even a modest one), the total pipeline drops to 300-600ms -- genuinely conversational latency.

Extending the assistant

This is a foundation, not a finished product. Some directions to take it further, each a direct application of something from this audio arc:

Noise reduction preprocessing (episode #99): run the captured audio through the Wiener filter or a trained denoiser before feeding it to Whisper. This dramatically improves transcription accuracy in noisy environments -- kitchens, cafes, rooms with open windows.

Speaker identification (episode #97): extract a speaker embedding from the command audio, compare against enrolled speakers, and personalize the response. "Hey assistant, what's on my calendar?" could return different results depending on who is speaking.

Emotion awareness (episode #98): detect the user's emotional state from prosodic features (pitch variance, speech rate, energy) and adapt tone. A stressed user asking "what time is my meeting?" might get a calmer, more reassuring TTS voice.

Local LLM backend (episode #70): replace the pattern-based NLU with an on-device language model for open-domain conversation. Instead of matching rigid patterns, send the transcription to a local Llama/Mistral model and parse structured JSON from its response.

Multi-language support: Whisper already handles 99 languages out of the box -- just remove the language="en" constraint and let it auto-detect. Wire the detected language through to NLU patterns and TTS model selection.

Each of these is straightforward because of the modular architecture. Swap one component without touching the others. That's the whole point of building it this way -- and it's the same principle that makes production ML systems maintainable at scale.

Samengevat

  • The voice assistant pipeline chains six components: audio capture + VAD, wake word detection, ASR, NLU, action execution, and TTS -- each from a previous episode in the audio arc;
  • wake word detection uses a tiny CNN binary classifier running continuously on 100ms audio chunks -- it gates the expensive pipeline so you're not running Whisper on every second of ambient sound;
  • Whisper handles ASR, converting the captured command to text in 1-2 seconds on CPU (faster with GPU or the "tiny" model variant);
  • pattern-based NLU extracts intents and slots with zero training data -- fast, predictable, and debuggable for a known command vocabulary;
  • the main loop alternates between passive listening (wake word) and active processing (full pipeline) -- the same two-phase architecture used by Siri, Alexa, and Google Assistant;
  • end-to-end latency is 1-2 seconds on CPU, dominated by ASR and TTS -- GPU acceleration or smaller models bring this under 500ms for genuinely conversational interaction;
  • every component is independently replaceable: swap regex NLU for a BERT model, swap VITS for piper, add noise reduction, add speaker ID -- the modular design makes it all straightforward.

That wraps the audio AI arc. From raw waveforms and spectrograms in episode #92 all the way through to a working voice assistant that ties everything together. The next section of this series moves into a fundamentally different area of AI -- one where agents learn not from labeled data, but from interacting with an environment and receiving rewards for good behavior. It's a paradigm shift from everything we've done so far, and it opens up some of the most exciting applications in modern AI. Plenty more ground to cover ;-)

Thanks for reading!

@scipio



0
0
0.000
0 comments