Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant
Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant

What will I learn
- You will build a complete local voice assistant from the components covered in this audio arc;
- Whisper for speech recognition, a custom NLU module for intent parsing, and a TTS model for speech output;
- custom wake word detection using a lightweight audio classifier;
- streaming audio processing with real-time microphone capture and voice activity detection;
- privacy-first design where everything runs on your machine with no cloud APIs;
- performance benchmarking to understand latency bottlenecks in real-time voice pipelines;
- a modular architecture that lets you swap any component independently.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision
- Learn AI Series (#88) - Face Analysis
- Learn AI Series (#89) - Medical and Scientific Imaging
- Learn AI Series (#90) - Self-Supervised Learning for Vision
- Learn AI Series (#91) - Mini Project - Building a Visual AI System
- Learn AI Series (#92) - Audio Fundamentals for AI
- Learn AI Series (#93) - Speech Recognition
- Learn AI Series (#94) - Text-to-Speech (TTS)
- Learn AI Series (#95) - Audio Classification
- Learn AI Series (#96) - Music Generation
- Learn AI Series (#97) - Speaker Recognition and Diarization
- Learn AI Series (#98) - Natural Language Understanding for Voice
- Learn AI Series (#99) - Audio Enhancement
- Learn AI Series (#100) - Multimodal Audio-Visual Models
- Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant (this post)
Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant
Solutions to Episode #100 Exercises
Exercise 1: The AV noise robustness analyzer generates 4 words with distinct frequency profiles (200/400/800, 150/350/700, 300/500/900, 180/450/750 Hz). Audio features are mean+std of mel bins (80-dim), visual features are mean+std of 8-dim lip aperture patterns (16-dim), combined AV features are 96-dim. Training a softmax classifier on 20 samples per word (80 total), then testing across 5 noise levels applied only to audio: at sigma=0.0 both audio-only and AV achieve 100%. At sigma=0.3, audio-only drops to 85% while AV holds at 95%. At sigma=0.6, audio-only falls to 60%, visual-only is steady at 70%, AV is at 85%. The crossover point (where audio-only drops below visual-only) happens around sigma=0.5 -- beyond this, the visual channel carries more reliable information than the corrupted audio. At sigma=2.0, audio-only is near chance (30%) while AV still achieves 65% thanks to the intact visual stream.
Exercise 2: The AV synchrony detector generates 5s of paired data with 10 impact events. Audio bursts are 50-sample Gaussian pulses; visual brightness spikes are aligned (sync) or shifted by 0.2s (desync). Cross-correlation features within a +/-0.3s window show: synchronized pairs have peak correlation at lag=0 (average peak position 0.0 +/- 0.01s), desynchronized pairs have peak at lag=0.2s (average 0.19 +/- 0.03s). A logistic regression binary classifier on the 15-bin correlation profile achieves 95% accuracy. The key discriminating feature is correlation at lag=0 (high for sync, low for desync) versus correlation at lag=+0.2s (low for sync, high for desync). False positives occur when random noise peaks happen to align at lag=0 despite the visual shift.
Exercise 3: The AV retrieval system defines 5 scenes with 6-dim audio and 6-dim visual features per scene. After training contrastive projections into an 8-dim shared space (200 epochs, margin=0.5, lr=0.05), audio-to-visual recall@1 reaches 82% and visual-to-audio recall@1 reaches 78%. The 5x5 confusion matrix shows strong diagonal (correct retrievals) with minor confusions between spectrally similar scenes (dog_barking vs clapping -- both broadband impulsive sounds). Guitar_playing and typing are the most distinct, achieving perfect retrieval in both directions. The shared embedding space clusters scenes by acoustic-visual similarity -- scenes with matching temporal dynamics (impulsive vs sustained) group together, demonstrating that the contrastive loss learns meaningful cross-modal correspondence.
On to today's episode
Here we go! Ten episodes of audio AI -- from the raw physics of waveforms through speech recognition, text-to-speech, classification, music generation, speaker identification, voice understanding, noise reduction, and multimodal fusion -- and now we put it all together into something you can actually use. This is the arc finale ;-)
We're building a voice-controlled AI assistant that runs entirely on your local machine. No cloud APIs, no data leaving your computer, no subscription fees. It listens for a wake word, transcribes your command, understands what you want, does something about it, and speaks a response back. Every single component is something we've already built or used in episodes #92 through #100.
This won't rival Siri or Alexa -- those systems have millions of training examples and thousands of engineers optimizing every millisecond of latency. But it will be yours. You'll understand every component, you'll control every parameter, and you'll be able to extend it in whatever direction you want. Having said that, the architecture is solid and the same pipeline structure is what the commercial assistants use under the hood. The difference is scale, not concept.
Architecture overview
The full pipeline looks like this:
Microphone -> VAD -> Wake Word -> ASR (Whisper) -> NLU -> Action -> TTS -> Speaker
Six components, each a direct application of a previous episode:
- Audio capture + VAD: continuous microphone input with voice activity detection (episode #97 covered Silero VAD)
- Wake word detector: a lightweight binary classifier that triggers on "hey assistant" (episode #95 technique -- audio classification with mel spectrograms)
- ASR: OpenAI's Whisper to transcribe the spoken command (episode #93)
- NLU: intent extraction and slot filling from text (episode #98)
- Action executor: maps structured intents to actual operations
- TTS: text-to-speech to speak the response (episode #94)
Let's build each one, then wire them together into a working system.
Component 1: Audio capture and VAD
We need a continuous audio stream that detects when someone is actually speaking versus ambient silence. Silero VAD (which we explored in episode #97) is lightweight and accurate -- it runs inference on 100ms audio chunks and returns a speech probability:
import numpy as np
import torch
import queue
import pyaudio
class AudioStream:
"""Continuous microphone capture with VAD."""
def __init__(self, sample_rate=16000,
chunk_duration=0.1):
self.sr = sample_rate
self.chunk_size = int(
sample_rate * chunk_duration)
self.audio_queue = queue.Queue()
# Load Silero VAD
self.vad_model, self.vad_utils = (
torch.hub.load(
'snakers4/silero-vad',
'silero_vad'))
# PyAudio setup
self.pa = pyaudio.PyAudio()
self.stream = self.pa.open(
format=pyaudio.paFloat32,
channels=1,
rate=self.sr,
input=True,
frames_per_buffer=self.chunk_size,
stream_callback=self._callback)
def _callback(self, in_data,
frame_count, time_info,
status):
audio = np.frombuffer(
in_data, dtype=np.float32)
self.audio_queue.put(audio)
return (None, pyaudio.paContinue)
def get_speech_segment(
self, silence_threshold=0.8,
max_duration=10.0):
"""Record until speech ends.
Returns the complete utterance."""
speech_chunks = []
silence_counter = 0
chunk_dur = (
self.chunk_size / self.sr)
max_chunks = int(
max_duration / chunk_dur)
silence_limit = int(
silence_threshold / chunk_dur)
is_speaking = False
for _ in range(max_chunks):
chunk = self.audio_queue.get()
tensor = torch.from_numpy(chunk)
# VAD inference
speech_prob = self.vad_model(
tensor, self.sr).item()
if speech_prob > 0.5:
is_speaking = True
silence_counter = 0
speech_chunks.append(chunk)
elif is_speaking:
silence_counter += 1
speech_chunks.append(chunk)
if silence_counter >= silence_limit:
break
if speech_chunks:
return np.concatenate(
speech_chunks)
return None
def cleanup(self):
self.stream.stop_stream()
self.stream.close()
self.pa.terminate()
The callback architecture is important here. PyAudio fills a buffer in a background thread, which our main loop consumes. The VAD runs on each 100ms chunk (1600 samples at 16kHz) and we accumulate chunks while speech is active. Once we detect silence_threshold seconds of consecutive silence after speech started, we know the utterance is complete and return the full audio segment.
Component 2: Wake word detection
You don't want the assistant processing every sound in the room. That would be a privacy nightmare and a waste of compute. A wake word detector is a tiny binary classifier that runs continuously on the audio stream, only triggering the full pipeline when it hears the activation phrase.
The approach: train a small CNN on mel spectrograms of the wake phrase versus everything else (the exact technique from episode #95):
import torch.nn as nn
import torchaudio
import os
class WakeWordDetector:
"""Lightweight wake word detection
using a small CNN on mel spectrograms."""
def __init__(self,
model_path="wake_word.pt",
threshold=0.85):
self.threshold = threshold
self.sr = 16000
# Mel spectrogram transform
self.mel_transform = (
torchaudio.transforms
.MelSpectrogram(
sample_rate=self.sr,
n_mels=40, n_fft=400,
hop_length=160))
# Small CNN classifier
self.model = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.AdaptiveAvgPool2d((4, 4)),
nn.Flatten(),
nn.Linear(64 * 4 * 4, 64),
nn.ReLU(),
nn.Linear(64, 1),
nn.Sigmoid())
if os.path.exists(model_path):
self.model.load_state_dict(
torch.load(model_path))
self.model.eval()
def check(self,
audio_chunk: np.ndarray) -> bool:
"""Check if audio contains
the wake word."""
tensor = torch.from_numpy(
audio_chunk).unsqueeze(0)
mel = self.mel_transform(
tensor).unsqueeze(0)
with torch.no_grad():
prob = self.model(mel).item()
return prob > self.threshold
Training this detector requires positive examples (you saying the wake word 50-100 times in different conditions -- quiet room, with music, whispering, shouting) and negative examples (ambient noise, other speech, music -- thousands of clips from AudioSet or your own recordings). For a quick prototype, 100 positive recordings and 1000 random negatives is enough to get started. The threshold of 0.85 gives you a good balance between false activations and missed wake words -- adjust up for fewer false triggers, down for fewer misses.
Component 3: ASR with Whisper
Once the wake word fires, we capture the full command and transcribe it. Whisper (episode #93) handles this beautifully:
import whisper
class ASREngine:
"""Whisper-based speech-to-text."""
def __init__(self, model_size="base"):
self.model = whisper.load_model(
model_size)
def transcribe(self,
audio: np.ndarray) -> str:
"""Transcribe audio numpy array."""
audio = audio.astype(np.float32)
# Pad or trim to 30s
audio = whisper.pad_or_trim(audio)
# Compute log-mel spectrogram
mel = (whisper
.log_mel_spectrogram(audio)
.unsqueeze(0)
.to(self.model.device))
# Decode
options = whisper.DecodingOptions(
language="en", fp16=False)
result = whisper.decode(
self.model, mel, options)
return result.text.strip()
Whisper "base" (74M parameters) gives a solid balance between speed and accuracy for short voice commands. On a modern CPU it transcribes a 5-second clip in about 1-2 seconds. If latency is critical and you're willing to sacrifice some accuracy, use "tiny" (39M params, roughly 0.5s on CPU). If you have a GPU available, even "medium" (769M) runs in under a second.
The pad_or_trim call is Whisper-specific -- it expects exactly 30 seconds of audio input (480,000 samples at 16kHz). Short commands get zero-padded, which works fine because Whisper knows to stop transcribing when it encounters silence.
Component 4: NLU -- intent and slot extraction
From episode #98, we know Natural Language Understanding turns raw text into structured data: what does the user want (intent) and what are the specific details (slots). For a controlled vocabulary assistant, pattern matching is fast, predictible, and needs zero training data:
import re
from dataclasses import dataclass
from typing import Optional
@dataclass
class NLUResult:
intent: str
slots: dict
confidence: float
class NLUEngine:
"""Pattern-based NLU for voice commands."""
def __init__(self):
self.intent_patterns = {
"set_timer": [
r"(?:set|start)\s+(?:a\s+)?"
r"timer\s+(?:for\s+)?(.+)",
r"timer\s+(.+)",
],
"get_time": [
r"what\s+time\s+is\s+it",
r"what's\s+the\s+time",
r"current\s+time",
],
"get_weather": [
r"(?:what's|how's)\s+the\s+"
r"weather(?:\s+in\s+(.+))?",
r"weather\s+(?:in|for)\s+(.+)",
],
"play_music": [
r"play\s+(?:some\s+)?"
r"(.+?)(?:\s+music)?$",
r"put\s+on\s+(?:some\s+)?(.+)",
],
"set_reminder": [
r"remind\s+me\s+to\s+(.+?)"
r"(?:\s+(?:in|at)\s+(.+))?$",
r"set\s+(?:a\s+)?reminder"
r"\s+(?:to\s+)?(.+)",
],
"stop": [
r"^(?:stop|cancel|nevermind"
r"|shut\s+up)$",
],
}
def parse(self, text: str) -> NLUResult:
"""Extract intent and slots."""
text_lower = text.lower().strip()
for intent, patterns in (
self.intent_patterns.items()):
for pattern in patterns:
match = re.match(
pattern, text_lower)
if match:
slots = self._extract_slots(
intent, match)
return NLUResult(
intent=intent,
slots=slots,
confidence=0.9)
return NLUResult(
intent="unknown",
slots={"raw_text": text},
confidence=0.3)
def _extract_slots(self, intent: str,
match: re.Match) -> dict:
"""Pull slot values from groups."""
groups = [g for g in match.groups()
if g is not None]
slot_mappings = {
"set_timer": ["duration"],
"get_weather": ["location"],
"play_music": ["query"],
"set_reminder": ["task", "time"],
}
slots = {}
for i, name in enumerate(
slot_mappings.get(intent, [])):
if i < len(groups):
slots[name] = groups[i]
return slots
Pattern matching isn't fancy, but for 10-20 controlled intents it's fast, predictable, and debuggable. When someone says something outside the patterns, it returns unknown with low confidence -- and you can add new patterns as needs emerge. For a more capable (but heavier) system, swap this out for the BERT-based joint intent+slot model from episode #98.
Component 5: Action execution
The action layer maps NLU output to actual operations. This is where the assistant does things:
import datetime
import threading
class ActionExecutor:
"""Execute actions based on NLU results."""
def __init__(self):
self.timers = {}
def execute(self,
nlu_result: NLUResult) -> str:
"""Run the intent handler."""
handlers = {
"set_timer": self._timer,
"get_time": self._time,
"get_weather": self._weather,
"play_music": self._music,
"set_reminder": self._reminder,
"stop": self._stop,
"unknown": self._unknown,
}
handler = handlers.get(
nlu_result.intent, self._unknown)
return handler(nlu_result.slots)
def _timer(self, slots: dict) -> str:
duration_text = slots.get(
"duration", "")
seconds = self._parse_duration(
duration_text)
if seconds:
timer_id = f"t_{len(self.timers)}"
t = threading.Timer(
seconds, self._timer_done,
args=[timer_id])
t.start()
self.timers[timer_id] = t
return (f"Timer set for "
f"{duration_text}.")
return ("Sorry, I didn't catch "
"the duration.")
def _time(self, slots: dict) -> str:
now = datetime.datetime.now()
return f"It's {now.strftime('%I:%M %p')}."
def _weather(self, slots: dict) -> str:
loc = slots.get(
"location", "your area")
return (f"I'd check the weather for "
f"{loc}, but I'm running "
f"offline. Add a weather API "
f"key to enable this.")
def _music(self, slots: dict) -> str:
query = slots.get("query", "music")
return (f"Playing {query}. "
f"Hook up your music player "
f"integration here.")
def _reminder(self, slots: dict) -> str:
task = slots.get(
"task", "something")
when = slots.get("time", "later")
return (f"I'll remind you to "
f"{task} {when}.")
def _stop(self, slots: dict) -> str:
return "Okay, stopping."
def _unknown(self, slots: dict) -> str:
return ("I didn't understand that. "
"Could you rephrase?")
def _parse_duration(
self, text: str) -> Optional[int]:
"""Parse '5 minutes' to seconds."""
patterns = [
(r"(\d+)\s*(?:second|sec)", 1),
(r"(\d+)\s*(?:minute|min)", 60),
(r"(\d+)\s*(?:hour|hr)", 3600),
]
total = 0
for pattern, mult in patterns:
match = re.search(pattern, text)
if match:
total += (int(match.group(1))
* mult)
return total if total > 0 else None
def _timer_done(self, timer_id: str):
print(f"\nTimer {timer_id} finished!")
Notice how thin this layer is. The NLU already did the hard work of extracting structured information -- the action executor just maps it to function calls. This separation of concerns is what makes the architecture extensible: adding a new capability means adding patterns to the NLU and a handler to the executor. The rest of the pipeline stays untouched.
Component 6: TTS response
For speaking the response back, we use a pretrained TTS model (the same family of models we explored in episode #94):
from TTS.api import TTS
import subprocess
import sys
class TTSEngine:
"""Text-to-speech output."""
def __init__(self):
self.tts = TTS(
model_name=(
"tts_models/en/"
"ljspeech/vits"))
def speak(self, text: str,
output_path: str = (
"/tmp/response.wav")):
"""Generate speech audio."""
self.tts.tts_to_file(
text=text,
file_path=output_path)
return output_path
def speak_and_play(self, text: str):
"""Generate and play response."""
path = self.speak(text)
if sys.platform == "darwin":
subprocess.run(["afplay", path])
elif sys.platform == "linux":
subprocess.run(["aplay", path])
The VITS model ("Variational Inference with adversarial learning for end-to-end Text-to-Speech") produces natural-sounding speech and runs reasonably fast on CPU. For even lower latency you could use a smaller model like piper (optimized for on-device inference), or pre-generate common responses ("yes?", "okay", "I didn't understand that") at startup and play them from cache instead of generating fresh audio every time.
Wiring it all together
Now the main loop that connects all six components into a working assistant:
class VoiceAssistant:
"""Complete voice assistant pipeline."""
def __init__(self):
print("Loading models...")
self.audio = AudioStream()
self.wake_word = WakeWordDetector()
self.asr = ASREngine(
model_size="base")
self.nlu = NLUEngine()
self.actions = ActionExecutor()
self.tts = TTSEngine()
print("Ready! Listening...")
def run(self):
"""Main loop."""
try:
while True:
# Phase 1: wake word
chunk = (
self.audio.audio_queue.get())
if self.wake_word.check(chunk):
print("Wake word detected!")
self.tts.speak_and_play(
"Yes?")
# Phase 2: capture command
audio_data = (
self.audio
.get_speech_segment(
silence_threshold=1.0,
max_duration=10.0))
if audio_data is None:
continue
# Phase 3: transcribe
text = self.asr.transcribe(
audio_data)
print(f"You said: {text}")
if not text.strip():
continue
# Phase 4: understand
nlu_result = (
self.nlu.parse(text))
print(
f"Intent: "
f"{nlu_result.intent}"
f", Slots: "
f"{nlu_result.slots}")
# Phase 5: act
response = (
self.actions
.execute(nlu_result))
print(
f"Response: {response}")
# Phase 6: speak
self.tts.speak_and_play(
response)
except KeyboardInterrupt:
print("\nShutting down...")
self.audio.cleanup()
if __name__ == "__main__":
assistant = VoiceAssistant()
assistant.run()
The flow is dead simple once you see it laid out: listen passively for the wake word (cheap CNN, runs on every 100ms chunk), then switch to active mode (capture full utterance, run expensive Whisper inference, parse, act, respond), then go back to passive listening. This two-phase approach is how every commercial voice assistant works -- you can't afford to run full ASR on every second of ambient audio, so you use a tiny classifier as a gate.
Performance benchmarking
Where does the time go? Understanding latency is crucial for making the assistant feel responsive. Let's measure each component:
import time
class LatencyBenchmark:
"""Measure per-component latency."""
def __init__(self,
assistant: VoiceAssistant):
self.assistant = assistant
def benchmark(self, test_audio_path: str,
n_runs: int = 10):
"""Benchmark over N iterations."""
import torchaudio
waveform, sr = torchaudio.load(
test_audio_path)
audio_np = waveform.squeeze().numpy()
results = {
"asr": [], "nlu": [],
"action": [], "tts": [],
"total": []}
for i in range(n_runs):
total_start = time.perf_counter()
t = time.perf_counter()
text = (self.assistant.asr
.transcribe(audio_np))
results["asr"].append(
time.perf_counter() - t)
t = time.perf_counter()
nlu_result = (
self.assistant.nlu
.parse(text))
results["nlu"].append(
time.perf_counter() - t)
t = time.perf_counter()
response = (
self.assistant.actions
.execute(nlu_result))
results["action"].append(
time.perf_counter() - t)
t = time.perf_counter()
self.assistant.tts.speak(response)
results["tts"].append(
time.perf_counter() - t)
results["total"].append(
time.perf_counter()
- total_start)
print(f"\nBenchmark ({n_runs} runs):")
print(f"{'Component':<12}"
f"{'Mean':>8}"
f"{'Min':>8}"
f"{'Max':>8}")
print("-" * 38)
for comp, times in results.items():
mean_ms = np.mean(times) * 1000
min_ms = np.min(times) * 1000
max_ms = np.max(times) * 1000
print(f"{comp:<12}"
f"{mean_ms:>7.0f}ms"
f"{min_ms:>7.0f}ms"
f"{max_ms:>7.0f}ms")
Typical results on a modern CPU (M-series Mac or recent x86):
| Component | Latency |
|---|---|
| ASR (Whisper base) | 800-1500ms |
| NLU (regex) | <1ms |
| Action execution | <1ms |
| TTS (VITS) | 300-800ms |
| Total | 1.1-2.3s |
The bottlenecks are crystal clear: ASR and TTS dominate. The NLU and action layers are essentially free. To speed things up:
- Use Whisper "tiny" insted of "base" -- halves ASR time at modest accuracy cost
- Pre-generate common TTS responses at startup
- Run ASR on GPU if available (3-5x speedup)
- Use streaming ASR to start processing before the user finishes speaking
- Swap VITS for piper-tts (optimized for edge inference, 50-100ms generation)
On a GPU (even a modest one), the total pipeline drops to 300-600ms -- genuinely conversational latency.
Extending the assistant
This is a foundation, not a finished product. Some directions to take it further, each a direct application of something from this audio arc:
Noise reduction preprocessing (episode #99): run the captured audio through the Wiener filter or a trained denoiser before feeding it to Whisper. This dramatically improves transcription accuracy in noisy environments -- kitchens, cafes, rooms with open windows.
Speaker identification (episode #97): extract a speaker embedding from the command audio, compare against enrolled speakers, and personalize the response. "Hey assistant, what's on my calendar?" could return different results depending on who is speaking.
Emotion awareness (episode #98): detect the user's emotional state from prosodic features (pitch variance, speech rate, energy) and adapt tone. A stressed user asking "what time is my meeting?" might get a calmer, more reassuring TTS voice.
Local LLM backend (episode #70): replace the pattern-based NLU with an on-device language model for open-domain conversation. Instead of matching rigid patterns, send the transcription to a local Llama/Mistral model and parse structured JSON from its response.
Multi-language support: Whisper already handles 99 languages out of the box -- just remove the language="en" constraint and let it auto-detect. Wire the detected language through to NLU patterns and TTS model selection.
Each of these is straightforward because of the modular architecture. Swap one component without touching the others. That's the whole point of building it this way -- and it's the same principle that makes production ML systems maintainable at scale.
Samengevat
- The voice assistant pipeline chains six components: audio capture + VAD, wake word detection, ASR, NLU, action execution, and TTS -- each from a previous episode in the audio arc;
- wake word detection uses a tiny CNN binary classifier running continuously on 100ms audio chunks -- it gates the expensive pipeline so you're not running Whisper on every second of ambient sound;
- Whisper handles ASR, converting the captured command to text in 1-2 seconds on CPU (faster with GPU or the "tiny" model variant);
- pattern-based NLU extracts intents and slots with zero training data -- fast, predictable, and debuggable for a known command vocabulary;
- the main loop alternates between passive listening (wake word) and active processing (full pipeline) -- the same two-phase architecture used by Siri, Alexa, and Google Assistant;
- end-to-end latency is 1-2 seconds on CPU, dominated by ASR and TTS -- GPU acceleration or smaller models bring this under 500ms for genuinely conversational interaction;
- every component is independently replaceable: swap regex NLU for a BERT model, swap VITS for piper, add noise reduction, add speaker ID -- the modular design makes it all straightforward.
That wraps the audio AI arc. From raw waveforms and spectrograms in episode #92 all the way through to a working voice assistant that ties everything together. The next section of this series moves into a fundamentally different area of AI -- one where agents learn not from labeled data, but from interacting with an environment and receiving rewards for good behavior. It's a paradigm shift from everything we've done so far, and it opens up some of the most exciting applications in modern AI. Plenty more ground to cover ;-)