Learn AI Series (#98) - Natural Language Understanding for Voice

@scipio 70

1 day ago

StemSocial

Learn AI Series (#98) - Natural Language Understanding for Voice

What will I learn

You will learn intent classification: determining what a user wants from their spoken words;
slot filling: extracting specific parameters (entities) from natural language commands using BIO sequence labeling;
joint intent + slot models: a single shared encoder that handles both tasks simultaneously;
dialogue state tracking: maintaining conversation context across multiple turns;
speech emotion recognition: detecting feelings from HOW something is said, not just the words;
building a from-scratch voice command NLU pipeline that parses raw utterances into structured actions;
evaluation metrics for NLU systems: intent accuracy, slot F1, and sentence-level accuracy.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#98) - Natural Language Understanding for Voice

Solutions to Episode #97 Exercises

Exercise 1: Speaker enrollment and verification simulator -- generate 6 speakers (f0 ranges from 80-250Hz, distinct formant patterns), 5 enrollment + 3 test utterances each, compute spectral embeddings via Mel spectrogram mean+std, average enrollment embeddings, then run all 108 verification trials. Sweep thresholds 0.0-1.0 in steps of 0.001 to find EER. Result: EER around 8-12% with these simple spectral features. Same-speaker scores cluster around 0.75-0.90, different-speaker scores around 0.20-0.50, with a clean separation gap. The EER threshold lands roughly at 0.60-0.65 depending on the random seed.

Exercise 2: Speaker change detector -- generate 15s of 4-segment audio (A, B, A, C), use sliding 1.5s windows at 0.25s hop, extract spectral features per window, compute cosine distance between consecutive windows, detect peaks above mean + 1.5*std. The three true boundaries (at 3.5s, 7.0s, 10.5s) should each produce a distance spike. A detected peak within +/-0.5s of a true boundary counts as correct. With clean synthetic audio you get precision near 1.0 and recall of 0.67-1.0 depending on how distinct the speaker profiles are.

Exercise 3: Multi-speaker overlap detector -- generate 12s of audio with two overlap regions (3-5s and 8-10s). Analyze energy in 4 frequency bands per frame, compute overlap score based on active bands and total energy. Overlap regions show higher total energy and more active bands than solo-speaker regions. Precision around 0.7-0.85, recall around 0.65-0.80, F1 around 0.70-0.80. The average overlap score in true overlap regions is significantly higher (0.5-0.7) than in solo regions (0.1-0.3).

On to today's episode

Here we go! You say "set a timer for five minutes" to your phone and it sets a timer. You say "play something chill" and music starts playing. Somewhere between your voice hitting the microphone and the device taking action, something figured out what you actually wanted. That something is Natural Language Understanding (NLU) -- and for voice interfaces specifically, it's the brain that sits between speech recognition and action execution.

In episode #93 we turned audio waveforms into text (ASR). In episode #94 we turned text back into audio (TTS). In #97 we figured out who is speaking. But raw transcribed text alone isn't enough to act on. The sentence "turn off the lights in the bedroom" is just a string of characters. NLU transforms it into structured information: the intent is control_device, the device is lights, the location is bedroom, and the action is off. That's what the system needs to actually do something useful ;-)

The voice assistant pipeline

Every voice assistant -- whether it's Siri, Alexa, Google Assistant, or some custom system you build yourself -- follows roughly the same flow:

Microphone -> ASR -> NLU -> Dialogue Manager -> Action/Response -> TTS -> Speaker

ASR (episode #93) gives us raw text. NLU gives us structured meaning. The dialogue manager decides what to do with that meaning (execute an action, ask a clarifying question, update context). And TTS (episode #94) speaks the response back. Today we focus on NLU and the dialogue manager -- these are where the actual intelligence lives.

Intent classification

Intent classification is a text classification problem: given a user utterance, predict which predefined intent category it belongs to. "What's the weather like?" maps to get_weather. "Set an alarm for 7 AM" maps to set_alarm. "Play some jazz" maps to play_music. If you've been following along since episode #12 (logistic regression) and episode #59 (BERT), you already know how to classify text. Intent classification is exactly that -- fine-tuning a text encoder on labeled utterance-intent pairs.

Let's build an intent classifier from scratch using a simple bag-of-words approach first, so we can see the fundamentals before bringing in neural networks:

import numpy as np


class IntentClassifier:
    """Bag-of-words intent classifier
    with TF-IDF features."""

    def __init__(self):
        self.vocab = {}
        self.idf = None
        self.weights = None
        self.intent_names = []
        self.rng = np.random.RandomState(42)

    def tokenize(self, text):
        """Simple whitespace tokenizer
        with lowercasing."""
        return text.lower().strip(
            ).replace("?", " ").replace(
            "!", " ").replace(
            "'", " ").split()

    def build_vocab(self, texts):
        """Build vocabulary from
        training texts."""
        word_set = set()
        for text in texts:
            for w in self.tokenize(text):
                word_set.add(w)
        self.vocab = {w: i for i, w
                      in enumerate(
                          sorted(word_set))}

    def text_to_tfidf(self, texts):
        """Convert texts to TF-IDF
        feature matrix."""
        n = len(texts)
        v = len(self.vocab)
        tf = np.zeros((n, v))
        for i, text in enumerate(texts):
            tokens = self.tokenize(text)
            for t in tokens:
                if t in self.vocab:
                    tf[i, self.vocab[t]] += 1
            if len(tokens) > 0:
                tf[i] /= len(tokens)

        # Compute IDF
        df = (tf > 0).sum(axis=0) + 1
        self.idf = np.log(n / df) + 1
        return tf * self.idf

    def train(self, texts, labels,
              lr=0.1, epochs=200):
        """Train with softmax regression."""
        self.intent_names = sorted(
            set(labels))
        intent_map = {name: i for i, name
                      in enumerate(
                          self.intent_names)}

        self.build_vocab(texts)
        X = self.text_to_tfidf(texts)
        y = np.array([intent_map[l]
                      for l in labels])

        n_classes = len(self.intent_names)
        n_features = X.shape[1]
        self.weights = self.rng.randn(
            n_features,
            n_classes) * 0.01
        self.bias = np.zeros(n_classes)

        for epoch in range(epochs):
            logits = X @ self.weights + (
                self.bias)
            # Softmax
            exp_l = np.exp(
                logits - logits.max(
                    axis=1, keepdims=True))
            probs = exp_l / exp_l.sum(
                axis=1, keepdims=True)

            # One-hot targets
            targets = np.zeros_like(probs)
            targets[np.arange(len(y)),
                    y] = 1.0

            # Gradient
            grad = probs - targets
            self.weights -= lr * (
                X.T @ grad) / len(y)
            self.bias -= lr * grad.mean(
                axis=0)

        # Training accuracy
        preds = (
            X @ self.weights + self.bias
            ).argmax(axis=1)
        acc = (preds == y).mean()
        print(f"Training accuracy: "
              f"{acc:.1%}")

    def predict(self, text):
        """Predict intent for one
        utterance."""
        tokens = self.tokenize(text)
        v = len(self.vocab)
        tf = np.zeros(v)
        for t in tokens:
            if t in self.vocab:
                tf[self.vocab[t]] += 1
        if len(tokens) > 0:
            tf /= len(tokens)
        features = tf * self.idf
        logits = features @ (
            self.weights) + self.bias
        exp_l = np.exp(
            logits - logits.max())
        probs = exp_l / exp_l.sum()
        idx = probs.argmax()
        return self.intent_names[idx], (
            probs[idx])


# Training data -- 5 intents
texts = [
    "what is the weather today",
    "is it going to rain tomorrow",
    "weather forecast for Amsterdam",
    "how cold is it outside",
    "will it snow this weekend",
    "set an alarm for seven AM",
    "wake me up at 6 30",
    "remind me in ten minutes",
    "create a timer for 5 minutes",
    "set a reminder for noon",
    "play some jazz music",
    "put on my workout playlist",
    "play the latest album by Radiohead",
    "shuffle my favorites",
    "play something relaxing",
    "turn off the kitchen lights",
    "dim the bedroom lamp to 50",
    "set thermostat to 21 degrees",
    "lock the front door",
    "close the garage",
    "what happened in tech today",
    "latest news headlines",
    "any updates on the stock market",
    "read me the morning news",
    "what is trending right now",
]
labels = (
    ["get_weather"] * 5
    + ["set_alarm"] * 5
    + ["play_music"] * 5
    + ["control_device"] * 5
    + ["get_news"] * 5
)

clf = IntentClassifier()
clf.train(texts, labels)

# Test
test_cases = [
    "what is the temperature outside",
    "play me some chill beats",
    "set a timer for 20 minutes",
    "turn on the living room fan",
    "any breaking news today",
]
print("\n--- Predictions ---")
for t in test_cases:
    intent, conf = clf.predict(t)
    print(f"  '{t}' -> {intent} "
          f"({conf:.0%})")

For production voice NLU systems, you'd typically have 20-100 intents with hundreds of training examples per intent. The SNIPS and ATIS datasets are standard benchmarks -- SNIPS covers 7 intents like music playback, weather queries, and restaurant booking, each with around 2,000 examples. Intent classification accuracy on these benchmarks exceeds 98% with fine-tuned transformer models. Having said that, real users say things your training set never anticipated, and that's where the fun begins.

Slot filling with BIO tagging

Intent tells you what the user wants. Slot filling tells you the specifics. Consider "book a table for two at eight PM at that Italian place downtown." The intent is book_restaurant. The slots are:

Slot	Value
party_size	2
time	8:00 PM
cuisine	Italian
location	downtown

Slot filling is a sequence labeling problem -- assign a label to each token in the input. This is the same setup as Named Entity Recognition (which we touched on back in episode #59). The standard labeling scheme is BIO (Begin, Inside, Outside):

Book    a    table  for  two        at   eight     PM       at   that  Italian    place  downtown
O       O    O      O    B-party    O    B-time    I-time   O    O     B-cuisine  O      B-location

Each token gets tagged as O (not part of any slot), B-something (beginning of a slot), or I-something (continuation of a slot). Let's build a slot filler from scratch:

import numpy as np


class SlotFiller:
    """BIO sequence tagger for
    slot filling."""

    def __init__(self, embed_dim=32,
                 hidden_dim=64):
        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim
        self.rng = np.random.RandomState(
            42)
        self.word2idx = {}
        self.tag2idx = {}
        self.idx2tag = {}

    def build_vocab(self, sentences,
                     tag_seqs):
        words = set()
        tags = set()
        for sent in sentences:
            for w in sent:
                words.add(w.lower())
        for seq in tag_seqs:
            for t in seq:
                tags.add(t)
        self.word2idx = {"<UNK>": 0}
        for i, w in enumerate(
                sorted(words)):
            self.word2idx[w] = i + 1
        self.tag2idx = {
            t: i for i, t in enumerate(
                sorted(tags))}
        self.idx2tag = {
            i: t for t, i
            in self.tag2idx.items()}

    def init_params(self):
        v = len(self.word2idx)
        t = len(self.tag2idx)
        d = self.embed_dim
        h = self.hidden_dim
        s = 0.1
        self.W_embed = self.rng.randn(
            v, d) * s
        self.W_hidden = self.rng.randn(
            d * 3, h) * s
        self.b_hidden = np.zeros(h)
        self.W_out = self.rng.randn(
            h, t) * s
        self.b_out = np.zeros(t)

    def encode_sentence(self, tokens):
        indices = []
        for t in tokens:
            w = t.lower()
            idx = self.word2idx.get(
                w, 0)
            indices.append(idx)
        return np.array(indices)

    def forward(self, indices):
        """Forward pass with context
        window of 3."""
        n = len(indices)
        d = self.embed_dim
        embeds = self.W_embed[indices]
        padded = np.zeros((n + 2, d))
        padded[1:n+1] = embeds

        logits = np.zeros(
            (n, len(self.tag2idx)))
        hiddens = np.zeros(
            (n, self.hidden_dim))
        for i in range(n):
            window = padded[
                i:i+3].flatten()
            h = np.tanh(
                window @ self.W_hidden
                + self.b_hidden)
            hiddens[i] = h
            logits[i] = (
                h @ self.W_out
                + self.b_out)
        return logits, hiddens

    def train(self, sentences, tag_seqs,
              lr=0.05, epochs=300):
        self.build_vocab(
            sentences, tag_seqs)
        self.init_params()

        for epoch in range(epochs):
            correct = 0
            total = 0
            for sent, tags in zip(
                    sentences, tag_seqs):
                indices = (
                    self.encode_sentence(
                        sent))
                tag_ids = np.array([
                    self.tag2idx[t]
                    for t in tags])
                logits, hiddens = (
                    self.forward(indices))

                exp_l = np.exp(
                    logits - logits.max(
                        axis=1,
                        keepdims=True))
                probs = exp_l / exp_l.sum(
                    axis=1, keepdims=True)
                preds = logits.argmax(
                    axis=1)
                correct += (
                    preds == tag_ids
                    ).sum()
                total += len(tag_ids)

                targets = np.zeros_like(
                    probs)
                targets[
                    np.arange(len(
                        tag_ids)),
                    tag_ids] = 1.0
                d_logits = (
                    probs - targets)
                d_W_out = (
                    hiddens.T @ d_logits)
                d_b_out = d_logits.sum(
                    axis=0)
                self.W_out -= (
                    lr * d_W_out
                    / len(indices))
                self.b_out -= (
                    lr * d_b_out
                    / len(indices))

            if (epoch + 1) % 100 == 0:
                acc = correct / max(
                    total, 1)
                print(f"Epoch {epoch+1}: "
                      f"acc={acc:.1%}")

    def predict(self, tokens):
        indices = self.encode_sentence(
            tokens)
        logits, _ = self.forward(
            indices)
        tag_ids = logits.argmax(axis=1)
        return [self.idx2tag[i]
                for i in tag_ids]

    def extract_slots(self, tokens,
                       tags):
        """Extract slot-value pairs
        from BIO tags."""
        slots = {}
        cur_slot = None
        cur_value = []
        for token, tag in zip(
                tokens, tags):
            if tag.startswith("B-"):
                if cur_slot:
                    slots[cur_slot] = (
                        " ".join(
                            cur_value))
                cur_slot = tag[2:]
                cur_value = [token]
            elif (tag.startswith("I-")
                  and cur_slot):
                cur_value.append(token)
            else:
                if cur_slot:
                    slots[cur_slot] = (
                        " ".join(
                            cur_value))
                cur_slot = None
                cur_value = []
        if cur_slot:
            slots[cur_slot] = (
                " ".join(cur_value))
        return slots


# Training data
sentences = [
    ["set", "alarm", "for", "seven",
     "AM"],
    ["play", "jazz", "music"],
    ["turn", "off", "the", "kitchen",
     "lights"],
    ["weather", "in", "Amsterdam",
     "tomorrow"],
    ["dim", "bedroom", "lamp", "to",
     "fifty", "percent"],
    ["wake", "me", "at", "six",
     "thirty"],
    ["play", "my", "workout",
     "playlist"],
    ["set", "thermostat", "to",
     "twenty", "one", "degrees"],
]
tag_seqs = [
    ["O", "O", "O", "B-time",
     "I-time"],
    ["O", "B-genre", "O"],
    ["O", "B-action", "O",
     "B-location", "B-device"],
    ["O", "O", "B-location",
     "B-time"],
    ["O", "B-location", "B-device",
     "O", "B-level", "I-level"],
    ["O", "O", "O", "B-time",
     "I-time"],
    ["O", "O", "B-genre", "O"],
    ["O", "B-device", "O", "B-level",
     "I-level", "I-level"],
]

filler = SlotFiller()
filler.train(sentences, tag_seqs)

print("\n--- Slot Extraction ---")
test_sent = ["set", "alarm", "for",
             "seven", "AM"]
tags = filler.predict(test_sent)
slots = filler.extract_slots(
    test_sent, tags)
print(f"Tokens: {test_sent}")
print(f"Tags:   {tags}")
print(f"Slots:  {slots}")

The elegant approach (used in production) is to do intent classification and slot filling simultaneously with a shared encoder. The [CLS] token representation classifies the intent, while each token's representation classifies its BIO slot tag. Joint training works because the two tasks share useful information -- knowing the intent is play_music makes it more likely that "chill" is a B-genre slot rather than a B-mood slot.

Joint intent + slot model

Let's build a joint model. We'll use PyTorch here since the shared encoder architecture maps naturally to nn.Module:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


class JointNLUModel(nn.Module):
    """Joint intent classifier +
    slot filler with shared encoder."""

    def __init__(self, vocab_size,
                 n_intents,
                 n_slot_tags,
                 embed_dim=64,
                 hidden_dim=128):
        super().__init__()
        self.embedding = nn.Embedding(
            vocab_size, embed_dim,
            padding_idx=0)
        # Shared bidirectional LSTM
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim,
            batch_first=True,
            bidirectional=True)
        # Intent head: uses last
        # hidden state
        self.intent_head = nn.Linear(
            hidden_dim * 2, n_intents)
        # Slot head: per-token
        # classification
        self.slot_head = nn.Linear(
            hidden_dim * 2, n_slot_tags)

    def forward(self, input_ids):
        emb = self.embedding(input_ids)
        lstm_out, (h_n, _) = self.lstm(
            emb)
        # Intent: concat final
        # forward + backward
        intent_repr = torch.cat(
            [h_n[0], h_n[1]], dim=1)
        intent_logits = (
            self.intent_head(
                intent_repr))
        # Slots: per-token from
        # full sequence
        slot_logits = self.slot_head(
            lstm_out)
        return intent_logits, slot_logits


# Demo with synthetic data
torch.manual_seed(42)
model = JointNLUModel(
    vocab_size=200,
    n_intents=5,
    n_slot_tags=9,
    embed_dim=32,
    hidden_dim=64)

# Fake batch: 4 utterances, max 8 tokens
x = torch.randint(1, 200, (4, 8))
intent_logits, slot_logits = model(x)

print(f"Input shape: {x.shape}")
print(f"Intent logits: "
      f"{intent_logits.shape}")
print(f"Slot logits: "
      f"{slot_logits.shape}")
print(f"Predicted intents: "
      f"{intent_logits.argmax(dim=1)}")
print(f"Predicted slot tags "
      f"(first utterance): "
      f"{slot_logits[0].argmax(dim=1)}")

total = sum(p.numel()
            for p in model.parameters())
print(f"Total parameters: {total:,}")

The loss function for joint training is simply the sum (or weighted sum) of the intent cross-entropy loss and the slot cross-entropy loss. In practice, joint models outperform training each task separately because the shared representations capture information useful for both tasks. A paper by Liu and Lane (2016) showed that joint training on the ATIS dataset improved slot F1 by 0.5-1.0% over separate training, which is significant at the 95%+ accuracy level.

Dialogue state tracking

Single-turn NLU handles one command at a time. Real conversations are multi-turn. Consider this exchange:

User: "What's the weather?"
System: "Where?"
User: "Amsterdam"
System: "12 degrees and cloudy in Amsterdam."
User: "And tomorrow?"

That last utterance -- "And tomorrow?" -- makes zero sense without the previous context. The user means "What's the weather in Amsterdam tomorrow?" A dialogue manager tracks this context across turns:

import numpy as np


class DialogueState:
    """Tracks accumulated slots
    across conversation turns."""

    def __init__(self):
        self.current_intent = None
        self.slots = {}
        self.history = []
        self.turn_count = 0

    def update(self, intent, new_slots):
        """Merge new information into
        the current state."""
        self.turn_count += 1
        if intent and intent != (
                "follow_up"):
            self.current_intent = intent
        self.slots.update(new_slots)

    def reset(self):
        self.current_intent = None
        self.slots = {}
        self.turn_count = 0


class DialogueManager:
    """Rule-based dialogue manager
    with slot tracking."""

    def __init__(self):
        self.state = DialogueState()
        self.required_slots = {
            "get_weather": ["location"],
            "set_alarm": ["time"],
            "play_music": [],
            "control_device": [
                "device", "action"],
            "book_restaurant": [
                "party_size", "time"],
        }
        self.prompts = {
            "location": "Where?",
            "time": "When?",
            "party_size": (
                "For how many people?"),
            "device": "Which device?",
            "action": (
                "Turn it on or off?"),
        }

    def simulate_nlu(self, text):
        """Fake NLU for demonstration."""
        text_low = text.lower()
        intent = None
        slots = {}
        if "weather" in text_low:
            intent = "get_weather"
        elif "alarm" in text_low or (
                "wake" in text_low
                or "timer" in text_low):
            intent = "set_alarm"
        elif "play" in text_low:
            intent = "play_music"
        elif ("turn" in text_low
              or "dim" in text_low):
            intent = "control_device"
        else:
            intent = "follow_up"

        if "amsterdam" in text_low:
            slots["location"] = (
                "Amsterdam")
        if "tomorrow" in text_low:
            slots["time"] = "tomorrow"
        if "seven" in text_low:
            slots["time"] = "7:00 AM"
        if "lights" in text_low:
            slots["device"] = "lights"
        if "off" in text_low:
            slots["action"] = "off"

        return intent, slots

    def process_turn(self, user_text):
        intent, slots = (
            self.simulate_nlu(
                user_text))
        self.state.update(
            intent, slots)
        self.state.history.append({
            "role": "user",
            "text": user_text})

        required = (
            self.required_slots.get(
                self.state
                .current_intent, []))
        missing = [
            s for s in required
            if s not in (
                self.state.slots)]

        if missing:
            response = self.prompts.get(
                missing[0],
                "Could you clarify?")
        else:
            ci = self.state.current_intent
            sl = dict(self.state.slots)
            response = (
                f"Executing {ci} "
                f"with {sl}")
            self.state.reset()

        self.state.history.append({
            "role": "system",
            "text": response})
        return response


dm = DialogueManager()
print("=== Multi-turn Dialogue ===\n")
turns = [
    "What's the weather?",
    "Amsterdam",
    "And tomorrow?",
]
for turn in turns:
    response = dm.process_turn(turn)
    print(f"User: {turn}")
    print(f"System: {response}\n")

Modern voice assistants use neural dialogue state tracking rather than rule-based slot merging. Models trained on the MultiWOZ dataset learn to predict the full dialogue state at each turn, handling coreference ("that place"), corrections ("actually, make it three people"), and implicit references ("tomorrow" inheriting the location from the previous turn). None the less, the rule-based approach works surprisingly well for well-defined domains with a limited number of intents and slots -- and it has the advantage of being completly predictable and debuggable.

Speech emotion recognition

Words carry meaning, but how you say them carries emotion. "I'm fine" can mean contentment or frustration depending on pitch, speed, and tone. Speech Emotion Recognition (SER) extracts emotional cues directly from audio -- bypassing the text entirely. The approach mirrors audio classification from episode #95: extract spectral features, feed them through a model, predict emotion categories.

Let's build a simplified SER system from scratch using prosodic features:

import numpy as np


class EmotionRecognizer:
    """Prosody-based speech emotion
    recognizer."""

    def __init__(self, sr=16000):
        self.sr = sr
        self.emotions = [
            "neutral", "happy",
            "angry", "sad"]
        self.rng = np.random.RandomState(
            42)

    def extract_prosody(self, audio):
        """Extract prosodic features:
        pitch stats, energy stats,
        speaking rate proxy."""
        frame_len = int(
            self.sr * 0.025)
        hop = int(self.sr * 0.010)
        n_frames = max(1,
            (len(audio) - frame_len)
            // hop)
        energy = np.zeros(n_frames)
        zcr = np.zeros(n_frames)
        for i in range(n_frames):
            s = i * hop
            frame = audio[
                s:s + frame_len]
            energy[i] = np.sqrt(
                np.mean(frame ** 2))
            signs = np.sign(frame)
            zcr[i] = np.mean(
                np.abs(np.diff(signs))
                > 0)

        pitches = []
        for i in range(0,
                len(audio) - frame_len,
                hop * 4):
            frame = audio[
                i:i + frame_len * 2]
            if len(frame) < frame_len:
                break
            corr = np.correlate(
                frame, frame, "full")
            corr = corr[
                len(corr)//2:]
            min_lag = int(
                self.sr / 500)
            max_lag = int(
                self.sr / 60)
            if max_lag < len(corr):
                seg = corr[
                    min_lag:max_lag]
                if len(seg) > 0:
                    peak = (
                        np.argmax(seg)
                        + min_lag)
                    pitches.append(
                        self.sr / peak)

        pitches = np.array(
            pitches) if pitches else (
                np.array([150.0]))

        features = np.array([
            np.mean(energy),
            np.std(energy),
            np.max(energy),
            np.mean(zcr),
            np.std(zcr),
            np.mean(pitches),
            np.std(pitches),
            np.max(pitches)
            - np.min(pitches),
            len(audio) / self.sr,
            np.mean(
                np.abs(np.diff(energy))),
            np.percentile(energy, 90)
            - np.percentile(energy, 10),
            np.mean(pitches > 200),
        ])
        return features

    def synthesize_emotional(
            self, emotion, dur=2.0):
        n = int(self.sr * dur)
        t = np.arange(n) / self.sr
        profiles = {
            "neutral": {
                "f0": 150,
                "f0_var": 5,
                "energy": 0.3,
                "rate": 1.0},
            "happy": {
                "f0": 200,
                "f0_var": 30,
                "energy": 0.5,
                "rate": 1.3},
            "angry": {
                "f0": 170,
                "f0_var": 15,
                "energy": 0.6,
                "rate": 1.1},
            "sad": {
                "f0": 120,
                "f0_var": 8,
                "energy": 0.15,
                "rate": 0.7},
        }
        p = profiles[emotion]
        f0 = p["f0"] + p["f0_var"] * (
            np.sin(2 * np.pi * 3 * t))
        signal = p["energy"] * np.sin(
            2 * np.pi * f0 * t)
        signal += 0.15 * np.sin(
            2 * np.pi * 800 * t)
        signal += 0.1 * np.sin(
            2 * np.pi * 1200 * t)
        env = (0.5 + 0.5 * np.sin(
            2 * np.pi * p["rate"]
            * 2 * t))
        signal *= env
        signal += self.rng.randn(
            n) * 0.02
        return signal

    def run(self):
        X_train = []
        y_train = []
        for emo in self.emotions:
            for _ in range(20):
                audio = (
                    self.synthesize_emotional(
                        emo,
                        dur=1.5 + self.rng
                        .random()))
                feat = (
                    self.extract_prosody(
                        audio))
                X_train.append(feat)
                y_train.append(
                    self.emotions.index(
                        emo))

        X_train = np.array(X_train)
        y_train = np.array(y_train)
        mean = X_train.mean(axis=0)
        std = X_train.std(axis=0) + 1e-9
        X_norm = (X_train - mean) / std

        # Test data
        X_test = []
        y_test = []
        for emo in self.emotions:
            for _ in range(5):
                audio = (
                    self.synthesize_emotional(
                        emo,
                        dur=1.5 + self.rng
                        .random()))
                feat = (
                    self.extract_prosody(
                        audio))
                X_test.append(feat)
                y_test.append(
                    self.emotions.index(
                        emo))

        X_test = np.array(X_test)
        y_test = np.array(y_test)
        X_test_n = (
            X_test - mean) / std

        # kNN (k=5)
        correct = 0
        k = 5
        conf = np.zeros((4, 4), dtype=int)
        for i in range(len(X_test_n)):
            dists = np.linalg.norm(
                X_norm - X_test_n[i],
                axis=1)
            nn_idx = np.argsort(
                dists)[:k]
            votes = y_train[nn_idx]
            pred = int(np.bincount(
                votes).argmax())
            conf[y_test[i], pred] += 1
            if pred == y_test[i]:
                correct += 1

        acc = correct / len(y_test)
        print(f"Emotion recognition "
              f"accuracy: {acc:.1%}")
        print(f"\nConfusion matrix:")
        print(f"{'':>10}", end="")
        for e in self.emotions:
            print(f"{e:>10}", end="")
        print()
        for i, e in enumerate(
                self.emotions):
            print(f"{e:>10}", end="")
            for j in range(4):
                print(f"{conf[i,j]:>10}",
                      end="")
            print()


rec = EmotionRecognizer()
rec.run()

Production SER systems use wav2vec2 or similar self-supervised models (episode #90 covered self-supervised learning) fine-tuned on emotional speech datasets like IEMOCAP and RAVDESS. Accuracy ranges from 65-80% depending on the emotion granularity, which reflects genuine ambiguity in how humans express (and perceive) emotion.

An interesting direction: combining text sentiment with speech emotion gives you a multimodal emotion signal. Someone saying "great" with an angry tone is probably sarcastic. The text says positive; the audio says negative. A fused model catches what neither modality detects alone.

NLU evaluation metrics

How do you know if your NLU system actually works? There are three levels of evaluation, and you really want to look at all of them:

import numpy as np


class NLUEvaluator:
    """Evaluate intent accuracy,
    slot F1, and sentence accuracy."""

    def intent_accuracy(
            self, y_true, y_pred):
        correct = sum(
            1 for t, p in zip(
                y_true, y_pred)
            if t == p)
        return correct / len(y_true)

    def slot_f1(self, true_slots,
                 pred_slots):
        """F1 over slot-value pairs."""
        true_set = set()
        pred_set = set()
        for slots in true_slots:
            for k, v in slots.items():
                true_set.add((k, v))
        for slots in pred_slots:
            for k, v in slots.items():
                pred_set.add((k, v))

        tp = len(
            true_set & pred_set)
        fp = len(
            pred_set - true_set)
        fn = len(
            true_set - pred_set)

        prec = tp / max(tp + fp, 1)
        rec = tp / max(tp + fn, 1)
        f1 = (2 * prec * rec
              / max(prec + rec, 1e-9))
        return {"precision": prec,
                "recall": rec,
                "f1": f1}

    def sentence_accuracy(
            self, it, ip, st, sp):
        """Both intent AND all slots
        must be correct."""
        correct = sum(
            1 for a, b, c, d in zip(
                it, ip, st, sp)
            if a == b and c == d)
        return correct / len(it)

    def run_demo(self):
        intent_true = [
            "get_weather", "set_alarm",
            "play_music",
            "control_device",
            "get_weather", "play_music",
            "set_alarm",
            "control_device",
            "get_weather", "play_music"]
        intent_pred = [
            "get_weather", "set_alarm",
            "play_music",
            "control_device",
            "get_weather", "play_music",
            "set_alarm", "get_news",
            "get_weather", "play_music"]
        slots_true = [
            {"location": "Amsterdam"},
            {"time": "7 AM"},
            {"genre": "jazz"},
            {"device": "lights",
             "action": "off"},
            {"location": "Berlin",
             "time": "tomorrow"},
            {},
            {"time": "noon"},
            {"device": "thermostat",
             "level": "21"},
            {"location": "London"},
            {"genre": "rock"}]
        slots_pred = [
            {"location": "Amsterdam"},
            {"time": "7 AM"},
            {"genre": "jazz"},
            {"device": "lights",
             "action": "off"},
            {"location": "Berlin",
             "time": "tomorrow"},
            {},
            {"time": "noon"},
            {"device": "thermostat",
             "level": "20"},
            {"location": "London"},
            {"genre": "rock"}]

        i_acc = self.intent_accuracy(
            intent_true, intent_pred)
        s_f1 = self.slot_f1(
            slots_true, slots_pred)
        s_acc = self.sentence_accuracy(
            intent_true, intent_pred,
            slots_true, slots_pred)

        print("=== NLU Evaluation ===")
        print(f"Intent accuracy: "
              f"{i_acc:.1%}")
        print(f"Slot precision: "
              f"{s_f1['precision']:.1%}")
        print(f"Slot recall: "
              f"{s_f1['recall']:.1%}")
        print(f"Slot F1: "
              f"{s_f1['f1']:.1%}")
        print(f"Sentence accuracy: "
              f"{s_acc:.1%}")
        print(f"\nNote: sentence accuracy "
              f"is the hardest metric")
        print(f"BOTH intent AND all "
              f"slots must be correct.")


evaluator = NLUEvaluator()
evaluator.run_demo()

Sentence accuracy is the most honest metric. On the SNIPS benchmark, state-of-the-art sentence accuracy sits around 92-95%, meaning 5-8% of utterances still have at least one error in either intent or slots. For dialogue systems, evaluation gets even harder: task completion rate (did the user achieve their goal?) and turn efficiency (how many turns did it take?) are the most meaningful metrics, but they require either human evaluation or user simulation.

Samengevat

Intent classification determines what the user wants -- it's text classification on user utterances, and even a simple TF-IDF + softmax model gets you surprisingly far (fine-tuned transformers push past 98% on standard benchmarks);
slot filling extracts the specifics -- a sequence labeling task using BIO tags, best done jointly with intent classification in a shared-encoder model;
joint training with a shared encoder (bidirectional LSTM or transformer) improves both tasks because they share useful information -- knowing the intent constrains which slot types are plausible;
dialogue state tracking maintains context across turns -- accumulating slot values, resolving coreferences ("that place"), and handling corrections;
speech emotion recognition detects how something is said (not just the words) using prosodic features like pitch variation, energy dynamics, and speaking rate;
the full voice pipeline chains ASR (episode #93) -> NLU -> Dialogue Manager -> Action -> TTS (episode #94), with speaker identification (#97) optionally identifying who's talking;
evaluation uses intent accuracy, slot F1 (exact match on type+value pairs), and sentence accuracy (both intent AND all slots correct) -- sentence accuracy is the hardest and most honest metric.

There's plenty more ground to cover in the audio domain -- enhancing noisy recordings, combining audio and visual streams into unified multimodal representations, and ultimately building systems that understand not just your words but your full communicative intent. The voice carries a remarkeable amount of information that goes well beyond the literal transcript.

Exercises

Exercise 1: Build an intent confidence calibrator. Create a class IntentCalibrator that generates a synthetic NLU evaluation dataset: 6 intents, 50 test utterances each, with softmax probability outputs simulated as Dirichlet samples (alpha=5.0 for correct intent, alpha=0.5 for others, with 85% of samples having correct predictions). Compute: (a) raw accuracy, (b) a reliability diagram with 10 confidence bins (for each bin, compute average confidence and average accuracy), (c) Expected Calibration Error (ECE = weighted average of |accuracy - confidence| per bin). Print the reliability diagram as a table and the final ECE value.

Exercise 2: Build a slot conflict resolver. Create a class SlotConflictResolver that simulates a 5-turn dialogue where the user changes their mind: turn 1 sets time="7 AM" and location="Amsterdam", turn 2 corrects time="8 AM", turn 3 adds party_size="4", turn 4 contradicts location="Rotterdam", turn 5 confirms. Track a dialogue state that (a) detects when a new slot value conflicts with an existing one, (b) logs every conflict with both old and new values, (c) always accepts the newest value. Print the state after each turn, flag conflicts, and show the final resolved state.

Exercise 3: Build a prosodic emotion feature analyzer. Create a class ProsodyAnalyzer that synthesizes 2-second audio clips for 4 emotions (neutral, happy, angry, sad) using distinct f0 ranges (neutral: 130-160Hz, happy: 180-240Hz, angry: 150-200Hz, sad: 90-130Hz), energy profiles, and speaking rate modulations. Extract 8 prosodic features per clip (mean/std/range of f0, mean/std of energy, zero-crossing rate mean, duration-normalized energy variance, pitch contour direction changes). Generate 15 clips per emotion, compute the mean feature vector per emotion, and print a feature comparison table showing which features best separate each emotion pair (compute the separability score as |mean_A - mean_B| / sqrt(var_A + var_B) for each feature and each emotion pair).

Thanks for your time!

@scipio

stem stemsocial steemstem python programming

0.000

0 comments

Learn AI Series (#98) - Natural Language Understanding for Voice

Learn AI Series (#98) - Natural Language Understanding for Voice

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#98) - Natural Language Understanding for Voice

Solutions to Episode #97 Exercises

On to today's episode

The voice assistant pipeline

Intent classification

Slot filling with BIO tagging

Joint intent + slot model

Dialogue state tracking

Speech emotion recognition

NLU evaluation metrics

Samengevat

Exercises

Thanks for your time!

Curriculum (of the `Learn AI Series`):