Learn AI Series (#95) - Audio Classification

avatar

Learn AI Series (#95) - Audio Classification

variant-b-11-teal.png

What will I learn

  • You will learn environmental sound classification: identifying sounds like sirens, dogs barking, and rain using spectrograms and CNNs;
  • audio data augmentation: SpecAugment and waveform-level techniques that make models robust to real-world noise;
  • AudioSet: the ImageNet of audio, with 2 million clips across 527 sound classes, and how to use pretrained models;
  • the Audio Spectrogram Transformer (AST): applying the Vision Transformer architecture to audio spectrograms;
  • multi-label audio tagging: handling clips where multiple sounds happen simultaneously using sigmoid activations and BCE loss;
  • sound event detection: predicting not just what sounds are present but when they occur in a recording;
  • building a complete audio classification pipeline from raw waveforms to predictions.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#95) - Audio Classification

Solutions to Episode #94 Exercises

Exercise 1: Text normalization benchmark.

import re


class TextNormBenchmark:
    def __init__(self):
        self.ones = [
            "", "one", "two", "three", "four",
            "five", "six", "seven", "eight",
            "nine", "ten", "eleven", "twelve",
            "thirteen", "fourteen", "fifteen",
            "sixteen", "seventeen", "eighteen",
            "nineteen"]
        self.tens = [
            "", "", "twenty", "thirty", "forty",
            "fifty", "sixty", "seventy", "eighty",
            "ninety"]
        self.ordinals_special = {
            1: "first", 2: "second", 3: "third",
            5: "fifth", 8: "eighth", 9: "ninth",
            12: "twelfth", 20: "twentieth"}
        self.abbrevs = {
            "Mr.": "mister", "Mrs.": "missus",
            "Dr.": "doctor", "Prof.": "professor",
            "vs.": "versus", "etc.": "et cetera",
            "Jr.": "junior", "Sr.": "senior"}

    def num_to_words(self, n):
        if n < 0 or n > 9999:
            return str(n)
        if n == 0:
            return "zero"
        if n < 20:
            return self.ones[n]
        if n < 100:
            t = self.tens[n // 10]
            o = self.ones[n % 10]
            return f"{t} {o}".strip() if o else t
        if n < 1000:
            h = self.ones[n // 100] + " hundred"
            rem = n % 100
            if rem == 0:
                return h
            return h + " " + self.num_to_words(rem)
        th = self.num_to_words(n // 1000)
        rem = n % 1000
        if rem == 0:
            return th + " thousand"
        return (th + " thousand "
                + self.num_to_words(rem))

    def num_to_ordinal(self, n):
        if n in self.ordinals_special:
            return self.ordinals_special[n]
        w = self.num_to_words(n)
        if w.endswith("y"):
            return w[:-1] + "ieth"
        if w.endswith("e"):
            return w[:-1] + "th"
        return w + "th"

    def normalize_currency(self, text):
        def repl(m):
            d = int(m.group(1))
            c = int(m.group(2)) if m.group(2) else 0
            parts = []
            if d > 0:
                parts.append(
                    self.num_to_words(d)
                    + " dollar"
                    + ("s" if d != 1 else ""))
            if c > 0:
                parts.append(
                    self.num_to_words(c)
                    + " cent"
                    + ("s" if c != 1 else ""))
            if not parts:
                return "zero dollars"
            return " and ".join(parts)
        return re.sub(
            r'\$(\d+)(?:\.(\d{2}))?', repl, text)

    def normalize_time(self, text):
        def repl(m):
            h, mi = int(m.group(1)), int(m.group(2))
            if mi == 0:
                return (self.num_to_words(h)
                        + " o'clock")
            return (self.num_to_words(h) + " "
                    + self.num_to_words(mi))
        return re.sub(
            r'(\d{1,2}):(\d{2})', repl, text)

    def normalize_ordinals(self, text):
        def repl(m):
            return self.num_to_ordinal(
                int(m.group(1)))
        return re.sub(
            r'(\d+)(?:st|nd|rd|th)\b', repl, text)

    def normalize_abbrevs(self, text):
        for abbr, full in self.abbrevs.items():
            text = text.replace(abbr, full)
        return text

    def normalize(self, text):
        text = self.normalize_currency(text)
        text = self.normalize_time(text)
        text = self.normalize_ordinals(text)
        text = self.normalize_abbrevs(text)
        return text

    def run_tests(self):
        cases = [
            ("$12.99", "twelve dollars and "
             "ninety nine cents"),
            ("$0.01", "one cent"),
            ("$1.00", "one dollar"),
            ("3:45", "three forty five"),
            ("12:00", "twelve o'clock"),
            ("1st", "first"),
            ("2nd", "second"),
            ("3rd", "third"),
            ("21st", "twenty first"),
            ("Mr.", "mister"),
            ("Dr.", "doctor"),
            ("etc.", "et cetera"),
            ("$100.50",
             "one hundred dollars and "
             "fifty cents"),
            ("5:09", "five nine"),
            ("4th", "fourth"),
        ]
        passed = 0
        for inp, expected in cases:
            got = self.normalize(inp)
            ok = got == expected
            passed += int(ok)
            status = "PASS" if ok else "FAIL"
            if not ok:
                print(f"  {status}: '{inp}' -> "
                      f"'{got}' (expected "
                      f"'{expected}')")
        print(f"Tests: {passed}/{len(cases)}")


bench = TextNormBenchmark()
bench.run_tests()
text = ("Dr. Smith paid $12.99 at 3:30 "
        "for his 2nd visit. Mr. Jones "
        "owes $0.50 etc.")
print(f"\nNormalized: {bench.normalize(text)}")

The normalizer handles the most common patterns you'd need for a TTS front-end. The tricky bit is ordinals -- "twenty first" requires combining the tens word with the special ordinal form. Real production normalizers handle many more edge cases (dates, addresses, phone numbers, Roman numerals), but this covers the core logic.

Exercise 2: Mel spectrogram reconstruction evaluator.

import numpy as np


class MelReconstructionEval:
    def __init__(self, sr=16000, n_mels=80,
                 n_fft=1024, hop=256):
        self.sr = sr
        self.n_mels = n_mels
        self.n_fft = n_fft
        self.hop = hop

    def generate_mel(self, freqs, duration=2.0):
        n = int(self.sr * duration)
        t = np.arange(n) / self.sr
        sig = np.zeros(n)
        for f0, f1 in freqs:
            phase = 2 * np.pi * (
                f0 * t + (f1 - f0)
                / (2 * duration) * t ** 2)
            sig += 0.3 * np.sin(phase)
        mod = 0.5 + 0.5 * np.sin(
            2 * np.pi * 4 * t)
        sig *= mod
        window = np.hanning(self.n_fft)
        n_frames = (n - self.n_fft) // self.hop + 1
        n_bins = self.n_fft // 2 + 1
        spec = np.zeros((n_bins, n_frames))
        for i in range(n_frames):
            s = i * self.hop
            frame = sig[s:s + self.n_fft] * window
            spec[:, i] = np.abs(
                np.fft.rfft(frame)) ** 2
        fb = self._mel_fb(n_bins)
        mel = fb @ spec
        return 10.0 * np.log10(mel + 1e-10)

    def _mel_fb(self, n_bins):
        fmax = self.sr // 2
        mel_lo = 2595 * np.log10(1 + 0 / 700)
        mel_hi = 2595 * np.log10(1 + fmax / 700)
        pts = np.linspace(
            mel_lo, mel_hi, self.n_mels + 2)
        hz = 700 * (10 ** (pts / 2595) - 1)
        bins = np.floor(
            (self.n_fft + 1) * hz / self.sr
        ).astype(int)
        fb = np.zeros((self.n_mels, n_bins))
        for i in range(self.n_mels):
            lo, mid, hi = (bins[i],
                bins[i+1], bins[i+2])
            if mid > lo:
                for k in range(lo, mid):
                    fb[i, k] = (
                        (k - lo) / (mid - lo))
            if hi > mid:
                for k in range(mid, hi + 1):
                    if k < n_bins:
                        fb[i, k] = (
                            (hi - k) / (hi - mid))
        return fb

    def compute_mcd(self, ref, gen):
        n_cep = 13
        n = min(ref.shape[0], n_cep)
        basis = np.zeros((n, ref.shape[0]))
        for k in range(n):
            for j in range(ref.shape[0]):
                basis[k, j] = np.cos(
                    np.pi * k * (2*j+1)
                    / (2 * ref.shape[0]))
        ref_c = basis @ ref
        gen_c = basis @ gen
        diff = ref_c[1:] - gen_c[1:]
        dist = np.sqrt(
            (diff ** 2).sum(axis=0))
        return float(dist.mean())

    def compute_sc(self, ref, gen):
        diff_norm = np.sqrt(
            ((ref - gen) ** 2).sum())
        ref_norm = np.sqrt((ref ** 2).sum())
        return float(diff_norm / ref_norm)

    def run(self):
        rng = np.random.RandomState(42)
        ref = self.generate_mel(
            [(150, 3000)])
        exc = ref + rng.randn(
            *ref.shape) * 0.5
        mod_noise = rng.randn(
            *ref.shape) * 2.0
        mod = ref + mod_noise
        for b in rng.choice(
                80, 3, replace=False):
            mod[b, :] = ref.min()
        poor = self.generate_mel(
            [(2000, 500)])

        print(f"{'Quality':<12} {'MCD':>8} "
              f"{'SpConv':>8}")
        print("-" * 30)
        for name, gen in [("excellent", exc),
                ("moderate", mod),
                ("poor", poor)]:
            mcd = self.compute_mcd(ref, gen)
            sc = self.compute_sc(ref, gen)
            print(f"{name:<12} {mcd:>8.3f} "
                  f"{sc:>8.4f}")


evaluator = MelReconstructionEval()
evaluator.run()

MCD (Mel Cepstral Distortion) is the standard metric for evaluating vocoder quality in TTS. It measures the distance in cepstral space -- which corresponds roughly to perceptual similarity. Lower MCD means the generated audio sounds more like the reference. The "excellent" version should have the lowest values, confirming that small additive noise causes minimal perceptual distortion.

Exercise 3: Vocoder upsampling simulator.

import numpy as np


class VocoderUpsampleSim:
    def __init__(self, n_mels=80, n_frames=100,
                 hop=256, sr=16000):
        self.n_mels = n_mels
        self.n_frames = n_frames
        self.hop = hop
        self.sr = sr
        rng = np.random.RandomState(42)
        self.mel = np.zeros(
            (n_mels, n_frames))
        for f in range(n_frames):
            for b in range(10, 31):
                self.mel[b, f] = rng.uniform(
                    0.5, 1.0)
            self.mel[:, f] += rng.randn(
                n_mels) * 0.05

    def nearest_neighbor(self):
        samples = np.repeat(
            self.mel.mean(axis=0), self.hop)
        kernel = np.ones(16) / 16
        return np.convolve(
            samples, kernel, mode='same')

    def linear_interp(self):
        means = self.mel.mean(axis=0)
        x_mel = np.arange(
            self.n_frames) * self.hop
        x_out = np.arange(
            self.n_frames * self.hop)
        return np.interp(x_out, x_mel, means)

    def transposed_conv(self):
        rng = np.random.RandomState(123)
        out_len = self.n_frames * self.hop
        output = np.zeros(out_len)
        kernel = rng.randn(self.hop) * 0.1
        means = self.mel.mean(axis=0)
        for i in range(self.n_frames):
            start = i * self.hop
            end = start + self.hop
            output[start:end] += (
                means[i] * kernel)
        return output

    def analyze(self, signal, name):
        smoothness = np.mean(
            np.abs(np.diff(signal)))
        drange = signal.max() - signal.min()
        zc = np.sum(
            np.diff(np.sign(signal)) != 0)
        zcr = zc / len(signal)
        return {
            "name": name,
            "length": len(signal),
            "smoothness": smoothness,
            "drange": drange,
            "zcr": zcr}

    def run(self):
        methods = [
            ("Nearest", self.nearest_neighbor()),
            ("Linear", self.linear_interp()),
            ("TransConv", self.transposed_conv())]
        print(f"{'Method':<12} {'Length':>8} "
              f"{'Smooth':>8} {'DRange':>8} "
              f"{'ZCR':>8}")
        print("-" * 48)
        for name, sig in methods:
            r = self.analyze(sig, name)
            print(
                f"{r['name']:<12} "
                f"{r['length']:>8} "
                f"{r['smoothness']:>8.5f} "
                f"{r['drange']:>8.4f} "
                f"{r['zcr']:>8.5f}")


sim = VocoderUpsampleSim()
sim.run()

The transposed convolution produces the highest zero-crossing rate and dynamic range because the learned kernel introduces high-frequency structure that simple interpolation cannot. This is exactly why real vocoders (HiFi-GAN, WaveRNN) use learned upsampling -- they can synthesize the fine-grained waveform details that make audio sound natural rather than muffled.

On to today's episode

Here we go! The last three episodes covered the full speech pipeline: audio fundamentals (#92), speech recognition (#93), and text-to-speech (#94). We can now convert sound to numbers, turn speech into text, and generate speech from text. But speech is only one kind of sound. What about all the other sounds in the world?

A security camera needs to detect glass breaking. A wildlife monitoring station needs to identify bird species from their calls. A music streaming service needs to classify genre, detect mood, and tag instruments -- automatically, at scale, for millions of tracks. Your phone needs to recognize "Hey Siri" against a background of restaurant noise. None of these are speech recognition problems. They're audio classification problems.

The beautiful thing (and if you've been following along since episode #92, you might already see where this is going): since we convert audio to Mel spectrograms, audio classification is effectively image classification. All the CNN architectures from episodes #45-47 and the Vision Transformer from episode #54 apply directly. A spectrogram is a 2D image. A siren looks different from a dog bark in spectrogram space, just like a cat looks different from a car in pixel space. The core techniques transfer -- the differences are in the data, the augmentation, and the evaluation ;-)

CNN-based audio classification

The simplest and still surprisingly effective approach: compute a Mel spectrogram, treat it as a single-channel grayscale image, and feed it through a convolutional neural network. The architecture is identical to what we built for image classification:

import torch
import torch.nn as nn
import numpy as np


class AudioCNN(nn.Module):
    """CNN for audio classification operating
    on Mel spectrograms."""

    def __init__(self, n_mels=64,
                 num_classes=50):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(128, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes),
        )

    def forward(self, mel_spec):
        # mel_spec: (batch, 1, n_mels, time)
        x = self.features(mel_spec)
        return self.classifier(x)


model = AudioCNN(num_classes=50)
# Simulate a batch of 4 spectrograms
# 64 Mel bands, ~4 seconds at hop=512
mel_input = torch.randn(4, 1, 64, 173)
logits = model(mel_input)
print(f"Input: {mel_input.shape}")
print(f"Output: {logits.shape}")
print(f"Parameters: "
      f"{sum(p.numel() for p in model.parameters()):,}")

The architecture should feel very familiar from episode #45: conv -> batchnorm -> relu -> pool, repeated several times, then global average pooling collapses the spatial dimensions and a linear layer maps to class logits. The AdaptiveAvgPool2d((1, 1)) is key -- it lets the model accept spectrograms of any time length, which is essential because audio clips are rarely the exact same duration.

Two design choices worth understanding: 64 Mel bands is a common choice for classification (Whisper uses 80, some systems use 128 -- more bands means finer frequency resolution but also more computation). And the number of time frames depends on your sample rate and hop length: a 4-second clip at 22050 Hz with hop 512 gives about 173 frames.

This CNN approach achieves roughly 80-85% accuracy on standard benchmarks like ESC-50 (50 environmental sound classes: dog bark, rain, chainsaw, clock alarm, helicopter, etc.) and UrbanSound8K (10 urban sound classes). Not state-of-the-art, but remarkably good for how simple it is.

Audio data augmentation

Just like image classification benefits enormously from augmentations (random crops, flips, color jitter -- episode #14), audio classification needs its own augmentation toolkit. The two levels: waveform augmentation (before computing the spectrogram) and spectrogram augmentation (after):

class AudioAugmentor:
    """Two-level audio augmentation:
    waveform-level and spectrogram-level."""

    def __init__(self, sr=16000, seed=42):
        self.sr = sr
        self.rng = np.random.RandomState(seed)

    def add_noise(self, waveform, snr_db=20):
        """Add Gaussian noise at a given
        signal-to-noise ratio."""
        power = np.mean(waveform ** 2)
        noise_power = power / (
            10 ** (snr_db / 10))
        noise = self.rng.randn(
            len(waveform)) * np.sqrt(
            noise_power)
        return waveform + noise

    def time_shift(self, waveform,
                   max_shift=0.1):
        """Circular shift by random amount."""
        shift = self.rng.randint(
            -int(self.sr * max_shift),
            int(self.sr * max_shift))
        return np.roll(waveform, shift)

    def speed_perturb(self, waveform,
                      lo=0.9, hi=1.1):
        """Change speed (affects pitch too)."""
        factor = self.rng.uniform(lo, hi)
        indices = np.arange(
            0, len(waveform), factor)
        indices = indices[
            indices < len(waveform)
        ].astype(int)
        return waveform[indices]

    def spec_augment(self, mel, n_freq=2,
                     n_time=2, freq_w=10,
                     time_w=20):
        """SpecAugment: mask random frequency
        bands and time segments."""
        aug = mel.copy()
        n_mels, n_frames = aug.shape
        for _ in range(n_freq):
            f = self.rng.randint(0, freq_w)
            f0 = self.rng.randint(
                0, max(n_mels - f, 1))
            aug[f0:f0 + f, :] = 0
        for _ in range(n_time):
            t = self.rng.randint(0, time_w)
            t0 = self.rng.randint(
                0, max(n_frames - t, 1))
            aug[:, t0:t0 + t] = 0
        return aug

    def demo(self):
        n = self.sr * 2
        t = np.arange(n) / self.sr
        sig = 0.5 * np.sin(
            2 * np.pi * 440 * t)

        noisy = self.add_noise(sig, snr_db=20)
        shifted = self.time_shift(sig)
        fast = self.speed_perturb(sig, 1.1, 1.1)

        print("Waveform augmentations:")
        print(f"  Original:  {len(sig)} samples")
        print(f"  + noise:   {len(noisy)} "
              f"(SNR=20dB)")
        print(f"  Shifted:   {len(shifted)}")
        print(f"  Speed 1.1: {len(fast)}")

        mel = self.rng.randn(64, 200) * 10
        aug_mel = self.spec_augment(mel)
        freq_masked = np.sum(
            aug_mel.sum(axis=1) == 0)
        time_masked = np.sum(
            aug_mel.sum(axis=0) == 0)
        print(f"\nSpecAugment on (64, 200):")
        print(f"  Freq bands zeroed: "
              f"{freq_masked}")
        print(f"  Time frames zeroed: "
              f"{time_masked}")


augmentor = AudioAugmentor()
augmentor.demo()

SpecAugment (Park et al., 2019) deserves special attention because it's the single most impactful augmentation technique for audio models. The idea is dead simple: randomly zero out horizontal stripes (frequency masking) and vertical stripes (time masking) in the spectrogram during training. Frequency masking forces the model to not rely on any single frequency band -- if it can only recognize a dog bark by the fundamental frequency, masking that band forces it to learn from harmonics and temporal patterns too. Time masking forces robustness to missing segments. We built this from scratch in episode #92's augmentation section, and here it shows up again because it's genuinely used everywhere in production audio systems.

The waveform-level augmentations serve different purposes: noise injection simulates real-world recording conditions (nobody records in a soundproof studio), time shifting teaches the model that a dog bark at the beginning vs end of a clip is still a dog bark, and speed perturbation adds pitch and tempo variation.

AudioSet: the ImageNet of audio

In the vision world, ImageNet (episode #46) was the dataset that enabled transfer learning and transformed the field. Audio has its equivalent: AudioSet (Gemmeke et al., 2017), a collection of 2 million 10-second YouTube clips annotated with 527 sound event classes. The classes range from "Speech" and "Music" to "Chainsaw", "Church bell", "Burping", and "Skateboard". It's multi-label -- a single clip might be tagged with both "Music" and "Singing" and "Guitar", because those sounds co-occur naturally.

AudioSet pretraining is the standard starting point for almost any audio classification task, the same way ImageNet pretraining is standard for vision. Here's how the pretrained models compare:

class AudioSetModels:
    """Compare AudioSet-pretrained model
    architectures and their performance."""

    def __init__(self):
        self.models = {
            "CNN6": {
                "params_m": 4.7,
                "mAP": 0.343,
                "type": "CNN",
                "year": 2020},
            "CNN10": {
                "params_m": 5.2,
                "mAP": 0.380,
                "type": "CNN",
                "year": 2020},
            "CNN14": {
                "params_m": 80.7,
                "mAP": 0.431,
                "type": "CNN",
                "year": 2020},
            "AST": {
                "params_m": 86.0,
                "mAP": 0.459,
                "type": "Transformer",
                "year": 2021},
            "BEATs": {
                "params_m": 90.0,
                "mAP": 0.498,
                "type": "Transformer",
                "year": 2023},
            "Audio-MAE": {
                "params_m": 86.0,
                "mAP": 0.473,
                "type": "Transformer",
                "year": 2022},
        }

    def compare(self):
        print(f"{'Model':<12} {'Type':<14} "
              f"{'Params':>7} {'mAP':>6} "
              f"{'Year':>5}")
        print("-" * 48)
        for name, m in self.models.items():
            print(
                f"{name:<12} {m['type']:<14} "
                f"{m['params_m']:>5.1f}M "
                f"{m['mAP']:>5.3f} "
                f"{m['year']:>5}")
        print(f"\nmAP = mean Average Precision")
        print(f"Higher is better, 1.0 = perfect")
        print(f"Human estimate: ~0.65 mAP")


models = AudioSetModels()
models.compare()

The PANNs family (Pretrained Audio Neural Networks, Kong et al., 2020) established the CNN baselines: CNN14 achieves 0.431 mAP with a relatively straightforward 14-layer convolutional architecture. Then the transformers arrived. AST (Audio Spectrogram Transformer, Gong et al., 2021) pushed to 0.459 by treating spectrograms as patch sequences -- exactly how ViT (episode #54) treats images. And BEATs (Chen et al., 2023) combined audio-specific pre-training with iterative self-supervised learning to reach 0.498.

That human estimate of ~0.65 mAP tells you something important: AudioSet's 527-class multi-label task is genuinely hard even for people. Many classes are subjective ("Music" vs "Musical instrument"), some clips have barely audible target sounds, and the YouTube-sourced labels aren't perfect.

The Audio Spectrogram Transformer

Let's look at AST more carefully because it's the clearest example of how vision architectures transfer to audio. The key insight: if you can represent audio as a 2D spectrogram, then a spectrogram is just an image, and Vision Transformers work on images. AST literally takes a pretrained ViT (trained on ImageNet!) and fine-tunes it on AudioSet:

class ASTArchitecture:
    """Illustrate how AST converts
    spectrograms to patch sequences for
    a vision transformer."""

    def __init__(self, n_mels=128,
                 n_frames=1024,
                 patch_h=16, patch_w=16,
                 d_model=768):
        self.n_mels = n_mels
        self.n_frames = n_frames
        self.patch_h = patch_h
        self.patch_w = patch_w
        self.d_model = d_model

    def compute_patches(self):
        patches_h = self.n_mels // self.patch_h
        patches_w = self.n_frames // self.patch_w
        total = patches_h * patches_w
        return patches_h, patches_w, total

    def describe(self):
        ph, pw, total = self.compute_patches()
        print(f"Input spectrogram: "
              f"{self.n_mels} x {self.n_frames}")
        print(f"Patch size: "
              f"{self.patch_h} x {self.patch_w}")
        print(f"Grid: {ph} x {pw} = "
              f"{total} patches")
        print(f"Each patch: "
              f"{self.patch_h * self.patch_w} "
              f"values -> projected to "
              f"{self.d_model}d")
        print(f"Sequence length: "
              f"{total} + 1 (CLS token) = "
              f"{total + 1}")
        print(f"\nFor comparison, ViT-B/16 "
              f"on 224x224 images:")
        img_patches = (224 // 16) ** 2
        print(f"  Image patches: "
              f"{img_patches}")
        print(f"  Audio patches: "
              f"{total}")
        print(f"  Audio has "
              f"{total / img_patches:.1f}x "
              f"more patches")
        print(f"\nThis means AST processes "
              f"longer sequences than ViT,")
        print(f"so it needs positional "
              f"embedding interpolation.")


ast = ASTArchitecture()
ast.describe()

The gotcha here is sequence length. A 128 x 1024 Mel spectrogram produces 8 x 64 = 512 patches. A 224x224 image only produces 196 patches. That's 2.6x more tokens for the transformer to process, which means more memory and more computation (self-attention is quadratic in sequence length, as we covered in episode #52). AST handles this by interpolating the positional embeddings from ViT's 196 positions to 512 -- the same technique we discussed in episode #54 for applying ViT to higher-resolution images.

The practical result: AST is slower than CNN14 (more compute per sample) but more accurate (0.459 vs 0.431 mAP). For real-time on-device classification (smart speakers, phones), CNNs are still preferred. For batch processing where accuracy matters more than latency, transformers win.

Multi-label classification

Here's where audio classification diverges fundamentaly from typical image classification. An image usually has one primary label: "cat", "dog", "car". An audio clip often contains multiple simultaneous sounds: speech over music, birds chirping with wind noise, a siren mixed with traffic. This makes audio tagging a multi-label problem:

class MultiLabelDemo:
    """Demonstrate multi-label vs single-label
    classification differences."""

    def __init__(self):
        self.classes = [
            "Speech", "Music", "Dog", "Siren",
            "Rain", "Traffic", "Bird", "Wind"]

    def single_label(self, logits):
        """Softmax: probabilities sum to 1.
        Forces exactly ONE prediction."""
        exp = np.exp(logits - logits.max())
        probs = exp / exp.sum()
        pred = np.argmax(probs)
        return probs, [self.classes[pred]]

    def multi_label(self, logits,
                    threshold=0.5):
        """Sigmoid: each class independent.
        Multiple classes can be active."""
        probs = 1 / (1 + np.exp(-logits))
        preds = [self.classes[i]
                 for i in range(len(logits))
                 if probs[i] > threshold]
        return probs, preds

    def demo(self):
        # A clip with speech AND music AND rain
        logits = np.array(
            [2.1, 1.8, -1.0, -2.5,
             1.5, -0.3, -1.2, -0.8])

        s_probs, s_preds = self.single_label(
            logits)
        m_probs, m_preds = self.multi_label(
            logits)

        print("Logits:", [f"{l:.1f}" for l
                          in logits])
        print(f"\nSingle-label (softmax):")
        print(f"  Prediction: {s_preds}")
        for i, c in enumerate(self.classes):
            bar = "#" * int(s_probs[i] * 40)
            print(f"  {c:<10} {s_probs[i]:.3f} "
                  f"{bar}")

        print(f"\nMulti-label (sigmoid):")
        print(f"  Predictions: {m_preds}")
        for i, c in enumerate(self.classes):
            bar = "#" * int(m_probs[i] * 40)
            act = " *" if m_probs[i] > 0.5 else ""
            print(f"  {c:<10} {m_probs[i]:.3f} "
                  f"{bar}{act}")

        print(f"\nSoftmax FORCES a single "
              f"winner.")
        print(f"Sigmoid lets Speech + Music + "
              f"Rain all be active.")


demo = MultiLabelDemo()
demo.demo()

The key differences from single-label classification (which is what we've mostly done in this series):

  1. Sigmoid instead of softmax: each class gets an independent probability between 0 and 1. Softmax forces probabilities to sum to 1 (one winner), sigmoid doesn't.
  2. Binary Cross-Entropy (BCE) instead of Cross-Entropy: BCE treats each class as a separate binary classification problem. The loss for "Speech" is computed independantly from the loss for "Music".
  3. Threshold-based prediction: instead of argmax (pick the highest), you apply a threshold (typically 0.5) and everything above is considered active.
  4. Evaluation metric: mAP (mean Average Precision) instead of accuracy. mAP captures how well the model ranks positive classes above negative ones, which matters more than the exact threshold.

Sound event detection: the temporal dimension

Classification tells you what sounds are in a clip. Sound Event Detection (SED) tells you when each sound occurs -- start time and end time. This is audio's equivalent of object detection (episode #78): instead of spatial bounding boxes around objects in an image, you predict temporal segments for sounds in a recording:

class SoundEventDetector:
    """Simulate frame-level sound event
    detection with post-processing."""

    def __init__(self, sr=16000, hop=512,
                 n_classes=5):
        self.sr = sr
        self.hop = hop
        self.n_classes = n_classes
        self.classes = [
            "Speech", "Music", "Dog",
            "Siren", "Bird"]

    def simulate_predictions(self, duration):
        """Create realistic frame-level
        predictions for demo purposes."""
        rng = np.random.RandomState(42)
        n_frames = int(
            duration * self.sr / self.hop)
        preds = np.zeros(
            (self.n_classes, n_frames))
        # Speech from 1-4s
        s1 = int(1.0 * self.sr / self.hop)
        s2 = int(4.0 * self.sr / self.hop)
        preds[0, s1:s2] = rng.uniform(
            0.7, 0.95, s2 - s1)
        # Dog bark at 2.5-3.5s
        d1 = int(2.5 * self.sr / self.hop)
        d2 = int(3.5 * self.sr / self.hop)
        preds[2, d1:d2] = rng.uniform(
            0.6, 0.85, d2 - d1)
        # Add noise everywhere
        preds += rng.uniform(
            0, 0.15,
            (self.n_classes, n_frames))
        return np.clip(preds, 0, 1)

    def detect_events(self, preds,
                      threshold=0.5,
                      min_dur_frames=3):
        """Convert frame-level predictions
        to discrete events."""
        events = []
        for c in range(self.n_classes):
            active = preds[c] > threshold
            in_event = False
            start = 0
            for f in range(len(active)):
                if active[f] and not in_event:
                    start = f
                    in_event = True
                elif not active[f] and in_event:
                    dur = f - start
                    if dur >= min_dur_frames:
                        events.append({
                            "class": self.classes[c],
                            "start": start * self.hop / self.sr,
                            "end": f * self.hop / self.sr,
                            "confidence": float(
                                preds[c, start:f].mean())})
                    in_event = False
            if in_event:
                dur = len(active) - start
                if dur >= min_dur_frames:
                    events.append({
                        "class": self.classes[c],
                        "start": start * self.hop / self.sr,
                        "end": len(active) * self.hop / self.sr,
                        "confidence": float(
                            preds[c, start:].mean())})
        return sorted(
            events, key=lambda e: e["start"])

    def run(self):
        duration = 6.0
        preds = self.simulate_predictions(
            duration)
        n_frames = preds.shape[1]
        print(f"Audio: {duration:.0f}s, "
              f"{n_frames} frames")
        print(f"Frame predictions: "
              f"{preds.shape}")

        events = self.detect_events(preds)
        print(f"\nDetected events:")
        for e in events:
            print(f"  [{e['start']:.1f}s - "
                  f"{e['end']:.1f}s] "
                  f"{e['class']} "
                  f"(conf: {e['confidence']:.2f})")


detector = SoundEventDetector()
detector.run()

The model produces per-frame, per-class probabilities. Post-processing converts these continuous probabilities into discrete events by: (1) applying a threshold to get binary frame-level decisions, (2) grouping consecutive positive frames into event segments, (3) filtering out events shorter than a minimum duration (to remove spurious detections), and (4) optionally merging events that are separated by very small gaps. This post-processing pipeline is analogous to Non-Maximum Suppression in object detection -- the raw model outputs need cleanup before they're useful.

The DCASE (Detection and Classification of Acoustic Scenes and Events) challenge series is the main benchmark for SED. Systems are evaluated using event-based F1 score with a tolerance window -- a detected event is correct if it overlaps with a ground-truth event within a certain time margin.

Complete classification pipeline

Let me put everything together into a full pipeline that goes from raw audio files to predictions, including proper preprocessing, augmentation, and evaluation:

class AudioClassificationPipeline:
    """Complete pipeline: load audio ->
    preprocess -> augment -> classify ->
    evaluate."""

    def __init__(self, sr=16000, n_mels=64,
                 duration=4.0, n_fft=1024,
                 hop=512, num_classes=10):
        self.sr = sr
        self.n_mels = n_mels
        self.max_samples = int(sr * duration)
        self.n_fft = n_fft
        self.hop = hop
        self.num_classes = num_classes

    def preprocess(self, waveform):
        """Normalize, pad/truncate, compute
        Mel spectrogram."""
        peak = np.abs(waveform).max()
        if peak > 0:
            waveform = waveform / peak
        if len(waveform) > self.max_samples:
            waveform = waveform[
                :self.max_samples]
        else:
            pad = self.max_samples - len(
                waveform)
            waveform = np.pad(
                waveform, (0, pad))
        return self._mel_spec(waveform)

    def _mel_spec(self, signal):
        """Compute log-Mel spectrogram."""
        window = np.hanning(self.n_fft)
        n_frames = (len(signal) - self.n_fft
                    ) // self.hop + 1
        n_bins = self.n_fft // 2 + 1
        spec = np.zeros((n_bins, n_frames))
        for i in range(n_frames):
            s = i * self.hop
            frame = signal[s:s + self.n_fft]
            spec[:, i] = np.abs(
                np.fft.rfft(frame * window))
        fb = self._mel_fb(n_bins)
        mel = fb @ (spec ** 2)
        return 10.0 * np.log10(mel + 1e-10)

    def _mel_fb(self, n_bins):
        fmax = self.sr // 2
        mel_lo = 2595 * np.log10(1 + 0 / 700)
        mel_hi = 2595 * np.log10(
            1 + fmax / 700)
        pts = np.linspace(
            mel_lo, mel_hi, self.n_mels + 2)
        hz = 700 * (10 ** (pts / 2595) - 1)
        bins = np.floor(
            (self.n_fft + 1) * hz / self.sr
        ).astype(int)
        fb = np.zeros((self.n_mels, n_bins))
        for i in range(self.n_mels):
            lo, mid, hi = (bins[i],
                bins[i+1], bins[i+2])
            if mid > lo:
                for k in range(lo, mid):
                    fb[i, k] = (
                        (k - lo) / (mid - lo))
            if hi > mid:
                for k in range(mid, hi + 1):
                    if k < n_bins:
                        fb[i, k] = (
                            (hi - k)
                            / (hi - mid))
        return fb

    def evaluate_predictions(self, y_true,
                              y_pred):
        """Compute accuracy and per-class
        precision/recall."""
        correct = sum(
            t == p for t, p
            in zip(y_true, y_pred))
        acc = correct / len(y_true)
        # Per-class stats
        classes = set(y_true)
        print(f"Overall accuracy: "
              f"{acc:.1%} "
              f"({correct}/{len(y_true)})")
        print(f"\n{'Class':>6} {'Prec':>6} "
              f"{'Recall':>6} {'Count':>6}")
        print("-" * 28)
        for c in sorted(classes):
            tp = sum(1 for t, p
                     in zip(y_true, y_pred)
                     if t == c and p == c)
            fp = sum(1 for t, p
                     in zip(y_true, y_pred)
                     if t != c and p == c)
            fn = sum(1 for t, p
                     in zip(y_true, y_pred)
                     if t == c and p != c)
            prec = tp / (tp + fp) if (
                tp + fp) > 0 else 0
            rec = tp / (tp + fn) if (
                tp + fn) > 0 else 0
            count = sum(
                1 for t in y_true if t == c)
            print(f"{c:>6} {prec:>5.1%} "
                  f"{rec:>5.1%} {count:>6}")

    def run_demo(self):
        rng = np.random.RandomState(42)
        n = self.max_samples
        t = np.arange(n) / self.sr
        sigs = []
        for freq in [200, 500, 1000, 2000,
                     4000]:
            sig = np.sin(
                2 * np.pi * freq * t)
            sig += rng.randn(n) * 0.1
            sigs.append(sig)

        print("Preprocessing 5 signals:")
        for i, sig in enumerate(sigs):
            mel = self.preprocess(sig)
            print(f"  Signal {i}: waveform "
                  f"{len(sig)} -> Mel "
                  f"{mel.shape}")

        # Simulated evaluation
        y_true = [0, 0, 1, 1, 2, 2, 3, 3,
                  4, 4, 0, 1, 2, 3, 4]
        y_pred = [0, 0, 1, 2, 2, 2, 3, 4,
                  4, 4, 0, 1, 1, 3, 4]
        print()
        self.evaluate_predictions(
            y_true, y_pred)


pipeline = AudioClassificationPipeline()
pipeline.run_demo()

The pipeline encapsulates the standard workflow: load -> normalize -> pad/truncate to fixed length -> Mel spectrogram -> feed to model -> evaluate. The evaluation uses per-class precision and recall, which is more informative than overall accuracy (same lesson we learned way back in episode #13). A model might have 90% accuracy but completely fail on rare classes -- per-class metrics reveal that.

What makes audio classification hard?

Worth being honest about the challenges that make this problem harder than it might seem:

Class imbalance: in AudioSet, "Music" appears in 40% of clips. "Skateboard" appears in 0.03%. Your model will happily learn to predict "Music" for everything and still get decent overall mAP. Stratified sampling, class weighting, and focal loss (all techniques from episode #14 and #40) are essential.

Label noise: AudioSet labels come from human annotators watching YouTube videos. They miss sounds, they mislabel sounds, they disagree with each other. Some clips are labeled "Speech" but the speech is barely audible under loud music. Training with noisy labels requires either label cleaning or noise-robust loss functions.

Domain shift: a model trained on YouTube audio (compressed, variable quality, background music) might fail on a security camera recording (high-quality, quiet background, specific reverb characteristics). This is the same domain shift problem we discussed in episode #35 -- your training distribution needs to match your deployment conditions, or you need domain adaptation techniques.

Temporal granularity: is a "door slam" a 0.3-second event or a 2-second event? It depends on whether you include the reverb. Different annotators draw different temporal boundaries, and your model needs to handle this ambiguity. This is why SED evaluation uses tolerance windows rather than exact match.

Samengevat

  • Audio classification assigns labels to sound recordings; since Mel spectrograms are 2D images, CNN and transformer architectures from the vision arc transfer directly;
  • SpecAugment (frequency masking + time masking) is the most effective audio augmentation, forcing models to learn robust features instead of relying on specific spectral regions;
  • AudioSet (2 million YouTube clips, 527 classes) is the ImageNet of audio; pretrained models like CNN14 and AST provide strong starting points for transfer learning;
  • the Audio Spectrogram Transformer (AST) treats spectrograms as patch sequences (exactly like ViT treats images), achieving 0.459 mAP on AudioSet versus CNN14's 0.431;
  • audio tagging is inherently multi-label (sigmoid + BCE loss, not softmax + cross-entropy) because multiple sounds naturally co-occur in real recordings;
  • Sound Event Detection (SED) extends classification with temporal localization -- predicting per-frame per-class probabilities, then post-processing into discrete events with start/end times;
  • practical challenges include class imbalance, label noise, domain shift, and temporal granularity -- the same issues that affect classification in other domains, but amplified by the messy nature of real-world audio.

We've covered how to classify what sounds are in a recording and when they happen. Next time we'll look at a creative application of audio AI that takes this in a completely different direction: generating music. The models need to understand not just what music sounds like but concepts like rhythm, harmony, and structure over much longer time horizons than speech or environmental sounds.

Exercises

Exercise 1: Build a spectrogram-based audio feature comparator. Create a class AudioFeatureComparator that: (a) generates 5 synthetic audio signals (each 2 seconds at 16,000 Hz): (1) a pure 440 Hz sine wave, (2) white noise (random seed 42), (3) a frequency sweep from 200 to 4000 Hz, (4) a "speech-like" signal (sum of 150 Hz, 300 Hz, 900 Hz, 2500 Hz with amplitude modulation at 4 Hz), (5) a "music-like" signal (440 Hz + 554 Hz + 659 Hz chord with slow vibrato at 5 Hz), (b) computes for each signal: the Mel spectrogram (64 bands, n_fft=1024, hop=256), spectral centroid (the energy-weighted mean frequency across all frames), spectral flatness (geometric mean divided by arithmetic mean of magnitudes -- high for noise, low for tonal signals), temporal variation (mean absolute difference between consecutive Mel frames), (c) prints a comparison table with all 5 signals and their features. Verify that: white noise has the highest spectral flatness, the pure tone has the lowest spectral flatness, the speech-like signal has high temporal variation (because of amplitude modulation), and the sweep has a centroid between the pure tone and noise.

Exercise 2: Build a multi-label evaluation toolkit. Create a class MultiLabelEvaluator that: (a) implements compute_ap(y_true, y_scores) that computes Average Precision for a single class: sort samples by descending score, sweep through thresholds, compute precision and recall at each threshold, and compute the area under the precision-recall curve (using the trapezoidal rule), (b) implements compute_map(Y_true, Y_scores) that computes mean Average Precision across all classes, (c) creates a test scenario with 5 classes and 20 samples where: class 0 has perfect predictions (AP = 1.0), class 1 has near-perfect predictions (AP > 0.9), class 2 has moderate predictions (AP around 0.5-0.7), class 3 has random predictions (AP around 0.2-0.5), class 4 has inverted predictions (AP near 0.0), (d) prints the AP for each class and the overall mAP, (e) verifies the ordering: AP for class 0 > class 1 > class 2 > class 3 > class 4, and that the overall mAP is the arithmetic mean of per-class APs.

Exercise 3: Build a sound event post-processor. Create a class SEDPostProcessor that: (a) generates synthetic frame-level predictions for a 10-second clip (320 frames at 16kHz/512 hop) with 3 classes: (1) "Speech" active from frames 20-120 with probability 0.8, (2) "Music" active from frames 50-250 with probability 0.7, (3) "Dog" with three short bursts at frames 40-55, 100-110, and 200-215 with probability 0.75, plus Gaussian noise (std=0.1) added to all predictions, (b) implements three post-processing stages: (1) threshold(preds, t=0.5) -- binary decisions per frame, (2) fill_gaps(binary, max_gap=5) -- merge events separated by fewer than max_gap frames, (3) remove_short(binary, min_dur=8) -- remove events shorter than min_dur frames, (c) applies all three stages in sequence and extracts events (class, start_frame, end_frame) after each stage, (d) prints the events after each stage for all 3 classes, showing how the post-processing cleans up the predictions. Verify that: after thresholding, the Dog class has 3 separate events; after gap-filling, some may merge if gaps are small enough; after minimum duration filtering, any very short noise-triggered events are removed.

Doei!

@scipio



0
0
0.000
0 comments