Learn AI Series (#95) - Audio Classification
Learn AI Series (#95) - Audio Classification

What will I learn
- You will learn environmental sound classification: identifying sounds like sirens, dogs barking, and rain using spectrograms and CNNs;
- audio data augmentation: SpecAugment and waveform-level techniques that make models robust to real-world noise;
- AudioSet: the ImageNet of audio, with 2 million clips across 527 sound classes, and how to use pretrained models;
- the Audio Spectrogram Transformer (AST): applying the Vision Transformer architecture to audio spectrograms;
- multi-label audio tagging: handling clips where multiple sounds happen simultaneously using sigmoid activations and BCE loss;
- sound event detection: predicting not just what sounds are present but when they occur in a recording;
- building a complete audio classification pipeline from raw waveforms to predictions.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision
- Learn AI Series (#88) - Face Analysis
- Learn AI Series (#89) - Medical and Scientific Imaging
- Learn AI Series (#90) - Self-Supervised Learning for Vision
- Learn AI Series (#91) - Mini Project - Building a Visual AI System
- Learn AI Series (#92) - Audio Fundamentals for AI
- Learn AI Series (#93) - Speech Recognition
- Learn AI Series (#94) - Text-to-Speech (TTS)
- Learn AI Series (#95) - Audio Classification (this post)
Learn AI Series (#95) - Audio Classification
Solutions to Episode #94 Exercises
Exercise 1: Text normalization benchmark.
import re
class TextNormBenchmark:
def __init__(self):
self.ones = [
"", "one", "two", "three", "four",
"five", "six", "seven", "eight",
"nine", "ten", "eleven", "twelve",
"thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen",
"nineteen"]
self.tens = [
"", "", "twenty", "thirty", "forty",
"fifty", "sixty", "seventy", "eighty",
"ninety"]
self.ordinals_special = {
1: "first", 2: "second", 3: "third",
5: "fifth", 8: "eighth", 9: "ninth",
12: "twelfth", 20: "twentieth"}
self.abbrevs = {
"Mr.": "mister", "Mrs.": "missus",
"Dr.": "doctor", "Prof.": "professor",
"vs.": "versus", "etc.": "et cetera",
"Jr.": "junior", "Sr.": "senior"}
def num_to_words(self, n):
if n < 0 or n > 9999:
return str(n)
if n == 0:
return "zero"
if n < 20:
return self.ones[n]
if n < 100:
t = self.tens[n // 10]
o = self.ones[n % 10]
return f"{t} {o}".strip() if o else t
if n < 1000:
h = self.ones[n // 100] + " hundred"
rem = n % 100
if rem == 0:
return h
return h + " " + self.num_to_words(rem)
th = self.num_to_words(n // 1000)
rem = n % 1000
if rem == 0:
return th + " thousand"
return (th + " thousand "
+ self.num_to_words(rem))
def num_to_ordinal(self, n):
if n in self.ordinals_special:
return self.ordinals_special[n]
w = self.num_to_words(n)
if w.endswith("y"):
return w[:-1] + "ieth"
if w.endswith("e"):
return w[:-1] + "th"
return w + "th"
def normalize_currency(self, text):
def repl(m):
d = int(m.group(1))
c = int(m.group(2)) if m.group(2) else 0
parts = []
if d > 0:
parts.append(
self.num_to_words(d)
+ " dollar"
+ ("s" if d != 1 else ""))
if c > 0:
parts.append(
self.num_to_words(c)
+ " cent"
+ ("s" if c != 1 else ""))
if not parts:
return "zero dollars"
return " and ".join(parts)
return re.sub(
r'\$(\d+)(?:\.(\d{2}))?', repl, text)
def normalize_time(self, text):
def repl(m):
h, mi = int(m.group(1)), int(m.group(2))
if mi == 0:
return (self.num_to_words(h)
+ " o'clock")
return (self.num_to_words(h) + " "
+ self.num_to_words(mi))
return re.sub(
r'(\d{1,2}):(\d{2})', repl, text)
def normalize_ordinals(self, text):
def repl(m):
return self.num_to_ordinal(
int(m.group(1)))
return re.sub(
r'(\d+)(?:st|nd|rd|th)\b', repl, text)
def normalize_abbrevs(self, text):
for abbr, full in self.abbrevs.items():
text = text.replace(abbr, full)
return text
def normalize(self, text):
text = self.normalize_currency(text)
text = self.normalize_time(text)
text = self.normalize_ordinals(text)
text = self.normalize_abbrevs(text)
return text
def run_tests(self):
cases = [
("$12.99", "twelve dollars and "
"ninety nine cents"),
("$0.01", "one cent"),
("$1.00", "one dollar"),
("3:45", "three forty five"),
("12:00", "twelve o'clock"),
("1st", "first"),
("2nd", "second"),
("3rd", "third"),
("21st", "twenty first"),
("Mr.", "mister"),
("Dr.", "doctor"),
("etc.", "et cetera"),
("$100.50",
"one hundred dollars and "
"fifty cents"),
("5:09", "five nine"),
("4th", "fourth"),
]
passed = 0
for inp, expected in cases:
got = self.normalize(inp)
ok = got == expected
passed += int(ok)
status = "PASS" if ok else "FAIL"
if not ok:
print(f" {status}: '{inp}' -> "
f"'{got}' (expected "
f"'{expected}')")
print(f"Tests: {passed}/{len(cases)}")
bench = TextNormBenchmark()
bench.run_tests()
text = ("Dr. Smith paid $12.99 at 3:30 "
"for his 2nd visit. Mr. Jones "
"owes $0.50 etc.")
print(f"\nNormalized: {bench.normalize(text)}")
The normalizer handles the most common patterns you'd need for a TTS front-end. The tricky bit is ordinals -- "twenty first" requires combining the tens word with the special ordinal form. Real production normalizers handle many more edge cases (dates, addresses, phone numbers, Roman numerals), but this covers the core logic.
Exercise 2: Mel spectrogram reconstruction evaluator.
import numpy as np
class MelReconstructionEval:
def __init__(self, sr=16000, n_mels=80,
n_fft=1024, hop=256):
self.sr = sr
self.n_mels = n_mels
self.n_fft = n_fft
self.hop = hop
def generate_mel(self, freqs, duration=2.0):
n = int(self.sr * duration)
t = np.arange(n) / self.sr
sig = np.zeros(n)
for f0, f1 in freqs:
phase = 2 * np.pi * (
f0 * t + (f1 - f0)
/ (2 * duration) * t ** 2)
sig += 0.3 * np.sin(phase)
mod = 0.5 + 0.5 * np.sin(
2 * np.pi * 4 * t)
sig *= mod
window = np.hanning(self.n_fft)
n_frames = (n - self.n_fft) // self.hop + 1
n_bins = self.n_fft // 2 + 1
spec = np.zeros((n_bins, n_frames))
for i in range(n_frames):
s = i * self.hop
frame = sig[s:s + self.n_fft] * window
spec[:, i] = np.abs(
np.fft.rfft(frame)) ** 2
fb = self._mel_fb(n_bins)
mel = fb @ spec
return 10.0 * np.log10(mel + 1e-10)
def _mel_fb(self, n_bins):
fmax = self.sr // 2
mel_lo = 2595 * np.log10(1 + 0 / 700)
mel_hi = 2595 * np.log10(1 + fmax / 700)
pts = np.linspace(
mel_lo, mel_hi, self.n_mels + 2)
hz = 700 * (10 ** (pts / 2595) - 1)
bins = np.floor(
(self.n_fft + 1) * hz / self.sr
).astype(int)
fb = np.zeros((self.n_mels, n_bins))
for i in range(self.n_mels):
lo, mid, hi = (bins[i],
bins[i+1], bins[i+2])
if mid > lo:
for k in range(lo, mid):
fb[i, k] = (
(k - lo) / (mid - lo))
if hi > mid:
for k in range(mid, hi + 1):
if k < n_bins:
fb[i, k] = (
(hi - k) / (hi - mid))
return fb
def compute_mcd(self, ref, gen):
n_cep = 13
n = min(ref.shape[0], n_cep)
basis = np.zeros((n, ref.shape[0]))
for k in range(n):
for j in range(ref.shape[0]):
basis[k, j] = np.cos(
np.pi * k * (2*j+1)
/ (2 * ref.shape[0]))
ref_c = basis @ ref
gen_c = basis @ gen
diff = ref_c[1:] - gen_c[1:]
dist = np.sqrt(
(diff ** 2).sum(axis=0))
return float(dist.mean())
def compute_sc(self, ref, gen):
diff_norm = np.sqrt(
((ref - gen) ** 2).sum())
ref_norm = np.sqrt((ref ** 2).sum())
return float(diff_norm / ref_norm)
def run(self):
rng = np.random.RandomState(42)
ref = self.generate_mel(
[(150, 3000)])
exc = ref + rng.randn(
*ref.shape) * 0.5
mod_noise = rng.randn(
*ref.shape) * 2.0
mod = ref + mod_noise
for b in rng.choice(
80, 3, replace=False):
mod[b, :] = ref.min()
poor = self.generate_mel(
[(2000, 500)])
print(f"{'Quality':<12} {'MCD':>8} "
f"{'SpConv':>8}")
print("-" * 30)
for name, gen in [("excellent", exc),
("moderate", mod),
("poor", poor)]:
mcd = self.compute_mcd(ref, gen)
sc = self.compute_sc(ref, gen)
print(f"{name:<12} {mcd:>8.3f} "
f"{sc:>8.4f}")
evaluator = MelReconstructionEval()
evaluator.run()
MCD (Mel Cepstral Distortion) is the standard metric for evaluating vocoder quality in TTS. It measures the distance in cepstral space -- which corresponds roughly to perceptual similarity. Lower MCD means the generated audio sounds more like the reference. The "excellent" version should have the lowest values, confirming that small additive noise causes minimal perceptual distortion.
Exercise 3: Vocoder upsampling simulator.
import numpy as np
class VocoderUpsampleSim:
def __init__(self, n_mels=80, n_frames=100,
hop=256, sr=16000):
self.n_mels = n_mels
self.n_frames = n_frames
self.hop = hop
self.sr = sr
rng = np.random.RandomState(42)
self.mel = np.zeros(
(n_mels, n_frames))
for f in range(n_frames):
for b in range(10, 31):
self.mel[b, f] = rng.uniform(
0.5, 1.0)
self.mel[:, f] += rng.randn(
n_mels) * 0.05
def nearest_neighbor(self):
samples = np.repeat(
self.mel.mean(axis=0), self.hop)
kernel = np.ones(16) / 16
return np.convolve(
samples, kernel, mode='same')
def linear_interp(self):
means = self.mel.mean(axis=0)
x_mel = np.arange(
self.n_frames) * self.hop
x_out = np.arange(
self.n_frames * self.hop)
return np.interp(x_out, x_mel, means)
def transposed_conv(self):
rng = np.random.RandomState(123)
out_len = self.n_frames * self.hop
output = np.zeros(out_len)
kernel = rng.randn(self.hop) * 0.1
means = self.mel.mean(axis=0)
for i in range(self.n_frames):
start = i * self.hop
end = start + self.hop
output[start:end] += (
means[i] * kernel)
return output
def analyze(self, signal, name):
smoothness = np.mean(
np.abs(np.diff(signal)))
drange = signal.max() - signal.min()
zc = np.sum(
np.diff(np.sign(signal)) != 0)
zcr = zc / len(signal)
return {
"name": name,
"length": len(signal),
"smoothness": smoothness,
"drange": drange,
"zcr": zcr}
def run(self):
methods = [
("Nearest", self.nearest_neighbor()),
("Linear", self.linear_interp()),
("TransConv", self.transposed_conv())]
print(f"{'Method':<12} {'Length':>8} "
f"{'Smooth':>8} {'DRange':>8} "
f"{'ZCR':>8}")
print("-" * 48)
for name, sig in methods:
r = self.analyze(sig, name)
print(
f"{r['name']:<12} "
f"{r['length']:>8} "
f"{r['smoothness']:>8.5f} "
f"{r['drange']:>8.4f} "
f"{r['zcr']:>8.5f}")
sim = VocoderUpsampleSim()
sim.run()
The transposed convolution produces the highest zero-crossing rate and dynamic range because the learned kernel introduces high-frequency structure that simple interpolation cannot. This is exactly why real vocoders (HiFi-GAN, WaveRNN) use learned upsampling -- they can synthesize the fine-grained waveform details that make audio sound natural rather than muffled.
On to today's episode
Here we go! The last three episodes covered the full speech pipeline: audio fundamentals (#92), speech recognition (#93), and text-to-speech (#94). We can now convert sound to numbers, turn speech into text, and generate speech from text. But speech is only one kind of sound. What about all the other sounds in the world?
A security camera needs to detect glass breaking. A wildlife monitoring station needs to identify bird species from their calls. A music streaming service needs to classify genre, detect mood, and tag instruments -- automatically, at scale, for millions of tracks. Your phone needs to recognize "Hey Siri" against a background of restaurant noise. None of these are speech recognition problems. They're audio classification problems.
The beautiful thing (and if you've been following along since episode #92, you might already see where this is going): since we convert audio to Mel spectrograms, audio classification is effectively image classification. All the CNN architectures from episodes #45-47 and the Vision Transformer from episode #54 apply directly. A spectrogram is a 2D image. A siren looks different from a dog bark in spectrogram space, just like a cat looks different from a car in pixel space. The core techniques transfer -- the differences are in the data, the augmentation, and the evaluation ;-)
CNN-based audio classification
The simplest and still surprisingly effective approach: compute a Mel spectrogram, treat it as a single-channel grayscale image, and feed it through a convolutional neural network. The architecture is identical to what we built for image classification:
import torch
import torch.nn as nn
import numpy as np
class AudioCNN(nn.Module):
"""CNN for audio classification operating
on Mel spectrograms."""
def __init__(self, n_mels=64,
num_classes=50):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(128, 256, 3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(),
nn.AdaptiveAvgPool2d((1, 1)),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Dropout(0.3),
nn.Linear(256, num_classes),
)
def forward(self, mel_spec):
# mel_spec: (batch, 1, n_mels, time)
x = self.features(mel_spec)
return self.classifier(x)
model = AudioCNN(num_classes=50)
# Simulate a batch of 4 spectrograms
# 64 Mel bands, ~4 seconds at hop=512
mel_input = torch.randn(4, 1, 64, 173)
logits = model(mel_input)
print(f"Input: {mel_input.shape}")
print(f"Output: {logits.shape}")
print(f"Parameters: "
f"{sum(p.numel() for p in model.parameters()):,}")
The architecture should feel very familiar from episode #45: conv -> batchnorm -> relu -> pool, repeated several times, then global average pooling collapses the spatial dimensions and a linear layer maps to class logits. The AdaptiveAvgPool2d((1, 1)) is key -- it lets the model accept spectrograms of any time length, which is essential because audio clips are rarely the exact same duration.
Two design choices worth understanding: 64 Mel bands is a common choice for classification (Whisper uses 80, some systems use 128 -- more bands means finer frequency resolution but also more computation). And the number of time frames depends on your sample rate and hop length: a 4-second clip at 22050 Hz with hop 512 gives about 173 frames.
This CNN approach achieves roughly 80-85% accuracy on standard benchmarks like ESC-50 (50 environmental sound classes: dog bark, rain, chainsaw, clock alarm, helicopter, etc.) and UrbanSound8K (10 urban sound classes). Not state-of-the-art, but remarkably good for how simple it is.
Audio data augmentation
Just like image classification benefits enormously from augmentations (random crops, flips, color jitter -- episode #14), audio classification needs its own augmentation toolkit. The two levels: waveform augmentation (before computing the spectrogram) and spectrogram augmentation (after):
class AudioAugmentor:
"""Two-level audio augmentation:
waveform-level and spectrogram-level."""
def __init__(self, sr=16000, seed=42):
self.sr = sr
self.rng = np.random.RandomState(seed)
def add_noise(self, waveform, snr_db=20):
"""Add Gaussian noise at a given
signal-to-noise ratio."""
power = np.mean(waveform ** 2)
noise_power = power / (
10 ** (snr_db / 10))
noise = self.rng.randn(
len(waveform)) * np.sqrt(
noise_power)
return waveform + noise
def time_shift(self, waveform,
max_shift=0.1):
"""Circular shift by random amount."""
shift = self.rng.randint(
-int(self.sr * max_shift),
int(self.sr * max_shift))
return np.roll(waveform, shift)
def speed_perturb(self, waveform,
lo=0.9, hi=1.1):
"""Change speed (affects pitch too)."""
factor = self.rng.uniform(lo, hi)
indices = np.arange(
0, len(waveform), factor)
indices = indices[
indices < len(waveform)
].astype(int)
return waveform[indices]
def spec_augment(self, mel, n_freq=2,
n_time=2, freq_w=10,
time_w=20):
"""SpecAugment: mask random frequency
bands and time segments."""
aug = mel.copy()
n_mels, n_frames = aug.shape
for _ in range(n_freq):
f = self.rng.randint(0, freq_w)
f0 = self.rng.randint(
0, max(n_mels - f, 1))
aug[f0:f0 + f, :] = 0
for _ in range(n_time):
t = self.rng.randint(0, time_w)
t0 = self.rng.randint(
0, max(n_frames - t, 1))
aug[:, t0:t0 + t] = 0
return aug
def demo(self):
n = self.sr * 2
t = np.arange(n) / self.sr
sig = 0.5 * np.sin(
2 * np.pi * 440 * t)
noisy = self.add_noise(sig, snr_db=20)
shifted = self.time_shift(sig)
fast = self.speed_perturb(sig, 1.1, 1.1)
print("Waveform augmentations:")
print(f" Original: {len(sig)} samples")
print(f" + noise: {len(noisy)} "
f"(SNR=20dB)")
print(f" Shifted: {len(shifted)}")
print(f" Speed 1.1: {len(fast)}")
mel = self.rng.randn(64, 200) * 10
aug_mel = self.spec_augment(mel)
freq_masked = np.sum(
aug_mel.sum(axis=1) == 0)
time_masked = np.sum(
aug_mel.sum(axis=0) == 0)
print(f"\nSpecAugment on (64, 200):")
print(f" Freq bands zeroed: "
f"{freq_masked}")
print(f" Time frames zeroed: "
f"{time_masked}")
augmentor = AudioAugmentor()
augmentor.demo()
SpecAugment (Park et al., 2019) deserves special attention because it's the single most impactful augmentation technique for audio models. The idea is dead simple: randomly zero out horizontal stripes (frequency masking) and vertical stripes (time masking) in the spectrogram during training. Frequency masking forces the model to not rely on any single frequency band -- if it can only recognize a dog bark by the fundamental frequency, masking that band forces it to learn from harmonics and temporal patterns too. Time masking forces robustness to missing segments. We built this from scratch in episode #92's augmentation section, and here it shows up again because it's genuinely used everywhere in production audio systems.
The waveform-level augmentations serve different purposes: noise injection simulates real-world recording conditions (nobody records in a soundproof studio), time shifting teaches the model that a dog bark at the beginning vs end of a clip is still a dog bark, and speed perturbation adds pitch and tempo variation.
AudioSet: the ImageNet of audio
In the vision world, ImageNet (episode #46) was the dataset that enabled transfer learning and transformed the field. Audio has its equivalent: AudioSet (Gemmeke et al., 2017), a collection of 2 million 10-second YouTube clips annotated with 527 sound event classes. The classes range from "Speech" and "Music" to "Chainsaw", "Church bell", "Burping", and "Skateboard". It's multi-label -- a single clip might be tagged with both "Music" and "Singing" and "Guitar", because those sounds co-occur naturally.
AudioSet pretraining is the standard starting point for almost any audio classification task, the same way ImageNet pretraining is standard for vision. Here's how the pretrained models compare:
class AudioSetModels:
"""Compare AudioSet-pretrained model
architectures and their performance."""
def __init__(self):
self.models = {
"CNN6": {
"params_m": 4.7,
"mAP": 0.343,
"type": "CNN",
"year": 2020},
"CNN10": {
"params_m": 5.2,
"mAP": 0.380,
"type": "CNN",
"year": 2020},
"CNN14": {
"params_m": 80.7,
"mAP": 0.431,
"type": "CNN",
"year": 2020},
"AST": {
"params_m": 86.0,
"mAP": 0.459,
"type": "Transformer",
"year": 2021},
"BEATs": {
"params_m": 90.0,
"mAP": 0.498,
"type": "Transformer",
"year": 2023},
"Audio-MAE": {
"params_m": 86.0,
"mAP": 0.473,
"type": "Transformer",
"year": 2022},
}
def compare(self):
print(f"{'Model':<12} {'Type':<14} "
f"{'Params':>7} {'mAP':>6} "
f"{'Year':>5}")
print("-" * 48)
for name, m in self.models.items():
print(
f"{name:<12} {m['type']:<14} "
f"{m['params_m']:>5.1f}M "
f"{m['mAP']:>5.3f} "
f"{m['year']:>5}")
print(f"\nmAP = mean Average Precision")
print(f"Higher is better, 1.0 = perfect")
print(f"Human estimate: ~0.65 mAP")
models = AudioSetModels()
models.compare()
The PANNs family (Pretrained Audio Neural Networks, Kong et al., 2020) established the CNN baselines: CNN14 achieves 0.431 mAP with a relatively straightforward 14-layer convolutional architecture. Then the transformers arrived. AST (Audio Spectrogram Transformer, Gong et al., 2021) pushed to 0.459 by treating spectrograms as patch sequences -- exactly how ViT (episode #54) treats images. And BEATs (Chen et al., 2023) combined audio-specific pre-training with iterative self-supervised learning to reach 0.498.
That human estimate of ~0.65 mAP tells you something important: AudioSet's 527-class multi-label task is genuinely hard even for people. Many classes are subjective ("Music" vs "Musical instrument"), some clips have barely audible target sounds, and the YouTube-sourced labels aren't perfect.
The Audio Spectrogram Transformer
Let's look at AST more carefully because it's the clearest example of how vision architectures transfer to audio. The key insight: if you can represent audio as a 2D spectrogram, then a spectrogram is just an image, and Vision Transformers work on images. AST literally takes a pretrained ViT (trained on ImageNet!) and fine-tunes it on AudioSet:
class ASTArchitecture:
"""Illustrate how AST converts
spectrograms to patch sequences for
a vision transformer."""
def __init__(self, n_mels=128,
n_frames=1024,
patch_h=16, patch_w=16,
d_model=768):
self.n_mels = n_mels
self.n_frames = n_frames
self.patch_h = patch_h
self.patch_w = patch_w
self.d_model = d_model
def compute_patches(self):
patches_h = self.n_mels // self.patch_h
patches_w = self.n_frames // self.patch_w
total = patches_h * patches_w
return patches_h, patches_w, total
def describe(self):
ph, pw, total = self.compute_patches()
print(f"Input spectrogram: "
f"{self.n_mels} x {self.n_frames}")
print(f"Patch size: "
f"{self.patch_h} x {self.patch_w}")
print(f"Grid: {ph} x {pw} = "
f"{total} patches")
print(f"Each patch: "
f"{self.patch_h * self.patch_w} "
f"values -> projected to "
f"{self.d_model}d")
print(f"Sequence length: "
f"{total} + 1 (CLS token) = "
f"{total + 1}")
print(f"\nFor comparison, ViT-B/16 "
f"on 224x224 images:")
img_patches = (224 // 16) ** 2
print(f" Image patches: "
f"{img_patches}")
print(f" Audio patches: "
f"{total}")
print(f" Audio has "
f"{total / img_patches:.1f}x "
f"more patches")
print(f"\nThis means AST processes "
f"longer sequences than ViT,")
print(f"so it needs positional "
f"embedding interpolation.")
ast = ASTArchitecture()
ast.describe()
The gotcha here is sequence length. A 128 x 1024 Mel spectrogram produces 8 x 64 = 512 patches. A 224x224 image only produces 196 patches. That's 2.6x more tokens for the transformer to process, which means more memory and more computation (self-attention is quadratic in sequence length, as we covered in episode #52). AST handles this by interpolating the positional embeddings from ViT's 196 positions to 512 -- the same technique we discussed in episode #54 for applying ViT to higher-resolution images.
The practical result: AST is slower than CNN14 (more compute per sample) but more accurate (0.459 vs 0.431 mAP). For real-time on-device classification (smart speakers, phones), CNNs are still preferred. For batch processing where accuracy matters more than latency, transformers win.
Multi-label classification
Here's where audio classification diverges fundamentaly from typical image classification. An image usually has one primary label: "cat", "dog", "car". An audio clip often contains multiple simultaneous sounds: speech over music, birds chirping with wind noise, a siren mixed with traffic. This makes audio tagging a multi-label problem:
class MultiLabelDemo:
"""Demonstrate multi-label vs single-label
classification differences."""
def __init__(self):
self.classes = [
"Speech", "Music", "Dog", "Siren",
"Rain", "Traffic", "Bird", "Wind"]
def single_label(self, logits):
"""Softmax: probabilities sum to 1.
Forces exactly ONE prediction."""
exp = np.exp(logits - logits.max())
probs = exp / exp.sum()
pred = np.argmax(probs)
return probs, [self.classes[pred]]
def multi_label(self, logits,
threshold=0.5):
"""Sigmoid: each class independent.
Multiple classes can be active."""
probs = 1 / (1 + np.exp(-logits))
preds = [self.classes[i]
for i in range(len(logits))
if probs[i] > threshold]
return probs, preds
def demo(self):
# A clip with speech AND music AND rain
logits = np.array(
[2.1, 1.8, -1.0, -2.5,
1.5, -0.3, -1.2, -0.8])
s_probs, s_preds = self.single_label(
logits)
m_probs, m_preds = self.multi_label(
logits)
print("Logits:", [f"{l:.1f}" for l
in logits])
print(f"\nSingle-label (softmax):")
print(f" Prediction: {s_preds}")
for i, c in enumerate(self.classes):
bar = "#" * int(s_probs[i] * 40)
print(f" {c:<10} {s_probs[i]:.3f} "
f"{bar}")
print(f"\nMulti-label (sigmoid):")
print(f" Predictions: {m_preds}")
for i, c in enumerate(self.classes):
bar = "#" * int(m_probs[i] * 40)
act = " *" if m_probs[i] > 0.5 else ""
print(f" {c:<10} {m_probs[i]:.3f} "
f"{bar}{act}")
print(f"\nSoftmax FORCES a single "
f"winner.")
print(f"Sigmoid lets Speech + Music + "
f"Rain all be active.")
demo = MultiLabelDemo()
demo.demo()
The key differences from single-label classification (which is what we've mostly done in this series):
- Sigmoid instead of softmax: each class gets an independent probability between 0 and 1. Softmax forces probabilities to sum to 1 (one winner), sigmoid doesn't.
- Binary Cross-Entropy (BCE) instead of Cross-Entropy: BCE treats each class as a separate binary classification problem. The loss for "Speech" is computed independantly from the loss for "Music".
- Threshold-based prediction: instead of argmax (pick the highest), you apply a threshold (typically 0.5) and everything above is considered active.
- Evaluation metric: mAP (mean Average Precision) instead of accuracy. mAP captures how well the model ranks positive classes above negative ones, which matters more than the exact threshold.
Sound event detection: the temporal dimension
Classification tells you what sounds are in a clip. Sound Event Detection (SED) tells you when each sound occurs -- start time and end time. This is audio's equivalent of object detection (episode #78): instead of spatial bounding boxes around objects in an image, you predict temporal segments for sounds in a recording:
class SoundEventDetector:
"""Simulate frame-level sound event
detection with post-processing."""
def __init__(self, sr=16000, hop=512,
n_classes=5):
self.sr = sr
self.hop = hop
self.n_classes = n_classes
self.classes = [
"Speech", "Music", "Dog",
"Siren", "Bird"]
def simulate_predictions(self, duration):
"""Create realistic frame-level
predictions for demo purposes."""
rng = np.random.RandomState(42)
n_frames = int(
duration * self.sr / self.hop)
preds = np.zeros(
(self.n_classes, n_frames))
# Speech from 1-4s
s1 = int(1.0 * self.sr / self.hop)
s2 = int(4.0 * self.sr / self.hop)
preds[0, s1:s2] = rng.uniform(
0.7, 0.95, s2 - s1)
# Dog bark at 2.5-3.5s
d1 = int(2.5 * self.sr / self.hop)
d2 = int(3.5 * self.sr / self.hop)
preds[2, d1:d2] = rng.uniform(
0.6, 0.85, d2 - d1)
# Add noise everywhere
preds += rng.uniform(
0, 0.15,
(self.n_classes, n_frames))
return np.clip(preds, 0, 1)
def detect_events(self, preds,
threshold=0.5,
min_dur_frames=3):
"""Convert frame-level predictions
to discrete events."""
events = []
for c in range(self.n_classes):
active = preds[c] > threshold
in_event = False
start = 0
for f in range(len(active)):
if active[f] and not in_event:
start = f
in_event = True
elif not active[f] and in_event:
dur = f - start
if dur >= min_dur_frames:
events.append({
"class": self.classes[c],
"start": start * self.hop / self.sr,
"end": f * self.hop / self.sr,
"confidence": float(
preds[c, start:f].mean())})
in_event = False
if in_event:
dur = len(active) - start
if dur >= min_dur_frames:
events.append({
"class": self.classes[c],
"start": start * self.hop / self.sr,
"end": len(active) * self.hop / self.sr,
"confidence": float(
preds[c, start:].mean())})
return sorted(
events, key=lambda e: e["start"])
def run(self):
duration = 6.0
preds = self.simulate_predictions(
duration)
n_frames = preds.shape[1]
print(f"Audio: {duration:.0f}s, "
f"{n_frames} frames")
print(f"Frame predictions: "
f"{preds.shape}")
events = self.detect_events(preds)
print(f"\nDetected events:")
for e in events:
print(f" [{e['start']:.1f}s - "
f"{e['end']:.1f}s] "
f"{e['class']} "
f"(conf: {e['confidence']:.2f})")
detector = SoundEventDetector()
detector.run()
The model produces per-frame, per-class probabilities. Post-processing converts these continuous probabilities into discrete events by: (1) applying a threshold to get binary frame-level decisions, (2) grouping consecutive positive frames into event segments, (3) filtering out events shorter than a minimum duration (to remove spurious detections), and (4) optionally merging events that are separated by very small gaps. This post-processing pipeline is analogous to Non-Maximum Suppression in object detection -- the raw model outputs need cleanup before they're useful.
The DCASE (Detection and Classification of Acoustic Scenes and Events) challenge series is the main benchmark for SED. Systems are evaluated using event-based F1 score with a tolerance window -- a detected event is correct if it overlaps with a ground-truth event within a certain time margin.
Complete classification pipeline
Let me put everything together into a full pipeline that goes from raw audio files to predictions, including proper preprocessing, augmentation, and evaluation:
class AudioClassificationPipeline:
"""Complete pipeline: load audio ->
preprocess -> augment -> classify ->
evaluate."""
def __init__(self, sr=16000, n_mels=64,
duration=4.0, n_fft=1024,
hop=512, num_classes=10):
self.sr = sr
self.n_mels = n_mels
self.max_samples = int(sr * duration)
self.n_fft = n_fft
self.hop = hop
self.num_classes = num_classes
def preprocess(self, waveform):
"""Normalize, pad/truncate, compute
Mel spectrogram."""
peak = np.abs(waveform).max()
if peak > 0:
waveform = waveform / peak
if len(waveform) > self.max_samples:
waveform = waveform[
:self.max_samples]
else:
pad = self.max_samples - len(
waveform)
waveform = np.pad(
waveform, (0, pad))
return self._mel_spec(waveform)
def _mel_spec(self, signal):
"""Compute log-Mel spectrogram."""
window = np.hanning(self.n_fft)
n_frames = (len(signal) - self.n_fft
) // self.hop + 1
n_bins = self.n_fft // 2 + 1
spec = np.zeros((n_bins, n_frames))
for i in range(n_frames):
s = i * self.hop
frame = signal[s:s + self.n_fft]
spec[:, i] = np.abs(
np.fft.rfft(frame * window))
fb = self._mel_fb(n_bins)
mel = fb @ (spec ** 2)
return 10.0 * np.log10(mel + 1e-10)
def _mel_fb(self, n_bins):
fmax = self.sr // 2
mel_lo = 2595 * np.log10(1 + 0 / 700)
mel_hi = 2595 * np.log10(
1 + fmax / 700)
pts = np.linspace(
mel_lo, mel_hi, self.n_mels + 2)
hz = 700 * (10 ** (pts / 2595) - 1)
bins = np.floor(
(self.n_fft + 1) * hz / self.sr
).astype(int)
fb = np.zeros((self.n_mels, n_bins))
for i in range(self.n_mels):
lo, mid, hi = (bins[i],
bins[i+1], bins[i+2])
if mid > lo:
for k in range(lo, mid):
fb[i, k] = (
(k - lo) / (mid - lo))
if hi > mid:
for k in range(mid, hi + 1):
if k < n_bins:
fb[i, k] = (
(hi - k)
/ (hi - mid))
return fb
def evaluate_predictions(self, y_true,
y_pred):
"""Compute accuracy and per-class
precision/recall."""
correct = sum(
t == p for t, p
in zip(y_true, y_pred))
acc = correct / len(y_true)
# Per-class stats
classes = set(y_true)
print(f"Overall accuracy: "
f"{acc:.1%} "
f"({correct}/{len(y_true)})")
print(f"\n{'Class':>6} {'Prec':>6} "
f"{'Recall':>6} {'Count':>6}")
print("-" * 28)
for c in sorted(classes):
tp = sum(1 for t, p
in zip(y_true, y_pred)
if t == c and p == c)
fp = sum(1 for t, p
in zip(y_true, y_pred)
if t != c and p == c)
fn = sum(1 for t, p
in zip(y_true, y_pred)
if t == c and p != c)
prec = tp / (tp + fp) if (
tp + fp) > 0 else 0
rec = tp / (tp + fn) if (
tp + fn) > 0 else 0
count = sum(
1 for t in y_true if t == c)
print(f"{c:>6} {prec:>5.1%} "
f"{rec:>5.1%} {count:>6}")
def run_demo(self):
rng = np.random.RandomState(42)
n = self.max_samples
t = np.arange(n) / self.sr
sigs = []
for freq in [200, 500, 1000, 2000,
4000]:
sig = np.sin(
2 * np.pi * freq * t)
sig += rng.randn(n) * 0.1
sigs.append(sig)
print("Preprocessing 5 signals:")
for i, sig in enumerate(sigs):
mel = self.preprocess(sig)
print(f" Signal {i}: waveform "
f"{len(sig)} -> Mel "
f"{mel.shape}")
# Simulated evaluation
y_true = [0, 0, 1, 1, 2, 2, 3, 3,
4, 4, 0, 1, 2, 3, 4]
y_pred = [0, 0, 1, 2, 2, 2, 3, 4,
4, 4, 0, 1, 1, 3, 4]
print()
self.evaluate_predictions(
y_true, y_pred)
pipeline = AudioClassificationPipeline()
pipeline.run_demo()
The pipeline encapsulates the standard workflow: load -> normalize -> pad/truncate to fixed length -> Mel spectrogram -> feed to model -> evaluate. The evaluation uses per-class precision and recall, which is more informative than overall accuracy (same lesson we learned way back in episode #13). A model might have 90% accuracy but completely fail on rare classes -- per-class metrics reveal that.
What makes audio classification hard?
Worth being honest about the challenges that make this problem harder than it might seem:
Class imbalance: in AudioSet, "Music" appears in 40% of clips. "Skateboard" appears in 0.03%. Your model will happily learn to predict "Music" for everything and still get decent overall mAP. Stratified sampling, class weighting, and focal loss (all techniques from episode #14 and #40) are essential.
Label noise: AudioSet labels come from human annotators watching YouTube videos. They miss sounds, they mislabel sounds, they disagree with each other. Some clips are labeled "Speech" but the speech is barely audible under loud music. Training with noisy labels requires either label cleaning or noise-robust loss functions.
Domain shift: a model trained on YouTube audio (compressed, variable quality, background music) might fail on a security camera recording (high-quality, quiet background, specific reverb characteristics). This is the same domain shift problem we discussed in episode #35 -- your training distribution needs to match your deployment conditions, or you need domain adaptation techniques.
Temporal granularity: is a "door slam" a 0.3-second event or a 2-second event? It depends on whether you include the reverb. Different annotators draw different temporal boundaries, and your model needs to handle this ambiguity. This is why SED evaluation uses tolerance windows rather than exact match.
Samengevat
- Audio classification assigns labels to sound recordings; since Mel spectrograms are 2D images, CNN and transformer architectures from the vision arc transfer directly;
- SpecAugment (frequency masking + time masking) is the most effective audio augmentation, forcing models to learn robust features instead of relying on specific spectral regions;
- AudioSet (2 million YouTube clips, 527 classes) is the ImageNet of audio; pretrained models like CNN14 and AST provide strong starting points for transfer learning;
- the Audio Spectrogram Transformer (AST) treats spectrograms as patch sequences (exactly like ViT treats images), achieving 0.459 mAP on AudioSet versus CNN14's 0.431;
- audio tagging is inherently multi-label (sigmoid + BCE loss, not softmax + cross-entropy) because multiple sounds naturally co-occur in real recordings;
- Sound Event Detection (SED) extends classification with temporal localization -- predicting per-frame per-class probabilities, then post-processing into discrete events with start/end times;
- practical challenges include class imbalance, label noise, domain shift, and temporal granularity -- the same issues that affect classification in other domains, but amplified by the messy nature of real-world audio.
We've covered how to classify what sounds are in a recording and when they happen. Next time we'll look at a creative application of audio AI that takes this in a completely different direction: generating music. The models need to understand not just what music sounds like but concepts like rhythm, harmony, and structure over much longer time horizons than speech or environmental sounds.
Exercises
Exercise 1: Build a spectrogram-based audio feature comparator. Create a class AudioFeatureComparator that: (a) generates 5 synthetic audio signals (each 2 seconds at 16,000 Hz): (1) a pure 440 Hz sine wave, (2) white noise (random seed 42), (3) a frequency sweep from 200 to 4000 Hz, (4) a "speech-like" signal (sum of 150 Hz, 300 Hz, 900 Hz, 2500 Hz with amplitude modulation at 4 Hz), (5) a "music-like" signal (440 Hz + 554 Hz + 659 Hz chord with slow vibrato at 5 Hz), (b) computes for each signal: the Mel spectrogram (64 bands, n_fft=1024, hop=256), spectral centroid (the energy-weighted mean frequency across all frames), spectral flatness (geometric mean divided by arithmetic mean of magnitudes -- high for noise, low for tonal signals), temporal variation (mean absolute difference between consecutive Mel frames), (c) prints a comparison table with all 5 signals and their features. Verify that: white noise has the highest spectral flatness, the pure tone has the lowest spectral flatness, the speech-like signal has high temporal variation (because of amplitude modulation), and the sweep has a centroid between the pure tone and noise.
Exercise 2: Build a multi-label evaluation toolkit. Create a class MultiLabelEvaluator that: (a) implements compute_ap(y_true, y_scores) that computes Average Precision for a single class: sort samples by descending score, sweep through thresholds, compute precision and recall at each threshold, and compute the area under the precision-recall curve (using the trapezoidal rule), (b) implements compute_map(Y_true, Y_scores) that computes mean Average Precision across all classes, (c) creates a test scenario with 5 classes and 20 samples where: class 0 has perfect predictions (AP = 1.0), class 1 has near-perfect predictions (AP > 0.9), class 2 has moderate predictions (AP around 0.5-0.7), class 3 has random predictions (AP around 0.2-0.5), class 4 has inverted predictions (AP near 0.0), (d) prints the AP for each class and the overall mAP, (e) verifies the ordering: AP for class 0 > class 1 > class 2 > class 3 > class 4, and that the overall mAP is the arithmetic mean of per-class APs.
Exercise 3: Build a sound event post-processor. Create a class SEDPostProcessor that: (a) generates synthetic frame-level predictions for a 10-second clip (320 frames at 16kHz/512 hop) with 3 classes: (1) "Speech" active from frames 20-120 with probability 0.8, (2) "Music" active from frames 50-250 with probability 0.7, (3) "Dog" with three short bursts at frames 40-55, 100-110, and 200-215 with probability 0.75, plus Gaussian noise (std=0.1) added to all predictions, (b) implements three post-processing stages: (1) threshold(preds, t=0.5) -- binary decisions per frame, (2) fill_gaps(binary, max_gap=5) -- merge events separated by fewer than max_gap frames, (3) remove_short(binary, min_dur=8) -- remove events shorter than min_dur frames, (c) applies all three stages in sequence and extracts events (class, start_frame, end_frame) after each stage, (d) prints the events after each stage for all 3 classes, showing how the post-processing cleans up the predictions. Verify that: after thresholding, the Dog class has 3 separate events; after gap-filling, some may merge if gaps are small enough; after minimum duration filtering, any very short noise-triggered events are removed.