Learn AI Series (#100) - Multimodal Audio-Visual Models

Learn AI Series (#100) - Multimodal Audio-Visual Models

variant-b-03-red.png

What will I learn

  • You will learn audio-visual speech recognition: combining lip reading with audio for robust ASR in noisy environments;
  • the cocktail party problem: separating individual speakers from a mixture using both audio and visual cues;
  • sound source localization: identifying which visual region in a video produces a given sound;
  • video-to-audio generation: synthesizing plausible audio from silent video input;
  • audio-visual emotion recognition: reading feelings from face and voice simultaneously for richer signal than either modality alone;
  • cross-modal self-supervised learning: exploiting the natural correspondence between audio and video as free supervision;
  • building from-scratch audio-visual fusion systems using NumPy and PyTorch that combine spectral and spatial feature streams.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#100) - Multimodal Audio-Visual Models

Solutions to Episode #99 Exercises

Exercise 1: Multi-band spectral subtractor -- generate 3s of clean tonal audio (3 harmonics at 200/400/800 Hz), add broadband noise at SNR=5dB. Estimate noise spectrum from first 0.5s (noise-only region), then subtract per-band with oversubtraction alpha=2.0 and spectral floor beta=0.01. Result: SNR improves from 5dB to roughly 12-15dB. The spectral floor prevents negative magnitudes (which cause "musical noise" artifacts). Per-band analysis shows the most improvement in bands where the signal-to-noise ratio was already moderate (200-800 Hz region), while very high frequency bands (above 2kHz) see less improvement because they were mostly noise to begin with.

Exercise 2: Adaptive echo canceller -- generate a 3s "far-end" signal (sum of 3 sinusoids), convolve it with a synthetic room impulse response (10-tap FIR with exponential decay), add local "near-end" speech (different frequency), and run NLMS adaptive filter (step size mu=0.5, filter length=15). Track convergence by measuring echo-return loss enhancement (ERLE) per frame. Result: the filter converges within 0.5-1.0s, ERLE rises from 0dB to 15-25dB, and the residual error drops below -20dB. The adapted filter coefficients approximate the true room impulse response -- plotting both side by side shows the match improving over time, with the largest coefficient errors in the tail taps (smallest true values).

Exercise 3: Waveform-domain denoiser with learned encoder-decoder -- generate 2s clean speech (5 harmonics with amplitude envelope), add Gaussian noise at SNR=0dB. Build a simple encoder (3 conv1d layers with stride-2 downsampling + ReLU) and symmetric decoder (3 transposed conv1d layers). Train for 200 epochs on 20 noisy/clean pairs using L1 loss. Result: SNR improves from 0dB to 5-8dB after training. The learned encoder filters show bandpass-like patterns -- lower filters capture low-frequency harmonics, higher filters capture the noise floor. Comparing waveform-domain vs spectrogram-domain (from exercise 1): waveform models avoid phase reconstruction artifacts but need more data to converge; spectrogram models converge faster but introduce musical noise at the phase boundaries.

On to today's episode

Here we go! Episode one hundred. One hundred. When this series started with "what machine learning actually is" I genuinely did not expect we'd make it this far, but here we are and the topic fits perfectly for the milestone: models that combine two of the most fundamental human senses -- seeing and hearing ;-)

Humans are naturally multimodal creatures. You can understand someone in a noisy bar partly because you're reading their lips (even if you don't realize you're doing it). You turn toward a sudden sound because your eyes and ears triangulate together. A silent video of someone clapping is incomplete -- your brain fills in the expected sound whether you want it to or not. Audio-visual AI tries to capture exactly this kind of integration, and it unlocks capabilities that neither modality achieves on its own.

We've been building toward this for quite some time. Episode #75 covered multimodal text+vision. Episodes #92-99 covered the full audio domain -- from fundamentals through speech recognition, generation, classification, speaker identification, NLU, and enhancement. Today we bring audio and vision together into unified systems. This is where things get really interesting.

Why multimodal? The argument from noise

Consider speech recognition in a crowded room. The audio signal is a mess -- multiple people talking, background music, glasses clinking. An audio-only ASR system (the kind we built in episode #93) struggles badly in these conditions. But a camera pointed at the speaker's face provides a completly independent signal channel. Lip movements are NOT affected by acoustic noise. Combining both modalities makes the system robust in situations where either modality alone would fail.

This is not theoretical hand-waving. The McGurk effect demonstrates how tightly humans fuse audio and visual speech perception: when you hear "ba" but see lips forming "ga," your brain perceives "da" -- a sound that was neither heard nor seen. Your brain fuses the two modalities into a single percept automatically. Audio-visual models aim for the same kind of integration.

Let's start by building the fundamental building block: a system that extracts features from both modalities and learns to combine them.

Audio-visual feature fusion from scratch

Before we get into specific tasks, let's build the core fusion mechansim that every audio-visual system needs -- taking features from two different modalities and combining them into a unified representation:

import numpy as np


class AudioVisualFuser:
    """Combine audio spectral features with visual
    spatial features using attention-weighted fusion."""

    def __init__(self, audio_dim=40, visual_dim=32,
                 fused_dim=64):
        self.rng = np.random.RandomState(42)
        self.audio_dim = audio_dim
        self.visual_dim = visual_dim
        self.fused_dim = fused_dim
        # Projection matrices
        self.W_audio = self.rng.randn(
            audio_dim, fused_dim) * 0.1
        self.W_visual = self.rng.randn(
            visual_dim, fused_dim) * 0.1
        # Attention weights
        self.W_attn = self.rng.randn(fused_dim, 1) * 0.1

    def project(self, audio_feat, visual_feat):
        """Project both modalities into shared space."""
        a_proj = np.tanh(audio_feat @ self.W_audio)
        v_proj = np.tanh(visual_feat @ self.W_visual)
        return a_proj, v_proj

    def attention_fuse(self, a_proj, v_proj):
        """Attention-weighted fusion."""
        # Stack modalities: (2, n_frames, fused_dim)
        stacked = np.stack([a_proj, v_proj], axis=0)
        # Compute attention scores
        scores = np.tanh(stacked @ self.W_attn).squeeze(-1)
        # Softmax over the 2 modalities
        exp_s = np.exp(scores - scores.max(axis=0))
        weights = exp_s / exp_s.sum(axis=0, keepdims=True)
        # Weighted combination
        fused = weights[0:1].T * a_proj + weights[1:2].T * v_proj
        return fused, weights

    def run_demo(self):
        n_frames = 20
        audio = self.rng.randn(n_frames, self.audio_dim)
        visual = self.rng.randn(n_frames, self.visual_dim)

        # Simulate noisy audio in frames 5-10
        audio[5:10] += self.rng.randn(5, self.audio_dim) * 3.0

        a_proj, v_proj = self.project(audio, visual)
        fused, weights = self.attention_fuse(a_proj, v_proj)

        print("=== Audio-Visual Fusion ===")
        print(f"Audio features: {audio.shape}")
        print(f"Visual features: {visual.shape}")
        print(f"Fused features: {fused.shape}")
        print(f"\nAttention weights (audio, visual):")
        print(f"  Clean frames (0-4):  audio={weights[0,:5].mean():.3f}"
              f"  visual={weights[1,:5].mean():.3f}")
        print(f"  Noisy frames (5-9):  audio={weights[0,5:10].mean():.3f}"
              f"  visual={weights[1,5:10].mean():.3f}")
        print(f"  Clean frames (10-19): audio={weights[0,10:].mean():.3f}"
              f"  visual={weights[1,10:].mean():.3f}")


fuser = AudioVisualFuser()
fuser.run_demo()

The attention mechanism is the key idea here. When the audio signal is clean, the model can rely heavily on audio features (which carry more information than lip movements for speech). When the audio is noisy, the attention shifts toward visual features. This adaptive weighting is exactly what makes audio-visual systems more robust than either modality alone -- the model learns to dynamically trust the cleaner signal.

Audio-visual speech recognition

Audio-visual ASR (AV-ASR) combines a visual encoder (processing lip/face video) with an audio encoder (processing spectrograms), fusing their representations before decoding text. Let's build a simplified version that demonstrates the key insight -- multimodal robustness under noise:

import numpy as np


class SimpleAVASR:
    """Audio-visual speech recognizer demonstrating
    noise robustness through multimodal fusion."""

    def __init__(self, sr=16000, n_mels=40, visual_dim=32):
        self.sr = sr
        self.n_mels = n_mels
        self.visual_dim = visual_dim
        self.rng = np.random.RandomState(42)

    def synthesize_av_speech(self, word, duration=0.5):
        """Generate paired audio-visual data for a word."""
        n = int(self.sr * duration)
        t = np.arange(n) / self.sr
        word_freqs = {
            "hello": [200, 400, 800],
            "world": [150, 350, 700],
            "stop":  [300, 500, 900],
            "go":    [180, 450, 750],
        }
        freqs = word_freqs.get(word, [200, 400, 800])
        audio = sum(
            (0.3 / (i + 1)) * np.sin(2 * np.pi * f * t)
            for i, f in enumerate(freqs)
        )
        audio += self.rng.randn(n) * 0.02

        # Visual: simulate lip aperture over time
        n_frames = 10
        lip_features = np.zeros((n_frames, self.visual_dim))
        for i in range(n_frames):
            phase = i / n_frames
            aperture = 0.5 + 0.3 * np.sin(
                2 * np.pi * freqs[0] / 200 * phase)
            lip_features[i, :8] = aperture + self.rng.randn(8) * 0.1
            lip_features[i, 8:16] = freqs[0] / 1000 + self.rng.randn(8) * 0.05
            lip_features[i, 16:] = self.rng.randn(self.visual_dim - 16) * 0.1
        return audio, lip_features

    def mel_spectrogram(self, audio):
        """Simplified mel spectrogram extraction."""
        n_fft, hop = 512, 160
        n_frames = max(1, (len(audio) - n_fft) // hop)
        spec = np.zeros((self.n_mels, n_frames))
        window = np.hanning(n_fft)
        for i in range(n_frames):
            s = i * hop
            frame = audio[s:s + n_fft]
            if len(frame) < n_fft:
                break
            fft_mag = np.abs(np.fft.rfft(frame * window))
            mel_bins = np.linspace(0, len(fft_mag),
                                   self.n_mels + 1, dtype=int)
            for m in range(self.n_mels):
                lo, hi = mel_bins[m], mel_bins[m + 1]
                if hi > lo:
                    spec[m, i] = np.mean(fft_mag[lo:hi])
        return np.log(spec + 1e-9)

    def extract_av_features(self, mel, visual):
        """Extract combined audio-visual feature vector."""
        audio_feat = np.concatenate([
            np.mean(mel, axis=1), np.std(mel, axis=1)])
        visual_feat = np.concatenate([
            np.mean(visual, axis=0), np.std(visual, axis=0)])
        return np.concatenate([audio_feat, visual_feat])

    def run(self):
        words = ["hello", "world", "stop", "go"]
        # Build training set
        X_train, y_train = [], []
        for wi, word in enumerate(words):
            for _ in range(15):
                audio, visual = self.synthesize_av_speech(word)
                mel = self.mel_spectrogram(audio)
                feat = self.extract_av_features(mel, visual)
                X_train.append(feat)
                y_train.append(wi)

        X = np.array(X_train)
        y = np.array(y_train)
        mean, std = X.mean(axis=0), X.std(axis=0) + 1e-9
        X_n = (X - mean) / std

        # Train softmax classifier
        W = self.rng.randn(X_n.shape[1], len(words)) * 0.01
        b = np.zeros(len(words))
        for epoch in range(300):
            logits = X_n @ W + b
            exp_l = np.exp(logits - logits.max(axis=1, keepdims=True))
            probs = exp_l / exp_l.sum(axis=1, keepdims=True)
            targets = np.zeros_like(probs)
            targets[np.arange(len(y)), y] = 1.0
            grad = probs - targets
            W -= 0.1 * (X_n.T @ grad) / len(y)
            b -= 0.1 * grad.mean(axis=0)

        # Test across noise levels
        print("=== AV-ASR Noise Robustness ===")
        for noise in [0.0, 0.5, 2.0]:
            av_ok = ao_ok = total = 0
            for wi, word in enumerate(words):
                for _ in range(5):
                    audio, visual = self.synthesize_av_speech(word)
                    audio += self.rng.randn(len(audio)) * noise
                    mel = self.mel_spectrogram(audio)
                    av_feat = self.extract_av_features(mel, visual)
                    av_n = (av_feat - mean) / std
                    if (av_n @ W + b).argmax() == wi:
                        av_ok += 1
                    # Audio-only: zero out visual features
                    a_only = np.zeros_like(av_feat)
                    a_only[:self.n_mels * 2] = av_feat[:self.n_mels * 2]
                    ao_n = (a_only - mean) / std
                    if (ao_n @ W + b).argmax() == wi:
                        ao_ok += 1
                    total += 1
            print(f"  Noise={noise:.1f}: "
                  f"AV={av_ok/total:.0%}  Audio-only={ao_ok/total:.0%}")


asr = SimpleAVASR()
asr.run()

The comparison at different noise levels tells the whole story: when audio is clean, both approaches perform similarly. But as noise increases, the audio-visual system degrades much more gracefully because the visual channel remains unaffected by acoustic noise. This is exactly why production AV-ASR systems like AV-HuBERT (Meta) significantly outperform audio-only Whisper in noisy conditions.

Lip reading: visual-only speech recognition

Pure lip reading -- transcribing speech from video alone, with no audio at all -- is one of the hardest problems in multimodal AI. Humans are surprisingly bad at it (only about 10-30% of English phonemes are visually distinguishable -- "p", "b", and "m" all look identical on the lips). Yet modern models achieve remarkable accuracy on benchmark datasets.

The key insight is that even though individual phonemes are ambiguous, the temporal sequence of lip movements combined with language model priors can disambiguate quit a lot. A sequence of lip shapes that could be "bat", "pat", or "mat" gets resolved by context -- if the previous word was "rubber" then "mat" is far more likely ;-)

The LRS2 and LRS3 datasets (from BBC and TED talks) are the standard benchmarks. State-of-the-art lip reading achieves around 20-25% WER on LRS3 -- not as good as audio ASR, but impressive considering the inherent ambiguity of visual speech. The typical architecture uses a 3D CNN frontend (to capture short-term spatiotemporal motion) followed by a transformer temporal model and CTC loss for text prediction.

The cocktail party problem

You're at a party. Multiple conversations happening simultaneously. Yet you can focus on one speaker and tune out the rest. How? Your brain uses both audio cues (pitch, timbre, spatial location) and visual cues (who's moving their lips, who's facing you) to separate the target speaker from the mixture.

Audio-visual source separation does the same thing. Given a video of multiple speakers and a mixed audio signal, separate each speaker's voice. Let's build a simplified version:

import numpy as np


class CocktailPartySolver:
    """Separate speakers using visual identity
    as a guidance signal for spectral masking."""

    def __init__(self, sr=16000):
        self.sr = sr
        self.rng = np.random.RandomState(42)

    def synthesize_speaker(self, f0, formants, duration=3.0):
        """Generate speaker audio with distinct vocal identity."""
        n = int(self.sr * duration)
        t = np.arange(n) / self.sr
        signal = sum(a * np.sin(2 * np.pi * f * t) for f, a in formants)
        for h in range(2, 5):
            signal += (0.2 / h) * np.sin(2 * np.pi * f0 * h * t)
        signal *= 0.5 + 0.5 * np.sin(2 * np.pi * 3 * t)
        return signal

    def visual_identity(self, speaker_id):
        """Simulate face embedding for a speaker."""
        self.rng.seed(speaker_id * 100)
        emb = self.rng.randn(32)
        return emb / np.linalg.norm(emb)

    def spectrogram(self, audio):
        """Simple magnitude spectrogram."""
        n_fft, hop = 512, 256
        n_frames = (len(audio) - n_fft) // hop
        spec = np.zeros((n_fft // 2 + 1, n_frames))
        window = np.hanning(n_fft)
        for i in range(n_frames):
            s = i * hop
            spec[:, i] = np.abs(np.fft.rfft(audio[s:s + n_fft] * window))
        return spec

    def spectral_mask(self, mixed_spec, target_spec):
        """Compute ideal ratio mask for target speaker."""
        mask = np.zeros_like(mixed_spec)
        mask[target_spec > mixed_spec * 0.3] = 1.0
        kernel = np.ones(3) / 3
        for i in range(mask.shape[0]):
            mask[i] = np.convolve(mask[i], kernel, mode='same')
        return np.clip(mask, 0, 1)

    def run(self):
        spk_a = self.synthesize_speaker(
            f0=120, formants=[(500, 0.4), (1000, 0.3), (2200, 0.15)])
        spk_b = self.synthesize_speaker(
            f0=220, formants=[(800, 0.35), (1300, 0.25), (2700, 0.12)])

        min_len = min(len(spk_a), len(spk_b))
        mixed = spk_a[:min_len] + spk_b[:min_len]

        spec_a = self.spectrogram(spk_a[:min_len])
        spec_b = self.spectrogram(spk_b[:min_len])
        spec_mix = self.spectrogram(mixed)

        face_a = self.visual_identity(0)
        face_b = self.visual_identity(1)

        mask_a = self.spectral_mask(spec_mix, spec_a)
        separated_a = spec_mix * mask_a
        separated_b = spec_mix * (1 - mask_a)

        def sdr_proxy(clean, estimate):
            sig = np.sum(clean ** 2)
            noise = np.sum((clean - estimate) ** 2)
            return 10 * np.log10(sig / (noise + 1e-10))

        print("=== Cocktail Party Separation ===")
        print(f"Speaker A f0=120Hz, Speaker B f0=220Hz")
        print(f"Mixed spectrogram: {spec_mix.shape}")
        print(f"\nSeparation quality:")
        print(f"  Speaker A SDR: {sdr_proxy(spec_a, separated_a):.1f} dB")
        print(f"  Speaker B SDR: {sdr_proxy(spec_b, separated_b):.1f} dB")
        print(f"\nVisual identity similarity:")
        print(f"  face_A . face_A: {np.dot(face_a, face_a):.3f}")
        print(f"  face_A . face_B: {np.dot(face_a, face_b):.3f}")


solver = CocktailPartySolver()
solver.run()

The Looking to Listen paper (Google, 2018) demonstrated this convincingly in a real system: show the model a video of two people talking simultaneously, and it separates each person's voice based on their face. The key insight is that lip movements are temporally correlated with the corresponding speaker's audio -- the model learns this correspondence during training. Having said that, the oracle mask we use above represents the ideal case; in practice the model has to predict the mask from the visual identity embedding, which is harder but still remarkably effective.

Sound source localization

Given a video, where is the sound coming from? Is it the guitar on the left, the drums on the right, or the singer in the middle? Sound source localization identifies which visual regions correspond to the audio:

import numpy as np


class SoundLocalizer:
    """Locate sound sources in an image using
    audio-visual feature correspondence."""

    def __init__(self, grid_h=7, grid_w=7):
        self.grid_h = grid_h
        self.grid_w = grid_w
        self.rng = np.random.RandomState(42)
        self.W_audio = self.rng.randn(40, 32) * 0.1

    def synthesize_scene(self):
        """Create a scene with 3 objects, one sounding."""
        visual_map = self.rng.randn(self.grid_h, self.grid_w, 32) * 0.1

        # Object 1: guitar (rows 2-4, cols 0-2)
        guitar_feat = self.rng.randn(32) * 0.5
        guitar_feat[0:8] = 1.0
        for r in range(2, 5):
            for c in range(0, 3):
                visual_map[r, c] = guitar_feat + self.rng.randn(32) * 0.05

        # Object 2: drums (rows 1-3, cols 4-6)
        drum_feat = self.rng.randn(32) * 0.5
        drum_feat[8:16] = 1.0
        for r in range(1, 4):
            for c in range(4, 7):
                visual_map[r, c] = drum_feat + self.rng.randn(32) * 0.05

        # Object 3: person (rows 4-6, cols 2-4)
        person_feat = self.rng.randn(32) * 0.5
        person_feat[16:24] = 1.0
        for r in range(4, 7):
            for c in range(2, 5):
                visual_map[r, c] = person_feat + self.rng.randn(32) * 0.05

        # Audio signal matches guitar
        audio_feat = self.rng.randn(40) * 0.1
        audio_feat[:8] = 1.0
        return visual_map, audio_feat

    def localize(self, visual_map, audio_feat):
        """Compute spatial attention heatmap."""
        audio_proj = np.tanh(audio_feat @ self.W_audio)
        audio_proj /= np.linalg.norm(audio_proj) + 1e-9

        heatmap = np.zeros((self.grid_h, self.grid_w))
        for r in range(self.grid_h):
            for c in range(self.grid_w):
                v = visual_map[r, c]
                v_norm = v / (np.linalg.norm(v) + 1e-9)
                heatmap[r, c] = np.dot(audio_proj, v_norm)
        heatmap = (heatmap - heatmap.min()) / (heatmap.max() - heatmap.min() + 1e-9)
        return heatmap

    def run(self):
        visual_map, audio_feat = self.synthesize_scene()
        heatmap = self.localize(visual_map, audio_feat)

        print("=== Sound Localization ===")
        print("Heatmap (7x7 grid):")
        print("  Guitar=rows 2-4 cols 0-2  Drums=rows 1-3 cols 4-6")
        print("  Person=rows 4-6 cols 2-4  Audio=guitar sound\n")
        for r in range(self.grid_h):
            row_str = "  "
            for c in range(self.grid_w):
                v = heatmap[r, c]
                row_str += " ## " if v > 0.7 else (" ++ " if v > 0.4 else " .. ")
            print(row_str)

        print(f"\nGuitar region avg:  {heatmap[2:5, 0:3].mean():.3f}")
        print(f"Drums region avg:   {heatmap[1:4, 4:7].mean():.3f}")
        print(f"Person region avg:  {heatmap[4:7, 2:5].mean():.3f}")


localizer = SoundLocalizer()
localizer.run()

The training trick in real systems: you don't need manual annotations of "the guitar is at pixel (234, 156)." Instead you use contrastive learning -- matching video frames with their corresponding audio (positive pairs) and mismatched audio (negative pairs). The model learns that the sounding object's visual features correlate with the audio features, and localization emerges as a natural byproduct. No expensive annotation required ;-)

Video-to-audio generation

The reverse problem: given a silent video, generate appropriate audio. A video of rain should produce rain sounds. A video of someone typing should produce keyboard clicks. A video of a dog barking should produce... well, a bark.

This combines video understanding (episode #83) with audio generation (episodes #94 and #96). The visual encoder extracts frame-level features, and an audio decoder generates a spectrogram conditioned on those features:

import numpy as np


class VideoToAudioGenerator:
    """Generate audio spectrograms from visual frame features."""

    def __init__(self, n_mels=40, visual_dim=32, hidden_dim=64):
        self.n_mels = n_mels
        self.rng = np.random.RandomState(42)
        self.W_enc = self.rng.randn(visual_dim, hidden_dim) * 0.1
        self.b_enc = np.zeros(hidden_dim)
        self.W_dec = self.rng.randn(hidden_dim, n_mels) * 0.1
        self.b_dec = np.zeros(n_mels)

    def simulate_video(self, event_type, n_frames=20):
        """Simulate visual features for sound-producing events."""
        features = np.zeros((n_frames, 32))
        if event_type == "clapping":
            for i in range(n_frames):
                phase = np.sin(2 * np.pi * 3 * i / n_frames)
                features[i, :8] = phase * 0.8 + self.rng.randn(8) * 0.1
                features[i, 8:16] = 0.3
        elif event_type == "drumming":
            for i in range(n_frames):
                phase = np.sin(2 * np.pi * 5 * i / n_frames)
                features[i, :8] = 0.2
                features[i, 8:16] = abs(phase) * 0.9 + self.rng.randn(8) * 0.1
        elif event_type == "waving":
            for i in range(n_frames):
                phase = np.sin(2 * np.pi * 2 * i / n_frames)
                features[i, :8] = phase * 0.4
                features[i, 16:24] = 0.5
        return features

    def generate_mel(self, visual_feats):
        """Predict mel spectrogram from visual features."""
        hidden = np.tanh(visual_feats @ self.W_enc + self.b_enc)
        return hidden @ self.W_dec + self.b_dec

    def run(self):
        events = ["clapping", "drumming", "waving"]
        print("=== Video-to-Audio Generation ===")
        mels = {}
        for event in events:
            visual = self.simulate_video(event)
            mel = self.generate_mel(visual)
            energy = np.mean(mel ** 2, axis=1)
            mels[event] = mel
            print(f"\n  {event}: mel={mel.shape}"
                  f"  mean_energy={np.mean(energy):.4f}"
                  f"  peak_frame={energy.argmax()}")

        print("\nCross-event mel similarity:")
        for i, e1 in enumerate(events):
            for e2 in events[i+1:]:
                m1, m2 = mels[e1].flatten(), mels[e2].flatten()
                cos = np.dot(m1, m2) / (np.linalg.norm(m1) * np.linalg.norm(m2) + 1e-9)
                print(f"  {e1} vs {e2}: {cos:.3f}")


gen = VideoToAudioGenerator()
gen.run()

Recent models like SpecVQGAN and Diff-Foley use codebook-based or diffusion-based generation rather than direct mel prediction, producing much higher quality output. The VGGSound dataset (300K+ video clips with audio labels) is the standard benchmark. None the less, the principle remains the same: extract temporal visual features, condition an audio generator on them, and produce audio that matches the visual content.

Audio-visual emotion recognition

In episode #98 we recognized emotion from speech alone using prosodic features. Adding the visual channel -- facial expressions -- creates a much richer signal. The face shows micro-expressions that the voice might mask, and vocal tone reveals emotion that a poker face conceals:

import numpy as np


class AVEmotionSystem:
    """Multimodal emotion recognition
    from face + audio features."""

    def __init__(self):
        self.emotions = ["neutral", "happy", "angry", "sad"]
        self.rng = np.random.RandomState(42)

    def synthesize_face_features(self, emotion, n_samples=15):
        """Simulate facial action unit features per emotion."""
        profiles = {
            "neutral": np.array([0.1, 0.1, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0]),
            "happy":   np.array([0.0, 0.8, 0.0, 0.0, 0.7, 0.6, 0.2, 0.0]),
            "angry":   np.array([0.7, 0.0, 0.8, 0.6, 0.0, 0.0, 0.0, 0.5]),
            "sad":     np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6, 0.7]),
        }
        base = profiles[emotion]
        return np.array([
            np.clip(base + self.rng.randn(8) * 0.15, 0, 1)
            for _ in range(n_samples)])

    def synthesize_audio_features(self, emotion, n_samples=15):
        """Simulate prosodic features per emotion."""
        profiles = {
            "neutral": np.array([150, 10, 0.3,  0.05, 1.0, 0.1]),
            "happy":   np.array([200, 30, 0.5,  0.1,  1.3, 0.3]),
            "angry":   np.array([170, 20, 0.6,  0.15, 1.1, 0.25]),
            "sad":     np.array([120, 8,  0.15, 0.03, 0.7, 0.05]),
        }
        base = profiles[emotion]
        return np.array([
            base + self.rng.randn(6) * base * 0.1
            for _ in range(n_samples)])

    def run(self):
        X_face, X_audio, y = [], [], []
        for ei, emo in enumerate(self.emotions):
            faces = self.synthesize_face_features(emo)
            audios = self.synthesize_audio_features(emo)
            for i in range(len(faces)):
                X_face.append(faces[i])
                X_audio.append(audios[i])
                y.append(ei)

        X_face = np.array(X_face)
        X_audio = np.array(X_audio)
        y = np.array(y)

        f_mean, f_std = X_face.mean(0), X_face.std(0) + 1e-9
        a_mean, a_std = X_audio.mean(0), X_audio.std(0) + 1e-9
        Xf_n = (X_face - f_mean) / f_std
        Xa_n = (X_audio - a_mean) / a_std
        X_combined = np.hstack([Xf_n, Xa_n])

        # kNN evaluation (k=5, leave-one-out)
        print("=== AV Emotion Recognition ===")
        for name, X_data in [("Face only", Xf_n), ("Audio only", Xa_n),
                              ("Audio+Visual", X_combined)]:
            correct = 0
            for i in range(len(X_data)):
                dists = np.linalg.norm(
                    np.delete(X_data, i, axis=0) - X_data[i], axis=1)
                y_rem = np.delete(y, i)
                pred = int(np.bincount(y_rem[np.argsort(dists)[:5]]).argmax())
                if pred == y[i]:
                    correct += 1
            print(f"  {name}: {correct / len(y):.1%}")

        # Sarcasm: happy words, angry voice, neutral face
        print("\n--- Sarcasm Scenario ---")
        sarcasm_face = np.array([0.1, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.0])
        sarcasm_audio = np.array([170, 20, 0.6, 0.15, 1.1, 0.25])
        sf_n = (sarcasm_face - f_mean) / f_std
        sa_n = (sarcasm_audio - a_mean) / a_std
        for name, feat, X_ref in [
            ("Face", sf_n, Xf_n), ("Audio", sa_n, Xa_n),
            ("Combined", np.concatenate([sf_n, sa_n]), X_combined),
        ]:
            dists = np.linalg.norm(X_ref - feat, axis=1)
            pred = int(np.bincount(y[np.argsort(dists)[:5]]).argmax())
            print(f"  {name} predicts: {self.emotions[pred]}")


system = AVEmotionSystem()
system.run()

On the IEMOCAP dataset, audio-visual models consistently outperform single-modality approaches by 5-10 percentage points. The gain is largest for emotions that are ambiguous in one modality but clear in the other -- sarcasm being the classic example (happy words, angry tone, flat face). A fused model catches what neither modality detects alone, which is exactly why the combined system scores higher in our demo too.

Cross-modal self-supervised learning

One of the most powerful ideas in audio-visual AI: use the natural correspondence between audio and video as free supervision. A video of a dog barking inherently contains aligned audio-visual information -- no human annotation needed.

Audio-Visual Correspondence (AVC) training is beautifully simple: given a video frame and an audio clip, predict whether they come from the same video or not. This binary classification task teaches the model about audio-visual semantics without any labels:

import numpy as np


class AVCPretrainer:
    """Self-supervised audio-visual correspondence:
    does this audio match this video frame?"""

    def __init__(self, embed_dim=32):
        self.embed_dim = embed_dim
        self.rng = np.random.RandomState(42)
        self.W_visual = self.rng.randn(16, embed_dim) * 0.1
        self.W_audio = self.rng.randn(16, embed_dim) * 0.1

    def synthesize_pair(self, label, scene_type):
        """Generate matched (label=1) or mismatched (label=0)
        audio-visual pair for a scene."""
        scenes = {
            "dog":    {"visual": np.array([1,0,0,0,0.8,0.5,0,0,0.3,0,0,0,0,0,0,0]),
                       "audio":  np.array([1,0,0,0,0.7,0.6,0,0,0.2,0,0,0,0,0,0,0])},
            "guitar": {"visual": np.array([0,1,0,0,0,0,0.9,0.4,0,0,0.5,0,0,0,0,0]),
                       "audio":  np.array([0,1,0,0,0,0,0.8,0.5,0,0,0.6,0,0,0,0,0])},
            "rain":   {"visual": np.array([0,0,1,0,0,0,0,0,0,0.7,0,0.8,0,0,0,0]),
                       "audio":  np.array([0,0,1,0,0,0,0,0,0,0.6,0,0.7,0,0,0,0])},
        }
        scene = scenes[scene_type]
        visual = scene["visual"] + self.rng.randn(16) * 0.1
        if label == 1:
            audio = scene["audio"] + self.rng.randn(16) * 0.1
        else:
            other = self.rng.choice([s for s in scenes if s != scene_type])
            audio = scenes[other]["audio"] + self.rng.randn(16) * 0.1
        return visual, audio

    def train(self, n_epochs=200, lr=0.05):
        scene_types = ["dog", "guitar", "rain"]
        X_v, X_a, Y = [], [], []
        for _ in range(30):
            for scene in scene_types:
                for label in [1, 0]:
                    v, a = self.synthesize_pair(label, scene)
                    X_v.append(v)
                    X_a.append(a)
                    Y.append(float(label))

        X_v, X_a, Y = np.array(X_v), np.array(X_a), np.array(Y)

        for epoch in range(n_epochs):
            v_emb = np.tanh(X_v @ self.W_visual)
            a_emb = np.tanh(X_a @ self.W_audio)
            v_norm = v_emb / (np.linalg.norm(v_emb, axis=1, keepdims=True) + 1e-9)
            a_norm = a_emb / (np.linalg.norm(a_emb, axis=1, keepdims=True) + 1e-9)
            sim = np.sum(v_norm * a_norm, axis=1)
            pred = 1 / (1 + np.exp(-sim * 5))
            loss = -np.mean(Y * np.log(pred + 1e-9) + (1 - Y) * np.log(1 - pred + 1e-9))

            if (epoch + 1) % 50 == 0:
                acc = np.mean((pred > 0.5) == Y)
                print(f"Epoch {epoch+1}: loss={loss:.3f}  acc={acc:.1%}")

        # Zero-shot retrieval demo
        print("\n--- Zero-shot Audio-Visual Retrieval ---")
        for scene in scene_types:
            v, _ = self.synthesize_pair(1, scene)
            v_e = np.tanh(v @ self.W_visual)
            v_e /= np.linalg.norm(v_e) + 1e-9
            print(f"  Query: {scene} (visual)")
            for s2 in scene_types:
                _, a2 = self.synthesize_pair(1, s2)
                a_e = np.tanh(a2 @ self.W_audio)
                a_e /= np.linalg.norm(a_e) + 1e-9
                cos = float(np.dot(v_e, a_e))
                match = " <--" if s2 == scene else ""
                print(f"    vs {s2} audio: {cos:.3f}{match}")


trainer = AVCPretrainer()
trainer.train()

ImageBind (Meta, 2023) takes this concept to its logical extreme: it learns a joint embedding space across six modalities (images, text, audio, depth, thermal, IMU sensor data) using only image-text pairs as the anchor. Because images naturally co-occur with all other modalities, the image embedding becomes a bridge -- you can match audio to text even though the model was never explicitly trained on audio-text pairs. The audio of a barking dog maps to the same region as the image of a dog and the text "dog barking." That's a remarkably elegant solution to the multimodal alignment problem.

Audio-visual evaluation metrics

Let's build proper evaluation for multimodal systems. The core question is always: does combining modalities actually help compared to using each modality alone?

import numpy as np


class AVEvaluator:
    """Evaluate multimodal systems vs unimodal baselines."""

    def __init__(self):
        self.rng = np.random.RandomState(42)

    def modality_gap_analysis(self):
        """Compare modalities across different conditions."""
        conditions = {
            "Clean":          (0.92, 0.75, 0.95),
            "Noisy audio":    (0.60, 0.75, 0.82),
            "Occluded face":  (0.90, 0.45, 0.88),
            "Both degraded":  (0.55, 0.40, 0.68),
        }
        print("=== Modality Gap Analysis ===")
        print(f"{'Condition':<18}{'Audio':>8}{'Visual':>8}{'AV':>8}{'Gain':>8}")
        print("-" * 50)
        for name, (a, v, av) in conditions.items():
            gain = av - max(a, v)
            print(f"{name:<18}{a:>7.0%}{v:>7.0%}{av:>7.0%}{gain:>+7.0%}")

    def complementarity_score(self):
        """Measure how much modalities complement each other."""
        n = 200
        audio_ok = self.rng.random(n) > 0.15
        visual_ok = self.rng.random(n) > 0.30
        av_ok = self.rng.random(n) > 0.08
        neither = ~audio_ok & ~visual_ok
        rescued = av_ok & neither

        print("\n=== Complementarity ===")
        print(f"Audio correct:     {audio_ok.mean():.1%}")
        print(f"Visual correct:    {visual_ok.mean():.1%}")
        print(f"Both correct:      {(audio_ok & visual_ok).mean():.1%}")
        print(f"Audio-only right:  {(audio_ok & ~visual_ok).mean():.1%}")
        print(f"Visual-only right: {(~audio_ok & visual_ok).mean():.1%}")
        print(f"AV rescued:        {rescued.sum()}/{neither.sum()}")

    def run(self):
        self.modality_gap_analysis()
        self.complementarity_score()


evaluator = AVEvaluator()
evaluator.run()

The modality gap analysis table is the most important table in any multimodal paper. It shows the accuracy of each modality alone versus the combined system across different conditions. The "Gain" column shows improvement over the best single modality. If the gain is zero or negative, your fusion mechanism is not working -- the combined system should always outperform the best individual modality, especially under degraded conditions.

Samengevat

  • Audio-visual fusion combines features from audio and visual streams using attention-weighted mechanisms that dynamically shift trust toward the cleaner modality -- this is the foundation all AV systems build on;
  • AV-ASR combines lip video with audio spectrograms for robust speech recognition in noisy environments; systems like AV-HuBERT use self-supervised pre-training on thousands of hours of talking-head video and significantly outperform audio-only Whisper in noise;
  • lip reading (video-only ASR) achieves ~20-25% WER on benchmarks like LRS3 using 3D CNN frontends and transformer temporal models -- impressive given that only 10-30% of phonemes are visually distinguishable;
  • the cocktail party problem is solved by using face identity to guide speaker separation from an audio mixture -- the Looking to Listen paper (Google, 2018) demonstrated that lip movements are temporally correlated with the corresponding speaker's audio;
  • sound source localization learns to match audio features with spatial visual features through contrastive learning, producing heatmaps that identify sounding objects without manual annotation;
  • video-to-audio generation synthesizes matching soundtracks from silent video using mel spectrogram prediction conditioned on visual frame features; modern systems use diffusion-based decoders for higher quality;
  • audio-visual emotion recognition outperforms single-modality approaches by 5-10 percentage points on IEMOCAP, with the largest gains on ambiguous cases like sarcasm;
  • cross-modal self-supervised learning (AVC, ImageBind) exploits the natural correspondence between audio and video as free supervision, building shared embedding spaces across modalities without labeled data.

We've covered a lot of ground in the audio domain over the last nine episodes -- from the raw physics of sound waves through speech recognition, synthesis, classification, speaker identification, voice NLU, enhancement, and now multimodal integration. The next milestone in this series is bringing all of that together into a practical project, and then we move into a completely different area of AI that involves learning through trial, error, and reward rather than labeled data. There's plenty more ground ahead.

Exercises

Exercise 1: Build an audio-visual noise robustness analyzer. Create a class AVNoiseAnalyzer that generates 4 "words" (4 distinct frequency profiles, 0.5s each at sr=16000), with both audio (sinusoidal harmonics) and visual features (8-dim lip aperture patterns). Train a softmax classifier on combined AV features (20 samples per word). Then test at 5 noise levels (sigma = 0.0, 0.3, 0.6, 1.0, 2.0) applied ONLY to the audio channel. For each noise level, compute and print accuracy for: (a) audio-only, (b) visual-only, (c) audio+visual combined. Show the crossover point where audio-only drops below visual-only accuracy.

Exercise 2: Build an audio-visual synchrony detector. Create a class AVSyncDetector that generates 5s of paired audio-visual data (sr=16000, 25 visual frames/sec): audio is a series of 10 "impacts" (short bursts at random times), visual is a brightness spike at each impact time. Create both synchronized (audio and visual align) and desynchronized versions (visual shifted by 0.2s). Extract per-frame cross-correlation features between audio energy and visual brightness within a +/-0.3s window. Train a binary classifier (synchronized vs not). Print accuracy, and show the average cross-correlation peak position for sync vs desync pairs.

Exercise 3: Build an audio-visual retrieval system. Create a class AVRetriever with 5 "scene" categories (dog_barking, guitar_playing, rain, typing, clapping), each with distinct audio features (6-dim: spectral centroid, bandwidth, energy, zero-crossing rate, pitch, harmonicity) and visual features (6-dim: motion magnitude, color histogram peaks, edge density, texture roughness, brightness variance, spatial frequency). Generate 10 samples per scene. Project both modalities into a shared 8-dim embedding space using learned projection matrices (train with contrastive loss: matched pairs should have high cosine similarity, mismatched pairs low). Evaluate with recall@1: given an audio query, retrieve the nearest visual match (and vice versa). Print the 5x5 retrieval confusion matrix and overall recall@1 for both directions.

Bedankt en tot de volgende keer!

@scipio



0
0
0.000
0 comments