Learn AI Series (#88) - Face Analysis

avatar

Learn AI Series (#88) - Face Analysis

variant-a-10-skyblue.png

What will I learn

  • You will learn face detection: from classical Viola-Jones cascades to modern RetinaFace;
  • face recognition: how FaceNet and ArcFace create identity embeddings from faces;
  • facial landmark detection: locating eyes, nose, mouth, and jawline with precision;
  • expression recognition: classifying emotions from facial features using CNNs;
  • deepfake detection: identifying synthetically generated faces and video face swaps;
  • the ethics of face recognition: surveillance, consent, bias, and civil liberties.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#88) - Face Analysis

Solutions to Episode #87 Exercises

Exercise 1: Stereo depth accuracy analyzer.

import numpy as np


class StereoDepthAnalyzer:
    """Analyze how stereo depth accuracy degrades
    with distance for a given camera setup."""

    def __init__(self, focal_length_px=1000,
                 baseline_m=0.12):
        self.f = focal_length_px
        self.B = baseline_m

    def disparity_to_depth(self, disparity):
        return self.f * self.B / (
            disparity + 1e-12)

    def depth_to_disparity(self, depth):
        return self.f * self.B / (
            depth + 1e-12)

    def depth_error_from_disparity_error(
            self, true_depth, disp_error_px):
        true_disp = self.depth_to_disparity(
            true_depth)
        err_disp = true_disp + disp_error_px
        err_depth = self.disparity_to_depth(
            err_disp)
        return abs(err_depth - true_depth)

    def run(self):
        depths = [1, 2, 5, 10, 20, 50, 100]
        disp_err = 1  # 1 pixel error

        print(f"Camera: f={self.f}px, "
              f"B={self.B}m")
        print(f"Disparity error: {disp_err}px")
        print()
        print(f"{'Depth':>6} {'TrueDisp':>9} "
              f"{'ErrDisp':>8} {'DepthErr':>9} "
              f"{'Rel%':>7}")
        print("-" * 44)

        prev_err = None
        for d in depths:
            true_disp = self.depth_to_disparity(d)
            err_disp = true_disp + disp_err
            err_depth = self.disparity_to_depth(
                err_disp)
            depth_err = abs(err_depth - d)
            rel_err = depth_err / d * 100

            ratio = ""
            if prev_err is not None and prev_err > 0:
                ratio = (f" ({depth_err / prev_err"
                         f":.1f}x)")
            prev_err = depth_err

            print(f"{d:>5}m {true_disp:>9.2f}px "
                  f"{err_disp:>8.2f} "
                  f"{depth_err:>8.3f}m "
                  f"{rel_err:>6.2f}%")

        # Verify quadratic growth
        err_1m = self.depth_error_from_disparity_error(
            1, 1)
        err_2m = self.depth_error_from_disparity_error(
            2, 1)
        print(f"\nQuadratic check: error at 2m / "
              f"error at 1m = "
              f"{err_2m / err_1m:.2f}x "
              f"(expect ~4x)")


analyzer = StereoDepthAnalyzer()
analyzer.run()

At 1 meter the true disparity is 120px, so adding 1px error gives only 0.008m of depth error (less than a centimeter). At 10 meters the disparity drops to 12px, and 1px error causes 0.83m depth error. At 100 meters the disparity is only 1.2px -- a single pixel of error throws the depth off by over 80 meters. The error grows roughly proportional to depth^2 / (f * B), confirming the quadratic degradation. This is why stereo depth is reliable at close range but essentially useless for distant objects -- you'd need either a longer baseline or higher resolution sensors.

Exercise 2: Point cloud statistics calculator.

import numpy as np
from scipy.spatial import KDTree


class PointCloudStats:
    """Generate a synthetic room point cloud and
    compute geometric statistics."""

    def __init__(self, pts_per_surface=500,
                 noise_std=0.02, seed=42):
        rng = np.random.RandomState(seed)
        clouds = []

        # Floor (y=0, x in [-2,2], z in [0,3])
        floor = np.column_stack([
            rng.uniform(-2, 2, pts_per_surface),
            np.zeros(pts_per_surface),
            rng.uniform(0, 3, pts_per_surface)])
        clouds.append(floor)

        # Back wall (z=3)
        wall_back = np.column_stack([
            rng.uniform(-2, 2, pts_per_surface),
            rng.uniform(0, 2.5, pts_per_surface),
            np.full(pts_per_surface, 3.0)])
        clouds.append(wall_back)

        # Left wall (x=-2)
        wall_left = np.column_stack([
            np.full(pts_per_surface, -2.0),
            rng.uniform(0, 2.5, pts_per_surface),
            rng.uniform(0, 3, pts_per_surface)])
        clouds.append(wall_left)

        # Right wall (x=2)
        wall_right = np.column_stack([
            np.full(pts_per_surface, 2.0),
            rng.uniform(0, 2.5, pts_per_surface),
            rng.uniform(0, 3, pts_per_surface)])
        clouds.append(wall_right)

        # Cube at (0, 0.25, 1.5), side=0.5
        cx, cy, cz, s = 0, 0.25, 1.5, 0.5
        hs = s / 2
        n_face = pts_per_surface // 6
        for axis in range(3):
            for sign in [-1, 1]:
                face = rng.uniform(
                    -hs, hs, (n_face, 3))
                face[:, axis] = sign * hs
                face += [cx, cy, cz]
                clouds.append(face)

        self.points = np.vstack(clouds)
        self.points += rng.randn(
            *self.points.shape) * noise_std

    def basic_stats(self):
        pts = self.points
        return {
            "count": len(pts),
            "bbox_min": pts.min(axis=0),
            "bbox_max": pts.max(axis=0),
            "centroid": pts.mean(axis=0),
        }

    def estimate_normals(self, k=20):
        tree = KDTree(self.points)
        normals = np.zeros_like(self.points)
        for i, pt in enumerate(self.points):
            _, idx = tree.query(pt, k=k)
            neighbors = self.points[idx]
            cov = np.cov(neighbors.T)
            eigvals, eigvecs = np.linalg.eigh(cov)
            # Smallest eigenvalue -> normal
            normals[i] = eigvecs[:, 0]
        return normals

    def classify(self, normals):
        ny = np.abs(normals[:, 1])
        horiz = ny > 0.8
        vert = ny < 0.3
        return horiz, vert

    def run(self):
        stats = self.basic_stats()
        print(f"Points: {stats['count']}")
        print(f"Bbox min: [{stats['bbox_min'][0]:.2f},"
              f" {stats['bbox_min'][1]:.2f},"
              f" {stats['bbox_min'][2]:.2f}]")
        print(f"Bbox max: [{stats['bbox_max'][0]:.2f},"
              f" {stats['bbox_max'][1]:.2f},"
              f" {stats['bbox_max'][2]:.2f}]")
        print(f"Centroid: [{stats['centroid'][0]:.2f},"
              f" {stats['centroid'][1]:.2f},"
              f" {stats['centroid'][2]:.2f}]")

        normals = self.estimate_normals()
        horiz, vert = self.classify(normals)

        n = len(self.points)
        print(f"\nHorizontal: {horiz.sum()} "
              f"({horiz.sum() / n * 100:.1f}%)")
        print(f"Vertical:   {vert.sum()} "
              f"({vert.sum() / n * 100:.1f}%)")
        print(f"Other:      "
              f"{n - horiz.sum() - vert.sum()}")

        h_avg = normals[horiz].mean(axis=0)
        v_avg = normals[vert].mean(axis=0)
        print(f"\nAvg horiz normal: "
              f"[{h_avg[0]:.3f}, {h_avg[1]:.3f},"
              f" {h_avg[2]:.3f}]")
        print(f"Avg vert normal:  "
              f"[{v_avg[0]:.3f}, {v_avg[1]:.3f},"
              f" {v_avg[2]:.3f}]")


cloud = PointCloudStats()
cloud.run()

The floor points (500 of them, roughly 20% of the total) get classified as horizontal because their normal vectors point nearly straight up along the Y axis. The three wall surfaces (1500 points combined, ~60%) are classified as vertical since their normals are perpendicular to Y. The cube surfaces contribute a mix of both -- horizontal top and bottom faces, vertical side faces. The "other" category catches points near surface edges where the k-nearest-neighbor PCA picks up neighbors from two different surfaces, producing a normal that's neither clearly horizontal nor vertical.

Exercise 3: NeRF ray sampling analyzer.

import numpy as np


class RayAnalyzer:
    """Analyze uniform vs stratified ray sampling
    for NeRF-style volume rendering."""

    def __init__(self, focal=50, cx=2, cy=2,
                 near=0.5, far=5.0, seed=42):
        self.focal = focal
        self.cx = cx
        self.cy = cy
        self.near = near
        self.far = far
        self.rng = np.random.RandomState(seed)

    def cast_ray(self, origin, direction, n):
        t_vals = np.linspace(
            self.near, self.far, n)
        pts = origin + t_vals[:, None] * direction
        deltas = np.diff(t_vals,
                         prepend=self.near)
        deltas[0] = t_vals[1] - t_vals[0]
        return pts, deltas

    def stratified_sampling(self, origin,
                            direction, n):
        bins = np.linspace(
            self.near, self.far, n + 1)
        t_vals = np.array([
            self.rng.uniform(bins[i], bins[i + 1])
            for i in range(n)])
        pts = origin + t_vals[:, None] * direction
        deltas = np.diff(t_vals,
                         prepend=self.near)
        deltas[0] = t_vals[1] - t_vals[0]
        return pts, deltas

    def generate_rays(self):
        origin = np.array([0.0, 0.0, 0.0])
        rays = []
        for v in range(4):
            for u in range(4):
                dx = (u - self.cx) / self.focal
                dy = (v - self.cy) / self.focal
                dz = 1.0
                d = np.array([dx, dy, dz])
                d = d / np.linalg.norm(d)
                rays.append((origin, d))
        return rays

    def run(self):
        rays = self.generate_rays()
        sample_counts = [8, 16, 32, 64,
                         128, 256]
        ref_n = 64

        print(f"Rays: {len(rays)}, "
              f"near={self.near}, far={self.far}")
        print(f"\n{'N':>5} {'AvgDelta':>9} "
              f"{'Uni_std':>9} {'Strat_std':>9} "
              f"{'RelTime':>8}")
        print("-" * 44)

        for n in sample_counts:
            uni_deltas = []
            strat_deltas = []
            for orig, d in rays:
                _, ud = self.cast_ray(orig, d, n)
                _, sd = self.stratified_sampling(
                    orig, d, n)
                uni_deltas.extend(ud)
                strat_deltas.extend(sd)

            uni_deltas = np.array(uni_deltas)
            strat_deltas = np.array(strat_deltas)

            avg_d = uni_deltas.mean()
            u_std = uni_deltas.std()
            s_std = strat_deltas.std()
            rel_time = n / ref_n

            print(f"{n:>5} {avg_d:>9.5f} "
                  f"{u_std:>9.6f} "
                  f"{s_std:>9.6f} "
                  f"{rel_time:>7.2f}x")

        print(f"\nUniform delta std = 0: "
              f"{uni_deltas.std() < 1e-10}")
        print(f"Stratified delta std > 0: "
              f"{strat_deltas.std() > 0.001}")


analyzer = RayAnalyzer()
analyzer.run()

Uniform sampling produces perfectly regular deltas (standard deviation essentially zero) because every sample is exactly (far - near) / N apart. Stratified sampling introduces controlled randomness within each bin -- the standard deviation is nonzero because the sample position varies within each bin, but bounded because no sample can escape its bin. This randomized jittering is important for NeRF because it prevents aliasing artifacts that can appear when regular sampling aligns with periodic scene structures. The rendering time scales linearly with sample count: 256 samples takes 4x as long as 64 samples, and 8 samples is 8x faster -- but with only 8 samples you'll see obvious banding where the ray under-samples the scene.

On to today's episode

Here we go! We've spent the last twelve episodes building a comprehensive understanding of computer vision: image processing fundamentals (#77), object detection (#78-79), segmentation (#80), pose estimation (#81), OCR (#82), video understanding (#83), generative diffusion models (#84-85), image editing (#86), and 3D reconstruction (#87). That's a LOT of ground covered, and we're not done yet.

Today we zoom in -- literally -- on one of the oldest and most commercially consequential problems in computer vision: face analysis. Humans are absolute face-processing machines. Your brain has dedicated neural circuitry (the fusiform face area, if you want to get neuroscientific about it) that detects and recognizes faces within milliseconds. You walk into a crowded room and instantly spot your friend's face among hundreds of strangers, read their emotional state, estimate their age, and notice they got a haircut -- all before you've consciously decided to look. We do this so naturally that it feels trivial ;-)

For machines, face analysis has been one of the longest-running challenges. The Viola-Jones detector from 2001 was considered revolutionary for detecting faces at 15 frames per second on a Pentium III. Today we have models that simultaneously detect faces, identify individuals, locate precise facial landmarks, classify expressions, and even determine whether a face is real or synthetically generated -- all from a single image. The journey from "can barely find a face in good lighting" to "can identify you in a crowd from a security camera" happened in about twenty years, and the implications go far beyond engineering.

This episode covers the full face analysis pipeline. We'll build each component piece by piece, and we'll also address something that most ML tutorials skip entirely: the ethics of deploying these systems on real people.

Face detection: from Haar cascades to RetinaFace

Face detection is the entry point to any face analysis pipeline. Before you can recognize who someone is or what expression they're making, you first need to find the face in the image and extract it.

Viola-Jones (2001) was the landmark paper. It used hand-crafted Haar-like features -- simple rectangular patterns that capture contrast between adjacent regions (bright forehead above dark eye sockets, bright nose bridge between darker cheeks). These features were evaluated in a cascade of progressively more complex classifiers: the first stage rejects 50% of candidate windows with a single feature check, the second stage rejects another 50% of survivors with a few more features, and so on. Only candidates that survive all stages get classified as faces. This cascading structure made it fast enough for real-time video on 2001-era hardware:

import numpy as np


class HaarCascadeSimulator:
    """Simulate the cascade reject logic of
    Viola-Jones face detection."""

    def __init__(self, num_stages=10,
                 reject_rates=None, seed=42):
        self.rng = np.random.RandomState(seed)
        if reject_rates is None:
            reject_rates = [0.5, 0.5, 0.4, 0.4,
                            0.3, 0.3, 0.2, 0.2,
                            0.1, 0.1]
        self.reject_rates = reject_rates
        self.num_stages = len(reject_rates)

    def evaluate_cascade(self, num_windows):
        survivors = num_windows
        print(f"{'Stage':>6} {'Reject%':>8} "
              f"{'Survivors':>10} {'Cum%':>7}")
        print("-" * 35)
        for i, rate in enumerate(
                self.reject_rates):
            rejected = int(survivors * rate)
            survivors -= rejected
            cum_pct = survivors / num_windows * 100
            print(f"{i + 1:>6} {rate * 100:>7.0f}% "
                  f"{survivors:>10,} "
                  f"{cum_pct:>6.2f}%")
        return survivors


# Sliding window on a 640x480 image at
# multiple scales: ~100,000 candidate windows
cascade = HaarCascadeSimulator()
final = cascade.evaluate_cascade(100_000)
print(f"\nFinal detections: {final:,} "
      f"(from 100,000 candidates)")

The cascade structure is what made Viola-Jones practical. Without it, you'd have to run a complex classifier on every single candidate window (tens of thousands per frame). With the cascade, most windows get rejected in the first few (cheap) stages. Only a handful survive to the expensive final stages. It's the same principle as a job application funnel -- reject the obvious mismatches early, spend time evaluating only the promising candidates.

Having said that, Viola-Jones has serious limitations. It struggles with profile faces, extreme lighting, heavy occlusion (sunglasses, scarves), and faces smaller than about 30x30 pixels. Modern deep learning detectors blow it out of the water on all of these.

RetinaFace (Deng et al., 2019) is the current standard for face detection. It's a single-stage detector (built on the same Feature Pyramid Network architecture we covered in episode #78 for general object detection) that simultaneously predicts face bounding boxes, confidence scores, and 5-point facial landmarks in a single forward pass:

from retinaface import RetinaFace
import cv2

# Detect all faces in an image
faces = RetinaFace.detect_faces(
    "group_photo.jpg")

for face_id, face_data in faces.items():
    bbox = face_data['facial_area']
    confidence = face_data['score']
    landmarks = face_data['landmarks']

    print(f"{face_id}: bbox={bbox}, "
          f"conf={confidence:.3f}")
    print(f"  Right eye: "
          f"{landmarks['right_eye']}")
    print(f"  Left eye:  "
          f"{landmarks['left_eye']}")
    print(f"  Nose:      "
          f"{landmarks['nose']}")

RetinaFace detects faces at multiple scales, handles partial occlusion surprisingly well, and works on faces as small as 20x20 pixels. The 5-point landmarks it produces (two eye centers, nose tip, two mouth corners) are immediately useful for face alignment -- rotating and scaling the face crop so that the eyes are at consistent positions. This alignment step is critical because it dramatically improves downstream recognition accuracy. If you skip alignment, the recognition model has to learn to handle rotated, tilted, and off-center faces on its own, which wastes a huge amount of model capacity.

Face recognition: identity as a vector

Face recognition is fundamentally an embedding problem -- the exact same concept we explored in episode #63, applied to faces instead of text. Instead of training a classifier with one output class per person (which would require retraining every time you add someone new), you train a network to produce a compact embedding vector such that:

  • Same person -> vectors are close (small distance)
  • Different people -> vectors are far apart (large distance)

FaceNet (Schroff et al., 2015) pioneered this approach with triplet loss: given an anchor face, a positive (same person), and a negative (different person), push the positive closer and the negative farther away:

import torch
import torch.nn as nn


class TripletLoss(nn.Module):
    """Triplet loss for face recognition.
    Same idea as metric learning from
    episode #63, applied to face images."""

    def __init__(self, margin=0.2):
        super().__init__()
        self.margin = margin

    def forward(self, anchor, positive,
                negative):
        # Each: (batch, embedding_dim)
        dist_pos = (anchor - positive).pow(
            2).sum(dim=1)
        dist_neg = (anchor - negative).pow(
            2).sum(dim=1)
        loss = torch.clamp(
            dist_pos - dist_neg + self.margin,
            min=0.0)
        return loss.mean()


# Quick sanity check
loss_fn = TripletLoss(margin=0.2)
anc = torch.randn(8, 128)
pos = anc + torch.randn(8, 128) * 0.1
neg = torch.randn(8, 128)
loss = loss_fn(anc, pos, neg)
print(f"Triplet loss: {loss.item():.4f}")

The margin parameter is the key -- it says "the negative must be farther than the positive by at least this much." Without the margin, the model could satisfy the constraint with an arbitrarily small gap between positive and negative distances. The margin forces a clear separation in embedding space.

ArcFace (Deng et al., 2019 -- same first author as RetinaFace, no coincidence) improved on triplet loss with additive angular margin loss. Instead of operating in Euclidean space, ArcFace normalizes both features and classifier weights onto the unit hypersphere and adds an angular margin penalty to the target class. The geometric intuition: on a unit sphere, the "distance" between points is the angle between them. ArcFace makes the model learn to separate identities by a minimum angle, which produces more discriminative embeddings:

class ArcFaceHead(nn.Module):
    """ArcFace: additive angular margin for
    face recognition training."""

    def __init__(self, embed_dim=512,
                 num_classes=10000,
                 margin=0.5, scale=64):
        super().__init__()
        self.weight = nn.Parameter(
            torch.randn(num_classes, embed_dim))
        nn.init.xavier_uniform_(self.weight)
        self.margin = margin
        self.scale = scale

    def forward(self, embeddings, labels):
        # Normalize both features and weights
        # onto unit hypersphere
        embeddings = nn.functional.normalize(
            embeddings)
        weights = nn.functional.normalize(
            self.weight)

        # Cosine similarity = dot product of
        # normalized vectors
        cosine = torch.mm(
            embeddings, weights.t())

        # Convert to angle, add margin to
        # target class, convert back
        theta = torch.acos(
            cosine.clamp(-1 + 1e-7, 1 - 1e-7))
        target_theta = theta[
            torch.arange(len(labels)), labels]
        marginal_cos = torch.cos(
            target_theta + self.margin)
        cosine[torch.arange(len(labels)),
               labels] = marginal_cos

        # Scale up for cross-entropy
        return self.scale * cosine


# Training: normal cross-entropy loss on
# the scaled logits
head = ArcFaceHead(embed_dim=512,
                   num_classes=100)
fake_emb = torch.randn(4, 512)
fake_labels = torch.tensor([0, 23, 45, 67])
logits = head(fake_emb, fake_labels)
print(f"ArcFace logits: {logits.shape}")

In practice, modern face recognition pipelines produce 512-dimensional embedding vectors. Two faces of the same person typically have cosine similarity above 0.6. Different people typically fall below 0.3. The threshold between "match" and "no match" is tuned based on your application's tolerance for false matches versus missed matches -- a phone unlock screen needs different tradeoffs than a border security system:

import numpy as np
from numpy.linalg import norm


def compare_faces(emb1, emb2, threshold=0.6):
    """Compare two face embeddings using
    cosine similarity."""
    similarity = np.dot(emb1, emb2) / (
        norm(emb1) * norm(emb2))
    return {
        'similarity': float(similarity),
        'same_person': similarity > threshold
    }


def analyze_thresholds(same_sims, diff_sims):
    """Find optimal threshold that maximizes
    accuracy given same-person and different-
    person similarity distributions."""
    thresholds = np.linspace(0.1, 0.9, 17)
    print(f"{'Thresh':>7} {'TPR':>6} "
          f"{'FPR':>6} {'Acc':>6}")
    print("-" * 28)
    best_acc, best_t = 0, 0
    for t in thresholds:
        tp = (same_sims >= t).mean()
        fp = (diff_sims >= t).mean()
        acc = ((same_sims >= t).sum()
               + (diff_sims < t).sum()) / (
                   len(same_sims) + len(diff_sims))
        if acc > best_acc:
            best_acc = acc
            best_t = t
        print(f"{t:>7.2f} {tp:>6.3f} "
              f"{fp:>6.3f} {acc:>6.3f}")
    print(f"\nBest: thresh={best_t:.2f}, "
          f"acc={best_acc:.3f}")


# Simulated distributions
rng = np.random.RandomState(42)
same = rng.normal(0.72, 0.12, 500)
diff = rng.normal(0.25, 0.15, 500)
same = np.clip(same, -1, 1)
diff = np.clip(diff, -1, 1)
analyze_thresholds(same, diff)

The gap between same-person and different-person distributions is where accuracy lives. A wider gap means the threshold is easier to set. A narrower gap (which happens with low-quality images, extreme poses, or aging between photos) means any threshold will produce more errors. This is why face recognition systems push so hard for well-lit, frontal, aligned face crops -- they're maximizing the gap.

Facial landmarks: mapping the geometry of a face

Beyond the 5 points from detection, full landmark models predict 68 (dlib standard), 98 (WFLW dataset), or even 468 (Google MediaPipe) keypoints covering the entire face geometry: eyebrow contours, eye corners and irises, nose bridge and nostrils, lip outline (both inner and outer edges), and the full jawline from ear to ear.

import mediapipe as mp
import cv2

mp_face_mesh = mp.solutions.face_mesh

with mp_face_mesh.FaceMesh(
    static_image_mode=True,
    max_num_faces=1,
    min_detection_confidence=0.5
) as face_mesh:
    image = cv2.imread("portrait.jpg")
    rgb = cv2.cvtColor(
        image, cv2.COLOR_BGR2RGB)
    results = face_mesh.process(rgb)

    if results.multi_face_landmarks:
        landmarks = results.multi_face_landmarks[0]
        h, w = image.shape[:2]

        # 468 landmarks with (x, y, z)
        for i, lm in enumerate(
                landmarks.landmark):
            px = int(lm.x * w)
            py = int(lm.y * h)
            # Highlight eye corners
            if i in [33, 133, 362, 263]:
                cv2.circle(image, (px, py),
                           3, (0, 255, 0), -1)

        print(f"Detected "
              f"{len(landmarks.landmark)} "
              f"landmarks")

Landmarks enable a whole family of downstream applications:

  • Face alignment: rotating and scaling based on eye positions to normalize faces for recognition (as we discussed above)
  • Expression analysis: tracking how specific facial regions deform relative to a neutral position
  • Gaze estimation: computing where someone is looking from the eye and iris landmarks
  • Face swapping: mapping one person's expression onto another's facial geometry
  • Augmented reality: anchoring virtual objects (glasses, hats, masks, filters -- yes, those Snapchat filters are basically landmark detection plus texture mapping) to accurate face positions
  • Driver monitoring: detecting drowsiness from eye closure and head pose in automotive systems

The 468-point MediaPipe mesh is dense enough to reconstruct a full 3D face surface (remember NeRF from episode #87? same ballpark, different approach). Each landmark has a z-coordinate estimating depth, so you get a coarse 3D face model from a single 2D image. Not as precise as a dedicated depth sensor, but remarkably useful for AR overlays and head pose estimation.

Expression recognition: reading emotions from pixels

Facial expression recognition classifies faces into emotional categories. The classic taxonomy comes from Paul Ekman's research (1970s): happy, sad, angry, surprised, disgusted, fearful, and neutral. Whether these are truly "universal" emotions or culturally constructed categories is actively debated -- but the seven-class framework remains the standard benchmark in computer vision.

The pipeline is conceptually simple: detect face -> align -> extract features -> classify. The features come from a CNN (usually a pretrained backbone like ResNet, fine-tuned on expression datasets):

import torch
import torch.nn as nn
import torchvision.models as models


class ExpressionClassifier(nn.Module):
    """Expression recognition from aligned
    face crops using a pretrained backbone."""

    EXPRESSIONS = [
        "angry", "disgusted", "fearful",
        "happy", "neutral", "sad", "surprised"
    ]

    def __init__(self, num_expressions=7):
        super().__init__()
        resnet = models.resnet18(
            weights='DEFAULT')
        self.features = nn.Sequential(
            *list(resnet.children())[:-1])
        self.classifier = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(512, num_expressions))

    def forward(self, face_crops):
        # face_crops: (batch, 3, 224, 224)
        features = self.features(
            face_crops).flatten(1)
        return self.classifier(features)

    def predict(self, face_crop):
        self.eval()
        with torch.no_grad():
            logits = self(face_crop.unsqueeze(0))
            probs = torch.softmax(logits, dim=1)
            idx = probs.argmax().item()
        return self.EXPRESSIONS[idx], probs[0]


model = ExpressionClassifier()
fake_face = torch.randn(1, 3, 224, 224)
logits = model(fake_face)
print(f"Input: {fake_face.shape}")
print(f"Output: {logits.shape} "
      f"({len(model.EXPRESSIONS)} classes)")

Modern systems increasingly use continuous affect dimensions instead of (or in addition to) discrete categories. The two main dimensions are valence (positive vs negative emotion, from very unhappy to very happy) and arousal (calm vs excited, from sleepy to hyperactive). A terrified face has negative valence and high arousal. A content face has positive valence and low arousal. This two-dimensional representation captures nuances that the seven discrete classes miss -- like the difference between "mildly amused" and "ecstatic," which both get lumped into "happy" in the Ekman framework.

Expression recognition has well-documented biases that anyone deploying these systems needs to understand. Models trained primarily on Western faces perform measurably worse on East Asian, South Asian, and African faces. The concept of "universal" expressions is itself culturally loaded -- the same facial movement (a smile) can signal happiness, politeness, discomfort, or embarrassment depending on cultural context. A system that confidently labels a polite Japanese smile as "happy" with 95% confidence is technically correct about the muscle configuration but potentially wrong about the underlying emotion.

Deepfake detection: is this face real?

GANs (episode #55) and diffusion models (episodes #84-85) can now generate photorealistic face images and swap faces in video so convincingly that human viewers can't reliably tell the difference. Detecting these fakes has become an active (and increasingly desperate) arms race.

Deepfake detectors look for artifacts that current generation models leave behind. The approach is conceptually straightforward -- train a binary classifier on real faces and fake faces:

class DeepfakeDetector(nn.Module):
    """Binary classifier: real or generated
    face. Uses EfficientNet backbone for
    spatial feature extraction."""

    def __init__(self):
        super().__init__()
        backbone = models.efficientnet_b0(
            weights='DEFAULT')
        self.features = nn.Sequential(
            *list(backbone.children())[:-1])
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.3),
            nn.Linear(1280, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
        )

    def forward(self, x):
        return self.classifier(
            self.features(x))


class FrequencyAnalyzer:
    """Analyze frequency domain artifacts in
    face images. GANs leave telltale periodic
    patterns visible in Fourier space."""

    def analyze(self, image_gray):
        h, w = image_gray.shape
        f_transform = np.fft.fft2(image_gray)
        f_shift = np.fft.fftshift(f_transform)
        magnitude = np.log(
            np.abs(f_shift) + 1)

        # Radial average -- real images have
        # smooth falloff, GANs show periodic
        # peaks at specific frequencies
        cy, cx = h // 2, w // 2
        radii = np.sqrt(
            (np.arange(h)[:, None] - cy) ** 2
            + (np.arange(w)[None, :] - cx) ** 2
        ).astype(int)
        max_r = min(cy, cx)
        profile = np.zeros(max_r)
        counts = np.zeros(max_r)
        for r in range(max_r):
            mask = (radii == r)
            profile[r] = magnitude[mask].mean()
            counts[r] = mask.sum()

        return profile, magnitude

    def detect_peaks(self, profile,
                     threshold=2.0):
        """Find abnormal frequency peaks that
        suggest GAN generation."""
        mean_val = profile.mean()
        std_val = profile.std()
        peaks = []
        for i, val in enumerate(profile):
            if val > mean_val + threshold * std_val:
                peaks.append((i, val))
        return peaks


detector = DeepfakeDetector()
fake_img = torch.randn(1, 3, 224, 224)
score = detector(fake_img)
print(f"Deepfake score: {score.item():.4f} "
      f"(raw logit, >0 = likely fake)")

What detectors actually look for:

  • Frequency artifacts: GANs (especially StyleGAN variants) produce subtle grid patterns visible in the frequency domain. A 2D Fourier transform of a GAN-generated face often reveals periodic peaks that real photographs lack. This works because the upsampling layers in the generator introduce regular patterns.
  • Blending boundaries: face swaps often have inconsistent lighting or texture at the boundary between the swapped and original regions. The skin tone, shading angle, or texture resolution changes abruptly.
  • Temporal inconsistencies: in video deepfakes, blinking patterns can be unnatural (early deepfakes barely blinked at all), head movements may lack proper parallax, and lip sync can drift.
  • Biological signals: real faces exhibit subtle color changes with each heartbeat (remote photoplethysmography -- the blood vessels under the skin change color slightly with each pulse). Deepfakes typically lack this signal entirely. This is one of the harder artifacts to fake because it requires modeling actual physiology.

The fundamental problem with deepfake detection: it's always playing catch-up. Every new detection method gets published, the generator developers read it, and they train their models to avoid producing that specific artifact. The most robust detection approaches combine multiple signal types rather than relying on any single artifact -- but even those have a shelf life. In practice, the most reliable approach is establishing provenance (cryptographic proof of where an image came from) rather than trying to determine authenticity from pixels alone.

Ethics: the elephant in the room

I usually keep these episodes focused on the technical side, but face recognition is one of those technologies where you genuinely CANNOT discuss the engineering without addressing the implications. This isn't a "wouldn't it be nice" ethics section that I'm bolting on for completeness. This is about understanding what the technology we just built actually does when you deploy it on real human beings.

Surveillance at scale: governments and corporations can (and do) identify and track individuals across camera networks without consent. China's social credit system, Clearview AI's scraping of billions of social media photos, and various law enforcement deployments in Western democracies are all public knowledge. The technology described in this episode -- RetinaFace detection + ArcFace embedding + vector search -- is sufficient to build a city-wide surveillance system. That's not hypothetical. It exists.

Bias: face recognition systems have consistently demonstrated higher error rates for darker-skinned individuals, women, and older people. The landmark study by Joy Buolamwini and Timnit Gebru (2018) found commercial face recognition systems from major vendors had error rates up to 34.7% for dark-skinned women while achieving near-perfect accuracy for light-skinned men. A false match in a law enforcement context can lead to wrongful arrest -- and this has happened, multiple times, in documented cases.

Consent and the "public face" problem: unlike a password or an email address, your face is visible to anyone with a camera. You can change your password. You can get a new email address. You cannot change your face. Being recognized in a photo uploaded by someone else means your biometric data is processed without your knowledge or agreement. Traditional data protection frameworks (built around the concept of informed consent) struggle with biometric data that is, by its nature, publicly visible.

Chilling effect: the knowledge that cameras can identify you changes behavior. Research consistently shows that surveillance reduces participation in protests, political gatherings, religious congregations, and even routine public spaces. For vulnerable populations (undocumented immigrants, political dissidents, members of persecuted minorities), ubiquitous face recognition represents an existential threat, not a convenience feature.

The EU AI Act classifies real-time biometric identification in public spaces as "high-risk" and imposes strict requirements. Several US cities (San Francisco, Boston, Portland) have banned government use of face recognition. Other jurisdictions have no restrictions at all. The technology is globally available; the regulation is fragmented and lagging.

As AI practitioners, understanding both what the technology can do and what it should not do isn't optional -- it's part of competence. Building a face recognition system without considering its deployment context is like building a bridge without considering what traffic it'll carry ;-)

Samengevat

  • Face detection evolved from Haar cascades with cascaded classifiers (Viola-Jones, 2001) to deep learning detectors like RetinaFace (2019) that simultaneously predict bounding boxes, confidence scores, and 5-point landmarks; the cascade architecture remains conceptually useful for understanding how to reject easy negatives early;
  • face recognition maps faces to 512-dimensional embedding vectors using metric learning losses; FaceNet's triplet loss pushes same-person embeddings closer and different-person embeddings apart; ArcFace's angular margin on the unit hypersphere produces more discriminative separations; the threshold between "match" and "no match" trades off false accepts vs false rejects;
  • facial landmarks (5 to 468 points) enable face alignment for recognition, expression tracking, gaze estimation, AR face filters, and coarse 3D face reconstruction from single images;
  • expression recognition classifies faces into Ekman's seven categories (or continuous valence/arousal dimensions) using CNN backbones fine-tuned on expression datasets; cultural bias in training data and the debated universality of emotions are documented concerns;
  • deepfake detection searches for frequency artifacts, blending boundaries, temporal inconsistencies, and missing biological signals; it's an ongoing arms race where detection methods have limited shelf life against improving generators;
  • face recognition ethics involve surveillance at scale, demographic bias (higher error rates for darker-skinned women), consent challenges with public biometric data, and chilling effects on civil liberties; regulation varies widely by jurisdiction and lags behind deployment.

We've now covered the major specialized domains within computer vision -- from raw pixels through detection, segmentation, tracking, OCR, video, generation, editing, 3D reconstruction, and face analysis. There's still territory to explore in how vision applies to specialized scientific and medical contexts, and in how machines can learn visual representations without explicit human-labeled supervision.

Exercises

Exercise 1: Build a face embedding similarity analyzer. Create a class EmbeddingSimilarityAnalyzer that: (a) generates synthetic 512-dimensional face embeddings for 5 "identities" with 10 "photos" each -- for each identity, generate a centroid vector (random unit vector), then create 10 variants by adding Gaussian noise (sigma=0.15) and re-normalizing to unit length, (b) computes the full 50x50 pairwise cosine similarity matrix, (c) computes and prints: the mean same-identity similarity (should be high), the mean different-identity similarity (should be low), the minimum same-identity similarity (the hardest positive pair), the maximum different-identity similarity (the hardest negative pair), and the "gap" between the hardest positive and hardest negative (this gap determines how easy it is to set a threshold), (d) for thresholds [0.3, 0.4, 0.5, 0.6, 0.7, 0.8], computes and prints the true positive rate, false positive rate, and accuracy, (e) identifies the threshold that maximizes accuracy. Verify that the gap between same-identity and different-identity distributions depends directly on the noise sigma -- lower noise = wider gap = easier recognition.

Exercise 2: Build a landmark-based face alignment tool. Create a class FaceAligner that: (a) takes 5-point landmarks (left_eye, right_eye, nose, mouth_left, mouth_right) as (x, y) coordinate pairs, (b) computes the angle of the line between the two eye centers relative to horizontal, (c) computes a 2D rotation matrix to align the eyes horizontally, (d) computes a scale factor to normalize the inter-eye distance to a target value (say 70 pixels), (e) computes a translation to center the midpoint between the eyes at a target position (say (112, 112) in a 224x224 output), (f) combines rotation, scaling, and translation into a single 2x3 affine matrix and applies it to the 5 landmarks (simulating what cv2.warpAffine would do to the image), (g) generates 5 test cases with varying eye positions, rotations (tilted faces from -30 to +30 degrees), and scales, applies the alignment, and prints: the original eye angle, the aligned eye angle (should be ~0), the original inter-eye distance, the aligned inter-eye distance (should be ~70), and the aligned eye center position (should be near (112, 112)). Verify that all test cases produce consistent aligned positions regardless of the input variation.

Exercise 3: Build a expression confusion matrix analyzer. Create a class ExpressionAnalyzer that: (a) simulates expression classification results for 7 classes (angry, disgusted, fearful, happy, neutral, sad, surprised) by generating a confusion matrix where each class has 100 test samples, the correct class gets probability 0.6-0.8 (varying per class -- "happy" is easiest at 0.8, "fearful" and "disgusted" are hardest at 0.6), and the remaining probability mass is distributed among specific confusion pairs: angry<->disgusted (high confusion), fearful<->surprised (high confusion), sad<->neutral (moderate confusion), (b) computes per-class precision, recall, and F1 score, (c) computes and prints the full 7x7 confusion matrix with actual counts, (d) identifies the two most-confused class pairs (highest off-diagonal counts), (e) computes an "overall difficulty" ranking of expressions by F1 score, (f) simulates the effect of adding more training data: re-run with the correct-class probability increased by 0.05 for the two worst-performing classes, and show how F1 scores change. Verify that the deliberately confused pairs (angry/disgusted, fearful/surprised) show up as the highest off-diagonal entries.

De groeten!

@scipio



0
0
0.000
0 comments