Learn AI Series (#81) - Pose Estimation and Tracking

avatar

Learn AI Series (#81) - Pose Estimation and Tracking

variant-a-12-green.png

What will I learn

  • You will learn human pose estimation: detecting body keypoints like joints and limbs in images;
  • heatmap-based keypoint prediction: why models predict probability maps instead of raw coordinates;
  • top-down vs bottom-up approaches and when each one works best;
  • practical pose estimation with MediaPipe and MMPose/HRNet for real-time and research-grade accuracy;
  • object tracking across video frames: SORT, DeepSORT, and ByteTrack;
  • evaluation metrics: OKS for pose quality, MOTA and IDF1 for tracking quality;
  • how pose estimation and tracking combine for action recognition, sports analytics, and motion capture.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#81) - Pose Estimation and Tracking

Solutions to Episode #80 Exercises

Exercise 1: Segmentation mask visualization toolkit.

import numpy as np


class SegmentationVisualizer:
    """Visualize and evaluate segmentation masks."""

    def __init__(self, class_colors):
        self.class_colors = class_colors
        self.num_classes = len(class_colors)

    def colorize(self, prediction):
        """Convert class label map to RGB image."""
        h, w = prediction.shape
        rgb = np.zeros((h, w, 3), dtype=np.uint8)
        for cls_id, color in self.class_colors.items():
            mask = prediction == cls_id
            rgb[mask] = color
        return rgb

    def overlay(self, image, prediction, alpha=0.5):
        """Blend colorized mask with original image."""
        colored = self.colorize(prediction)
        if image.dtype != np.uint8:
            image = (image * 255).astype(np.uint8)
        blended = (alpha * colored.astype(np.float32)
                   + (1 - alpha)
                   * image.astype(np.float32))
        return np.clip(blended, 0, 255).astype(np.uint8)

    def class_statistics(self, prediction):
        total = prediction.size
        print(f"Image: {prediction.shape[1]}x"
              f"{prediction.shape[0]} = {total} pixels")
        for cls_id, color in self.class_colors.items():
            count = (prediction == cls_id).sum()
            pct = count / total * 100
            present = "YES" if count > 0 else "no"
            print(f"  Class {cls_id}: {count:>6d} px "
                  f"({pct:>5.1f}%) [{present}]")

    def iou_per_class(self, prediction, ground_truth):
        ious = {}
        for cls_id in self.class_colors:
            pred_mask = prediction == cls_id
            gt_mask = ground_truth == cls_id
            inter = (pred_mask & gt_mask).sum()
            union = (pred_mask | gt_mask).sum()
            if union == 0:
                continue
            ious[cls_id] = inter / union
        miou = np.mean(list(ious.values())) if ious else 0
        print(f"Per-class IoU:")
        for c, v in ious.items():
            print(f"  Class {c}: {v:.4f}")
        print(f"  mIoU: {miou:.4f}")
        return ious, miou


# Test: 4 classes, 200x200
colors = {0: (0, 0, 0), 1: (255, 0, 0),
          2: (0, 255, 0), 3: (0, 0, 255)}
vis = SegmentationVisualizer(colors)

np.random.seed(42)
gt = np.random.randint(0, 4, (200, 200))
vis.class_statistics(gt)
ious, miou = vis.iou_per_class(gt, gt)
# mIoU = 1.0 when prediction == ground truth

When prediction equals ground truth exactly, every class gets IoU = 1.0 and mIoU = 1.0. The moment you introduce even small errors, mIoU drops quickly -- a class that covers only 5% of the image can have its IoU destroyed by a handful of misclassified pixels. This is why mIoU is a stricter metric than pixel accuracy: accuracy lets you get away with ignoring small classes, mIoU does not.

Exercise 2: Flexible U-Net with configurable depth.

import torch
import torch.nn as nn


class FlexibleUNet(nn.Module):
    """U-Net with configurable depth."""

    def __init__(self, in_ch=3, num_classes=21,
                 depth=4, base_channels=64):
        super().__init__()
        self.depth = depth
        self.first_forward = True

        # Build encoder
        self.encoders = nn.ModuleList()
        self.pools = nn.ModuleList()
        ch = in_ch
        for i in range(depth):
            out_ch = base_channels * (2 ** i)
            self.encoders.append(self._block(ch, out_ch))
            self.pools.append(nn.MaxPool2d(2))
            ch = out_ch

        # Bottleneck
        bot_ch = base_channels * (2 ** depth)
        self.bottleneck = self._block(ch, bot_ch)

        # Build decoder
        self.upconvs = nn.ModuleList()
        self.decoders = nn.ModuleList()
        ch = bot_ch
        for i in range(depth - 1, -1, -1):
            out_ch = base_channels * (2 ** i)
            self.upconvs.append(
                nn.ConvTranspose2d(ch, out_ch, 2, stride=2)
            )
            self.decoders.append(
                self._block(out_ch * 2, out_ch)
            )
            ch = out_ch

        self.out_conv = nn.Conv2d(ch, num_classes, 1)

    def _block(self, in_ch, out_ch):
        return nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch), nn.ReLU(True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch), nn.ReLU(True),
        )

    def parameter_count(self):
        return sum(p.numel() for p in self.parameters()
                   if p.requires_grad)

    def forward(self, x):
        skips = []
        for enc, pool in zip(self.encoders, self.pools):
            x = enc(x)
            skips.append(x)
            if self.first_forward:
                print(f"  Enc: {x.shape}")
            x = pool(x)

        x = self.bottleneck(x)
        if self.first_forward:
            print(f"  Bot: {x.shape}")

        for up, dec, skip in zip(
            self.upconvs, self.decoders,
            reversed(skips)
        ):
            x = up(x)
            x = torch.cat([x, skip], dim=1)
            x = dec(x)
            if self.first_forward:
                print(f"  Dec: {x.shape}")

        self.first_forward = False
        return self.out_conv(x)


# Compare depths 2-5
print(f"{'Depth':<6} {'Params':>12} {'Output':>20}")
print("-" * 40)
for d in [2, 3, 4, 5]:
    model = FlexibleUNet(depth=d, base_channels=64)
    print(f"\n--- Depth {d} ---")
    out = model(torch.randn(1, 3, 256, 256))
    params = model.parameter_count()
    print(f"{d:<6} {params:>12,} {str(out.shape):>20}")

Output spatial dimensions always match input regardless of depth -- the decoder upsamples back to exactly the original resolution. Deeper models have exponentially more parameters (depth 5 has roughly 16x more parameters than depth 2) because each level doubles the channel count. The practical sweet spot for most tasks is depth 3-4; depth 5 is overkill unless you have a very large dataset and significant GPU memory.

Exercise 3: Dice coefficient monitor.

import torch
import torch.nn.functional as F
import numpy as np


class DiceMonitor:
    """Track Dice scores across training batches."""

    def __init__(self, num_classes):
        self.num_classes = num_classes
        self.reset()

    def reset(self):
        self.tp = np.zeros(self.num_classes)
        self.fp = np.zeros(self.num_classes)
        self.fn = np.zeros(self.num_classes)

    def update(self, predictions, targets):
        """predictions: (B,C,H,W), targets: (B,H,W)."""
        preds = predictions.argmax(dim=1)  # (B,H,W)
        for c in range(self.num_classes):
            pred_c = (preds == c).float()
            targ_c = (targets == c).float()
            self.tp[c] += (pred_c * targ_c).sum().item()
            self.fp[c] += (pred_c * (1 - targ_c)).sum().item()
            self.fn[c] += ((1 - pred_c) * targ_c).sum().item()

    def compute(self):
        dice = {}
        for c in range(self.num_classes):
            denom = 2 * self.tp[c] + self.fp[c] + self.fn[c]
            if denom == 0:
                continue
            dice[c] = 2 * self.tp[c] / denom
        mean_dice = (np.mean(list(dice.values()))
                     if dice else 0.0)
        return dice, mean_dice

    def report(self):
        dice, mean_d = self.compute()
        print(f"{'Class':>6} {'Dice':>8} {'Bar'}")
        print("-" * 40)
        for c, d in sorted(dice.items()):
            bar = "#" * int(d * 30)
            print(f"{c:>6} {d:>8.4f} {bar}")
        print(f"{'Mean':>6} {mean_d:>8.4f}")


# Simulate 10 batches with improving predictions
monitor = DiceMonitor(num_classes=4)
np.random.seed(42)
torch.manual_seed(42)

gt = torch.randint(0, 4, (2, 32, 32))

for batch in range(10):
    # Gradually improve: start noisy, converge to GT
    noise_level = 1.0 - batch * 0.1
    logits = F.one_hot(gt, 4).permute(0, 3, 1, 2).float()
    logits += torch.randn_like(logits) * noise_level
    monitor.update(logits, gt)

monitor.report()

# Verify perfect predictions -> Dice = 1.0
monitor.reset()
perfect = F.one_hot(gt, 4).permute(0, 3, 1, 2).float()
perfect *= 100  # high confidence
monitor.update(perfect, gt)
dice, mean_d = monitor.compute()
print(f"\nPerfect predictions: mean Dice = {mean_d:.4f}")

The Dice score increases as the noise decreases, which is exactly what you want to see during training. With perfect predictions the Dice is 1.0 for every class. In real training, you'd call reset() at the start of each epoch, update() for every batch, and report() at the end -- giving you a live view of which classes the model is learning well and which ones are lagging behind.

On to today's episode

Here we go! The last four episodes built up a complete visual understanding stack: image processing fundamentals (#77), object detection foundations and modern approaches (#78-79), and pixel-level segmentation (#80). We can now tell what is in an image, where it is (bounding box), and exactly which pixels belong to it (mask). That's incredibly powerful, but it still treats objects as blobs -- a person is just a labeled region of pixels.

Humans aren't blobs. A person has structure -- arms, legs, a head, joints that bend in specific ways. Whether someone is standing, sitting, waving, or falling depends entirely on the spatial arrangement of their body parts. Segmentation tells you "these pixels are a person." Pose estimation tells you "this person's left elbow is at coordinates (342, 187) and their right knee is at (295, 410)." That structural information unlocks capabilities that detection and segmentation simply cannot provide ;-)

What pose estimation actually outputs

A pose estimator takes an image and returns a set of keypoints -- specific anatomical landmarks on the body. The COCO keypoint format defines 17 points:

 0: nose            1: left_eye        2: right_eye
 3: left_ear        4: right_ear       5: left_shoulder
 6: right_shoulder  7: left_elbow      8: right_elbow
 9: left_wrist     10: right_wrist    11: left_hip
12: right_hip      13: left_knee      14: right_knee
15: left_ankle     16: right_ankle

Each keypoint is an (x, y, confidence) triplet. The model predicts not just where the joint is, but how sure it is about that prediction. A wrist hidden behind someone's back might get coordinates but a low confidence score -- the model is essentially saying "I think it's here, but I'm guessing."

The keypoints are connected by a skeleton -- predefined edges between keypoints that represent bones and limbs. Shoulder-to-elbow, elbow-to-wrist, hip-to-knee, knee-to-ankle. The skeleton turns a bag of 17 scattered points into a recognizable human figure.

import numpy as np


# COCO skeleton definition: pairs of keypoint indices
COCO_SKELETON = [
    (0, 1), (0, 2), (1, 3), (2, 4),       # head
    (5, 6),                                  # shoulders
    (5, 7), (7, 9),                          # left arm
    (6, 8), (8, 10),                         # right arm
    (5, 11), (6, 12),                        # torso
    (11, 12),                                # hips
    (11, 13), (13, 15),                      # left leg
    (12, 14), (14, 16),                      # right leg
]

KEYPOINT_NAMES = [
    "nose", "left_eye", "right_eye",
    "left_ear", "right_ear",
    "left_shoulder", "right_shoulder",
    "left_elbow", "right_elbow",
    "left_wrist", "right_wrist",
    "left_hip", "right_hip",
    "left_knee", "right_knee",
    "left_ankle", "right_ankle",
]


def draw_skeleton(image, keypoints, threshold=0.3):
    """Draw keypoints and skeleton on image.

    keypoints: (17, 3) array of (x, y, confidence)
    """
    import cv2
    img = image.copy()
    h, w = img.shape[:2]

    # Draw keypoints as circles
    for i, (x, y, conf) in enumerate(keypoints):
        if conf < threshold:
            continue
        px, py = int(x), int(y)
        cv2.circle(img, (px, py), 4,
                   (0, 255, 0), -1)

    # Draw skeleton lines
    for i, j in COCO_SKELETON:
        if (keypoints[i, 2] < threshold
                or keypoints[j, 2] < threshold):
            continue
        pt1 = (int(keypoints[i, 0]),
               int(keypoints[i, 1]))
        pt2 = (int(keypoints[j, 0]),
               int(keypoints[j, 1]))
        cv2.line(img, pt1, pt2, (255, 0, 0), 2)

    return img


# Simulate a standing person
keypoints = np.array([
    [200, 50, 0.95],    # nose
    [190, 45, 0.90],    # left_eye
    [210, 45, 0.92],    # right_eye
    [180, 55, 0.80],    # left_ear
    [220, 55, 0.82],    # right_ear
    [170, 120, 0.95],   # left_shoulder
    [230, 120, 0.93],   # right_shoulder
    [150, 200, 0.88],   # left_elbow
    [250, 200, 0.87],   # right_elbow
    [140, 280, 0.85],   # left_wrist
    [260, 280, 0.83],   # right_wrist
    [180, 280, 0.92],   # left_hip
    [220, 280, 0.91],   # right_hip
    [175, 380, 0.89],   # left_knee
    [225, 380, 0.88],   # right_knee
    [170, 470, 0.85],   # left_ankle
    [230, 470, 0.84],   # right_ankle
])
print(f"Keypoints shape: {keypoints.shape}")
print(f"Mean confidence: {keypoints[:, 2].mean():.3f}")

Top-down vs bottom-up

There are two fundamentally different strategies for multi-person pose estimation, and the choice between them has real practical consequences.

Top-down: first detect all people in the image (using an object detector like YOLO or Faster R-CNN from episodes #78-79), then run a single-person pose estimator on each detected bounding box. The pose model never has to deal with multiple people -- it gets a cropped image containing exactly one person and predicts 17 keypoints.

Bottom-up: first detect all keypoints in the entire image at once (all the left elbows, all the right knees, everything), then figure out which keypoints belong to which person. The grouping step is the hard part -- if there are 5 people in the image, there are 5 left elbows, and you need to match each elbow to the correct person.

Top-down (two-stage):
Image -> Detector -> [crop_1, crop_2, ...] -> Pose model -> keypoints per person

Bottom-up (also two-stage, but different):
Image -> Keypoint detector -> all keypoints -> Grouping -> keypoints per person

Top-down is generally more accurate because the pose model operates on clean, single-person crops. But it's slower when there are many people -- runtime scales linearly with the number of detections. If you have 30 people in the frame, you're running the pose model 30 times. Bottom-up is faster for crowded scenes because keypoint detection happens once regardless of how many people are present, but the grouping step can struggle when people overlap or stand very close together.

For most practical applications (fitness apps, sports analytics, interactive installations), top-down with a fast detector is the right choice. You rarely have more than 10-15 people in the frame for these use cases, and accuracy matters more than throughput. For surveillance or crowd analysis where you might have 50+ people, bottom-up approches like OpenPose make more sense.

Heatmap-based keypoint prediction

Most modern pose estimators predict heatmaps rather than direct coordinate regression. For each keypoint, the model outputs a 2D probability map where the brightest spot indicates the predicted location.

Why heatmaps instead of just regressing (x, y) coordinates directly? Regressing exact coordinates is hard -- the loss landscape is steep and the model has to be precisely right. Heatmaps are a softer target: the ground truth is a 2D Gaussian centered on the true keypoint location, and the model learns to produce a similar blob. The predicted coordinate is just the argmax of the heatmap. Much easier to learn, and it naturally provides a confidence measure (how bright is the peak?).

import torch
import torch.nn as nn
import numpy as np


def generate_heatmap(height, width, center_x,
                     center_y, sigma=2.0):
    """Generate a Gaussian heatmap centered at
    (center_x, center_y)."""
    y_grid, x_grid = np.mgrid[0:height, 0:width]
    dist_sq = ((x_grid - center_x) ** 2
               + (y_grid - center_y) ** 2)
    heatmap = np.exp(-dist_sq / (2 * sigma ** 2))
    return heatmap


# Ground truth: left shoulder at pixel (120, 85)
gt_heatmap = generate_heatmap(
    64, 64, center_x=30, center_y=21, sigma=2.0
)
print(f"Heatmap shape: {gt_heatmap.shape}")
print(f"Peak value: {gt_heatmap.max():.3f}")
print(f"Peak location: "
      f"{np.unravel_index(gt_heatmap.argmax(), gt_heatmap.shape)}")

# The loss is MSE between predicted and GT heatmaps
# For 17 keypoints: model outputs (B, 17, H, W)
# Each channel is a heatmap for one keypoint
gt_batch = torch.from_numpy(
    gt_heatmap
).unsqueeze(0).unsqueeze(0).float()
pred = torch.randn(1, 1, 64, 64)

loss_fn = nn.MSELoss()
loss = loss_fn(pred, gt_batch)
print(f"Heatmap MSE loss: {loss.item():.4f}")

The architecture follows a pattern we've seen many times now: encoder (backbone) compresses the image into features, then a decoder (deconvolution layers) expands back to spatial resolution. It's essentialy the U-Net thinking from episode #80 applied to keypoint prediction instead of pixel classification.

import torch
import torch.nn as nn


class SimplePoseNet(nn.Module):
    """Simplified pose estimation network.
    Backbone -> deconv upsampling -> heatmaps."""

    def __init__(self, num_keypoints=17):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=3),
            nn.BatchNorm2d(64), nn.ReLU(),
            nn.Conv2d(64, 128, 3, stride=2, padding=1),
            nn.BatchNorm2d(128), nn.ReLU(),
            nn.Conv2d(128, 256, 3, stride=2, padding=1),
            nn.BatchNorm2d(256), nn.ReLU(),
        )
        self.deconv = nn.Sequential(
            nn.ConvTranspose2d(256, 128, 4,
                               stride=2, padding=1),
            nn.BatchNorm2d(128), nn.ReLU(),
            nn.ConvTranspose2d(128, 64, 4,
                               stride=2, padding=1),
            nn.BatchNorm2d(64), nn.ReLU(),
            nn.ConvTranspose2d(64, num_keypoints, 4,
                               stride=2, padding=1),
        )

    def forward(self, x):
        features = self.backbone(x)
        heatmaps = self.deconv(features)
        return heatmaps


model = SimplePoseNet()
img = torch.randn(1, 3, 256, 256)
heatmaps = model(img)
print(f"Input:   {img.shape}")
print(f"Output:  {heatmaps.shape}")
# (1, 17, 256, 256) -- one heatmap per keypoint


def heatmaps_to_keypoints(heatmaps):
    """Extract (x, y, confidence) per keypoint."""
    batch, num_kp, h, w = heatmaps.shape
    flat = heatmaps.view(batch, num_kp, -1)
    confidence, indices = flat.max(dim=-1)
    y = indices // w
    x = indices % w
    # Stack into (batch, num_keypoints, 3)
    return torch.stack([x, y, confidence], dim=-1)


kps = heatmaps_to_keypoints(heatmaps)
print(f"Keypoints: {kps.shape}")
# (1, 17, 3) -- x, y, confidence per keypoint

The real production models (HRNet, ViTPose) use much deeper backbones, but the data flow is identical: image in, heatmaps out, argmax to get coordinates. The difference is in how the backbone maintains spatial information -- and that's where HRNet gets interesting.

HRNet: maintaining resolution throughout

HRNet (High-Resolution Network, Sun et al., 2019) takes a radically different approach from the standard "shrink then expand" pattern. Instead of progressively reducing resolution and then upsampling (like ResNet + deconv, or U-Net's encoder-decoder), HRNet maintains high-resolution representations throughout the entire network, running parallel branches at different resolutions and repeatedly exchanging information between them.

Stage 1:  [High-res branch only]
             |
Stage 2:  [High-res] <-> [Medium-res]
             |               |
Stage 3:  [High-res] <-> [Medium-res] <-> [Low-res]
             |               |               |
Stage 4:  [High-res] <-> [Medium-res] <-> [Low-res] <-> [Very-low-res]

At each stage, a new lower-resolution branch is added, and all branches exchange information through multi-scale fusion: each branch receives features from all other branches (upsampled or downsampled as needed). The high-res branch never loses spatial detail, and the low-res branches provide the semantic context.

import torch
import torch.nn as nn
import torch.nn.functional as F


class MultiScaleFusion(nn.Module):
    """Fuse features across different resolutions."""

    def __init__(self, channels_list):
        """channels_list: [high_ch, medium_ch, low_ch]"""
        super().__init__()
        self.num_branches = len(channels_list)
        self.fuse_layers = nn.ModuleList()

        for i in range(self.num_branches):
            branch_fuse = nn.ModuleList()
            for j in range(self.num_branches):
                if i == j:
                    # Same resolution: identity
                    branch_fuse.append(nn.Identity())
                elif j > i:
                    # Lower-res to higher-res: upsample
                    branch_fuse.append(nn.Sequential(
                        nn.Conv2d(channels_list[j],
                                  channels_list[i], 1),
                        nn.BatchNorm2d(channels_list[i]),
                    ))
                else:
                    # Higher-res to lower-res: strided conv
                    layers = []
                    for k in range(i - j):
                        in_c = (channels_list[j] if k == 0
                                else channels_list[i])
                        layers.extend([
                            nn.Conv2d(in_c,
                                      channels_list[i],
                                      3, stride=2,
                                      padding=1),
                            nn.BatchNorm2d(
                                channels_list[i]),
                        ])
                    branch_fuse.append(
                        nn.Sequential(*layers)
                    )
            self.fuse_layers.append(branch_fuse)

    def forward(self, inputs):
        """inputs: list of tensors at different resolutions."""
        outputs = []
        for i in range(self.num_branches):
            fused = None
            for j in range(self.num_branches):
                x = self.fuse_layers[i][j](inputs[j])
                if j > i:
                    # Upsample to match target size
                    x = F.interpolate(
                        x, size=inputs[i].shape[2:],
                        mode="bilinear",
                        align_corners=True
                    )
                if fused is None:
                    fused = x
                else:
                    fused = fused + x
            outputs.append(F.relu(fused))
        return outputs


# Test: 3 branches at different resolutions
fusion = MultiScaleFusion([32, 64, 128])
inputs = [
    torch.randn(1, 32, 64, 64),   # high-res
    torch.randn(1, 64, 32, 32),   # medium-res
    torch.randn(1, 128, 16, 16),  # low-res
]
outputs = fusion(inputs)
for i, o in enumerate(outputs):
    print(f"Branch {i}: {inputs[i].shape} -> "
          f"{o.shape}")

Why does this matter for pose estimation specifically? Because keypoint localization needs both local precision (exactly which pixel is the elbow) and global context (understanding that the blob of pixels near the torso is an arm, not a background object). The standard approach of downsampling then upsampling inevitably loses some spatial precision even with skip connections. HRNet avoids this entirely by never downsampling the high-resolution branch. The result: consistently more accurate keypoint predictions, especially for small or occluded joints.

On the COCO keypoint benchmark, HRNet-W48 achieves around 75.1 AP compared to SimpleBaseline (ResNet-152 + deconv) at 72.0 AP. That 3-point improvement comes entirely from the architectural choice to maintain resolution.

Practical pose estimation with MediaPipe

For production use, you don't build from scratch. The ecosystem has mature models ready to deploy.

MediaPipe (Google) is the go-to for real-time applications. It runs on CPU, on mobile devices, in the browser. It detects 33 body keypoints (more than COCO's 17), plus hands (21 keypoints each) and face (468 landmarks):

import mediapipe as mp
import cv2

mp_pose = mp.solutions.pose
mp_draw = mp.solutions.drawing_utils

with mp_pose.Pose(
    static_image_mode=True,
    min_detection_confidence=0.5
) as pose:
    image = cv2.imread("person.jpg")
    rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = pose.process(rgb)

    if results.pose_landmarks:
        h, w = image.shape[:2]
        for i, lm in enumerate(
            results.pose_landmarks.landmark
        ):
            px = int(lm.x * w)
            py = int(lm.y * h)
            print(f"Keypoint {i}: ({px}, {py}), "
                  f"vis={lm.visibility:.2f}")

        # Draw skeleton on image
        mp_draw.draw_landmarks(
            image, results.pose_landmarks,
            mp_pose.POSE_CONNECTIONS
        )
        cv2.imwrite("pose_result.jpg", image)

MediaPipe is remarkable because it runs at 30+ FPS on a mobile phone. The secret is a lightweight architecture (BlazePose) specifically designed for on-device inference. It uses a two-stage pipeline internally: a fast detector finds the person, then a regression network predicts keypoints from the crop. The regression network is tiny -- about 3.5 million parameters.

MMPose (from the OpenMMLab ecosystem) offers research-grade accuracy with a zoo of pretrained models -- HRNet, ViTPose, RTMPose. It's heavier than MediaPipe but more acurate, and supports a wider range of pose types (body, hand, face, animal, whole-body):

# MMPose config-based inference
from mmpose.apis import (
    inference_topdown, init_model
)
from mmpose.utils import register_all_modules

register_all_modules()

config = (
    "configs/body_2d_keypoint/"
    "topdown_heatmap/coco/"
    "td-hm_hrnet-w48_8xb32-210e_coco-"
    "256x192.py"
)
checkpoint = (
    "https://download.openmmlab.com/mmpose/v1/"
    "body_2d_keypoint/topdown_heatmap/coco/"
    "td-hm_hrnet-w48_8xb32-210e_coco-"
    "256x192-0e67c616_20220913.pth"
)
model = init_model(config, checkpoint, device="cpu")

# Top-down: provide bounding boxes from detector
bboxes = [{"bbox": [100, 50, 400, 500]}]
results = inference_topdown(
    model, "person.jpg", bboxes
)
keypoints = results[0].pred_instances.keypoints
scores = results[0].pred_instances.keypoint_scores
print(f"Keypoints: {keypoints.shape}")
print(f"Scores: {scores.shape}")

The practical decision is straightforward: MediaPipe for real-time applications where speed matters (fitness apps, interactive installations, mobile deployment). HRNet/MMPose for offline processing where accuracy matters (sports analytics, research, medical assessment). Both produce the same output format -- keypoints with confidence scores -- so you can prototype with MediaPipe and swap in HRNet later if you need higher accuracy.

Object tracking across frames

Pose estimation tells you where people are in a single frame. Tracking tells you which person in frame t is the same person in frame t+1. This sounds trivial -- just match the closest detections, right? -- but it's genuinely hard. People look similar, they overlap, they walk behind pillars and reappear on the other side, they change direction abruptly.

The classic approach is SORT (Simple Online and Realtime Tracking, Bewley et al., 2016), which uses two components:

  1. Kalman filter: predict where each tracked object will be in the next frame based on its position and velocity
  2. Hungarian algorithm: optimally match predictions to new detections using IoU as the cost metric
import numpy as np
from scipy.optimize import linear_sum_assignment


def compute_iou(box1, box2):
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    a1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    a2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = a1 + a2 - inter
    return inter / max(union, 1e-6)


class SimpleTracker:
    """Simplified SORT-style tracker."""

    def __init__(self, max_age=5, iou_thresh=0.3):
        self.tracks = []
        self.next_id = 0
        self.max_age = max_age
        self.iou_thresh = iou_thresh

    def update(self, detections):
        """Match detections to existing tracks.

        detections: list of [x1, y1, x2, y2]
        Returns: list of (track_id, box) tuples
        """
        if not self.tracks:
            # First frame: create tracks for all dets
            results = []
            for det in detections:
                self.tracks.append({
                    "id": self.next_id,
                    "box": det,
                    "age": 0,
                })
                results.append(
                    (self.next_id, det)
                )
                self.next_id += 1
            return results

        # Build IoU cost matrix
        num_tracks = len(self.tracks)
        num_dets = len(detections)
        cost = np.zeros((num_tracks, num_dets))
        for i, trk in enumerate(self.tracks):
            for j, det in enumerate(detections):
                cost[i, j] = 1.0 - compute_iou(
                    trk["box"], det
                )

        # Hungarian matching
        row_idx, col_idx = linear_sum_assignment(cost)

        matched_tracks = set()
        matched_dets = set()
        results = []

        for r, c in zip(row_idx, col_idx):
            if cost[r, c] > (1 - self.iou_thresh):
                continue
            self.tracks[r]["box"] = detections[c]
            self.tracks[r]["age"] = 0
            matched_tracks.add(r)
            matched_dets.add(c)
            results.append(
                (self.tracks[r]["id"], detections[c])
            )

        # Unmatched detections: new tracks
        for j in range(num_dets):
            if j not in matched_dets:
                self.tracks.append({
                    "id": self.next_id,
                    "box": detections[j],
                    "age": 0,
                })
                results.append(
                    (self.next_id, detections[j])
                )
                self.next_id += 1

        # Age unmatched tracks, remove old ones
        for i in range(num_tracks):
            if i not in matched_tracks:
                self.tracks[i]["age"] += 1

        self.tracks = [
            t for t in self.tracks
            if t["age"] <= self.max_age
        ]

        return results


# Simulate 3 frames of tracking
tracker = SimpleTracker()

frame1 = [[100, 100, 200, 300],
           [400, 100, 500, 300]]
frame2 = [[110, 105, 210, 305],
           [410, 95, 510, 295]]
frame3 = [[120, 110, 220, 310],
           [420, 90, 520, 290],
           [250, 200, 350, 400]]

for i, dets in enumerate([frame1, frame2, frame3]):
    results = tracker.update(dets)
    print(f"Frame {i+1}:")
    for tid, box in results:
        print(f"  Track {tid}: {box}")

SORT is fast but fragile. When a person is briefly occluded (walks behind a pillar, crouches behind a desk) and reappears, SORT often assigns them a new ID because the IoU between the predicted and actual positions is too low after the gap. This is called an ID switch, and it's the most common failure mode.

DeepSORT (Wojke et al., 2017) fixes this by adding an appearance model. Each tracked object gets an appearance feature vector extracted by a small CNN trained on person re-identification data. When matching detections to tracks, DeepSORT uses both motion (Kalman prediction via IoU) and appearance (cosine similarity of feature vectors). This lets it re-identify a person after occlusion -- even if they've moved significantly, their appearance features still match ;-)

ByteTrack (Zhang et al., 2022) takes yet another approach: instead of using appearance features, it makes better use of low-confidence detections. Most trackers throw away detections below a confidence threshold (say 0.5). ByteTrack keeps them and uses them in a second matching round. A briefly occluded person might produce a low-confidence detection that other trackers would discard but ByteTrack correctly associates with the existing track. Simple idea, big improvement -- ByteTrack achieved state-of-the-art results on the MOT17 benchmark without any appearance model at all.

Where pose estimation meets tracking

The most powerful applications combine both. Detect poses in each frame, track the skeletons across frames, and you get a continuous motion signal for each person:

Video frames -> Person detection -> Pose estimation -> Pose tracking -> Action analysis
               (YOLO, ep78-79)    (HRNet/MediaPipe)   (DeepSORT)      (temporal model)

Each stage builds on what we've covered: detection (episodes #78-79), the CNN architectures that power pose estimation (#45-47), the U-Net-style encoder-decoder pattern (#80), and now tracking. The temporal model at the end might be an LSTM (episode #49), a transformer (episodes #52-53), or even a simple rule-based system ("if knee angle decreases below 90 degrees for more than 0.5 seconds -> classified as sitting down").

The applications are genuinely diverse:

  • Sports analytics: track player movements, analyze shooting form, measure stride length and cadence. The NBA uses optical tracking to generate the detailed player statistics you see in broadcasts.
  • Physical therapy: measure range of motion over time, detect compensation patterns (favoring one side), track rehabilitation progress objectively instead of relying on patient self-reporting.
  • Sign language recognition: track hand and body keypoints to decode signs. Each sign is a temporal sequence of poses, which maps naturally to sequence models from episodes #48-53.
  • Action recognition: classify the temporal sequence of poses -- walking, running, falling, fighting. Fall detection for elderly care is a major use case.
  • Motion capture: extract 3D body motion from 2D video. Much cheaper than marker-based mocap systems (which cost $50,000+), though less precise.

Evaluation metrics for pose and tracking

Pose estimation uses OKS (Object Keypoint Similarity) instead of IoU. OKS measures how close predicted keypoints are to ground truth, normalized by the object's scale and a per-keypoint constant that accounts for how precisely each keypoint can be annotated (the nose has a precise location; the hip center is ambiguous):

import numpy as np

# COCO per-keypoint sigmas (annotation noise)
COCO_SIGMAS = np.array([
    0.026, 0.025, 0.025,  # nose, eyes
    0.035, 0.035,           # ears
    0.079, 0.079,           # shoulders
    0.072, 0.072,           # elbows
    0.062, 0.062,           # wrists
    0.107, 0.107,           # hips
    0.087, 0.087,           # knees
    0.089, 0.089,           # ankles
])


def compute_oks(pred_kps, gt_kps, gt_area):
    """Object Keypoint Similarity.

    pred_kps, gt_kps: (17, 3) -- x, y, visibility
    gt_area: bounding box area of the person
    """
    visible = gt_kps[:, 2] > 0
    if visible.sum() == 0:
        return 0.0

    dx = pred_kps[:, 0] - gt_kps[:, 0]
    dy = pred_kps[:, 1] - gt_kps[:, 1]
    dist_sq = dx ** 2 + dy ** 2

    scale = 2 * (COCO_SIGMAS ** 2) * gt_area
    oks_per_kp = np.exp(-dist_sq / (scale + 1e-6))

    return float(oks_per_kp[visible].mean())


# Test: perfect prediction
gt = np.random.rand(17, 3)
gt[:, 2] = 1.0  # all visible
perfect_oks = compute_oks(gt, gt, gt_area=10000)
print(f"Perfect OKS: {perfect_oks:.4f}")  # 1.0

# Test: slightly noisy prediction
noisy = gt.copy()
noisy[:, :2] += np.random.randn(17, 2) * 3
noisy_oks = compute_oks(noisy, gt, gt_area=10000)
print(f"Noisy OKS:   {noisy_oks:.4f}")

The sigmas encode domain knowledge: shoulders and hips have high sigma (0.079, 0.107) because annotators disagree more about their exact center -- the shoulder is a broad area, not a precise point. The nose has low sigma (0.026) because everyone agrees on where the nose is. This normalization means a 5-pixel error on the hip is penalized less than a 5-pixel error on the nose, which correctly reflects annotation difficulty.

Tracking uses MOTA (Multiple Object Tracking Accuracy) and IDF1. MOTA penalizes missed detections, false positives, and ID switches. IDF1 measures how well the tracker maintains consistent identities -- it's the harmonic mean of identification precision and identification recall across the whole video. A tracker that detects everyone but keeps swapping IDs will have decent MOTA but poor IDF1.

Samengevat

  • Pose estimation detects body keypoints (joints) in images, outputting (x, y, confidence) per keypoint plus a skeleton structure -- going beyond detection (where) and segmentation (which pixels) to understand body structure;
  • heatmap-based prediction is the standard approach: for each keypoint, the model outputs a 2D Gaussian probability map and the predicted coordinate is the argmax -- much easier to learn than direct coordinate regression;
  • top-down approaches detect people first then estimate pose per crop (more accurate, runtime scales with number of people); bottom-up detects all keypoints first then groups them (faster for crowds, harder grouping);
  • HRNet maintains high-resolution representations throughout the network with parallel multi-scale branches and repeated fusion -- avoiding the information loss from standard downsample-then-upsample architectures;
  • MediaPipe runs real-time on CPU and mobile (33 body + 21 hand + 468 face keypoints); MMPose/HRNet offers higher accuracy for offline processing;
  • SORT tracks objects using Kalman filters + Hungarian matching; DeepSORT adds appearance features for re-identification after occlusion; ByteTrack recovers low-confidence detections instead of discarding them;
  • OKS (Object Keypoint Similarity) evaluates pose accuracy with per-keypoint normalization based on annotation difficulty; MOTA and IDF1 evaluate tracking quality;
  • combining pose estimation with tracking enables action recognition, sports analytics, motion capture, physical therapy, and sign language recognition -- any application that needs to understand human movement over time.

With pose estimation and tracking, we've now covered the full spectrum of visual understanding: what's in the image (classification), where it is (detection), which pixels belong to it (segmentation), and how the body is structured and moving (pose + tracking). There's another dimension of visual perception we haven't explored yet though -- understanding text and characters within images. Machines reading signage, extracting text from documents, recognizing handwriting. That's a distinct set of techniques with its own architectures and challenges.

Exercises

Exercise 1: Build a pose analysis toolkit. Create a class PoseAnalyzer that: (a) takes a (17, 3) COCO-format keypoint array and computes joint angles using the arctangent of limb vectors -- specifically left elbow angle (shoulder-elbow-wrist), right elbow angle, left knee angle (hip-knee-ankle), and right knee angle, (b) implements a classify_pose method that uses simple angle thresholds to classify the pose as "standing" (both knee angles > 160), "sitting" (at least one knee angle < 120), "arms_raised" (both wrists above shoulders), or "unknown", (c) implements a symmetry_score method that compares left vs right side joint angles and returns a 0-1 score where 1.0 means perfectly symmetric, (d) test with three synthetic keypoint arrays representing standing, sitting, and arms-raised poses, verifying correct classification for each.

Exercise 2: Implement a multi-object tracker with ID management. Create a class MultiObjectTracker that: (a) maintains a dictionary of active tracks, each with an ID, current bounding box, velocity estimate (simple: difference between last two positions), age (frames since creation), and hits (number of successful matches), (b) on each update(detections) call, predicts new positions using the velocity estimate, builds an IoU cost matrix between predictions and detections, solves the assignment with scipy.optimize.linear_sum_assignment, updates matched tracks, creates new tracks for unmatched detections, and increments age for unmatched tracks, (c) implements track lifecycle: a track is "tentative" until it accumulates 3 hits, then "confirmed"; a track is deleted after 5 frames without a match, (d) generate a synthetic 20-frame video sequence with 3 objects moving in straight lines (one disappears for 3 frames then reappears) and print per-frame track states. Verify that confirmed tracks maintain consistent IDs and that the reappearing object gets a new ID (since the old track was deleted).

Exercise 3: Build an OKS-based pose evaluation framework. Create a class PoseEvaluator that: (a) implements compute_oks(pred_kps, gt_kps, gt_area) using the full COCO sigma values (provided: nose=0.026, eyes=0.025, ears=0.035, shoulders=0.079, elbows=0.072, wrists=0.062, hips=0.107, knees=0.087, ankles=0.089), (b) implements compute_ap(predictions, ground_truths, oks_threshold) that sorts predictions by score, matches to ground truths using OKS >= threshold, computes precision-recall, and returns AP (area under the PR curve), (c) generates a synthetic evaluation set: 10 "images" each with 1-3 ground truth poses, plus three detector profiles -- "perfect" (exact GT keypoints, score=1.0), "good" (GT + 3px Gaussian noise, random scores 0.7-0.99), and "poor" (GT + 15px noise, 40% of poses missed, random false positives), (d) prints a comparison table showing AP at OKS thresholds 0.50, 0.75, and 0.50:0.05:0.95 (COCO-style AP) for each detector. Verify that the perfect detector achieves AP=1.0 and that the gap between "good" and "poor" is larger at OKS=0.75 than at OKS=0.50.

De groeten!

@scipio



0
0
0.000
0 comments