Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches

Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches

variant-b-05-hotpink.png

What will I learn

  • You will learn how YOLO reframed detection as a single regression problem and why that was revolutionary;
  • the YOLO evolution from v1 through v8: anchor boxes, multi-scale prediction, and the move to anchor-free;
  • SSD and multi-scale single-shot detection from different feature map levels;
  • anchor-free detectors: FCOS and CenterNet -- simpler architectures with fewer hyperparameters;
  • training a custom object detector on your own dataset with transfer learning;
  • evaluation metrics: mAP, precision-recall curves, and what mAP50-95 actually measures.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches

Solutions to Episode #78 Exercises

Exercise 1: Detection dataset simulator.

import numpy as np
import cv2
import random


class DetectionDataset:
    """Generate synthetic detection datasets."""

    def __init__(self, num_images=50, img_size=300, seed=42):
        self.num_images = num_images
        self.img_size = img_size
        self.rng = random.Random(seed)
        self.np_rng = np.random.RandomState(seed)
        self.classes = {"red": (0, 0, 255),
                        "green": (0, 255, 0),
                        "blue": (255, 0, 0)}  # BGR
        self.data = self._generate()

    def _generate(self):
        dataset = []
        for _ in range(self.num_images):
            img = np.zeros(
                (self.img_size, self.img_size, 3),
                dtype=np.uint8
            )
            num_objects = self.rng.randint(1, 5)
            annotations = []
            placed_boxes = []

            for _ in range(num_objects):
                cls_name = self.rng.choice(
                    list(self.classes.keys())
                )
                color = self.classes[cls_name]

                for attempt in range(50):
                    w = self.rng.randint(30, 100)
                    h = self.rng.randint(30, 100)
                    x1 = self.rng.randint(
                        0, self.img_size - w
                    )
                    y1 = self.rng.randint(
                        0, self.img_size - h
                    )
                    x2, y2 = x1 + w, y1 + h

                    overlap = False
                    for bx1, by1, bx2, by2 in placed_boxes:
                        if not (x2 <= bx1 or x1 >= bx2
                                or y2 <= by1 or y1 >= by2):
                            overlap = True
                            break

                    if not overlap:
                        img[y1:y2, x1:x2] = color
                        annotations.append({
                            "class": cls_name,
                            "box": [x1, y1, x2, y2],
                        })
                        placed_boxes.append(
                            (x1, y1, x2, y2)
                        )
                        break

            dataset.append((img, annotations))
        return dataset

    def get_sample(self, index):
        return self.data[index]

    def visualize(self, index, save_path="det_vis.png"):
        img, anns = self.data[index]
        vis = img.copy()
        for ann in anns:
            x1, y1, x2, y2 = ann["box"]
            cv2.rectangle(vis, (x1, y1), (x2, y2),
                          (255, 255, 255), 2)
            cv2.putText(vis, ann["class"], (x1, y1 - 5),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5,
                        (255, 255, 255), 1)
        cv2.imwrite(save_path, vis)
        return save_path

    def statistics(self):
        total_objects = 0
        class_counts = {c: 0 for c in self.classes}
        total_area = 0

        for _, anns in self.data:
            total_objects += len(anns)
            for ann in anns:
                class_counts[ann["class"]] += 1
                x1, y1, x2, y2 = ann["box"]
                total_area += (x2 - x1) * (y2 - y1)

        img_area = self.img_size ** 2
        avg_objs = total_objects / len(self.data)
        avg_pct = (total_area / total_objects
                   / img_area * 100)

        print(f"Total images:    {len(self.data)}")
        print(f"Avg objects/img: {avg_objs:.1f}")
        print(f"Class distribution:")
        for cls, cnt in class_counts.items():
            print(f"  {cls}: {cnt}")
        print(f"Avg object area: {avg_pct:.1f}% of image")


ds = DetectionDataset(num_images=50)
ds.statistics()

The non-overlap constraint is the interesting part. In real datasets, objects overlap constantly (people in a crowd, cars on a highway). Our simulator avoids it for simplicity, but that actually makes detection easier than real-world scenarios. If you extend this simulator to allow overlap, you'll immediately see why NMS (which we covered last episode) becomes so critical -- overlapping ground truth objects generate ambiguous training signals.

Exercise 2: NMS benchmarking suite.

import numpy as np
import random


def compute_iou(box1, box2):
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    a1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    a2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = a1 + a2 - inter
    return inter / max(union, 1e-6)


class NMSBenchmark:
    def __init__(self, seed=42):
        self.rng = random.Random(seed)

    def _make_scenario(self, mode):
        boxes, scores, gt = [], [], []
        if mode == "easy":
            for i in range(5):
                x = 50 + i * 120
                gt.append([x, 50, x + 80, 130])
                boxes.append([x + 2, 48, x + 82, 132])
                scores.append(0.9 - i * 0.05)
        elif mode == "moderate":
            for i in range(5):
                x = 30 + i * 100
                gt.append([x, 30, x + 70, 110])
                for j in range(3):
                    dx = self.rng.randint(-10, 10)
                    dy = self.rng.randint(-10, 10)
                    boxes.append([x + dx, 30 + dy,
                                  x + 70 + dx, 110 + dy])
                    scores.append(
                        0.9 - j * 0.15
                        + self.rng.uniform(-0.05, 0.05)
                    )
        else:  # hard
            positions = [
                (30, 30), (60, 40), (200, 30),
                (210, 50), (350, 100)
            ]
            for px, py in positions:
                gt.append([px, py, px + 60, py + 80])
            for gx1, gy1, gx2, gy2 in gt:
                for _ in range(10):
                    dx = self.rng.randint(-15, 15)
                    dy = self.rng.randint(-15, 15)
                    boxes.append([gx1 + dx, gy1 + dy,
                                  gx2 + dx, gy2 + dy])
                    scores.append(
                        self.rng.uniform(0.3, 0.95)
                    )
        return boxes, scores, gt

    def standard_nms(self, boxes, scores, thresh=0.5):
        order = sorted(range(len(scores)),
                       key=lambda i: scores[i],
                       reverse=True)
        keep = []
        while order:
            best = order[0]
            keep.append(best)
            order = [
                i for i in order[1:]
                if compute_iou(boxes[best],
                               boxes[i]) < thresh
            ]
        return keep

    def soft_nms(self, boxes, scores, sigma=0.5,
                 thresh=0.01):
        boxes = [list(b) for b in boxes]
        scores = list(scores)
        kept = []
        while scores:
            idx = scores.index(max(scores))
            kept.append(idx)
            ref = boxes[idx]
            boxes.pop(idx)
            scores.pop(idx)
            for i in range(len(scores)):
                iou = compute_iou(ref, boxes[i])
                scores[i] *= np.exp(
                    -(iou ** 2) / sigma
                )
            surviving = [
                (b, s) for b, s in
                zip(boxes, scores) if s >= thresh
            ]
            if surviving:
                boxes = [x[0] for x in surviving]
                scores = [x[1] for x in surviving]
            else:
                break
        return kept

    def _detection_rate(self, kept_boxes, gt_boxes):
        detected = 0
        for g in gt_boxes:
            for k in kept_boxes:
                if compute_iou(k, g) >= 0.5:
                    detected += 1
                    break
        return detected / max(len(gt_boxes), 1)

    def run(self):
        print(f"{'Scenario':<10} {'Method':<10} "
              f"{'Kept':>5} {'DetRate':>8} {'FP':>5}")
        print("-" * 42)
        for mode in ["easy", "moderate", "hard"]:
            boxes, scores, gt = self._make_scenario(mode)
            for name, fn in [
                ("NMS", self.standard_nms),
                ("SoftNMS", self.soft_nms)
            ]:
                kept_idx = fn(list(boxes), list(scores))
                kept_b = [boxes[i] for i in kept_idx]
                dr = self._detection_rate(kept_b, gt)
                matched = sum(
                    1 for k in kept_b
                    if any(compute_iou(k, g) >= 0.5
                           for g in gt)
                )
                fp = len(kept_b) - matched
                print(f"{mode:<10} {name:<10} "
                      f"{len(kept_b):>5} {dr:>8.1%} "
                      f"{fp:>5}")


bench = NMSBenchmark()
bench.run()

In the "hard" scenario with closely-placed objects, Soft-NMS should retain more correct detections because it decays scores gradually instead of hard-deleting overlapping boxes. The two objects at positions (30,30) and (60,40) have substantial overlap -- standard NMS will likely suppress one of them, while Soft-NMS keeps both (with a reduced score for the secondary one).

Exercise 3: Simplified Faster R-CNN forward pass simulator.

import numpy as np
import random


def compute_iou(box1, box2):
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    a1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    a2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = a1 + a2 - inter
    return inter / max(union, 1e-6)


class SimplifiedFasterRCNN:
    def __init__(self, grid=7, num_classes=5, seed=42):
        self.grid = grid
        self.num_classes = num_classes
        self.rng = np.random.RandomState(seed)
        self.scales = [32, 64, 128]
        self.ratios = [0.5, 1.0, 2.0]

    def generate_anchors(self):
        anchors = []
        cell_size = 448 / self.grid  # assume 448px image
        for row in range(self.grid):
            for col in range(self.grid):
                cx = (col + 0.5) * cell_size
                cy = (row + 0.5) * cell_size
                for s in self.scales:
                    for r in self.ratios:
                        w = s * np.sqrt(r)
                        h = s / np.sqrt(r)
                        anchors.append([
                            cx - w / 2, cy - h / 2,
                            cx + w / 2, cy + h / 2
                        ])
        return anchors

    def rpn_scores(self, anchors):
        return self.rng.uniform(0, 1, len(anchors))

    def rpn_offsets(self, anchors):
        return self.rng.uniform(-5, 5,
                                (len(anchors), 4))

    def apply_offsets(self, anchors, offsets):
        proposals = []
        for a, o in zip(anchors, offsets):
            proposals.append([
                a[0] + o[0], a[1] + o[1],
                a[2] + o[2], a[3] + o[3]
            ])
        return proposals

    def top_k(self, proposals, scores, k=300):
        order = np.argsort(scores)[::-1][:k]
        return ([proposals[i] for i in order],
                [scores[i] for i in order])

    def nms(self, boxes, scores, thresh=0.7):
        order = sorted(range(len(scores)),
                       key=lambda i: scores[i],
                       reverse=True)
        keep = []
        while order:
            best = order[0]
            keep.append(best)
            order = [
                i for i in order[1:]
                if compute_iou(boxes[best],
                               boxes[i]) < thresh
            ]
        return keep

    def forward(self):
        # Stage 1: generate anchors
        anchors = self.generate_anchors()
        print(f"Stage 1 - Anchors: {len(anchors)}")

        # Stage 2: RPN scoring + offset
        obj_scores = self.rpn_scores(anchors)
        offsets = self.rpn_offsets(anchors)
        proposals = self.apply_offsets(anchors, offsets)

        # Stage 3: top-K proposals
        proposals, scores = self.top_k(
            proposals, obj_scores, k=300
        )
        print(f"Stage 2 - After top-K: {len(proposals)}")

        # Stage 4: NMS on proposals
        kept = self.nms(proposals, scores, thresh=0.7)
        proposals = [proposals[i] for i in kept]
        print(f"Stage 3 - After NMS: {len(proposals)}")

        # Stage 5: detection head
        detections = []
        for p in proposals:
            cls = self.rng.randint(0, self.num_classes)
            conf = self.rng.uniform(0.1, 1.0)
            detections.append({
                "box": p, "class": cls,
                "score": conf
            })

        # Stage 6: per-class NMS
        final = []
        for c in range(self.num_classes):
            class_dets = [
                d for d in detections if d["class"] == c
            ]
            if not class_dets:
                continue
            boxes_c = [d["box"] for d in class_dets]
            scores_c = [d["score"] for d in class_dets]
            kept_c = self.nms(boxes_c, scores_c,
                              thresh=0.3)
            for i in kept_c:
                final.append(class_dets[i])

        print(f"Stage 4 - Final detections: {len(final)}")
        return final


model = SimplifiedFasterRCNN()
results = model.forward()

The key takeaway is the dramatic reduction at each stage: 441 anchors -> 300 top-K -> ~50 after NMS -> final detections after per-class NMS. Each stage acts as a filter. The RPN is basically saying "there might be objects here" for 300 out of 441 locations, then NMS merges overlapping proposals, and the detection head assigns classes and confidence scores. Real Faster R-CNN uses learned weights instead of random scores, but the data flow pattern is identical.

On to today's episode

Here we go! Last episode we traced the evolution of two-stage detectors from the brute-force sliding window all the way through to Faster R-CNN. We built IoU from scratch, implemented NMS, walked through the R-CNN family's systematic bottleneck removal, and ended up at ~5 FPS on a 2015-era GPU. Good, but not real-time.

The two-stage pipeline -- first propose regions, then classify them -- is inherently limited in speed. You're doing two separate jobs sequentially. In 2015, a paper came along with a title that said it all: "You Only Look Once." What if you could skip the proposal stage entirely and predict everything in a single forward pass? That's exactly what YOLO did, and it changed the entire field ;-)

YOLO: detection as regression

YOLO (Redmon et al., 2015) reframed detection as a single regression problem. Instead of the propose-then-classify pipeline, YOLO divides the image into an S x S grid (typically 7x7) and predicts bounding boxes and class probabilities directly from each grid cell in one forward pass through the network.

Each grid cell predicts B bounding boxes (each with 5 values: x, y, w, h, confidence) and C class probabilities. The output tensor has shape S x S x (B*5 + C). For PASCAL VOC with 20 classes, S=7, B=2, that's 7 x 7 x 30 -- a single tensor that encodes every detection in the image.

import torch
import torch.nn as nn


class SimpleYOLOHead(nn.Module):
    """Simplified YOLO detection head."""

    def __init__(self, in_channels, grid_size=7,
                 num_boxes=2, num_classes=20):
        super().__init__()
        self.S = grid_size
        self.B = num_boxes
        self.C = num_classes

        # Each cell: B boxes (x,y,w,h,conf) + C class probs
        out_features = (self.S * self.S
                        * (self.B * 5 + self.C))
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(in_channels * (grid_size ** 2),
                      4096),
            nn.LeakyReLU(0.1),
            nn.Dropout(0.5),
            nn.Linear(4096, out_features),
        )

    def forward(self, x):
        return self.fc(x).view(
            -1, self.S, self.S, self.B * 5 + self.C
        )


# For PASCAL VOC: 7x7 grid, 2 boxes, 20 classes
head = SimpleYOLOHead(512, grid_size=7,
                       num_boxes=2, num_classes=20)
fake_features = torch.randn(1, 512, 7, 7)
output = head(fake_features)
print(f"Output shape: {output.shape}")
# torch.Size([1, 7, 7, 30])
# 30 = 2 boxes * 5 values + 20 classes

The box coordinates (x, y) are relative to the grid cell, and (w, h) are relative to the whole image. The confidence score represents both the probability that a box contains an object AND how good the box actually is: P(object) * IoU(predicted, truth).

YOLO's speed was revolutionary: 45 FPS on a GPU, compared to Faster R-CNN's ~5 FPS. The tradeoff? Lower accuracy, especially on small objects and objects that appear in groups. The coarse 7x7 grid means each cell can only predict a limited number of objects. Two small birds sitting in the same grid cell? YOLO v1 can only detect one of them.

Having said that, for most practical applications (security cameras, robotics, autonomous driving), that speed advantage was worth more than marginal accuracy improvements. A detector that runs at 5 FPS can't power a real-time system. One that runs at 45+ FPS absolutely can.

The YOLO evolution

YOLOv1 had clear limitations, and subsequent versions addressed them systematically. This progression is worth understanding because it shows how the field iterated toward the modern detectors we use today.

YOLOv2 (also called YOLO9000) added batch normalization everywhere, borrowed anchor boxes from Faster R-CNN (instead of free-form box prediction, start with predefined shapes and learn to refine them), introduced multi-scale training (randomly resize the input during training to handle different object sizes), and used a better backbone called Darknet-19.

YOLOv3 brought multi-scale detection: predictions at three different feature map scales (like a Feature Pyramid Network). This was a big deal for small objects -- the fine-grained feature map catches small things that the coarse grid missed. It also switched to Darknet-53 with residual connections (sound familiar from episode #46?).

YOLOv5 (Ultralytics) was a PyTorch reimplementation that prioritized usability over novelty. Easy training, ONNX/TensorRT export, excellent documentation, pip-installable. It became the de facto standard for anyone who needed to actually ship a detection system rather than publish a paper.

YOLOv8 (also Ultralytics, 2023) is the current practical standard. It's anchor-free (no predefined box shapes -- full circle back to the original YOLO philosophy of simplicity), uses a decoupled detection head (separate branches for classification and localization), and supports detection, segmentation, pose estimation, and classification all in one unified framework.

from ultralytics import YOLO

# Load pretrained model (COCO, 80 classes)
model = YOLO("yolov8n.pt")  # nano: fast, smaller

# Inference on an image
results = model("street_photo.jpg")

# Process results
for result in results:
    boxes = result.boxes
    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        confidence = box.conf[0].item()
        class_id = int(box.cls[0].item())
        class_name = model.names[class_id]
        print(f"{class_name}: {confidence:.2f} "
              f"at [{x1:.0f},{y1:.0f},"
              f"{x2:.0f},{y2:.0f}]")

# Visualize with boxes drawn on the image
result.save("output_detections.jpg")

Five lines of code to go from an image to detected objects with bounding boxes, confidence scores, and class labels. That's the power of modern frameworks built on top of a decade of architectural iteration ;-)

The YOLO model sizes range from nano (3.2M parameters, ~640 FPS on GPU) through small, medium, large, to extra-large (68.2M parameters, highest accuracy). Pick based on your deployment constraint: edge device with limited compute? Use nano. Server with a beefy GPU? Use large or extra-large.

SSD: multi-scale single-shot detection

SSD (Single Shot MultiBox Detector, Liu et al., 2016) was published around the same time as YOLO and took a different approach to single-shot detection. Instead of predicting from a single grid, SSD predicts from multiple feature maps at different scales.

Input Image (300x300)
    |
VGG-16 Backbone
    |
Feature Map 38x38 -> predictions (small objects)
    |
Feature Map 19x19 -> predictions (medium objects)
    |
Feature Map 10x10 -> predictions
    |
Feature Map  5x5  -> predictions
    |
Feature Map  3x3  -> predictions (large objects)
    |
Feature Map  1x1  -> predictions (very large objects)

The insight is elegant: small feature maps (3x3) have large receptive fields and detect large objects. Large feature maps (38x38) have small receptive fields and detect small objects. By predicting from all levels simultaneously, SSD handles objects of varying sizes much better than original YOLO.

At each spatial position in each feature map, SSD predicts: class scores for each of several default boxes (anchor boxes) with different aspect ratios, and offset adjustments to refine those default boxes.

import torch
import torch.nn as nn


class SSDPredictionHead(nn.Module):
    """Prediction head for one SSD feature level."""

    def __init__(self, in_channels, num_anchors,
                 num_classes):
        super().__init__()
        # Classification: anchors * classes per position
        self.cls = nn.Conv2d(
            in_channels, num_anchors * num_classes,
            3, padding=1
        )
        # Localization: anchors * 4 coords per position
        self.loc = nn.Conv2d(
            in_channels, num_anchors * 4,
            3, padding=1
        )

    def forward(self, feature_map):
        cls_pred = self.cls(feature_map)
        loc_pred = self.loc(feature_map)
        return cls_pred, loc_pred


# Each feature level gets its own prediction head
# with the appropriate number of input channels
heads = nn.ModuleList([
    SSDPredictionHead(512, num_anchors=4,
                       num_classes=21),  # 38x38
    SSDPredictionHead(1024, num_anchors=6,
                       num_classes=21),  # 19x19
    SSDPredictionHead(512, num_anchors=6,
                       num_classes=21),  # 10x10
    SSDPredictionHead(256, num_anchors=6,
                       num_classes=21),  #  5x5
    SSDPredictionHead(256, num_anchors=4,
                       num_classes=21),  #  3x3
    SSDPredictionHead(256, num_anchors=4,
                       num_classes=21),  #  1x1
])

# Total anchors: 38*38*4 + 19*19*6 + 10*10*6
#   + 5*5*6 + 3*3*4 + 1*1*4 = 8732
print(f"Total anchor boxes: "
      f"{38*38*4 + 19*19*6 + 10*10*6"
      f" + 5*5*6 + 3*3*4 + 1*1*4}")

8,732 anchor boxes total across all feature levels. That sounds like a lot, but the vast majority get classified as "background" very quickly. The network learns to only activate the anchors that actually overlap with objects.

The multi-scale prediction idea from SSD became foundational. Feature Pyramid Networks (FPN, Lin et al., 2017) formalized this approach by adding top-down connections that pass semantic information from deeper layers back to shallower layers. Virtually every modern detector -- including recent YOLO versions -- uses FPN or something similar. The principle that different scales need different feature maps has become one of those things nobody questions anymore.

Anchor-free detection: simpler is better

Anchor-based detectors (Faster R-CNN, SSD, YOLO v2-v5) all require careful anchor design. What sizes? What aspect ratios? How many per position? Get the anchors wrong and performance suffers. There's even an entire research subfield about "anchor optimization" -- which is a sign that maybe the whole anchor concept is more trouble than it's worth.

FCOS (Fully Convolutional One-Stage, Tian et al., 2019) eliminates anchors entirely. For each position in the feature map, it directly predicts: the distances to the four sides of the bounding box (left, top, right, bottom), a classification score, and a centerness score that downweights predictions far from object centers.

import torch
import torch.nn as nn


class FCOSHead(nn.Module):
    """Anchor-free detection: predict distances
    to box edges from every feature position."""

    def __init__(self, in_channels, num_classes):
        super().__init__()
        # Classification branch
        self.cls_conv = nn.Sequential(
            nn.Conv2d(in_channels, 256, 3, padding=1),
            nn.GroupNorm(32, 256),
            nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1),
            nn.GroupNorm(32, 256),
            nn.ReLU(),
        )
        self.cls_score = nn.Conv2d(
            256, num_classes, 3, padding=1
        )

        # Regression branch
        self.reg_conv = nn.Sequential(
            nn.Conv2d(in_channels, 256, 3, padding=1),
            nn.GroupNorm(32, 256),
            nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1),
            nn.GroupNorm(32, 256),
            nn.ReLU(),
        )
        # 4 distances: left, top, right, bottom
        self.reg_pred = nn.Conv2d(256, 4, 3, padding=1)
        self.centerness = nn.Conv2d(256, 1, 3, padding=1)

    def forward(self, feature_map):
        cls_feat = self.cls_conv(feature_map)
        reg_feat = self.reg_conv(feature_map)

        cls_score = self.cls_score(cls_feat)
        # exp() ensures distances are positive
        reg_pred = torch.exp(self.reg_pred(reg_feat))
        centerness = self.centerness(reg_feat)

        return cls_score, reg_pred, centerness


# Test with a 32x32 feature map, 80 COCO classes
head = FCOSHead(256, num_classes=80)
feat = torch.randn(1, 256, 32, 32)
cls_out, reg_out, center_out = head(feat)
print(f"Classification: {cls_out.shape}")
# (1, 80, 32, 32) -- per-position class scores
print(f"Regression:     {reg_out.shape}")
# (1, 4, 32, 32) -- l,t,r,b distances
print(f"Centerness:     {center_out.shape}")
# (1, 1, 32, 32) -- how close to object center

The centerness trick is clever. Without it, positions at the edge of an object produce poor boxes (because the distances to opposite sides are very unequal). The centerness score is defined as sqrt(min(l,r)/max(l,r) * min(t,b)/max(t,b)) -- it's 1.0 at the exact center of an object and approaches 0 at the edges. Multiplying the classification score by centerness during inference naturally suppresses low-quality detections from edge positions.

CenterNet (Zhou et al., 2019) takes the simplification even further. It detects objects as center points. The model produces a heatmap where each peak represents an object center, then predicts width and height at each peak location. No anchors, no NMS needed (peaks in the heatmap naturally separate because each object produces exactly one peak).

def centernet_decode(heatmap, wh_pred, top_k=100):
    """Extract detections from CenterNet outputs.

    heatmap: (1, C, H, W) -- per-class center heatmaps
    wh_pred: (1, 2, H, W) -- width,height at each pos
    """
    batch, num_classes, h, w = heatmap.shape

    # Find local maxima (peaks) in the heatmap
    # using max-pooling with kernel 3
    pooled = torch.nn.functional.max_pool2d(
        heatmap, 3, stride=1, padding=1
    )
    # A position is a peak if it equals the pooled value
    peaks = (heatmap == pooled).float() * heatmap

    # Get top-K peaks across all classes
    flat = peaks.view(batch, -1)
    top_scores, top_indices = flat.topk(top_k)

    # Convert flat indices back to (class, y, x)
    top_classes = top_indices // (h * w)
    positions = top_indices % (h * w)
    top_y = positions // w
    top_x = positions % w

    # Look up width/height at each peak position
    detections = []
    for i in range(top_k):
        score = top_scores[0, i].item()
        if score < 0.3:
            break
        cx = top_x[0, i].item()
        cy = top_y[0, i].item()
        cls = top_classes[0, i].item()
        width = wh_pred[0, 0, cy, cx].item()
        height = wh_pred[0, 1, cy, cx].item()
        detections.append({
            "class": cls,
            "score": score,
            "box": [cx - width / 2, cy - height / 2,
                    cx + width / 2, cy + height / 2],
        })

    return detections

The trend in detection architectures is unmistakable: simpler is better. Fewer hyperparameters, fewer special components, more straightforward training. YOLOv8 adopted anchor-free detection, DETR (Detection Transformer) formulates detection as a set prediction problem -- the whole field is moving away from the complex multi-stage pipelines that dominated five years ago.

Training a custom detector

All these architectures are great for understanding the concepts, but the real power of modern detection is that you can train on your own objects with surprisingly little data and effort. Want to detect specific products on store shelves? Defects on a manufacturing line? Species of birds in a forest? The workflow is the same.

Step 1: annotate your data. You need images with bounding box annotations. Each box has coordinates and a class label. Tools like Label Studio, CVAT, or Roboflow provide annotation interfaces. For YOLO format, each image gets a text file with one line per object:

# class_id center_x center_y width height
# all values normalized to 0-1 relative to image size
0 0.45 0.38 0.12 0.25
2 0.73 0.62 0.08 0.15

Step 2: organize your dataset in the standard YOLO directory structure:

dataset/
  images/
    train/
      img001.jpg
      img002.jpg
    val/
      img003.jpg
      img004.jpg
  labels/
    train/
      img001.txt
      img002.txt
    val/
      img003.txt
      img004.txt
  data.yaml
# data.yaml -- tells YOLO where to find everything
path: ./dataset
train: images/train
val: images/val

names:
  0: product_a
  1: product_b
  2: defect

Step 3: train.

from ultralytics import YOLO

# Start from pretrained COCO weights (transfer learning)
model = YOLO("yolov8n.pt")

# Train on your custom dataset
results = model.train(
    data="dataset/data.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    patience=20,        # early stopping if no improvement
    lr0=0.01,           # initial learning rate
    augment=True,       # built-in augmentations
)

# Training outputs: best.pt and last.pt in runs/detect/

The built-in augmentations include mosaic (stitching four training images together -- a trick introduced in YOLOv4 that dramatically improves small object detection), random flip, random rotation, HSV jitter, and scale variation. You get a solid data augmentation pipeline without writing any extra code.

Step 4: evaluate and iterate.

# Evaluate on the validation set
metrics = model.val()
print(f"mAP50:    {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")
print(f"Precision: {metrics.box.mp:.3f}")
print(f"Recall:    {metrics.box.mr:.3f}")

# Run inference on a new image
results = model("new_test_image.jpg")
results[0].save("prediction_result.jpg")

# Export for deployment
model.export(format="onnx")    # ONNX for TensorRT
model.export(format="torchscript")  # mobile

For most custom detection tasks, 200-500 annotated images with transfer learning from a COCO-pretrained model is enough to get decent results. The pretrained backbone already knows how to extract visual features -- edges, textures, shapes -- from episode #45's CNN theory in practice. Your training just teaches the detection head to recognize your specific objects. More data always helps, but the diminishing returns curve flattens quickly with transfer learning.

A practical tip from experience: spend more time on annotation quality than on model tuning. Inconsistent annotations (different annotators drawing boxes at different tightness levels, missing objects in some images, mislabeled classes) hurt performance more than any hyperparameter choice. Clean data beats a bigger model every time -- the same lesson we learned back in episode #14 about data preparation.

Evaluation: mean Average Precision

Detection evaluation is more involved than classification because you need to match predicted boxes to ground truth boxes. We introduced IoU in episode #78. Now we build the full evaluation pipeline on top of it.

The standard metric is mAP (mean Average Precision). For each class:

  1. Sort all detections by confidence score (descending)
  2. For each detection, check if it matches a ground truth box (IoU >= threshold)
  3. Mark each detection as true positive (matched) or false positive (unmatched or duplicate)
  4. Compute precision and recall at each step
  5. Plot the precision-recall curve
  6. Average Precision = area under this curve
import numpy as np


def compute_iou(box1, box2):
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    a1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    a2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = a1 + a2 - inter
    return inter / max(union, 1e-6)


def compute_ap(predictions, ground_truths,
               iou_threshold=0.5):
    """Average Precision for a single class."""
    preds = sorted(predictions,
                   key=lambda p: p["score"],
                   reverse=True)

    if len(ground_truths) == 0:
        return 0.0

    tp = np.zeros(len(preds))
    fp = np.zeros(len(preds))
    matched_gt = set()

    for i, pred in enumerate(preds):
        best_iou = 0
        best_gt = -1
        for j, gt in enumerate(ground_truths):
            iou = compute_iou(pred["box"], gt["box"])
            if iou > best_iou:
                best_iou = iou
                best_gt = j

        if (best_iou >= iou_threshold
                and best_gt not in matched_gt):
            tp[i] = 1
            matched_gt.add(best_gt)
        else:
            fp[i] = 1

    # Cumulative precision and recall
    tp_cum = np.cumsum(tp)
    fp_cum = np.cumsum(fp)

    precision = tp_cum / (tp_cum + fp_cum)
    recall = tp_cum / len(ground_truths)

    # Area under P-R curve
    # (11-point interpolation or all-points)
    ap = np.trapz(precision, recall)
    return ap


# Example: 4 predictions, 2 ground truth objects
predictions = [
    {"box": [100, 100, 200, 200], "score": 0.95},
    {"box": [102, 98, 205, 203], "score": 0.90},
    {"box": [300, 300, 400, 400], "score": 0.80},
    {"box": [50, 50, 100, 100], "score": 0.70},
]
ground_truths = [
    {"box": [105, 95, 205, 205]},
    {"box": [295, 295, 405, 405]},
]

ap = compute_ap(predictions, ground_truths)
print(f"Average Precision: {ap:.3f}")
# First pred matches GT[0] (TP)
# Second pred overlaps GT[0] too but it's already matched (FP)
# Third pred matches GT[1] (TP)
# Fourth pred matches nothing (FP)

mAP is just the mean AP across all object classes. But which IoU threshold do you use?

mAP50 (PASCAL VOC metric): AP at IoU threshold 0.5. Relatively lenient -- a prediction that overlaps the ground truth by just half counts as correct. This is fine for "is the car roughly there" but doesn't reward precise localization.

mAP50-95 (COCO metric): the mean of AP values computed at IoU thresholds from 0.5 to 0.95, in steps of 0.05. Much stricter. A model that localizes objects precisely (IoU > 0.9) scores significantly higher than one that draws sloppy boxes (IoU around 0.5). This is the standard metric today, and it's what you'll see reported in every modern detection paper.

Choosing your detector

If you're starting a new detection project, here's the practical decision framework:

ConstraintRecommendation
Real-time on edge (phone, Jetson)YOLOv8n or YOLOv8s
Real-time on server GPUYOLOv8m or YOLOv8l
Maximum accuracy, speed secondaryYOLOv8x or Faster R-CNN + ResNeXt
Small objects dominateFaster R-CNN with FPN (two-stage still wins here)
Custom classes, few imagesYOLOv8 + transfer learning from COCO
Quick prototype / demoYOLOv8 pretrained on COCO (80 classes)

For the vast majority of real-world projects, YOLOv8 with transfer learning is the right starting point. Fast to train, easy to deploy (ONNX, TensorRT, CoreML, TFLite -- all supported), well-documented, and competitive with anything more complex. Start there, measure your performance, and only switch to a heavier architecture if you have a specific reason.

Two-stage detectors like Faster R-CNN still have an edge on small object detection because the RPN can propose very small regions that single-shot detectors might miss at their coarsest feature level. But that gap is narrowing with every new YOLO release.

Samengevat

  • YOLO reframed detection as single-pass regression: divide the image into a grid, predict boxes and classes directly, achieve real-time speeds (45+ FPS) -- a fundamentally different approach from the two-stage proposal-then-classify pipeline we covered in episode #78;
  • the YOLO evolution (v1 through v8) systematically added anchor boxes, multi-scale prediction, better backbones, and eventually circled back to anchor-free design -- YOLOv8 is the current practical standrd for real-world detection;
  • SSD introduced multi-scale prediction from different feature map levels, handling objects of varying sizes by using small feature maps for large objects and large feature maps for small ones;
  • anchor-free detectors (FCOS, CenterNet) simplify architecture by eliminating anchor design entirely -- predict distances to box edges or detect center points directly, with fewer hyperparameters to tune;
  • custom detection with transfer learning requires as few as 200-500 annotated images: structure your dataset in YOLO format, start from COCO-pretrained weights, and the backbone's existing feature knowledge does most of the heavy lifting;
  • mAP (mean Average Precision) measures detection quality across the full precision-recall curve, with mAP50-95 as the strict modern standard that rewards precise localization at high IoU thresholds.

With the detection foundations and modern approaches both covered, we have the complete picture of how to find objects in images. But detection only draws rectangles around things. What if you need pixel-level precision -- knowing exactly which pixels belong to each object? That's a different problem with its own set of architectures, and it builds directly on the feature pyramid and multi-scale concepts we've been working with here.

Exercises

Exercise 1: Build a YOLO annotation format converter. Create a class AnnotationConverter that: (a) loads annotations from PASCAL VOC XML format (each XML file has <object> tags with <name>, <bndbox> containing <xmin>, <ymin>, <xmax>, <ymax>), (b) converts to YOLO format (class_id, center_x, center_y, width, height -- all normalized 0-1), (c) converts to COCO JSON format (image_id, category_id, bbox as [x, y, width, height] in pixels, area), (d) can convert in all three directions (VOC -> YOLO, YOLO -> COCO, COCO -> VOC). Simulate with 5 test images (hardcoded annotations) and verify round-trip conversion: VOC -> YOLO -> COCO -> VOC produces the same bounding boxes (within floating point tolerance).

Exercise 2: Implement a multi-scale detection simulator that demonstrates the SSD principle. Create a class MultiScaleDetector that: (a) generates a synthetic 300x300 image with 8 objects at three scales -- 2 large (100x100+), 3 medium (40x80), and 3 small (15x25), (b) creates three simulated feature maps: 38x38 (stride 8), 19x19 (stride 16), and 10x10 (stride 30), (c) at each feature map level, iterates over spatial positions and checks which ground truth objects have their center within that cell, (d) assigns each object to the feature level whose stride best matches the object size (large objects -> 10x10 map, small objects -> 38x38 map), (e) prints statistics showing how many objects each level detects and which objects would be missed if only a single scale were used. Demonstrate that no single scale catches all objects, but all three together provide complete coverage.

Exercise 3: Build a detection model benchmarking framework. Create a class DetectorBenchmark that: (a) generates a synthetic test set of 20 images with known ground truth (using the DetectionDataset class from Exercise 1 of episode #78, or a simplified version), (b) simulates three "detectors" with different accuracy profiles -- "perfect" (exact ground truth boxes with score 1.0), "noisy" (ground truth boxes with random IoU perturbation between 0.5-0.95 and random scores), and "poor" (50% of objects detected with high noise plus 30% false positives), (c) runs each detector through the full compute_ap evaluation pipeline, (d) prints a comparison table showing mAP50 and mAP75 for each detector, plus per-class AP. Verify that the "perfect" detector gets AP=1.0 and that the "poor" detector scores significantly lower at the stricter IoU=0.75 threshold than at IoU=0.5.

Thanks for reading!

@scipio



0
0
0.000
0 comments