Learn Zig Series (#29) - Inline Assembly and Low-Level Control

Learn Zig Series (#29) - Inline Assembly and Low-Level Control

zig.png

What will I learn

  • You will learn how to write solutions for the Episode 28 exercises;
  • You will learn Zig's inline assembly syntax with asm volatile;
  • You will learn input and output constraints for registers;
  • You will learn clobber lists and why the compiler needs them;
  • You will learn how to read CPU counters and special registers;
  • You will learn CPUID and feature detection at runtime;
  • You will learn when inline assembly is justified vs when Zig builtins suffice;
  • You will learn a practical example: precise cycle counting with RDTSC.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Zig 0.14+ distribution (download from ziglang.org);
  • The ambition to learn Zig programming.

Difficulty

  • Intermediate

Curriculum (of the Learn Zig Series):

Learn Zig Series (#29) - Inline Assembly and Low-Level Control

Solutions to Episode 28 Exercises

Exercise 1 - A Zig shared library exporting a RingBuffer:

const std = @import("std");

const RingBuffer = struct {
    data: []i32,
    head: u32,
    tail: u32,
    count: u32,
    capacity: u32,
};

export fn ringbuf_create(capacity: u32) ?*anyopaque {
    const allocator = std.heap.page_allocator;
    const buf = allocator.create(RingBuffer) catch return null;
    const data = allocator.alloc(i32, capacity) catch {
        allocator.destroy(buf);
        return null;
    };
    buf.* = .{
        .data = data,
        .head = 0,
        .tail = 0,
        .count = 0,
        .capacity = capacity,
    };
    return @ptrCast(buf);
}

export fn ringbuf_push(opaque: *anyopaque, value: i32) bool {
    const buf: *RingBuffer = @ptrCast(@alignCast(opaque));
    if (buf.count == buf.capacity) return false;
    buf.data[buf.tail] = value;
    buf.tail = (buf.tail + 1) % buf.capacity;
    buf.count += 1;
    return true;
}

export fn ringbuf_pop(opaque: *anyopaque, out: *i32) bool {
    const buf: *RingBuffer = @ptrCast(@alignCast(opaque));
    if (buf.count == 0) return false;
    out.* = buf.data[buf.head];
    buf.head = (buf.head + 1) % buf.capacity;
    buf.count -= 1;
    return true;
}

export fn ringbuf_destroy(opaque: *anyopaque) void {
    const buf: *RingBuffer = @ptrCast(@alignCast(opaque));
    const allocator = std.heap.page_allocator;
    allocator.free(buf.data);
    allocator.destroy(buf);
}

The ?*anyopaque return on ringbuf_create maps directly to C's void* -- if allocation fails, we return null. The C or Python caller checks for null before using the handle. Internally we cast back to *RingBuffer using @ptrCast(@alignCast(...)). The ring buffer itself is standard circular buffer logic with head/tail indices wrapping around via modulo.

Exercise 2 - Zig sorting library with ascending/descending flag:

const std = @import("std");

export fn zig_sort(arr: [*]i32, len: usize, ascending: bool) void {
    const slice = arr[0..len];
    if (ascending) {
        std.mem.sort(i32, slice, {}, struct {
            fn cmp(_: void, a: i32, b: i32) bool {
                return a < b;
            }
        }.cmp);
    } else {
        std.mem.sort(i32, slice, {}, struct {
            fn cmp(_: void, a: i32, b: i32) bool {
                return a > b;
            }
        }.cmp);
    }
}

export fn zig_is_sorted(arr: [*]const i32, len: usize, ascending: bool) bool {
    if (len <= 1) return true;
    const slice = arr[0..len];
    for (0..len - 1) |i| {
        if (ascending) {
            if (slice[i] > slice[i + 1]) return false;
        } else {
            if (slice[i] < slice[i + 1]) return false;
        }
    }
    return true;
}

Notice the bool parameter -- Zig's bool maps to C's _Bool (or bool from <stdbool.h>). The anonymous struct-with-function pattern for the comparison is standard Zig style for inline comparators, we used a similar approach back in episode 14 when discussing generics.

Exercise 3 - Calculator protocol with extern struct:

const CalcRequest = extern struct {
    op: u8,
    a: f64,
    b: f64,
};

const CalcResult = extern struct {
    value: f64,
    error_code: i32,
};

export fn calc_execute(request: *const CalcRequest) CalcResult {
    return switch (request.op) {
        '+' => .{ .value = request.a + request.b, .error_code = 0 },
        '-' => .{ .value = request.a - request.b, .error_code = 0 },
        '*' => .{ .value = request.a * request.b, .error_code = 0 },
        '/' => if (request.b == 0.0)
            CalcResult{ .value = 0.0, .error_code = 1 }
        else
            CalcResult{ .value = request.a / request.b, .error_code = 0 },
        else => .{ .value = 0.0, .error_code = 2 },
    };
}

Both structs use extern struct because they cross the language boundary -- CalcRequest comes from Python/C, and CalcResult goes back. The op field is a raw u8 holding the ASCII value of the operator character. Error codes are plain integers (0 = ok, 1 = division by zero, 2 = unknown operator) because Zig's error unions can't cross the C ABI. This is the standard pattern: when you can't use Zig's error handling, fall back to C-style status codes.

Here we go! Episodes 27 and 28 gave us the full C interop picture -- calling C from Zig, and exposing Zig to C. We went back and forth across that language boundary, learned about @cImport, export, extern struct, callconv(.c), building shared libraries. Good stuff.

Today we go even lower. We're dropping past C, past function calls, past any kind of abstraction -- straight into the CPU itself. Inline assembly lets you embed raw machine instructions inside your Zig code. You tell the processor exactly what to do, which registers to use, what gets clobbered. No compiler interpretation, no optimization passes, no safety nets. Just you and the silicon ;-)

Now, I should be very clear upfront: you almost never need this. Zig already gives you @Vector for SIMD (we covered that in episode 19), @prefetch for cache hints, builtins for byte swapping and bit counting and all sorts of low-level operations. The compiler is really good at generating efficient machine code. But there are situations where you genuinely need inline assembly -- reading hardware registers, accessing CPU identification instructions, doing precise cycle timing, implementing OS-level primitives. That's what this episode is about.

Zig's inline assembly syntax with asm volatile

The basic syntax for inline assembly in Zig looks like this:

const std = @import("std");

pub fn main() void {
    // A simple no-op -- does absolutely nothing
    asm volatile ("nop");

    // Read the stack pointer into a Zig variable
    const stack_ptr = asm volatile (
        "mov %rsp, %[result]"
        : [result] "=r" (-> usize)
        :
        :
    );

    std.debug.print("Stack pointer: 0x{x}\n", .{stack_ptr});

    // Execute two instructions in sequence
    var value: u64 = 42;
    asm volatile (
        \\add $10, %[val]
        \\shl $1, %[val]
        : [val] "+r" (value)
        :
        :
    );

    std.debug.print("After add 10 then shift left: {d}\n", .{value});
}

Let me break this down. The asm volatile expression takes a string (the assembly instructions) followed by up to three colon-separated sections: outputs, inputs, and clobbers. The volatile keyword tells the compiler "do NOT remove or reorder this assembly even if you think it does nothing useful". Without volatile, the compiler might decide the assembly has no visible side effects and optimize it away entirely. For anything involving hardware registers, timing, or I/O, you always want volatile.

The instruction string uses AT&T syntax by default on x86 (source on the left, destination on the right -- opposite of Intel syntax). Multi-line assembly uses Zig's \\ multi-line string syntax. Named operands like %[val] let you refer to Zig variables inside the assembly without hardcoding specific register names -- the compiler picks registers for you.

That double-backslash multi-line string is important. Each \\ starts a new line of the string literal, and each line becomes one assembly instruction. This is way more readable than cramming everything onto one line separated by \n\t like you'd do in C's __asm__.

Input and output constraints

The constraint system is where inline assembly gets interesting (and a bit tricky, to be honest). Constraints tell the compiler HOW to connect your Zig variables to the assembly instructions -- which registers to use, whether a variable is read, written, or both:

const std = @import("std");

pub fn main() void {
    // Output constraint: "=r" means "write to any general-purpose register"
    // The -> usize tells Zig the type of the result
    const cycles = asm volatile (
        "rdtsc"
        : [lo] "={eax}" (-> u32)
        :
        : "edx"
    );
    std.debug.print("TSC low 32 bits: {d}\n", .{cycles});

    // Input constraint: "r" means "any general-purpose register"
    var x: u64 = 100;
    var y: u64 = 200;
    const sum = asm volatile (
        "add %[b], %[a]"
        : [a] "=r" (-> u64)
        : [a] "0" (x),
          [b] "r" (y)
        :
    );
    _ = &x;
    _ = &y;
    std.debug.print("100 + 200 = {d}\n", .{sum});

    // Read-write constraint: "+r" means the variable is both input and output
    var counter: u32 = 10;
    asm volatile (
        \\dec %[c]
        \\dec %[c]
        \\dec %[c]
        : [c] "+r" (counter)
        :
        :
    );
    std.debug.print("10 decremented 3 times: {d}\n", .{counter});
}

The common constraints you'll use on x86_64:

  • "r" -- any general-purpose register (input)
  • "=r" -- any general-purpose register (output, write-only)
  • "+r" -- any general-purpose register (input AND output)
  • "={eax}" -- specifically the EAX register
  • "={rax}" -- specifically the RAX register (64-bit)
  • "i" -- an immediate (compile-time constant) value
  • "m" -- a memory location
  • "0" -- same register as operand 0 (ties an input to an output)

The "0" constraint in the addition example above is a neat trick. It says "put this input in the SAME register that was allocated for output operand 0". This is needed because x86's add instruction modifies its destination operand in place -- the result overwrites one of the inputs.

NB: on ARM architectures the constraint letters are different. ARM uses "r" for general registers too, but specific registers use different syntax ("{r0}", "{sp}", etc.), and the instruction mnemonics are obviously completely different. The concepts transfer -- outputs, inputs, clobbers -- but the details are platform-specific.

Clobber lists: telling the compiler what you'll modify

The clobber list is the third section after outputs and inputs, and it's arguably the most important for correctness. It tells the compiler which registers (or memory) your assembly instructions modify as a SIDE EFFECT -- things you change that aren't captured in the output constraints:

const std = @import("std");

fn divmod(dividend: u64, divisor: u64) struct { quotient: u64, remainder: u64 } {
    // x86 DIV uses RDX:RAX as implicit input and puts
    // quotient in RAX, remainder in RDX
    var lo: u64 = dividend;
    var hi: u64 = 0;
    asm volatile (
        "div %[divisor]"
        : [lo] "+{rax}" (lo),
          [hi] "+{rdx}" (hi)
        : [divisor] "r" (divisor)
        :
    );
    return .{ .quotient = lo, .remainder = hi };
}

pub fn main() void {
    const result = divmod(17, 5);
    std.debug.print("17 / 5 = {d} remainder {d}\n", .{
        result.quotient,
        result.remainder,
    });

    // Example where clobbering "memory" is essential
    var buffer = [_]u8{ 'H', 'e', 'l', 'l', 'o' };
    const ptr: [*]u8 = &buffer;
    const len: usize = buffer.len;

    // This assembly touches memory through the pointer
    // We MUST declare "memory" as clobbered so the compiler
    // doesn't assume buffer[] is unchanged after this asm
    asm volatile (
        \\xor %%rcx, %%rcx
        \\.loop:
        \\  xorb $0x20, (%%rdi, %%rcx)
        \\  inc %%rcx
        \\  cmp %%rsi, %%rcx
        \\  jb .loop
        :
        : [ptr] "{rdi}" (ptr),
          [len] "{rsi}" (len)
        : "rcx", "memory", "cc"
    );

    std.debug.print("After XOR 0x20: {s}\n", .{&buffer});
}

Three common clobber entries:

  • "memory" -- tells the compiler that the assembly reads or writes memory that might alias with Zig variables. Without this, the compiler might cache a variable's value in a register across the asm block and miss the fact that the assembly changed it. This is the scariest one to forget because the bug only shows up in optimized builds where the compiler actually does register caching.
  • "cc" -- the condition codes (flags register). Most arithmetic and comparison instructions modify the CPU flags. If you don't declare "cc" as clobbered, the compiler might rely on flags it set before the assembly block, and your code breaks silently.
  • Specific registers like "rcx" -- if your assembly uses a register as a scratch value that isn't in the output list, you must clobber it. Otherwise the compiler might store something important in that register and expect it to still be there afterwards.

I want to stress how subtle these bugs can be. If you forget a clobber, your code might work perfectly in debug mode (where the compiler stores everything on the stack) and then break in ReleaseFast or ReleaseSmall (where the compiler aggressively uses registers). The bug is a misunderstanding between you and the compiler about what's changed, and the compiler doesn't check -- it trusts your clobber list completely.

Reading CPU counters and special registers

One of the most common reasons to use inline assembly is accessing special CPU registers that have no Zig builtin equivalent. The Time Stamp Counter (TSC) is the classic example -- a 64-bit counter that increments with every clock cycle (or at a fixed frequency on modern CPUs):

const std = @import("std");

fn readTSC() u64 {
    var lo: u32 = undefined;
    var hi: u32 = undefined;
    asm volatile ("rdtsc"
        : [lo] "={eax}" (lo),
          [hi] "={edx}" (hi)
        :
        :
    );
    return (@as(u64, hi) << 32) | @as(u64, lo);
}

fn readTSCFenced() u64 {
    // RDTSCP serializes -- waits for all prior instructions to complete
    // Also returns the processor ID in ECX (we ignore it here)
    var lo: u32 = undefined;
    var hi: u32 = undefined;
    asm volatile ("rdtscp"
        : [lo] "={eax}" (lo),
          [hi] "={edx}" (hi)
        :
        : "ecx"
    );
    return (@as(u64, hi) << 32) | @as(u64, lo);
}

fn readCR3() u64 {
    // CR3 holds the page table base address -- ring 0 only!
    // This will crash in userspace with a protection fault
    // Shown here for educational purposes
    return asm volatile (
        "mov %%cr3, %[result]"
        : [result] "=r" (-> u64)
        :
        :
    );
}

pub fn main() void {
    const tsc1 = readTSC();

    // Do some work
    var sum: u64 = 0;
    for (0..1000) |i| {
        sum += i;
    }

    const tsc2 = readTSC();

    std.debug.print("TSC start:    {d}\n", .{tsc1});
    std.debug.print("TSC end:      {d}\n", .{tsc2});
    std.debug.print("Cycles spent: {d}\n", .{tsc2 - tsc1});
    std.debug.print("Sum:          {d}\n", .{sum});

    // Fenced version for more accurate measurements
    const fenced1 = readTSCFenced();
    var product: u64 = 1;
    for (1..21) |i| {
        product *%= i;
    }
    const fenced2 = readTSCFenced();

    std.debug.print("\nFenced TSC diff: {d} cycles\n", .{fenced2 - fenced1});
    std.debug.print("Product:         {d}\n", .{product});
}

The difference between rdtsc and rdtscp is subtle but important. Plain rdtsc can be executed out-of-order by the CPU -- the processor might actually execute it BEFORE some of the instructions that come before it in your code (modern CPUs reorder instructions for performance). rdtscp is a serializing instruction: it waits for all previous instructions to complete before reading the counter. This gives you more accurate timing, but at the cost of being slightly slower.

For benchmarking, the common pattern is: rdtscp before, rdtsc after. Or lfence; rdtsc before, rdtscp after. Different measurement protocols exist and people argue about which one is most accurate. For our purposes, rdtscp on both sides is fine.

The CR3 example is there to show what's possible in kernel-mode code. Control registers (CR0, CR2, CR3, CR4) contain processor configuration like the page table base, protection flags, and so on. You can only access them in ring 0 (kernel mode). If you're writing an operating system in Zig, you'd use inline assembly like this constantly. In userspace, that instruction would fault with a segmentation violation. Don't run it ;-)

CPUID and feature detection at runtime

The cpuid instruction is how you ask the CPU "what can you do?" at runtime. It returns processor identification, feature flags, cache sizes, and all sorts of hardware information. This is essential for writing code that adapts to the specific CPU it's running on:

const std = @import("std");

const CpuidResult = struct {
    eax: u32,
    ebx: u32,
    ecx: u32,
    edx: u32,
};

fn cpuid(leaf: u32, subleaf: u32) CpuidResult {
    var eax: u32 = undefined;
    var ebx: u32 = undefined;
    var ecx: u32 = undefined;
    var edx: u32 = undefined;
    asm volatile (
        "cpuid"
        : [eax] "={eax}" (eax),
          [ebx] "={ebx}" (ebx),
          [ecx] "={ecx}" (ecx),
          [edx] "={edx}" (edx)
        : [leaf] "{eax}" (leaf),
          [sub] "{ecx}" (subleaf)
        :
    );
    return .{ .eax = eax, .ebx = ebx, .ecx = ecx, .edx = edx };
}

fn getCpuVendor() [12]u8 {
    const r = cpuid(0, 0);
    var vendor: [12]u8 = undefined;
    @memcpy(vendor[0..4], std.mem.asBytes(&r.ebx));
    @memcpy(vendor[4..8], std.mem.asBytes(&r.edx));
    @memcpy(vendor[8..12], std.mem.asBytes(&r.ecx));
    return vendor;
}

fn hasSSE42() bool {
    const r = cpuid(1, 0);
    // SSE4.2 is bit 20 of ECX for leaf 1
    return (r.ecx & (1 << 20)) != 0;
}

fn hasAVX2() bool {
    const r = cpuid(7, 0);
    // AVX2 is bit 5 of EBX for leaf 7, subleaf 0
    return (r.ebx & (1 << 5)) != 0;
}

fn hasAVX512F() bool {
    const r = cpuid(7, 0);
    // AVX-512 Foundation is bit 16 of EBX for leaf 7
    return (r.ebx & (1 << 16)) != 0;
}

pub fn main() void {
    const vendor = getCpuVendor();
    std.debug.print("CPU Vendor: {s}\n", .{&vendor});

    const r1 = cpuid(1, 0);
    const family = (r1.eax >> 8) & 0xF;
    const model = (r1.eax >> 4) & 0xF;
    const stepping = r1.eax & 0xF;
    const ext_model = (r1.eax >> 16) & 0xF;
    const ext_family = (r1.eax >> 20) & 0xFF;

    std.debug.print("Family: {d}  Model: {d}  Stepping: {d}\n", .{
        family + ext_family,
        model + (ext_model << 4),
        stepping,
    });

    std.debug.print("\nFeature detection:\n", .{});
    std.debug.print("  SSE4.2:     {}\n", .{hasSSE42()});
    std.debug.print("  AVX2:       {}\n", .{hasAVX2()});
    std.debug.print("  AVX-512F:   {}\n", .{hasAVX512F()});
}

The cpuid instruction is one of those rare assembly instructions that uses four registers simultaneously -- EAX as input (the "leaf" number, which selects what information you're asking for) and EAX, EBX, ECX, EDX as outputs. Leaf 0 returns the vendor string (split across EBX, EDX, ECX in that weird order -- thanks Intel). Leaf 1 returns feature flags. Leaf 7 returns extended feature flags including AVX2 and AVX-512.

You might ask: why would I use this in Zig when I could just check @import("builtin").cpu.features? Good question. The comptime builtin info tells you what the TARGET CPU supports (based on the -Dcpu flag at build time). But the cpuid approach tells you what the ACTUAL CPU supports at runtime. If you're distributing a binary that should run on multiple CPU generations and pick the fastest code path dynamically, runtime detection via cpuid is what you want. If you're building specifically for one target, comptime feature checks are better because the compilre can optimize away the unused code paths.

When inline assembly is justified vs when Zig builtins suffice

This is the pragmatic section. Before you reach for inline assembly, check if Zig already has a builtin that does what you need. Zig's builtins compile to the same machine instructions but are portable, optimizable, and type-safe:

const std = @import("std");

// DON'T use inline assembly for this:
fn popcount_asm(val: u64) u64 {
    return asm volatile (
        "popcnt %[val], %[out]"
        : [out] "=r" (-> u64)
        : [val] "r" (val)
        :
    );
}

// DO use the Zig builtin:
fn popcount_builtin(val: u64) u6 {
    return @popCount(val);
}

// DON'T use inline assembly for byte swap:
fn bswap_asm(val: u32) u32 {
    return asm volatile (
        "bswap %[val]"
        : [val] "+r" (val)
        :
        :
    );
}

// DO use the builtin:
fn bswap_builtin(val: u32) u32 {
    return @byteSwap(val);
}

// DON'T use assembly for counting leading zeros:
fn clz_asm(val: u64) u64 {
    return asm volatile (
        "lzcnt %[val], %[out]"
        : [out] "=r" (-> u64)
        : [val] "r" (val)
        :
    );
}

// DO use the builtin:
fn clz_builtin(val: u64) u7 {
    return @clz(val);
}

pub fn main() void {
    const test_val: u64 = 0xFF00FF00FF00FF00;

    std.debug.print("popcount asm:     {d}\n", .{popcount_asm(test_val)});
    std.debug.print("popcount builtin: {d}\n", .{popcount_builtin(test_val)});

    const test32: u32 = 0xDEADBEEF;
    std.debug.print("\nbswap asm:     0x{X}\n", .{bswap_asm(test32)});
    std.debug.print("bswap builtin: 0x{X}\n", .{bswap_builtin(test32)});

    const test_clz: u64 = 0x0000FFFFFFFFFFFF;
    std.debug.print("\nclz asm:     {d}\n", .{clz_asm(test_clz)});
    std.debug.print("clz builtin: {d}\n", .{clz_builtin(test_clz)});
}

The builtins win every time for these operations. They work on all architectures (ARM, RISC-V, WASM, not just x86). The compiler can constant-fold them at comptime. They return proper Zig types (a u6 for popcount of a u64, because the result is always 0-64). And the compiler generates the same machine instruction anyway -- @popCount on x86 with the right CPU features emits popcnt.

Here's my rule of thumb for when inline assembly IS justified:

  1. CPU identification (cpuid) -- no builtin for this
  2. Hardware counters (rdtsc, rdtscp, rdpmc) -- no builtin
  3. Serializing instructions (lfence, mfence, sfence) -- @fence covers some cases but not all
  4. Privileged instructions in kernel code (mov cr3, wrmsr, lgdt, invlpg) -- OS development territory
  5. Very specific instruction sequences where the exact ordering matters and the compiler must not reorder anything
  6. Architecture-specific features not covered by builtins -- though Zig is adding more builtins all the time

If none of those apply, you probably don't need assembly. I've seen codebases where people write inline assembly for memcpy-style loops. Don't do that. The compiler's memcpy is almost certainly better than yours. Same goes for basic arithmetic, comparisons, branching -- the compiler has decades of optimization research behind it. You're not going to beat it by hand for general-purpose code. Having said that, for the specific cases listed above, there's simply no alternative.

@prefetch and cache control hints

Before we get to the final practical example, let me cover Zig's @prefetch builtin. This isn't assembly (it's a builtin), but it relates to the same low-level CPU control territory and connects nicely to what we did with cache lines and memory layout in episode 8:

const std = @import("std");

fn sumWithPrefetch(data: []const u64) u64 {
    var total: u64 = 0;
    const prefetch_distance = 8; // prefetch 8 elements (64 bytes) ahead

    for (data, 0..) |val, i| {
        // Tell the CPU to start loading data we'll need soon
        if (i + prefetch_distance < data.len) {
            @prefetch(@as(*const u8, @ptrCast(&data[i + prefetch_distance])), .{
                .rw = .read,
                .locality = 3,   // keep in all cache levels
                .cache = .data,
            });
        }
        total += val;
    }
    return total;
}

fn sumNoPrefetch(data: []const u64) u64 {
    var total: u64 = 0;
    for (data) |val| {
        total += val;
    }
    return total;
}

pub fn main() void {
    // Allocate a large array to make cache effects visible
    const allocator = std.heap.page_allocator;
    const size = 1024 * 1024; // 1M elements = 8MB
    const data = allocator.alloc(u64, size) catch {
        std.debug.print("Allocation failed\n", .{});
        return;
    };
    defer allocator.free(data);

    // Fill with values
    for (data, 0..) |*slot, i| {
        slot.* = i;
    }

    // Benchmark both approaches
    const reps = 50;
    var sum1: u64 = 0;
    var sum2: u64 = 0;

    var timer = std.time.Timer.start() catch unreachable;

    for (0..reps) |_| {
        sum1 +%= sumNoPrefetch(data);
    }
    const time_no_prefetch = timer.read();

    timer.reset();

    for (0..reps) |_| {
        sum2 +%= sumWithPrefetch(data);
    }
    const time_prefetch = timer.read();

    std.debug.print("No prefetch: {d}ms  (sum: {d})\n", .{
        time_no_prefetch / std.time.ns_per_ms,
        sum1,
    });
    std.debug.print("Prefetch:    {d}ms  (sum: {d})\n", .{
        time_prefetch / std.time.ns_per_ms,
        sum2,
    });
}

The @prefetch builtin compiles to the prefetcht0/prefetcht1/prefetcht2/prefetchnta instructions on x86 (or the ARM equivalent on ARM). It's a hint to the CPU: "I'm going to need this memory soon, please start loading it into cache." The CPU is free to ignore the hint if the cache is full or if it deems the prefetch unnecessary.

The locality parameter (0-3) controls how long the data stays in cache. 3 means "keep it in all cache levels" (L1, L2, L3). 0 means "non-temporal -- I only need this once, don't pollute the cache hierarchy." For sequential scans through large arrays, locality 0 is often better because you don't want to evict other useful data from L1/L2. For data you'll access repeatedly, locality 3 is the right choice.

Honestly, in most cases the CPU's hardware prefetcher already does a great job detecting sequential access patterns. You'll see the biggest improvement from manual prefetching when your access pattern is irregular -- linked lists, hash table lookups, tree traversals -- where the hardware prefetcher can't predict what memory you'll touch next.

Practical example: precise cycle counting with RDTSC

Let's put several of these concepts together into something you'd actually use in a real project. A precise cycle counter for benchmarking tight code sections:

const std = @import("std");
const builtin = @import("builtin");

const CycleMeasurement = struct {
    cycles: u64,
    overhead: u64,

    fn net(self: CycleMeasurement) u64 {
        if (self.cycles > self.overhead) {
            return self.cycles - self.overhead;
        }
        return 0;
    }
};

fn rdtscp() u64 {
    var lo: u32 = undefined;
    var hi: u32 = undefined;
    asm volatile ("rdtscp"
        : [lo] "={eax}" (lo),
          [hi] "={edx}" (hi)
        :
        : "ecx"
    );
    return (@as(u64, hi) << 32) | @as(u64, lo);
}

fn measureOverhead() u64 {
    // Measure the cost of the measurement itself
    var min_overhead: u64 = std.math.maxInt(u64);
    for (0..1000) |_| {
        const start = rdtscp();
        const end = rdtscp();
        const diff = end - start;
        if (diff < min_overhead) min_overhead = diff;
    }
    return min_overhead;
}

fn measureCycles(comptime func: anytype, args: anytype) CycleMeasurement {
    const overhead = measureOverhead();

    // Warm up
    for (0..100) |_| {
        _ = @call(.auto, func, args);
    }

    // Measure multiple times, take the minimum
    var min_cycles: u64 = std.math.maxInt(u64);
    for (0..1000) |_| {
        const start = rdtscp();
        _ = @call(.auto, func, args);
        const end = rdtscp();
        const diff = end - start;
        if (diff < min_cycles) min_cycles = diff;
    }

    return .{ .cycles = min_cycles, .overhead = overhead };
}

// Some functions to benchmark
fn sumArray(data: []const u32) u64 {
    var total: u64 = 0;
    for (data) |v| total += v;
    return total;
}

fn findMax(data: []const u32) u32 {
    var max: u32 = 0;
    for (data) |v| {
        if (v > max) max = v;
    }
    return max;
}

fn countOnes(data: []const u32) u64 {
    var total: u64 = 0;
    for (data) |v| {
        total += @popCount(v);
    }
    return total;
}

pub fn main() void {
    if (builtin.cpu.arch != .x86_64) {
        std.debug.print("This example requires x86_64\n", .{});
        return;
    }

    var data: [256]u32 = undefined;
    for (&data, 0..) |*slot, i| {
        slot.* = @as(u32, @truncate(i *% 7919 +% 104729));
    }

    const slice: []const u32 = &data;

    const m1 = measureCycles(sumArray, .{slice});
    const m2 = measureCycles(findMax, .{slice});
    const m3 = measureCycles(countOnes, .{slice});

    std.debug.print("Array size: {d} elements ({d} bytes)\n\n", .{
        data.len,
        data.len * @sizeOf(u32),
    });
    std.debug.print("sumArray:   {d} cycles (raw: {d}, overhead: {d})\n", .{
        m1.net(), m1.cycles, m1.overhead,
    });
    std.debug.print("findMax:    {d} cycles (raw: {d}, overhead: {d})\n", .{
        m2.net(), m2.cycles, m2.overhead,
    });
    std.debug.print("countOnes:  {d} cycles (raw: {d}, overhead: {d})\n", .{
        m3.net(), m3.cycles, m3.overhead,
    });
    std.debug.print("\nPer-element:\n", .{});
    std.debug.print("  sumArray:  {d:.2} cycles/elem\n", .{
        @as(f64, @floatFromInt(m1.net())) / @as(f64, @floatFromInt(data.len)),
    });
    std.debug.print("  findMax:   {d:.2} cycles/elem\n", .{
        @as(f64, @floatFromInt(m2.net())) / @as(f64, @floatFromInt(data.len)),
    });
    std.debug.print("  countOnes: {d:.2} cycles/elem\n", .{
        @as(f64, @floatFromInt(m3.net())) / @as(f64, @floatFromInt(data.len)),
    });
}

A couple of things worth noting about this benchmarking approach. We take the minimum of many measurements, not the average. Why? Because the minimum represents the best case -- when the code ran without interrupts, context switches, or cache misses from other processes. The average includes all that noise. For micro-benchmarks of tight loops, the minimum is the most reproducible and meaningful number.

We also measure and subtract the overhead of rdtscp itself. Two back-to-back rdtscp calls aren't free -- they take around 30-50 cycles depending on the CPU. Subtracting that overhead gives us a cleaner measurment of the actual function we're benchmarking.

The comptime func: anytype parameter with @call(.auto, func, args) is a nice Zig pattern for making the benchmark function generic. We covered comptime parameters thoroughly in episode 9 and generics in episode 14. The compiler monomorphizes measureCycles for each function we pass to it, so there's no function pointer overhead in the measurement loop.

The builtin.cpu.arch check at the top is important -- this code uses x86_64-specific instructions. On ARM or RISC-V, you'd need different assembly (ARM has MRS for reading cycle counters, for instance). Always gate platform-specific assembly behind architecture checks. The concepts of working with thread-level atomics and shared memory are where things get really interesting when you combine them with this kind of low-level timing work.

Wat we geleerd hebben

  • asm volatile embeds raw machine instructions in Zig code. The volatile keyword prevents the compiler from removing or reordering the assembly, which is essential for I/O, timing, and hardware access.
  • Output constraints ("=r", "={eax}") tell the compiler where to put results. Input constraints ("r", "{rdi}") tell it where to find inputs. Read-write constraints ("+r") handle variables that are both input and output.
  • Clobber lists declare side effects: registers modified as scratch space, "memory" for memory writes, "cc" for condition flags. Forgetting a clobber causes bugs that only appear in optimized builds.
  • RDTSC/RDTSCP reads the CPU's Time Stamp Counter for precise cycle-level timing. RDTSCP serializes instruction execution for more accurate measurements.
  • CPUID queries the CPU for feature flags, vendor string, model info, and more. Useful for runtime feature detection when compiling generic binaries.
  • Zig builtins (@popCount, @byteSwap, @clz, @prefetch) should be prefered over inline assembly for standard operations -- they're portable, optimizable, and type-safe.
  • @prefetch hints the CPU to preload memory into cache. Most useful for irregular access patterns where the hardware prefetcher can't predict what you need next.
  • Benchmarking with RDTSCP: measure many iterations, take the minimum, subtract measurement overhead. Use comptime generics to avoid function pointer overhead in the measurement loop.

Exercises

  1. Write a function cpuBrand() [48]u8 that uses cpuid with leaves 0x80000002, 0x80000003, and 0x80000004 to extract the full CPU brand string (e.g. "Intel(R) Core(TM) i7-12700K" or "AMD Ryzen 9 7950X"). Each leaf returns 16 bytes of the string across EAX, EBX, ECX, EDX. Call it from main and print the result.

  2. Using the rdtscp-based cycle measurement approach from this episode, benchmark three different ways to compute the sum of a [1024]u32 array: (a) a plain for loop, (b) an unrolled loop that processes 4 elements per iteration, and (c) using @Vector(4, u32) to SIMD-sum 4 elements at a time (we covered @Vector in episode 19). Print the cycle count for each approach and calculate the speedup ratio of (b) and (c) relative to (a).

  3. Write a spinWait(target_cycles: u64) function that uses rdtsc in a busy loop to wait for a precise number of CPU cycles. Test it by measuring 1000 cycles, 10000 cycles, and 100000 cycles with rdtscp and print how close each actual wait was to the target. Then explain in a comment why this approach isn't reliable for wall-clock timing and what std.time.Timer does differently.

De groeten!

@scipio



0
0
0.000
1 comments
avatar

que buen tema para aprender a controlar el hardware de cerca, gracias por explicar esto tan bien
it is a great topic to learn how to control hardware closely thanks for explaining this so well

0
0
0.000