Learn Zig Series (#34) - Performance Profiling and Optimization
Learn Zig Series (#34) - Performance Profiling and Optimization

What will I learn
- You will learn how to write solutions for the Episode 33 exercises;
- You will learn measuring execution time with
std.time.Timer; - You will learn using
std.debug.printtimestamps for quick profiling; - You will learn compiler optimization levels and their effects on generated code;
- You will learn reading disassembly with
zig build -Doptimize=ReleaseFastandobjdump; - You will learn cache-friendly data structures: struct of arrays vs array of structs;
- You will learn branch prediction hints and
@branchHint; - You will learn avoiding unnecessary allocations: stack vs heap tradeoffs;
- You will learn a practical example: optimizing a hot loop from 10x slower to matching C.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Zig 0.14+ distribution (download from ziglang.org);
- The ambition to learn Zig programming.
Difficulty
- Intermediate
Curriculum (of the Learn Zig Series):
- Zig Programming Tutorial - ep001 - Intro
- Learn Zig Series (#2) - Hello Zig, Variables and Types
- Learn Zig Series (#3) - Functions and Control Flow
- Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
- Learn Zig Series (#5) - Arrays, Slices, and Strings
- Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
- Learn Zig Series (#7) - Memory Management and Allocators
- Learn Zig Series (#8) - Pointers and Memory Layout
- Learn Zig Series (#9) - Comptime (Zig's Superpower)
- Learn Zig Series (#10) - Project Structure, Modules, and File I/O
- Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
- Learn Zig Series (#12) - Testing and Test-Driven Development
- Learn Zig Series (#13) - Interfaces via Type Erasure
- Learn Zig Series (#14) - Generics with Comptime Parameters
- Learn Zig Series (#15) - The Build System (build.zig)
- Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
- Learn Zig Series (#17) - Packed Structs and Bit Manipulation
- Learn Zig Series (#18) - Async Concepts and Event Loops
- Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
- Learn Zig Series (#19) - SIMD with @Vector
- Learn Zig Series (#20) - Working with JSON
- Learn Zig Series (#21) - Networking and TCP Sockets
- Learn Zig Series (#22) - Hash Maps and Data Structures
- Learn Zig Series (#23) - Iterators and Lazy Evaluation
- Learn Zig Series (#24) - Logging, Formatting, and Debug Output
- Learn Zig Series (#25) - Mini Project: HTTP Status Checker
- Learn Zig Series (#26) - Writing a Custom Allocator
- Learn Zig Series (#27) - C Interop: Calling C from Zig
- Learn Zig Series (#28) - C Interop: Exposing Zig to C
- Learn Zig Series (#29) - Inline Assembly and Low-Level Control
- Learn Zig Series (#30) - Thread Safety and Atomics
- Learn Zig Series (#31) - Memory-Mapped I/O and Files
- Learn Zig Series (#32) - Compile-Time Reflection with @typeInfo
- Learn Zig Series (#33) - Building a State Machine with Tagged Unions
- Learn Zig Series (#34) - Performance Profiling and Optimization (this post)
Learn Zig Series (#34) - Performance Profiling and Optimization
Solutions to Episode 33 Exercises
Exercise 1 - Math expression tokenizer using a state machine:
const std = @import("std");
const TokenTag = enum { number, op, paren_open, paren_close };
const Token = union(TokenTag) {
number: f64,
op: u8,
paren_open,
paren_close,
};
const TokState = union(enum) {
idle,
reading_int: usize,
reading_decimal: struct { start: usize, dot_pos: usize },
};
fn tokenize(input: []const u8, out: []Token) usize {
var state: TokState = .idle;
var count: usize = 0;
for (input, 0..) |ch, pos| {
switch (state) {
.idle => {
if (ch >= '0' and ch <= '9') {
state = .{ .reading_int = pos };
} else if (ch == '(') {
if (count < out.len) {
out[count] = .paren_open;
count += 1;
}
} else if (ch == ')') {
if (count < out.len) {
out[count] = .paren_close;
count += 1;
}
} else if (ch == '+' or ch == '-' or ch == '*' or ch == '/') {
if (count < out.len) {
out[count] = .{ .op = ch };
count += 1;
}
}
// spaces: stay idle
},
.reading_int => |start| {
if (ch == '.') {
state = .{ .reading_decimal = .{ .start = start, .dot_pos = pos } };
} else if (ch < '0' or ch > '9') {
// end of number
if (count < out.len) {
out[count] = .{ .number = parseNum(input[start..pos]) };
count += 1;
}
state = .idle;
// re-process current char
if (ch == '(') {
if (count < out.len) { out[count] = .paren_open; count += 1; }
} else if (ch == ')') {
if (count < out.len) { out[count] = .paren_close; count += 1; }
} else if (ch == '+' or ch == '-' or ch == '*' or ch == '/') {
if (count < out.len) { out[count] = .{ .op = ch }; count += 1; }
}
}
},
.reading_decimal => |info| {
if (ch < '0' or ch > '9') {
if (count < out.len) {
out[count] = .{ .number = parseNum(input[info.start..pos]) };
count += 1;
}
state = .idle;
if (ch == '(') {
if (count < out.len) { out[count] = .paren_open; count += 1; }
} else if (ch == ')') {
if (count < out.len) { out[count] = .paren_close; count += 1; }
} else if (ch == '+' or ch == '-' or ch == '*' or ch == '/') {
if (count < out.len) { out[count] = .{ .op = ch }; count += 1; }
}
}
},
}
}
// flush trailing number
switch (state) {
.reading_int => |start| {
if (count < out.len) {
out[count] = .{ .number = parseNum(input[start..]) };
count += 1;
}
},
.reading_decimal => |info| {
if (count < out.len) {
out[count] = .{ .number = parseNum(input[info.start..]) };
count += 1;
}
},
.idle => {},
}
return count;
}
fn parseNum(s: []const u8) f64 {
var result: f64 = 0;
var frac: f64 = 0;
var frac_div: f64 = 1;
var past_dot = false;
for (s) |c| {
if (c == '.') {
past_dot = true;
} else {
const d: f64 = @floatFromInt(c - '0');
if (past_dot) {
frac_div *= 10;
frac += d / frac_div;
} else {
result = result * 10 + d;
}
}
}
return result + frac;
}
pub fn main() void {
const input = "123 + 45.6 * (7 - 2)";
var buf: [32]Token = undefined;
const n = tokenize(input, &buf);
for (buf[0..n]) |tok| {
switch (tok) {
.number => |v| std.debug.print("number({d}) ", .{v}),
.op => |c| std.debug.print("op({c}) ", .{c}),
.paren_open => std.debug.print("paren_open ", .{}),
.paren_close => std.debug.print("paren_close ", .{}),
}
}
std.debug.print("\n", .{});
}
The tokenizer follows the same state machine pattern from the episode. idle dispatches on the first character it sees: digits enter reading_int, operators and parens emit tokens immediately. When reading_int hits a . it transitions to reading_decimal. Both number states flush their accumulated value when a non-digit appears and then re-process the current character in the idle handler. The parseNum function avoids using std.fmt.parseFloat here on purpose -- it's a manual parse to keep the example self-contained and illustrate how simple number parsing really is.
Exercise 2 - Retry with backoff connection state machine:
const std = @import("std");
const testing = std.testing;
const ConnState = union(enum) {
idle,
connecting: struct { attempt: u8, delay: u8 },
connected: struct { uptime: u32 },
backing_off: struct { remaining: u8, attempt: u8, delay: u8 },
failed: []const u8,
};
const ConnEvent = enum { start, tick, success, failure, reset };
fn connTransition(state: ConnState, event: ConnEvent) ConnState {
return switch (state) {
.idle => switch (event) {
.start => ConnState{ .connecting = .{ .attempt = 1, .delay = 1 } },
else => state,
},
.connecting => |info| switch (event) {
.success => ConnState{ .connected = .{ .uptime = 0 } },
.failure => if (info.attempt >= 5)
ConnState{ .failed = "max attempts reached" }
else
ConnState{ .backing_off = .{
.remaining = info.delay,
.attempt = info.attempt,
.delay = info.delay,
} },
.tick => state,
else => state,
},
.connected => |info| switch (event) {
.tick => ConnState{ .connected = .{ .uptime = info.uptime + 1 } },
.failure => ConnState{ .connecting = .{ .attempt = 1, .delay = 1 } },
.reset => ConnState.idle,
else => state,
},
.backing_off => |info| switch (event) {
.tick => if (info.remaining <= 1)
ConnState{ .connecting = .{
.attempt = info.attempt + 1,
.delay = info.delay * 2,
} }
else
ConnState{ .backing_off = .{
.remaining = info.remaining - 1,
.attempt = info.attempt,
.delay = info.delay,
} },
.reset => ConnState.idle,
else => state,
},
.failed => switch (event) {
.reset => ConnState.idle,
else => state,
},
};
}
test "connect-fail-backoff-retry-succeed cycle" {
var s: ConnState = .idle;
s = connTransition(s, .start);
try testing.expect(s == .connecting);
s = connTransition(s, .failure);
try testing.expect(s == .backing_off);
// backoff delay = 1 tick
s = connTransition(s, .tick);
try testing.expect(s == .connecting);
s = connTransition(s, .failure);
try testing.expect(s == .backing_off);
// backoff delay = 2 ticks
s = connTransition(s, .tick);
try testing.expect(s == .backing_off);
s = connTransition(s, .tick);
try testing.expect(s == .connecting);
// third attempt succeeds
s = connTransition(s, .success);
try testing.expect(s == .connected);
// verify uptime ticks
s = connTransition(s, .tick);
switch (s) {
.connected => |info| try testing.expectEqual(@as(u32, 1), info.uptime),
else => return error.TestUnexpectedResult,
}
}
test "max attempts leads to failed" {
var s: ConnState = ConnState{ .connecting = .{ .attempt = 5, .delay = 16 } };
s = connTransition(s, .failure);
try testing.expect(s == .failed);
// only reset gets us out
s = connTransition(s, .start);
try testing.expect(s == .failed);
s = connTransition(s, .reset);
try testing.expect(s == .idle);
}
pub fn main() void {
std.debug.print("Run with: zig test this_file.zig\n", .{});
}
The backoff doubling happens in the backing_off -> tick transition: when remaining ticks hit zero, we transition back to connecting with attempt + 1 and delay * 2. So the sequence is: delay 1, delay 2, delay 4, delay 8, and if the 5th attempt also fails, we go to failed. The test walks through a partial failure sequence (2 failures with backoff, then success) and a max-attempt scenario.
Exercise 3 - Markdown inline parser for bold, italic, and code spans:
const std = @import("std");
const SpanStyle = enum { plain, bold, italic, code };
const Span = struct {
text: []const u8,
style: SpanStyle,
};
const MdState = union(enum) {
normal: usize,
saw_star: usize, // saw one *, might be italic or bold
italic: usize, // inside *...*
bold: usize, // inside **...**
bold_saw_star: usize, // inside bold, saw one closing *
code: usize, // inside `...`
};
fn parseInline(input: []const u8, out: []Span) usize {
var state: MdState = .{ .normal = 0 };
var count: usize = 0;
for (input, 0..) |ch, pos| {
state = switch (state) {
.normal => |start| blk: {
if (ch == '*') {
if (pos > start and count < out.len) {
out[count] = .{ .text = input[start..pos], .style = .plain };
count += 1;
}
break :blk MdState{ .saw_star = pos };
} else if (ch == '`') {
if (pos > start and count < out.len) {
out[count] = .{ .text = input[start..pos], .style = .plain };
count += 1;
}
break :blk MdState{ .code = pos + 1 };
} else break :blk state;
},
.saw_star => |star_pos| blk: {
if (ch == '*') {
// double star -> bold open
break :blk MdState{ .bold = pos + 1 };
} else {
// single star -> italic open
break :blk MdState{ .italic = star_pos + 1 };
}
},
.italic => |start| blk: {
if (ch == '*') {
if (count < out.len) {
out[count] = .{ .text = input[start..pos], .style = .italic };
count += 1;
}
break :blk MdState{ .normal = pos + 1 };
} else break :blk state;
},
.bold => |start| blk: {
if (ch == '*') {
break :blk MdState{ .bold_saw_star = start };
} else break :blk state;
},
.bold_saw_star => |start| blk: {
if (ch == '*') {
// closing **
const end = pos - 1;
if (count < out.len) {
out[count] = .{ .text = input[start..end], .style = .bold };
count += 1;
}
break :blk MdState{ .normal = pos + 1 };
} else {
// false alarm, the * was inside bold text
break :blk MdState{ .bold = start };
}
},
.code => |start| blk: {
if (ch == '`') {
if (count < out.len) {
out[count] = .{ .text = input[start..pos], .style = .code };
count += 1;
}
break :blk MdState{ .normal = pos + 1 };
} else break :blk state;
},
};
}
// flush remaining text as plain
const remaining_start = switch (state) {
.normal => |s| s,
.saw_star => |s| s,
.italic => |s| s -| 1,
.bold => |s| if (s >= 2) s - 2 else 0,
.bold_saw_star => |s| if (s >= 2) s - 2 else 0,
.code => |s| if (s >= 1) s - 1 else 0,
};
if (remaining_start < input.len and count < out.len) {
out[count] = .{ .text = input[remaining_start..], .style = .plain };
count += 1;
}
return count;
}
pub fn main() void {
const input = "hello **bold** and *italic* with `code` end";
var buf: [32]Span = undefined;
const n = parseInline(input, &buf);
for (buf[0..n]) |span| {
std.debug.print("[{s}] \"{s}\"\n", .{ @tagName(span.style), span.text });
}
}
The tricky part is the saw_star state -- when we see a single *, we don't know yet if it's italic (*text*) or the start of bold (**text**). We need to peek at the NEXT character. The state machine handles this naturally: saw_star consumes one character and decides. The bold_saw_star state does the same thing at the closing end -- one * inside bold might be a stray asterisk or the start of **. If the next char is also *, we close; otherwise we go back to bold. Unclosed markers at end-of-input get flushed as plain text.
Alright, let's shift gears completely. Today's episode is all about performance -- measuring it, understanding it, and improving it. If you've been following along since episode 7 (memory management) and episode 8 (pointers and memory layout), you already have the foundational understanding of how Zig manages memory at a low level. Now we're going to use that knowledge to write code that's actually fast -- and more importantly, to PROVE it's fast with measurements rather than guesswork ;-)
Measuring execution time with std.time.Timer
The first rule of optimization: measure before you change anything. Zig's standard library gives you std.time.Timer for high-resolution timing. It uses the best available clock on your platform (CLOCK_MONOTONIC on Linux, mach_absolute_time on macOS, QueryPerformanceCounter on Windows).
const std = @import("std");
fn doWork(data: []u32) u64 {
var sum: u64 = 0;
for (data) |val| {
sum +%= val;
}
return sum;
}
pub fn main() !void {
var prng = std.Random.DefaultPrng.init(42);
var data: [1_000_000]u32 = undefined;
for (&data) |*slot| {
slot.* = prng.random().int(u32);
}
var timer = try std.time.Timer.start();
const result = doWork(&data);
const elapsed = timer.read();
std.debug.print("sum = {d}\n", .{result});
std.debug.print("elapsed: {d} ns ({d:.3} ms)\n", .{
elapsed,
@as(f64, @floatFromInt(elapsed)) / 1_000_000.0,
});
}
Timer.start() captures the current timestamp and timer.read() returns the elapsed time in nanoseconds as a u64. That's it -- no ceremony, no global state, no "begin/end" pairs to match up.
A couple of things to be aware of. First, the timer measures wall-clock time, not CPU time. If your OS scheduler suspends your thread mid-measurement, that time is included. For micro-benchmarks this can add noise, so you typically want to run multiple iterations and take the minimum (the minimum is least affected by scheduling jitter, unlike the average which includes all the outliers).
Second, be careful with Debug builds. The default build mode has safety checks, bounds checking, and zero optimizations. Timing unoptimized code tells you nothing about production performance. Always profile with ReleaseFast or ReleaseSafe (we'll cover these in detail below).
const std = @import("std");
fn benchmarkFn(comptime func: anytype, args: anytype, iterations: u32) u64 {
var best: u64 = std.math.maxInt(u64);
for (0..iterations) |_| {
var timer = std.time.Timer.start() catch unreachable;
const result = @call(.auto, func, args);
const elapsed = timer.read();
std.mem.doNotOptimizeAway(result);
if (elapsed < best) best = elapsed;
}
return best;
}
fn sumNaive(data: []const u32) u64 {
var acc: u64 = 0;
for (data) |v| acc += v;
return acc;
}
pub fn main() !void {
var prng = std.Random.DefaultPrng.init(12345);
var data: [500_000]u32 = undefined;
for (&data) |*slot| {
slot.* = prng.random().int(u32);
}
const ns = benchmarkFn(sumNaive, .{&data}, 100);
std.debug.print("best of 100: {d} ns ({d:.2} ms)\n", .{
ns,
@as(f64, @floatFromInt(ns)) / 1_000_000.0,
});
}
Note the std.mem.doNotOptimizeAway(result) call. In optimized builds, the compiler may realize you never USE the result of func and just... delete the entire computation. doNotOptimizeAway tells the optimizer "pretend this value escapes to somewhere you can't see" so it has to actually compute it. Without this, your benchmark might report 0 nanoseconds because the compiler optimized away the thing you're trying to measure. Ask me how I know ;-)
std.debug.print timestamps for quick profiling
Sometimes you don't need a precise benchmark framework -- you just want to know "which part of my program is slow." The quick-and-dirty approach is printing timestamps at interesting points:
const std = @import("std");
fn timestamp() i128 {
return std.time.nanoTimestamp();
}
fn slowSetup(allocator: std.mem.Allocator) ![]u8 {
const buf = try allocator.alloc(u8, 10_000_000);
for (buf, 0..) |*b, i| {
b.* = @truncate(i *% 17 +% 3);
}
return buf;
}
fn processData(data: []const u8) u64 {
var acc: u64 = 0;
for (data) |b| {
if (b > 128) acc += 1;
}
return acc;
}
pub fn main() !void {
const allocator = std.heap.page_allocator;
const t0 = timestamp();
const data = try slowSetup(allocator);
const t1 = timestamp();
const result = processData(data);
const t2 = timestamp();
allocator.free(data);
const t3 = timestamp();
std.debug.print("setup: {d:.2} ms\n", .{@as(f64, @floatFromInt(t1 - t0)) / 1e6});
std.debug.print("process: {d:.2} ms\n", .{@as(f64, @floatFromInt(t2 - t1)) / 1e6});
std.debug.print("free: {d:.2} ms\n", .{@as(f64, @floatFromInt(t3 - t2)) / 1e6});
std.debug.print("total: {d:.2} ms\n", .{@as(f64, @floatFromInt(t3 - t0)) / 1e6});
std.debug.print("result: {d}\n", .{result});
}
This is the "printf debugging" of performance work. It's not rigorous, but it tells you immediately whether your bottleneck is in setup, processing, or cleanup. I use this approach constantly during early development -- you can always replace it with proper benchmarks later once you know WHERE to focus.
The key insight: nanoTimestamp() returns an i128 (signed 128-bit integer) representing nanoseconds. The subtraction gives you elapsed nanoseconds. Dividing by 1e6 converts to milliseconds for human-readable output. Good enough for "is this function taking 2ms or 200ms?" questions.
Compiler optimization levels and their effects on generated code
Zig gives you four optimization modes, and the difference between them is MASSIVE -- we're talking 10x to 50x performance gaps for the same source code. Here's what each one does:
Debug (default): No optimizations. Full safety checks (bounds checking, integer overflow detection, null pointer checks). Debug info included. Stack traces on crashes. This is what you get with a plain zig build or zig run.
ReleaseSafe: Optimizations enabled (equivalent to LLVM -O2), BUT safety checks are still active. This gives you fast code that still catches undefined behavior. Good for production servers where you'd rather crash than corrupt data.
ReleaseFast: Aggressive optimizations (LLVM -O2 plus auto-vectorization). Safety checks DISABLED -- undefined behavior is actual undefined behavior, just like C. Fastest possible code.
ReleaseSmall: Optimizations focused on binary size (LLVM -Oz). Safety checks disabled. Use this for embedded or Wasm where every byte counts.
const std = @import("std");
fn matmul(a: []const f64, b: []const f64, c: []f64, n: usize) void {
for (0..n) |i| {
for (0..n) |j| {
var sum: f64 = 0;
for (0..n) |k| {
sum += a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sum;
}
}
}
pub fn main() !void {
const n = 256;
var a: [n * n]f64 = undefined;
var b: [n * n]f64 = undefined;
var c: [n * n]f64 = undefined;
var prng = std.Random.DefaultPrng.init(99);
for (&a) |*v| v.* = prng.random().float(f64);
for (&b) |*v| v.* = prng.random().float(f64);
var timer = try std.time.Timer.start();
matmul(&a, &b, &c, n);
const elapsed = timer.read();
std.mem.doNotOptimizeAway(&c);
std.debug.print("256x256 matmul: {d:.2} ms\n", .{
@as(f64, @floatFromInt(elapsed)) / 1_000_000.0,
});
}
Try compiling this with each mode:
zig build-exe matmul.zig # Debug
zig build-exe matmul.zig -O ReleaseSafe # ReleaseSafe
zig build-exe matmul.zig -O ReleaseFast # ReleaseFast
zig build-exe matmul.zig -O ReleaseSmall # ReleaseSmall
On my machine the 256x256 matrix multiply takes roughly 120ms in Debug, 8ms in ReleaseSafe, 5ms in ReleaseFast, and 9ms in ReleaseSmall. The Debug-to-ReleaseFast gap is about 24x. That's not unusual -- bounds checks on every array access in a triple-nested loop are expensive.
Having said that, don't just slap ReleaseFast on everything and call it a day. The safety checks that Debug and ReleaseSafe provide are incredibly valuable during development. My workflow: develop and test in Debug, benchmark in ReleaseFast, deploy in ReleaseSafe (unless you've profiled and the safety overhead is unacceptable for your specific use case).
Reading disassembly with zig build -Doptimize=ReleaseFast and objdump
Sometimes you need to see what the compiler is actually generating. Maybe your benchmarks show an unexpected performance cliff, or you want to verify the compiler is vectorizing a loop. Zig makes this straightforward because it produces native binaries via LLVM -- you can use standard tools like objdump to inspect the output.
First, build an optimized binary:
zig build-exe example.zig -O ReleaseFast
Then disassemble it:
objdump -d -M intel example | less
The -M intel flag gives you Intel syntax (which I find more readable than AT&T). Look for the function you care about -- it'll be labeled in the output.
Here's a concrete example. Consider these two functions that do the same thing -- sum an array:
const std = @import("std");
export fn sumLoop(ptr: [*]const u32, len: usize) u64 {
var acc: u64 = 0;
for (ptr[0..len]) |val| {
acc += val;
}
return acc;
}
export fn sumUnrolled(ptr: [*]const u32, len: usize) u64 {
var acc0: u64 = 0;
var acc1: u64 = 0;
var acc2: u64 = 0;
var acc3: u64 = 0;
var i: usize = 0;
const chunks = len / 4;
while (i < chunks * 4) : (i += 4) {
acc0 += ptr[i];
acc1 += ptr[i + 1];
acc2 += ptr[i + 2];
acc3 += ptr[i + 3];
}
while (i < len) : (i += 1) {
acc0 += ptr[i];
}
return acc0 + acc1 + acc2 + acc3;
}
Compile with -O ReleaseFast and check the disassembly. In many cases, LLVM will auto-vectorize sumLoop to use SIMD instructions (like vpaddd on x86), making both versions equally fast. But sometimes the manual unrolling helps the compiler use multiple accumulators, breaking the data dependency chain and improving throughput on out-of-order CPUs. The disassembly tells you what actually happened -- no guessing.
If you're using a build.zig, you can set the optimization mode in the build script:
const std = @import("std");
pub fn build(b: *std.Build) void {
const target = b.standardTargetOptions(.{});
const optimize = b.standardOptimizeOption(.{});
const exe = b.addExecutable(.{
.name = "bench",
.root_source_file = b.path("src/main.zig"),
.target = target,
.optimize = optimize,
});
b.installArtifact(exe);
}
Then build with: zig build -Doptimize=ReleaseFast. The standardOptimizeOption function automatically adds the -Doptimize flag to your build system.
Cache-friendly data structures: struct of arrays vs array of structs
This is where understanding hardware really pays off. Modern CPUs don't read memory one byte at a time -- they read entire cache lines (typically 64 bytes on x86). When you iterate over data, the CPU prefetcher tries to pull the NEXT cache line into L1 cache before you need it. If your data layout matches your access pattern, the prefetcher keeps up and you get near-peak throughput. If it doesn't, you're hitting main memory on almost every access and that's 100x slower than L1.
The classic example: Array of Structs (AoS) vs Struct of Arrays (SoA).
const std = @import("std");
// Array of Structs -- each Particle is stored contiguously
const ParticleAoS = struct {
x: f32,
y: f32,
z: f32,
vx: f32,
vy: f32,
vz: f32,
mass: f32,
charge: f32, // 32 bytes total per particle
};
// Struct of Arrays -- each field is stored contiguously
const ParticlesSoA = struct {
x: []f32,
y: []f32,
z: []f32,
vx: []f32,
vy: []f32,
vz: []f32,
mass: []f32,
charge: []f32,
};
fn updatePositionsAoS(particles: []ParticleAoS, dt: f32) void {
for (particles) |*p| {
p.x += p.vx * dt;
p.y += p.vy * dt;
p.z += p.vz * dt;
}
}
fn updatePositionsSoA(ps: ParticlesSoA, dt: f32) void {
for (ps.x, ps.vx) |*x, vx| x.* += vx * dt;
for (ps.y, ps.vy) |*y, vy| y.* += vy * dt;
for (ps.z, ps.vz) |*z, vz| z.* += vz * dt;
}
pub fn main() !void {
const count = 1_000_000;
const allocator = std.heap.page_allocator;
// AoS setup
const aos = try allocator.alloc(ParticleAoS, count);
defer allocator.free(aos);
for (aos) |*p| {
p.* = .{ .x = 1.0, .y = 2.0, .z = 3.0, .vx = 0.1, .vy = 0.2, .vz = 0.3, .mass = 1.0, .charge = 0.5 };
}
// SoA setup
var soa: ParticlesSoA = undefined;
soa.x = try allocator.alloc(f32, count);
soa.y = try allocator.alloc(f32, count);
soa.z = try allocator.alloc(f32, count);
soa.vx = try allocator.alloc(f32, count);
soa.vy = try allocator.alloc(f32, count);
soa.vz = try allocator.alloc(f32, count);
soa.mass = try allocator.alloc(f32, count);
soa.charge = try allocator.alloc(f32, count);
defer {
allocator.free(soa.x);
allocator.free(soa.y);
allocator.free(soa.z);
allocator.free(soa.vx);
allocator.free(soa.vy);
allocator.free(soa.vz);
allocator.free(soa.mass);
allocator.free(soa.charge);
}
for (soa.x) |*v| v.* = 1.0;
for (soa.y) |*v| v.* = 2.0;
for (soa.z) |*v| v.* = 3.0;
for (soa.vx) |*v| v.* = 0.1;
for (soa.vy) |*v| v.* = 0.2;
for (soa.vz) |*v| v.* = 0.3;
// Benchmark AoS
var best_aos: u64 = std.math.maxInt(u64);
for (0..20) |_| {
var timer = try std.time.Timer.start();
updatePositionsAoS(aos, 0.016);
const e = timer.read();
if (e < best_aos) best_aos = e;
}
// Benchmark SoA
var best_soa: u64 = std.math.maxInt(u64);
for (0..20) |_| {
var timer = try std.time.Timer.start();
updatePositionsSoA(soa, 0.016);
const e = timer.read();
if (e < best_soa) best_soa = e;
}
std.debug.print("AoS: {d:.2} ms\n", .{@as(f64, @floatFromInt(best_aos)) / 1e6});
std.debug.print("SoA: {d:.2} ms\n", .{@as(f64, @floatFromInt(best_soa)) / 1e6});
std.debug.print("speedup: {d:.1}x\n", .{@as(f64, @floatFromInt(best_aos)) / @as(f64, @floatFromInt(best_soa))});
}
With AoS, each particle is 32 bytes. When you read p.x the CPU loads a cache line containing x, y, z, vx, vy, vz, mass, and charge -- but updatePositionsAoS only USES x, y, z, vx, vy, vz (6 of 8 fields). That's 75% utilization. Worse, the next particle is 32 bytes away, so you get fewer particles per cache line.
With SoA, all the x values are packed together. When the CPU loads a cache line of x values, it gets 16 consecutive floats (64 bytes / 4 bytes each), ALL of which will be used. The prefetcher is happy, the cache is fully utilized, and the compiler can auto-vectorize the loop with SIMD because the data is perfectly aligned and contiguous.
On a typical modern CPU with 1M particles, the SoA version runs 2-3x faster than AoS for this kind of "iterate and touch a few fields" workload. The exact ratio depends on your cache hierarchy and whether the compiler vectorizes, but the direction is always the same: SoA wins when you iterate over many elements touching a subset of fields.
This doesn't mean AoS is always wrong. If you frequently access ALL fields of a single particle (like in collision detection where you need x, y, z, mass together), AoS keeps related data on the same cache line and avoids bouncing between multiple arrays. The right choice depends on your access patterns -- which is why you measure first ;-)
Branch prediction hints and @branchHint
Modern CPUs are heavily pipelined. When they encounter a branch (if/else, switch, loop condition), they don't wait to evaluate it -- they PREDICT which way it'll go and speculatively execute down that path. If the prediction is right, no time wasted. If it's wrong, the CPU has to flush the pipeline and start over -- a "branch misprediction penalty" of roughly 10-20 cycles on modern hardware.
Zig gives you @branchHint to tell the compiler (and through it, the CPU) which branch is more likely:
const std = @import("std");
fn parseByte(b: u8) enum { ascii, extended } {
if (b < 128) {
@branchHint(.likely);
return .ascii;
} else {
return .extended;
}
}
fn processBuffer(data: []const u8) struct { ascii: u64, extended: u64 } {
var ascii_count: u64 = 0;
var ext_count: u64 = 0;
for (data) |b| {
switch (parseByte(b)) {
.ascii => ascii_count += 1,
.extended => ext_count += 1,
}
}
return .{ .ascii = ascii_count, .extended = ext_count };
}
pub fn main() !void {
// Generate mostly-ASCII data (realistic for English text)
var data: [1_000_000]u8 = undefined;
var prng = std.Random.DefaultPrng.init(42);
for (&data) |*b| {
// 95% ASCII, 5% extended
const r = prng.random().int(u8);
b.* = if (r < 242) r % 128 else 128 + (r % 128);
}
var timer = try std.time.Timer.start();
const result = processBuffer(&data);
const elapsed = timer.read();
std.debug.print("ascii={d} extended={d} time={d} ns\n", .{
result.ascii,
result.extended,
elapsed,
});
}
@branchHint(.likely) tells the compiler "this branch is taken most of the time." The compiler may rearrange the code so the likely path is the straight-through path (no jump), keeping the instruction cache hot. It can also emit prefetch hints for the CPU's branch predictor.
The flip side is @branchHint(.unlikely) for error paths or rare conditions:
fn safeDivide(a: i32, b: i32) !i32 {
if (b == 0) {
@branchHint(.unlikely);
return error.DivisionByZero;
}
return @divTrunc(a, b);
}
When to use branch hints: Only when you KNOW the probability distribution from profiling or domain knowledge. Don't sprinkle .likely on random branches hoping it helps -- if you're wrong, you're actively making the code slower by misleading the predictor. The classic good use cases are error checks (errors are rare, so .unlikely), hot loops where one case dominates (like ASCII text processing), and parsing where valid input is the common case.
Avoiding unnecessary allocations: stack vs heap tradeoffs
Every heap allocation has overhead: the allocator has to find a free block, update its bookkeeping, and the allocated memory might be far from what you're currently working on (cold cache). Stack allocations are essentially free -- the stack pointer is already there, and the memory is already in cache because you just returned from the previous function call.
const std = @import("std");
// Heap version -- allocates a buffer dynamically
fn buildStringHeap(allocator: std.mem.Allocator, parts: []const []const u8) ![]u8 {
var total_len: usize = 0;
for (parts) |p| total_len += p.len;
const buf = try allocator.alloc(u8, total_len);
var offset: usize = 0;
for (parts) |p| {
@memcpy(buf[offset..][0..p.len], p);
offset += p.len;
}
return buf;
}
// Stack version -- uses a fixed buffer on the stack
fn buildStringStack(parts: []const []const u8) struct { data: [4096]u8, len: usize } {
var buf: [4096]u8 = undefined;
var offset: usize = 0;
for (parts) |p| {
if (offset + p.len > buf.len) break;
@memcpy(buf[offset..][0..p.len], p);
offset += p.len;
}
return .{ .data = buf, .len = offset };
}
pub fn main() !void {
const allocator = std.heap.page_allocator;
const parts = [_][]const u8{ "Hello", " ", "World", "!", " Zig", " is", " fast" };
// Benchmark heap
var best_heap: u64 = std.math.maxInt(u64);
for (0..10_000) |_| {
var timer = try std.time.Timer.start();
const s = try buildStringHeap(allocator, &parts);
const e = timer.read();
allocator.free(s);
if (e < best_heap) best_heap = e;
}
// Benchmark stack
var best_stack: u64 = std.math.maxInt(u64);
for (0..10_000) |_| {
var timer = try std.time.Timer.start();
const s = buildStringStack(&parts);
const e = timer.read();
std.mem.doNotOptimizeAway(&s);
if (e < best_stack) best_stack = e;
}
std.debug.print("heap: {d} ns\n", .{best_heap});
std.debug.print("stack: {d} ns\n", .{best_stack});
}
The stack version avoids the allocator entirely. For small, bounded-size outputs this is often a huge win. The tradeoff is obvious: you need to know the maximum size at compile time (or accept truncation), and you can't return the buffer from the function without copying it (the stack frame gets reclaimed when the function returns).
Zig's std.heap.FixedBufferAllocator gives you a nice middle ground -- you pre-allocate a buffer (on the stack or as a static global) and hand it to code that expects an Allocator interface:
const std = @import("std");
fn doStuffWithAllocator(allocator: std.mem.Allocator) !void {
var list = std.ArrayList(u32).init(allocator);
defer list.deinit();
for (0..100) |i| {
try list.append(@intCast(i));
}
std.debug.print("list has {d} items\n", .{list.items.len});
}
pub fn main() !void {
// 8KB buffer on the stack -- zero heap allocations
var stack_buf: [8192]u8 = undefined;
var fba = std.heap.FixedBufferAllocator.init(&stack_buf);
try doStuffWithAllocator(fba.allocator());
}
This is a pattern I use constantly for short-lived operations: parsing a single request, formatting a log line, building a small lookup table. The FixedBufferAllocator is deterministic (no syscalls, no fragmentation), fast (bump pointer allocation), and stack-friendly. Just make sure the buffer is big enough -- if it runs out you get error.OutOfMemory, which is a lot better than silently corrputing memory like you'd get in C with a too-small buffer.
Practical example: optimizing a hot loop from 10x slower to matching C
Let's put it all together. We'll start with a naive implementation of a function that counts how many values in a large array fall within a range, and optimize it step by step until it matches what an equivalent C program produces.
Version 1: naive
const std = @import("std");
fn countInRangeV1(data: []const u32, low: u32, high: u32) u64 {
var count: u64 = 0;
for (data) |val| {
if (val >= low and val <= high) {
count += 1;
}
}
return count;
}
pub fn main() !void {
const n: usize = 10_000_000;
const allocator = std.heap.page_allocator;
const data = try allocator.alloc(u32, n);
defer allocator.free(data);
var prng = std.Random.DefaultPrng.init(12345);
for (data) |*v| {
v.* = prng.random().int(u32);
}
var best: u64 = std.math.maxInt(u64);
var result: u64 = 0;
for (0..50) |_| {
var timer = try std.time.Timer.start();
result = countInRangeV1(data, 1_000_000, 2_000_000_000);
const elapsed = timer.read();
if (elapsed < best) best = elapsed;
}
std.debug.print("V1 count={d} time={d:.2} ms\n", .{
result,
@as(f64, @floatFromInt(best)) / 1e6,
});
}
This is clean, readable, and correct. In Debug mode it might take 80ms+ for 10M elements. In ReleaseFast it drops dramatically because LLVM can auto-vectorize this pattern. But let's see if we can help it along.
Version 2: branchless comparison
The branch if (val >= low and val <= high) can cause mispredictions when the data is uniformly random (roughly 46% of values are in range -- nearly worst-case for the predictor). We can convert this to branchless arithmetic:
fn countInRangeV2(data: []const u32, low: u32, high: u32) u64 {
var count: u64 = 0;
for (data) |val| {
count += @intFromBool(val >= low and val <= high);
}
return count;
}
@intFromBool converts a bool to u1 (0 or 1) without branching. The compiler turns the comparison into a conditional move or a subtract-and-shift sequence that the CPU can execute without a branch at all. On random data with ~50% hit rate, this can be meaningfully faster because there are zero mispredictions.
Version 3: manual SIMD (from episode 19)
If the compiler isn't auto-vectorizing to your satisfaction, you can use @Vector explicitly (as we covered in episode 19):
fn countInRangeV3(data: []const u32, low: u32, high: u32) u64 {
const vec_len = 8;
const low_vec: @Vector(vec_len, u32) = @splat(low);
const high_vec: @Vector(vec_len, u32) = @splat(high);
var count: u64 = 0;
var i: usize = 0;
// Process 8 elements at a time
while (i + vec_len <= data.len) : (i += vec_len) {
const chunk: @Vector(vec_len, u32) = data[i..][0..vec_len].*;
const ge_low = chunk >= low_vec;
const le_high = chunk <= high_vec;
const in_range = @select(bool, ge_low, le_high, @as(@Vector(vec_len, bool), @splat(false)));
count += @popCount(@as(u8, @bitCast(in_range)));
}
// Handle remainder
while (i < data.len) : (i += 1) {
count += @intFromBool(data[i] >= low and data[i] <= high);
}
return count;
}
This processes 8 elements per iteration using SIMD comparisons. The @select combines the two comparison masks, and @popCount counts how many bits are set (i.e. how many elements passed both tests). The remainder loop handles the tail elements that don't fill a full vector.
Putting them head to head:
const std = @import("std");
fn bench(comptime f: anytype, data: []const u32, low: u32, high: u32) struct { result: u64, ns: u64 } {
var best: u64 = std.math.maxInt(u64);
var result: u64 = 0;
for (0..50) |_| {
var timer = std.time.Timer.start() catch unreachable;
result = f(data, low, high);
const elapsed = timer.read();
std.mem.doNotOptimizeAway(result);
if (elapsed < best) best = elapsed;
}
return .{ .result = result, .ns = best };
}
pub fn main() !void {
const n: usize = 10_000_000;
const allocator = std.heap.page_allocator;
const data = try allocator.alloc(u32, n);
defer allocator.free(data);
var prng = std.Random.DefaultPrng.init(12345);
for (data) |*v| v.* = prng.random().int(u32);
const low: u32 = 1_000_000;
const high: u32 = 2_000_000_000;
const r1 = bench(countInRangeV1, data, low, high);
const r2 = bench(countInRangeV2, data, low, high);
const r3 = bench(countInRangeV3, data, low, high);
std.debug.print("V1 (naive): {d:.2} ms count={d}\n", .{ @as(f64, @floatFromInt(r1.ns)) / 1e6, r1.result });
std.debug.print("V2 (branchless): {d:.2} ms count={d}\n", .{ @as(f64, @floatFromInt(r2.ns)) / 1e6, r2.result });
std.debug.print("V3 (SIMD): {d:.2} ms count={d}\n", .{ @as(f64, @floatFromInt(r3.ns)) / 1e6, r3.result });
}
With ReleaseFast, on a modern x86 machine you'll typically see V1 and V2 performing similarly (LLVM auto-vectorizes both), and V3 being comparable or slightly faster depending on whether your manual vector width matches what LLVM would have chosen. The real gains show up in Debug mode where V1 is dramatically slower due to safety checks and no optimization.
The lesson here: the compiler is good at its job. In ReleaseFast, naive idiomatic Zig often compiles to the same machine code as hand-tuned C. The cases where manual optimization matters are:
- When the compiler can't prove aliasing rules (rare in Zig because pointers don't alias by default)
- When you need a specific SIMD width or instruction
- When the data layout is the bottleneck (SoA vs AoS) and the compiler can't restructure it for you
- When branch prediction is killing you and you need branchless code
For everything else, write clean code, compile with ReleaseFast, and let LLVM do its thing. That is genuinely the correct approach 90% of the time.
Wat we geleerd hebben
std.time.Timergives you nanosecond-precision timing -- always measure before optimizing and always usedoNotOptimizeAwayto prevent the compiler from deleting your benchmark- Quick profiling with
nanoTimestamp()differences helps identify which phase of your program is the bottleneck - Debug builds can be 10-50x slower than ReleaseFast due to safety checks and zero optimizations -- develop in Debug, benchmark in ReleaseFast, deploy in ReleaseSafe
objdump -d -M intelreveals what the compiler actually generated -- use it when benchmarks surprise you- SoA (Struct of Arrays) beats AoS (Array of Structs) when iterating over many elements touching few fields, because it maximizes cache line utilization and enables auto-vectorization
@branchHint(.likely)and@branchHint(.unlikely)help the compiler lay out code for the common path -- use them for error checks and known probability distributions, NOT for random guesses- Stack allocations and
FixedBufferAllocatoravoid heap overhead entirely for short-lived bounded-size operations - Branchless code via
@intFromBooleliminates branch mispredictions for nearly-50/50 conditons - Zig with ReleaseFast produces code competitive with hand-tuned C in most cases -- the compiler is your friend, write clean code first
Exercises
Write a benchmark that compares linear search vs binary search on a sorted array of 1 million
u32values. Search for 10,000 random target values and measure the total time for each approach. Usestd.time.TimeranddoNotOptimizeAway. Run in both Debug and ReleaseFast and compare the ratios. At what array size does binary search start winning?Implement a SoA version of a 2D particle simulation. Each particle has position (x, y), velocity (vx, vy), and a
boolactive flag. WriteupdatePositions(add velocity * dt to position) andcountActive(count active particles) functions in both AoS and SoA layouts. Benchmark both with 2 million particles. Also try a "hybrid" layout where position and velocity are SoA but the active flag is a separate packed bit array (std.bit_set.ArrayBitSet) -- does the bit-packing help or hurt compared to a plain[]bool?Take the matrix multiply from the optimization levels section and write three versions: naive (ijk loop), transposed (transpose B first, then iterate cache-friendly), and blocked/tiled (process 32x32 sub-blocks to maximize L1 cache reuse). Benchmark all three at sizes 128, 256, and 512 in ReleaseFast. Check the disassembly of the tiled version -- does LLVM vectorize the inner loop?
Here we go, that was quit some material for one episode! See you in the next one ;-)
looks awesome althougth i didnt understand anything hahaha
i will start will c++ this nigth
Congratulations @scipio! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOPCheck out our last posts: