Learn Zig Series (#31) - Memory-Mapped I/O and Files

@scipio 68

4 days ago

StemSocial

Learn Zig Series (#31) - Memory-Mapped I/O and Files

What will I learn

You will learn how to write solutions for the Episode 30 exercises;
You will learn what memory-mapped files are and why they matter for systems programming;
You will learn how to use std.posix.mmap to map files into your address space;
You will learn reading large files efficiently without buffered I/O;
You will learn modifying files through memory mappings with PROT.WRITE;
You will learn shared vs private mappings and when each applies;
You will learn anonymous mappings for large allocations outside the allocator;
You will learn unmapping with munmap and why defer is your best friend here;
You will learn a practical example: building a fast byte frequency counter using mmap.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Intermediate

Curriculum (of the `Learn Zig Series`):

Learn Zig Series (#31) - Memory-Mapped I/O and Files

Solutions to Episode 30 Exercises

Exercise 1 - Thread pool with 4 workers, task queue, submit and shutdown:

const std = @import("std");

const Task = struct {
    id: usize,
    sleep_ms: u64,
};

const TaskQueue = struct {
    buffer: [64]Task = undefined,
    head: usize = 0,
    tail: usize = 0,
    count: usize = 0,
    shutdown: bool = false,
    mutex: std.Thread.Mutex = .{},
    not_empty: std.Thread.Condition = .{},
    not_full: std.Thread.Condition = .{},

    fn push(self: *TaskQueue, task: Task) void {
        self.mutex.lock();
        defer self.mutex.unlock();
        while (self.count == 64 and !self.shutdown) {
            self.not_full.wait(&self.mutex);
        }
        if (self.shutdown) return;
        self.buffer[self.tail] = task;
        self.tail = (self.tail + 1) % 64;
        self.count += 1;
        self.not_empty.signal();
    }

    fn pop(self: *TaskQueue) ?Task {
        self.mutex.lock();
        defer self.mutex.unlock();
        while (self.count == 0 and !self.shutdown) {
            self.not_empty.wait(&self.mutex);
        }
        if (self.count == 0) return null;
        const task = self.buffer[self.head];
        self.head = (self.head + 1) % 64;
        self.count -= 1;
        self.not_full.signal();
        return task;
    }

    fn signalShutdown(self: *TaskQueue) void {
        self.mutex.lock();
        defer self.mutex.unlock();
        self.shutdown = true;
        self.not_empty.broadcast();
        self.not_full.broadcast();
    }
};

var active_tasks = std.atomic.Value(u32).init(0);

fn workerLoop(queue: *TaskQueue) void {
    while (queue.pop()) |task| {
        const count = active_tasks.fetchAdd(1, .seq_cst) + 1;
        std.debug.print("Task {d:>2} starting (active: {d})\n", .{ task.id, count });
        std.time.sleep(task.sleep_ms * std.time.ns_per_ms);
        _ = active_tasks.fetchSub(1, .seq_cst);
        std.debug.print("Task {d:>2} done\n", .{task.id});
    }
}

pub fn main() !void {
    var queue = TaskQueue{};
    var workers: [4]std.Thread = undefined;
    for (&workers) |*w| {
        w.* = try std.Thread.spawn(.{}, workerLoop, .{&queue});
    }

    var prng = std.Random.DefaultPrng.init(42);
    const rand = prng.random();
    for (0..20) |i| {
        const ms = 10 + rand.intRangeAtMost(u64, 0, 40);
        queue.push(.{ .id = i, .sleep_ms = ms });
    }

    // Wait for queue to drain then shut down
    while (true) {
        queue.mutex.lock();
        const empty = queue.count == 0;
        queue.mutex.unlock();
        if (empty and active_tasks.load(.seq_cst) == 0) break;
        std.time.sleep(5 * std.time.ns_per_ms);
    }
    queue.signalShutdown();
    for (&workers) |*w| w.join();
    std.debug.print("All tasks complete.\n", .{});
}

The key insight is the shutdown protocol: the main thread waits until both the queue is empty AND no tasks are active. Then it sets the shutdown flag and broadcasts on the condition to wake any waiting workers. The atomic active_tasks counter ensures we never see more than 4 active tasks simultaneously -- the thread pool itself enforces this since there are exactly 4 worker threads.

Exercise 2 - Read-write lock with readers/writers:

const std = @import("std");

const RwLock = struct {
    mutex: std.Thread.Mutex = .{},
    readers_ok: std.Thread.Condition = .{},
    writers_ok: std.Thread.Condition = .{},
    active_readers: u32 = 0,
    active_writers: u32 = 0,
    waiting_writers: u32 = 0,

    fn readLock(self: *RwLock) void {
        self.mutex.lock();
        defer self.mutex.unlock();
        while (self.active_writers > 0 or self.waiting_writers > 0) {
            self.readers_ok.wait(&self.mutex);
        }
        self.active_readers += 1;
    }

    fn readUnlock(self: *RwLock) void {
        self.mutex.lock();
        defer self.mutex.unlock();
        self.active_readers -= 1;
        if (self.active_readers == 0) {
            self.writers_ok.signal();
        }
    }

    fn writeLock(self: *RwLock) void {
        self.mutex.lock();
        defer self.mutex.unlock();
        self.waiting_writers += 1;
        while (self.active_readers > 0 or self.active_writers > 0) {
            self.writers_ok.wait(&self.mutex);
        }
        self.waiting_writers -= 1;
        self.active_writers = 1;
    }

    fn writeUnlock(self: *RwLock) void {
        self.mutex.lock();
        defer self.mutex.unlock();
        self.active_writers = 0;
        self.readers_ok.broadcast();
        self.writers_ok.signal();
    }
};

var rw = RwLock{};
var shared_data: [100]u32 = [_]u32{0} ** 100;

fn reader(id: usize) void {
    for (0..50) |_| {
        rw.readLock();
        const first = shared_data[0];
        var consistent = true;
        for (shared_data) |v| {
            if (v != first) { consistent = false; break; }
        }
        rw.readUnlock();
        if (!consistent) {
            std.debug.print("Reader {d}: INCONSISTENCY DETECTED!\n", .{id});
        }
        std.time.sleep(1 * std.time.ns_per_ms);
    }
    std.debug.print("Reader {d}: done, all reads consistent\n", .{id});
}

fn writer(id: usize) void {
    for (0..10) |round| {
        std.time.sleep(5 * std.time.ns_per_ms);
        rw.writeLock();
        const val: u32 = @intCast(id * 1000 + round);
        for (&shared_data) |*slot| slot.* = val;
        rw.writeUnlock();
    }
    std.debug.print("Writer {d}: done\n", .{id});
}

pub fn main() !void {
    var threads: [8]std.Thread = undefined;
    for (0..6) |i| threads[i] = try std.Thread.spawn(.{}, reader, .{i});
    for (6..8) |i| threads[i] = try std.Thread.spawn(.{}, writer, .{i - 6});
    for (&threads) |*t| t.join();
}

The writer-priority design (readers wait when waiting_writers > 0) prevents writer starvation. Without it, a continuous stream of readers could starve writers indefinitely. The readers verify consistency by checking every element equals the first -- if a writer were able to partially update the array, some elements would differ and the assertion would catch it.

Exercise 3 - Relaxed vs SeqCst counter timing:

const std = @import("std");

var counter_relaxed = std.atomic.Value(u64).init(0);
var counter_seqcst = std.atomic.Value(u64).init(0);

fn incrementRelaxed() void {
    for (0..1_000_000) |_| {
        _ = counter_relaxed.fetchAdd(1, .relaxed);
    }
}

fn incrementSeqCst() void {
    for (0..1_000_000) |_| {
        _ = counter_seqcst.fetchAdd(1, .seq_cst);
    }
}

fn bench(comptime f: anytype) u64 {
    var threads: [8]std.Thread = undefined;
    var timer = std.time.Timer.start() catch unreachable;
    for (&threads) |*t| t.* = std.Thread.spawn(.{}, f, .{}) catch unreachable;
    for (&threads) |*t| t.join();
    return timer.read();
}

pub fn main() void {
    const ns_relaxed = bench(incrementRelaxed);
    const ns_seqcst = bench(incrementSeqCst);

    std.debug.print("Relaxed: {d} ns  (counter={d})\n", .{
        ns_relaxed, counter_relaxed.load(.seq_cst),
    });
    std.debug.print("SeqCst:  {d} ns  (counter={d})\n", .{
        ns_seqcst, counter_seqcst.load(.seq_cst),
    });
    std.debug.print("SeqCst / Relaxed = {d:.2}x\n", .{
        @as(f64, @floatFromInt(ns_seqcst)) / @as(f64, @floatFromInt(ns_relaxed)),
    });

    // Both counters should read exactly 8_000_000.
    // .relaxed only drops ordering guarantees relative to OTHER
    // memory operations -- the fetchAdd itself is still atomic.
    // So for a standalone counter where we don't care about
    // ordering vs surrounding writes, .relaxed is correct AND
    // faster because the CPU doesn't need to flush store buffers
    // or issue memory fence instructions.
}

Both counters produce 8,000,000 -- fetchAdd is atomic regardless of ordering. The difference is that .seq_cst forces a full memory fence after each operation, making the CPU synchronize its store buffer with main memory. .relaxed skips that fence since we don't need surrounding operations to be ordered. On x86_64 the difference is modest (1.2-1.5x typically) because x86 has a strong memory model already. On ARM the difference is larger since ARM has a weaker default ordering.

Alright, with the concurrency solutions behind us, we're shifting gears. Last episode was all about multiple threads touching shared state -- dangerous, exciting, and very much about the CPU's view of memory. Today we flip the perspective: instead of thinking about how THREADS see memory, we're thinking about how the OPERATING SYSTEM maps files into memory. And this changes everything about how you read and write files ;-)

If you've been following along since episode 10 where we did basic file I/O with std.fs, you know the traditional approach: open a file, read bytes into a buffer, process them, write bytes back. That works, but it involves explicit read and write system calls, buffer management, and keeping track of file positions. Memory mapping takes a completely different approach -- you ask the OS to make a file's contents appear directly in your process's virtual address space. After that, reading the file is just reading memory. No read() calls, no buffers, no seeking. The file IS memory.

What memory-mapped files are and why they matter

At the hardware level, your CPU accesses memory through virtual addresses that get translated to physical addresses by the MMU (memory management unit). The OS maintains page tables that map virtual address ranges to physical memory pages. Memory mapping is the OS extending this mechanism to files: it sets up page table entries so that a range of virtual addresses points to the file's contents on disk.

When you first access a mapped page, a page fault occurs -- the CPU tries to read an address that has no physical memory backing yet. The OS catches this fault, reads the corresponding file data from disk into a free physical page, updates the page table, and lets your program continue as if nothing happened. This is called demand paging -- data only gets loaded from disk when you actually touch it. If you map a 10 GB file but only read the first 4 KB, only one page gets loaded.

The benefits over traditional I/O:

No double buffering. With read(), the OS reads data into its page cache, then copies it into your userspace buffer. With mmap, your process accesses the page cache directly. One less copy.
Lazy loading. Only pages you actually touch get read from disk. Map a huge file, scan the first few kilobytes, and the OS never loads the rest.
The OS manages caching. The kernel's page cache is sophisticated -- it does read-ahead, evicts cold pages under memory pressure, shares pages between processes mapping the same file. You get all of this for free.
Simplified code. The mapped region is just a slice. You iterate over it, index into it, pass it to any function that takes []const u8. No state machine for buffered reads.

The downsides:

Page-aligned. Mappings must be page-aligned (typically 4 KB boundaries). Not an issue for whole files, but clumsy if you only want 100 bytes from the middle.
Error handling is weird. If the underlying file gets truncated while mapped, accessing beyond the new end triggers a SIGBUS signal (on Linux), which is much harder to handle than a read error.
Not portable everywhere. Windows has a different API (CreateFileMapping/MapViewOfFile). Zig's std.posix.mmap works on POSIX systems (Linux, macOS, BSDs). On Windows you'd need a different approach.

Using `std.posix.mmap` in Zig

In Zig 0.14+, the mmap wrapper lives in std.posix. Here's the simplest case -- mapping a file for reading:

const std = @import("std");
const posix = std.posix;

pub fn main() !void {
    // Open the file
    const file = try std.fs.cwd().openFile("test.txt", .{});
    defer file.close();

    // Get the file size
    const stat = try file.stat();
    const file_size = stat.size;

    if (file_size == 0) {
        std.debug.print("File is empty, nothing to map.\n", .{});
        return;
    }

    // Map the file into memory
    const mapped = try posix.mmap(
        null,                           // let OS choose address
        file_size,                      // length to map
        posix.PROT.READ,               // read-only access
        .{ .TYPE = .SHARED },          // shared mapping
        file.handle,                    // file descriptor
        0,                              // offset into file
    );
    defer posix.munmap(mapped);

    // Now 'mapped' is a []align(std.mem.page_size) u8 slice
    // covering the entire file contents
    std.debug.print("File size: {d} bytes\n", .{file_size});
    std.debug.print("First 80 bytes: {s}\n", .{mapped[0..@min(80, mapped.len)]});
}

Let me walk through the parameters:

null -- we don't care where in our address space the mapping goes. The OS picks a suitable address. You CAN pass a specific address as a hint but there's rarely a reason to.
file_size -- how many bytes to map. This gets rounded up to the nearest page boundary internally. If your file is 5000 bytes on a system with 4096-byte pages, the mapping will actually be 8192 bytes, but only the first 5000 bytes correspond to file data. Accessing the padding bytes between file_size and the page boundary gives you zeros; they're not written back to disk.
posix.PROT.READ -- protection flags. READ for read-only, READ | WRITE for read-write. If you open the file read-only but request write protection, you get a permission error.
.{ .TYPE = .SHARED } -- mapping type. SHARED means changes are visible to other processes mapping the same file and get written back to disk. PRIVATE means you get a copy-on-write mapping -- writes modify your in-memory copy only.
file.handle -- the file descriptor. Just the .handle field from the opened file.
0 -- byte offset into the file where the mapping starts. Must be page-aligned.

The return type is []align(std.mem.page_size) u8 -- a byte slice aligned to the page size. You use it like any other slice. And the defer posix.munmap(mapped) at line 21 ensures we clean up when we're done. We talked about defer for memory cleanup back in episode 7 -- same principle applies here. Mapped memory is a resource, and failing to unmap it is a resource leak just like failing to free an allocation.

Reading large files efficiently

The real power of mmap shows when you're processing large files. Instead of reading chunks into a buffer and processing them, you just iterate over the mapped slice:

const std = @import("std");
const posix = std.posix;

fn countLines(data: []const u8) usize {
    var count: usize = 0;
    for (data) |byte| {
        if (byte == '\n') count += 1;
    }
    return count;
}

fn findPattern(data: []const u8, pattern: []const u8) usize {
    var count: usize = 0;
    var i: usize = 0;
    while (i + pattern.len <= data.len) : (i += 1) {
        if (std.mem.eql(u8, data[i..][0..pattern.len], pattern)) {
            count += 1;
        }
    }
    return count;
}

pub fn main() !void {
    const path = if (std.os.argv.len > 1)
        std.mem.span(std.os.argv[1])
    else
        "test.txt";

    const file = try std.fs.cwd().openFile(path, .{});
    defer file.close();

    const stat = try file.stat();
    if (stat.size == 0) {
        std.debug.print("Empty file.\n", .{});
        return;
    }

    const mapped = try posix.mmap(
        null,
        stat.size,
        posix.PROT.READ,
        .{ .TYPE = .SHARED },
        file.handle,
        0,
    );
    defer posix.munmap(mapped);

    const lines = countLines(mapped);
    const fn_count = findPattern(mapped, "fn ");

    std.debug.print("File: {s}\n", .{path});
    std.debug.print("Size: {d} bytes\n", .{stat.size});
    std.debug.print("Lines: {d}\n", .{lines});
    std.debug.print("Function-like patterns ('fn '): {d}\n", .{fn_count});
}

Notice how countLines and findPattern don't know or care that they're operating on a memory-mapped file. They just take []const u8. This is one of the best things about mmap -- it turns file I/O into regular memory operations, so all your existing functions that work on slices just work. No special "file reader" interface needed.

For performance, the OS handles read-ahead behind the scenes. When you touch page N, the kernel often prefetches pages N+1, N+2, etc., anticipating sequential access. On Linux you can hint the access pattern with posix.madvise -- .sequential tells the kernel "I'm going to read this linearly, prefetch aggressively", .random says "I'll jump around, don't bother with read-ahead", and .willneed asks the kernel to preload specific pages. For a file scanner like this, madvise(.sequential) can improve throughput on spinning disks quite a bit. On SSDs the difference is smaller since random access is cheap anyway.

Modifying files through memory mappings

To write through a mapping, open the file with write access and add PROT.WRITE to the protection flags:

const std = @import("std");
const posix = std.posix;

pub fn main() !void {
    // Create a test file first
    {
        const f = try std.fs.cwd().createFile("mmap_test.bin", .{});
        defer f.close();
        const data = "Hello, this is test data for mmap writing!\n";
        try f.writeAll(data);
    }

    // Open for read-write
    const file = try std.fs.cwd().openFile("mmap_test.bin", .{
        .mode = .read_write,
    });
    defer file.close();

    const stat = try file.stat();
    const mapped = try posix.mmap(
        null,
        stat.size,
        posix.PROT.READ | posix.PROT.WRITE,
        .{ .TYPE = .SHARED },
        file.handle,
        0,
    );
    defer posix.munmap(mapped);

    std.debug.print("Before: {s}\n", .{mapped});

    // Modify through the mapping -- upppercase the first 5 bytes
    for (mapped[0..5]) |*byte| {
        if (byte.* >= 'a' and byte.* <= 'z') {
            byte.* -= 32;  // ASCII lowercase to uppercase
        }
    }

    std.debug.print("After:  {s}\n", .{mapped});
    // Changes are written to disk automatically by the OS
    // (when it flushes dirty pages, or at munmap/close time)
}

With SHARED mapping and PROT.WRITE, any modifications you make to the slice get written back to the file on disk. The OS does this lazily -- dirty pages are flushed to disk eventually, either when the kernel's pdflush daemon runs, when you explicitly call msync, or when the mapping is removed. If your program crashes before the dirty pages are flushed, data can be lost. For critical writes, call posix.msync(mapped, posix.MSF.SYNC) to force an immediate flush.

Shared vs private mappings

The .TYPE flag controls what happens when you write to a mapping:

SHARED -- writes go back to the file. Other processes mapping the same file see your changes. This is how you do inter-process shared memory through files.
PRIVATE -- writes trigger copy-on-write. The OS gives you a private copy of the modified page. The original file is NOT changed. Other processes don't see your modifications.

Private mappings are useful when you want to load a file's data and then modify it in memory without affecting the original. Think of it like loading configuration: read the defaults from the file, then override values in memory for this process's runtime.

const std = @import("std");
const posix = std.posix;

pub fn main() !void {
    const file = try std.fs.cwd().openFile("mmap_test.bin", .{});
    defer file.close();
    const stat = try file.stat();

    // Private mapping -- copy-on-write
    const private = try posix.mmap(
        null,
        stat.size,
        posix.PROT.READ | posix.PROT.WRITE,
        .{ .TYPE = .PRIVATE },
        file.handle,
        0,
    );
    defer posix.munmap(private);

    // Shared mapping of the same file
    const shared = try posix.mmap(
        null,
        stat.size,
        posix.PROT.READ,
        .{ .TYPE = .SHARED },
        file.handle,
        0,
    );
    defer posix.munmap(shared);

    // Modify through private mapping
    @memset(private[0..5], 'X');

    std.debug.print("Private view: {s}\n", .{private[0..@min(40, private.len)]});
    std.debug.print("Shared view:  {s}\n", .{shared[0..@min(40, shared.len)]});
    // Private shows "XXXXX..." but shared still shows original data
}

The shared mapping still sees the original file data because the private mapping's writes went to a private copy of those pages. This is the same copy-on-write mechanism the OS uses for fork() -- the child process gets private copies of the parent's memory pages, and actual copying only happens when either side writes.

Anonymous mappings for large allocations

You can create a mapping without a file backing it -- that's an anonymous mapping. It gives you zero-filled memory directly from the OS, bypassing the allocator entirely:

const std = @import("std");
const posix = std.posix;

pub fn main() !void {
    // Allocate 16 MB of anonymous memory
    const size = 16 * 1024 * 1024;
    const mem = try posix.mmap(
        null,
        size,
        posix.PROT.READ | posix.PROT.WRITE,
        .{ .TYPE = .PRIVATE, .ANONYMOUS = true },
        -1,       // no file descriptor for anonymous mappings
        0,
    );
    defer posix.munmap(mem);

    // Memory is zero-initialized by the OS
    std.debug.print("First bytes: {d} {d} {d} {d}\n", .{
        mem[0], mem[1], mem[2], mem[3],
    });

    // Use it like any other memory
    @memset(mem[0..1024], 0xAB);
    std.debug.print("After write: {d} {d} {d} {d}\n", .{
        mem[0], mem[1], mem[2], mem[3],
    });
    std.debug.print("Mapped {d} MB of anonymous memory\n", .{size / 1024 / 1024});
}

Why would you use this instead of a regular allocator? A few reasons:

Guaranteed page-aligned. Allocators may or may not give you page-aligned memory. mmap always does.
Guaranteed zero-filled. The OS zeros anonymous pages for security (so you can't read another process's data). With page_allocator you get this too, but general-purpose allocators may return uninitialized memory.
Large allocations. If you need hundreds of megabytes or gigabytes, mmap is what page_allocator uses under the hood anyway. Going direct cuts out the middleman.
Memory protection. After filling your buffer, you can call mprotect to mark it read-only, catching accidental writes at the hardware level.

Having said that, for most allocations you should stick with Zig's allocator interface (episode 7). Anonymous mmap is for special cases where you need OS-level control over the memory's properties.

Unmapping and the importance of `defer`

Every mmap must be paired with munmap. If you forget to unmap, the mapping persists until the process exits. That's a resource leak -- the virtual address space is consumed, and the kernel keeps the file's page cache entries pinned.

const std = @import("std");
const posix = std.posix;

fn processFile(path: []const u8) !usize {
    const file = try std.fs.cwd().openFile(path, .{});
    defer file.close();

    const stat = try file.stat();
    if (stat.size == 0) return 0;

    const mapped = try posix.mmap(
        null,
        stat.size,
        posix.PROT.READ,
        .{ .TYPE = .SHARED },
        file.handle,
        0,
    );
    defer posix.munmap(mapped);

    // Even if we return early or an error propagates,
    // defer guarantees munmap runs
    var total: usize = 0;
    for (mapped) |byte| {
        total += byte;
    }
    return total;
}

pub fn main() !void {
    const result = try processFile("test.txt");
    std.debug.print("Byte sum: {d}\n", .{result});
}

The pattern is always the same: map, defer unmap, do your work. Same as file.close(), same as allocator .free(). Zig's defer makes resource management deterministic and hard to get wrong. In C you'd need to remember to call munmap on every exit path -- every error check, every early return. Forget one and you've got a leak that only shows up in long-running processes. With defer it's one line, right after the mmap call, and you never think about it again.

One subtlety: munmap can't fail in practice (the POSIX spec says it can return EINVAL for bad arguments, but if your arguments came from a succesful mmap call, munmap won't fail). That's why Zig's posix.munmap returns void, not an error union. Safe to call in defer without worrying about error handling.

Practical example: fast byte frequency counter

Let's put it all together with a practical tool. We'll build a byte frequency counter that mmaps a file and counts the occurrence of every byte value (0-255). This is useful for file analysis, detecting binary vs text files, finding encoding issues, or just being curious about a file's contents:

const std = @import("std");
const posix = std.posix;

fn countFrequencies(data: []const u8) [256]u64 {
    var freq: [256]u64 = [_]u64{0} ** 256;
    for (data) |byte| {
        freq[byte] += 1;
    }
    return freq;
}

fn printableChar(byte: u8) u8 {
    if (byte >= 32 and byte < 127) return byte;
    return '.';
}

pub fn main() !void {
    const path = if (std.os.argv.len > 1)
        std.mem.span(std.os.argv[1])
    else {
        std.debug.print("Usage: freq <filename>\n", .{});
        return;
    };

    const file = std.fs.cwd().openFile(path, .{}) catch |err| {
        std.debug.print("Cannot open '{s}': {}\n", .{ path, err });
        return;
    };
    defer file.close();

    const stat = try file.stat();
    if (stat.size == 0) {
        std.debug.print("File is empty.\n", .{});
        return;
    }

    const mapped = try posix.mmap(
        null,
        stat.size,
        posix.PROT.READ,
        .{ .TYPE = .SHARED },
        file.handle,
        0,
    );
    defer posix.munmap(mapped);

    const freq = countFrequencies(mapped);

    // Print results
    std.debug.print("\nByte frequency analysis for: {s}\n", .{path});
    std.debug.print("Total bytes: {d}\n\n", .{stat.size});

    // Find top 20 most frequent bytes
    var indices: [256]u8 = undefined;
    for (&indices, 0..) |*slot, i| slot.* = @intCast(i);

    // Simple selection sort for top 20
    for (0..20) |i| {
        var max_idx = i;
        for (i + 1..256) |j| {
            if (freq[indices[j]] > freq[indices[max_idx]]) {
                max_idx = j;
            }
        }
        const tmp = indices[i];
        indices[i] = indices[max_idx];
        indices[max_idx] = tmp;
    }

    std.debug.print("Top 20 byte values:\n", .{});
    std.debug.print("{s:>5} {s:>4} {s:>12} {s:>8}\n", .{
        "Byte", "Char", "Count", "Percent",
    });
    std.debug.print("{s}\n", .{"-" ** 32});

    for (0..20) |i| {
        const byte = indices[i];
        const count = freq[byte];
        if (count == 0) break;
        const pct = @as(f64, @floatFromInt(count)) /
            @as(f64, @floatFromInt(stat.size)) * 100.0;
        std.debug.print("0x{X:0>2}  '{c}'  {d:>10}  {d:>6.2}%\n", .{
            byte, printableChar(byte), count, pct,
        });
    }

    // Print summary stats
    var unique_bytes: u16 = 0;
    var is_text = true;
    for (0..256) |b| {
        if (freq[b] > 0) unique_bytes += 1;
        if (freq[b] > 0 and b < 32 and b != '\n' and b != '\r' and b != '\t') {
            is_text = false;
        }
    }
    if (freq[0] > 0) is_text = false;

    std.debug.print("\nUnique byte values: {d}/256\n", .{unique_bytes});
    std.debug.print("Likely format: {s}\n", .{
        if (is_text) "text" else "binary",
    });
}

Run this on a source code file and you'll see that space (0x20) and newline (0x0A) dominate, followed by common ASCII letters. Run it on a compiled binary and you'll see a much more uniform distribution with null bytes (0x00) near the top. Run it on a compressed file and the distribution is nearly flat across all 256 values -- that's actually one way to detect encryption or compression: high entropy means evenly distributed byte values.

The mmap approach here is clean: open file, map it, pass the slice to countFrequencies, print results. No buffered reader, no chunk management, no file position tracking. For a file that fits in RAM (which is practically everything on a modern system with 16+ GB), this is the simplest and fastest way to scan the contents.

Wat we geleerd hebben

Memory-mapped files let the OS map file contents directly into your address space. Reading the file becomes reading memory -- no explicit read() calls, no buffer management.
std.posix.mmap takes an address hint, length, protection flags, mapping type, file descriptor, and offset. Returns a page-aligned []u8 slice.
Demand paging means only pages you actually access get loaded from disk. Map a huge file, touch one page, and only 4 KB gets read.
PROT.READ | PROT.WRITE enables write access. With .SHARED mappings, writes go back to disk. With .PRIVATE mappings, writes trigger copy-on-write and only affect your process.
Anonymous mappings (ANONYMOUS = true, fd = -1) give you zero-filled OS pages without a file. Good for large aligned allocations with guaranteed zeroing.
Always defer posix.munmap right after a successful mmap. Same resource management discipline as allocators and file handles.
mmap shines for sequential scans of large files, read-only access patterns, and sharing data between processes. It's less ideal for small random writes (every modification dirties a full page) or files that change size frequently.

The patterns we covered today -- mapping files for reading, modifying through writable mappings, using anonymous mappings for large allocations -- these are foundational for systems programming. Database engines, text editors, log processors, and virtual memory allocators all use mmap extensively. When your data is already a file and you need to process it in memory, reaching for mmap instead of buffered I/O is often simpler AND faster. How those mapped memory regions interact with Zig's comptime type introspection system opens up some very interesting possibilities for building type-safe file format parsers ;-)

Exercises

Write a program that memory-maps two files and compares them byte-by-byte, printing the offset and differing bytes for the first 10 differences found (or "files are identical" if they match). Handle the case where the files have different sizes. Think about what happens at the boundary -- if file A is 1000 bytes and file B is 2000 bytes, you should compare the first 1000 bytes and then report "file B has 1000 extra bytes."
Build a simple file patcher using mmap: the program takes a filename, a hex offset, and a hex byte value, then writes that byte at the specified offset in the file using a writable shared mapping. Verify the write by re-reading the file with a separate mapping (or regular read) and printing the byte before and after. Make sure to validate that the offset is within the file's bounds.
Create a program that uses an anonymous mapping to allocate a large buffer (say 64 MB), fills it with a pattern, then uses std.posix.mprotect to mark the region read-only. Attempt a write after the protection change and observe what happens (the program should crash with a segfault -- catch this by wrapping the write attempt in a comment that explains the expected behavior, and print a message BEFORE the write attempt so the user knows what's about to happen). This demonstrates hardware-enforced memory protection.

Thanks for reading!

@scipio

stem stemsocial steemstem zig programming

0.000

1 comments

@solorzanot 64

4 days ago

Zig keeps surprising me with how simple it makes handling OS resources. Very well explained.

0.000

Learn Zig Series (#31) - Memory-Mapped I/O and Files

Learn Zig Series (#31) - Memory-Mapped I/O and Files

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Zig Series):

Learn Zig Series (#31) - Memory-Mapped I/O and Files

Solutions to Episode 30 Exercises

What memory-mapped files are and why they matter

Using std.posix.mmap in Zig

Reading large files efficiently

Modifying files through memory mappings

Shared vs private mappings

Anonymous mappings for large allocations

Unmapping and the importance of defer

Practical example: fast byte frequency counter

Wat we geleerd hebben

Exercises

Curriculum (of the `Learn Zig Series`):

Using `std.posix.mmap` in Zig

Unmapping and the importance of `defer`