Learn Zig Series (#22) - Hash Maps and Data Structures

@scipio 69

about 1 month ago

StemSocial

Learn Zig Series (#22) - Hash Maps and Data Structures

What will I learn

You will learn using std.AutoHashMap and std.HashMap for key-value storage;
You will learn how Zig's hash maps handle allocation and resizing internally;
You will learn custom hash and equality functions for user-defined key types;
You will learn std.StringHashMap for string-keyed lookups;
You will learn std.BoundedArray for stack-allocated dynamic arrays;
You will learn std.PriorityQueue for heap-based ordering;
You will learn std.SegmentedList for stable-pointer collections;
You will learn choosing the right data structure for your use case.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Intermediate

Curriculum (of the `Learn Zig Series`):

Learn Zig Series (#22) - Hash Maps and Data Structures

Welcome back! Last episode we built TCP servers and clients -- binding sockets, accepting connections, reading streams, the whole networking deal. In the exercises I asked you to build a key-value store over TCP that accepts SET/GET/DEL commands. If you attempted that, you probably reached for std.StringHashMap or std.AutoHashMap and realized: "wait, I've never actually used these properly." Today we fix that.

Hash maps are THE fundamental data structure for anything involving lookups by key. Configuration stores, caches, symbol tables, counting occurrences, deduplication, routing tables -- if you need to answer the question "given this key, what's the value?" in O(1) time, you want a hash map. And Zig's standard library ships with several specialized variants, each designed for different use cases. Beyond hash maps, the standard library also has some interesting collection types that fill niches you'd normally need an external library for in other languages.

This episode is a tour of the data structures that std provides out of the box. We'll cover when to use each one, how they handle memory, and the tradeoffs between them. Here we go!

Solutions to Episode 21 Exercises

Exercise 1 -- Counting echo server (prefix each response with a message number):

const std = @import("std");
const net = std.net;

fn handleClient(connection: net.Server.Connection) void {
    defer connection.stream.close();
    std.debug.print("Client connected: {}\n", .{connection.address});

    var buf: [4096]u8 = undefined;
    var count: u32 = 0;

    while (true) {
        const n = connection.stream.read(&buf) catch |err| {
            std.debug.print("Read error: {}\n", .{err});
            return;
        };
        if (n == 0) {
            std.debug.print("Client disconnected after {d} messages\n", .{count});
            return;
        }

        count += 1;
        var response: [4128]u8 = undefined;
        const prefix = std.fmt.bufPrint(&response, "[{d}] ", .{count}) catch return;
        const total_len = prefix.len + n;
        if (total_len <= response.len) {
            @memcpy(response[prefix.len..][0..n], buf[0..n]);
            connection.stream.writeAll(response[0..total_len]) catch return;
        }
    }
}

pub fn main() !void {
    const address = net.Address.initIp4(.{ 127, 0, 0, 1 }, 7000);
    var server = try address.listen(.{ .reuse_address = true });
    defer server.deinit();

    std.debug.print("Counting echo server on port 7000\n", .{});
    while (true) {
        const conn = server.accept() catch continue;
        handleClient(conn);
    }
}

The key insight is that count is declared inside handleClient, so it resets for each new connection. The prefix is formatted with bufPrint into a separate buffer, then concatenated with the echoed data. Simple state tracking per connection.

Exercise 2 -- TCP key-value store with SET/GET/DEL commands:

const std = @import("std");
const net = std.net;

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var store = std.StringHashMap([]const u8).init(allocator);
    defer {
        var it = store.iterator();
        while (it.next()) |entry| {
            allocator.free(entry.key_ptr.*);
            allocator.free(entry.value_ptr.*);
        }
        store.deinit();
    }

    const address = net.Address.initIp4(.{ 127, 0, 0, 1 }, 7001);
    var server = try address.listen(.{ .reuse_address = true });
    defer server.deinit();

    std.debug.print("KV store on port 7001\n", .{});
    while (true) {
        const conn = server.accept() catch continue;
        defer conn.stream.close();

        var buf: [4096]u8 = undefined;
        var line_buf: [1024]u8 = undefined;
        var line_len: usize = 0;

        while (true) {
            const n = conn.stream.read(&buf) catch break;
            if (n == 0) break;

            for (buf[0..n]) |byte| {
                if (byte == '\n' or byte == '\r') {
                    if (line_len > 0) {
                        processLine(
                            conn.stream,
                            line_buf[0..line_len],
                            &store,
                            allocator,
                        );
                        line_len = 0;
                    }
                } else if (line_len < line_buf.len) {
                    line_buf[line_len] = byte;
                    line_len += 1;
                }
            }
        }
    }
}

fn processLine(
    stream: net.Stream,
    line: []const u8,
    store: *std.StringHashMap([]const u8),
    allocator: std.mem.Allocator,
) void {
    var parts = std.mem.splitScalar(u8, line, ' ');
    const cmd = parts.next() orelse return;
    const key = parts.next() orelse {
        stream.writeAll("ERROR: missing key\n") catch {};
        return;
    };

    if (std.mem.eql(u8, cmd, "SET")) {
        const val = parts.rest();
        if (val.len == 0) {
            stream.writeAll("ERROR: missing value\n") catch {};
            return;
        }
        const owned_key = allocator.dupe(u8, key) catch return;
        const owned_val = allocator.dupe(u8, val) catch {
            allocator.free(owned_key);
            return;
        };
        if (store.fetchPut(owned_key, owned_val) catch null) |old| {
            allocator.free(old.key);
            allocator.free(old.value);
        }
        stream.writeAll("OK\n") catch {};
    } else if (std.mem.eql(u8, cmd, "GET")) {
        if (store.get(key)) |val| {
            stream.writeAll(val) catch {};
            stream.writeAll("\n") catch {};
        } else {
            stream.writeAll("NOT_FOUND\n") catch {};
        }
    } else if (std.mem.eql(u8, cmd, "DEL")) {
        if (store.fetchRemove(key)) |removed| {
            allocator.free(removed.key);
            allocator.free(removed.value);
            stream.writeAll("OK\n") catch {};
        } else {
            stream.writeAll("NOT_FOUND\n") catch {};
        }
    } else {
        stream.writeAll("ERROR: unknown command\n") catch {};
    }
}

The tricky part is memory ownership. The hash map stores pointers to string data, but the input buffer gets overwritten on each read. So we dupe both key and value into heap-allocated copies. fetchPut returns the old entry if one existed (so we can free it), and fetchRemove returns the removed entry (same reason). This is the reality of manual memory management -- every allocation needs a corresponding free.

Exercise 3 -- TCP port scanner:

const std = @import("std");
const net = std.net;

pub fn main() !void {
    const host = "127.0.0.1";
    const start_port: u16 = 1;
    const end_port: u16 = 1024;
    var open_count: u32 = 0;

    std.debug.print("Scanning {s} ports {d}-{d}...\n", .{
        host, start_port, end_port,
    });

    var port = start_port;
    while (port <= end_port) : (port += 1) {
        const stream = net.tcpConnectToHost(
            std.heap.page_allocator,
            host,
            port,
        ) catch {
            continue; // closed or filtered
        };
        stream.close();
        std.debug.print("  Port {d}: OPEN\n", .{port});
        open_count += 1;
    }

    std.debug.print("\nScan complete: {d} open ports found\n", .{open_count});
}

Sequential scanning is slow -- each failed connection attempt waits for a timeout. On localhost this is fast (immediate ConnectionRefused), but scanning a remote host takes noticeable time per port. You really feel why port scanners like nmap use concurrent connections. We'd need threads (or async, when Zig ships it) to scan efficiently. The sequential version is correct though, and the results come out sorted by port number naturally since we iterate in order.

Alright, on to hash maps and friends! ;-)

`std.AutoHashMap` -- the one you'll use most

std.AutoHashMap is Zig's convenience wrapper around the lower-level std.HashMap. It automatically generates hash and equality functions for your key type, which covers the vast majority of use cases. If your key is a primitive type (u32, i64, bool), an enum, or a pointer -- AutoHashMap just works.

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // Map from u32 keys to []const u8 values
    var map = std.AutoHashMap(u32, []const u8).init(allocator);
    defer map.deinit();

    // Insert entries
    try map.put(1, "one");
    try map.put(2, "two");
    try map.put(3, "three");

    // Lookup
    if (map.get(2)) |val| {
        std.debug.print("Key 2 = {s}\n", .{val}); // "two"
    }

    // Check existence
    std.debug.print("Contains 5? {}\n", .{map.contains(5)}); // false

    // Remove
    _ = map.remove(2);
    std.debug.print("After remove, contains 2? {}\n", .{map.contains(2)});

    // Count
    std.debug.print("Size: {d}\n", .{map.count()});
}

The API should feel familar if you've used hash maps in any language. put inserts (or overwrites if the key exists), get returns an optional value, remove deletes an entry, contains checks existence, count returns the number of entries.

Notice that put returns !void -- it can fail because inserting might trigger a resize, which means a heap allocation. This is Zig being explicit about the fact that "adding to a collection" is not a free operation. In languages with garbage collectors you never think about this, but in Zig the allocator is right there, reminding you that memory has a cost.

Iteration is straightforward:

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var scores = std.AutoHashMap([]const u8, i32).init(allocator);
    defer scores.deinit();

    try scores.put("alice", 95);
    try scores.put("bob", 87);
    try scores.put("carol", 92);

    // Iterate over all entries
    var it = scores.iterator();
    while (it.next()) |entry| {
        std.debug.print("{s}: {d}\n", .{ entry.key_ptr.*, entry.value_ptr.* });
    }
}

The iterator yields entries with key_ptr and value_ptr -- these are pointers into the map's internal storage, not copies. This is important: if you store those pointers and then mutate the map (insert/remove), the pointers may become invalid because a resize could move the internal storage. Read, use, move on. Don't stash pointers across mutations.

Having said that, iteration order is NOT guaranteed. Hash maps store entries based on their hash values, and the internal layout changes with resizes. If you need deterministic order, you need a different data structure or a separate sorted index.

How allocation and resizing work

Under the hood, AutoHashMap (and HashMap) uses open addressing with linear probing. The internal storage is a flat array of slots, each containing a key, a value, and a metadata byte. When you insert, the key is hashed to find a slot index. If that slot is occupied (collision), the map probes forward through consecutive slots until it finds an empty one.

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var map = std.AutoHashMap(u32, u32).init(allocator);
    defer map.deinit();

    // Watch the capacity grow
    std.debug.print("Initial capacity: {d}\n", .{map.capacity()});

    for (0..20) |i| {
        try map.put(@intCast(i), @intCast(i * 10));
        std.debug.print("After insert {d}: count={d}, capacity={d}\n", .{
            i, map.count(), map.capacity(),
        });
    }
}

The map starts with capacity 0 (no allocation at all). The first put triggers an allocation. As the load factor (count / capacity) approaches the threshold (about 80%), the map allocates a new, larger backing array and rehashes all existing entries into it. This rehashing is O(n) but happens infrequently enough that the amortized cost per insertion is still O(1).

One practical detail: if you know roughly how many entries you'll need upfront, you can avoid the incremental resizing by calling ensureTotalCapacity:

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var map = std.AutoHashMap(u32, []const u8).init(allocator);
    defer map.deinit();

    // Pre-allocate for 1000 entries -- single allocation, no resizing
    try map.ensureTotalCapacity(1000);
    std.debug.print("Pre-allocated capacity: {d}\n", .{map.capacity()});

    for (0..1000) |i| {
        try map.put(@intCast(i), "value");
    }
    std.debug.print("After 1000 inserts: capacity still {d}\n", .{map.capacity()});
}

This is a real performance win in hot paths. If you're building a map from known data (parsing a config file, loading a database result), pre-allocating avoids all the intermediate resize-and-rehash cycles. Remember from ep7 -- every allocation is a syscall on most allocators, and syscalls are expensive relative to in-memory operations.

Custom hash and equality functions

AutoHashMap uses auto-generated hash functions. For most key types this is perfect. But sometimes you need custom behavior -- maybe you want case-insensitive string keys, or you're using a struct as a key and want to hash on only some fields.

This is where the full std.HashMap comes in. It takes explicit hash and equality function types as comptime parameters:

const std = @import("std");

// Custom context for case-insensitive string comparison
const CaseInsensitiveContext = struct {
    pub fn hash(_: @This(), key: []const u8) u64 {
        var h: u64 = 0;
        for (key) |c| {
            const lower = if (c >= 'A' and c <= 'Z') c + 32 else c;
            h = h *% 31 +% lower;
        }
        return h;
    }

    pub fn eql(_: @This(), a: []const u8, b: []const u8) bool {
        if (a.len != b.len) return false;
        for (a, b) |ca, cb| {
            const la = if (ca >= 'A' and ca <= 'Z') ca + 32 else ca;
            const lb = if (cb >= 'A' and cb <= 'Z') cb + 32 else cb;
            if (la != lb) return false;
        }
        return true;
    }
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var map = std.HashMap(
        []const u8,
        i32,
        CaseInsensitiveContext,
        std.hash_map.default_max_load_percentage,
    ).init(allocator);
    defer map.deinit();

    try map.put("Content-Type", 1);
    try map.put("Accept", 2);

    // Lookup with different casing -- works because of our custom eql
    if (map.get("content-type")) |val| {
        std.debug.print("Found: {d}\n", .{val}); // 1
    }
    if (map.get("ACCEPT")) |val| {
        std.debug.print("Found: {d}\n", .{val}); // 2
    }
}

The CaseInsensitiveContext struct provides two functions: hash and eql. Both MUST agree -- if two keys are "equal" per eql, they MUST produce the same hash. If your eql says "Hello" == "hello" but your hash function gives them different hashes, the map will silently break (you'll insert duplicates, lookups will fail). This is the hash map contract, same as in every language.

The hash function uses a simple polynomial rolling hash with wrapping multiplication (*%). The eql function does byte-by-byte comparison after lowercasing. For HTTP headers, this is exactly what you want -- the spec says headers are case-insensitive, so Content-Type and content-type should map to the same entry.

Another common case is using structs as keys:

const std = @import("std");

const Point = struct {
    x: i32,
    y: i32,
};

const PointContext = struct {
    pub fn hash(_: @This(), p: Point) u64 {
        var h = std.hash.Wyhash.init(0);
        h.update(std.mem.asBytes(&p.x));
        h.update(std.mem.asBytes(&p.y));
        return h.final();
    }

    pub fn eql(_: @This(), a: Point, b: Point) bool {
        return a.x == b.x and a.y == b.y;
    }
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var grid = std.HashMap(
        Point,
        []const u8,
        PointContext,
        std.hash_map.default_max_load_percentage,
    ).init(allocator);
    defer grid.deinit();

    try grid.put(.{ .x = 0, .y = 0 }, "origin");
    try grid.put(.{ .x = 1, .y = 3 }, "checkpoint");
    try grid.put(.{ .x = -2, .y = 5 }, "treasure");

    if (grid.get(.{ .x = 1, .y = 3 })) |val| {
        std.debug.print("At (1,3): {s}\n", .{val});
    }
}

std.hash.Wyhash is a fast, high-quality hash function from the standard library. Feed it the raw bytes of your struct fields and call .final(). This approach generalizes to any struct -- just hash all the fields you care about.

`std.StringHashMap` -- the convenience wrapper

If your keys are []const u8 strings (which is extremely common), std.StringHashMap saves you from typing out the full generic parameters:

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // StringHashMap is equivalent to AutoHashMap([]const u8, V)
    var config = std.StringHashMap([]const u8).init(allocator);
    defer config.deinit();

    try config.put("host", "localhost");
    try config.put("port", "8080");
    try config.put("debug", "true");

    if (config.get("host")) |host| {
        std.debug.print("Connecting to {s}\n", .{host});
    }

    // getOrPut -- insert if missing, get pointer if exists
    const result = try config.getOrPut("timeout");
    if (!result.found_existing) {
        result.value_ptr.* = "30";
        std.debug.print("Set default timeout to 30\n", .{});
    }

    // Print all config
    var it = config.iterator();
    while (it.next()) |entry| {
        std.debug.print("  {s} = {s}\n", .{ entry.key_ptr.*, entry.value_ptr.* });
    }
}

getOrPut is one of the most useful methods on the map. It looks up the key. If found, it returns a pointer to the existing value. If NOT found, it inserts a slot for the key and returns a pointer to the uninitialized value -- you then write the default value through the pointer. This avoids the double-lookup pattern (check if exists, then insert) that plagues hash map usage in other languages.

One important caveat with StringHashMap: the map does NOT own the string data. The keys and values are slices that point to memory owned by someone else. If you're storing strings that come from a transient buffer (like reading lines from a file), you need to dupe them into heap memory first, and free them when removing entries. We did exactly this in the exercise solution above. The map stores the slice headers (pointer + length), not the character data itself.

`std.BoundedArray` -- stack-allocated dynamic arrays

Sometimes you need a dynamic array but you know the maximum size at compile time, and you don't want to involve the heap at all. std.BoundedArray is exactly this: a fixed-capacity array on the stack with a runtime length field.

const std = @import("std");

pub fn main() !void {
    // Capacity of 64, stored entirely on the stack
    var buf = std.BoundedArray(u8, 64){};

    // Append bytes
    try buf.appendSlice("Hello, ");
    try buf.appendSlice("world!");

    std.debug.print("Content: {s}\n", .{buf.slice()});
    std.debug.print("Length: {d}, Capacity: {d}\n", .{
        buf.len, buf.buffer.len,
    });

    // Works as a small fixed-size collection too
    var nums = std.BoundedArray(i32, 16){};
    for (0..10) |i| {
        try nums.append(@intCast(i * i));
    }

    std.debug.print("Squares: ", .{});
    for (nums.slice()) |n| {
        std.debug.print("{d} ", .{n});
    }
    std.debug.print("\n", .{});
}

The capacity (64 in the first example, 16 in the second) is a comptime parameter baked into the type. The backing storage is a regular [64]u8 array -- no pointers, no heap, no allocator. append and appendSlice return errors if you exceed the capacity, which you handle with try.

When is BoundedArray the right choice? When you have a natural upper bound. Parsing command-line arguments (max 256), collecting error messages during validation (max 32), building a small temporary buffer for formatting -- situations where the maximum is known and modest. If the capacity needs to be large or truly unbounded, use std.ArrayList (which heap-allocates) instead.

The nice thing is that .slice() returns a regular []T slice, so you can pass it to any function that takes a slice. The bounded-ness is an implementation detail of the container, invisible to consumers.

`std.PriorityQueue` -- heap-based ordering

A priority queue lets you efficiently retrieve the minimum (or maximum) element. Internally it's a binary heap stored in a flat array. Inserting and removing the top element are both O(log n).

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // Min-heap of i32 values
    var pq = std.PriorityQueue(i32, void, struct {
        fn compare(_: void, a: i32, b: i32) std.math.Order {
            return std.math.order(a, b);
        }
    }.compare).init(allocator, {});
    defer pq.deinit();

    // Insert some values (not in order)
    try pq.add(42);
    try pq.add(7);
    try pq.add(99);
    try pq.add(3);
    try pq.add(55);

    std.debug.print("Draining in priority order:\n", .{});
    while (pq.removeOrNull()) |val| {
        std.debug.print("  {d}\n", .{val});
    }
    // Output: 3, 7, 42, 55, 99 -- ascending order
}

The compare function determines the ordering. Return .lt if a should come out before b (min-heap), or flip it for a max-heap. The void context parameter means our comparator doesn't need any extra state -- if you needed to compare structs based on a configurable field, you could pass a context struct instead.

A more practical example -- a simple task scheduler:

const std = @import("std");

const Task = struct {
    name: []const u8,
    priority: u8, // lower = more urgent
    deadline: i64,
};

fn taskCompare(_: void, a: Task, b: Task) std.math.Order {
    // Sort by priority first, then by deadline
    if (a.priority != b.priority) {
        return std.math.order(a.priority, b.priority);
    }
    return std.math.order(a.deadline, b.deadline);
}

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var scheduler = std.PriorityQueue(Task, void, taskCompare).init(allocator, {});
    defer scheduler.deinit();

    try scheduler.add(.{ .name = "backup", .priority = 3, .deadline = 1000 });
    try scheduler.add(.{ .name = "deploy", .priority = 1, .deadline = 500 });
    try scheduler.add(.{ .name = "cleanup", .priority = 3, .deadline = 200 });
    try scheduler.add(.{ .name = "alert", .priority = 0, .deadline = 100 });

    std.debug.print("Processing tasks:\n", .{});
    while (scheduler.removeOrNull()) |task| {
        std.debug.print("  [{d}] {s} (deadline: {d})\n", .{
            task.priority, task.name, task.deadline,
        });
    }
}

Priority queues are perfect for job schedulers, event-driven simulations, Dijkstra's algorithm, and any scenario where you need "give me the most urgent item." They're NOT good for arbitrary lookups (that's what hash maps are for) or ordered traversal of all elements (that's what sorted arrays or trees are for). Each data structure has its niche ;-)

`std.SegmentedList` -- stable pointers

Here's an interesting one. std.SegmentedList is a list where pointers to elements remain valid even after the list grows. With a normal ArrayList, when the backing array resizes, all existing pointers become dangling because the data moved to a new allocation. SegmentedList solves this by allocating in segments (chunks) and never moving previous segments.

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // SegmentedList with prealloc of 4 elements per segment
    var list = std.SegmentedList(i32, 4){};
    defer list.deinit(allocator);

    // Add some elements and capture a pointer to the first one
    const ptr0 = try list.addOne(allocator);
    ptr0.* = 100;

    const ptr1 = try list.addOne(allocator);
    ptr1.* = 200;

    // Add more elements -- this may allocate new segments
    // but ptr0 and ptr1 remain valid
    for (0..20) |i| {
        const p = try list.addOne(allocator);
        p.* = @intCast(i * 10);
    }

    // ptr0 still points to our original value
    std.debug.print("ptr0 still valid: {d}\n", .{ptr0.*}); // 100
    std.debug.print("ptr1 still valid: {d}\n", .{ptr1.*}); // 200
    std.debug.print("Total elements: {d}\n", .{list.len});
}

The pointer stability guarantee makes SegmentedList useful in situations where other parts of your program hold references to elements in the list. Think of a graph where nodes are stored in a list and edges point to nodes -- with an ArrayList, growing the node list invalidates all edge pointers. With SegmentedList, the edges remain valid.

The downside is that element access by index is slightly slower (you have to compute which segment the index falls in, then index within that segment), and iteration is a tiny bit slower too because of the segment boundary crossings. For most applications this overhead is negligible, but it's there. If you don't need pointer stability, ArrayList is the simpler and faster choice.

Choosing the right data structure

Let me give you a quick reference table. This is what I keep in my head when picking a collection:

Need	Data structure	Why
Key-value lookup (primitive keys)	`std.AutoHashMap`	O(1) lookup, auto-generated hash
Key-value lookup (string keys)	`std.StringHashMap`	Convenience wrapper for `[]const u8` keys
Key-value lookup (custom keys)	`std.HashMap` with custom context	Full control over hash and equality
Dynamic array (heap)	`std.ArrayList`	Growable, O(1) amortized append
Dynamic array (stack, bounded)	`std.BoundedArray`	No allocator needed, compile-time capacity
Priority ordering	`std.PriorityQueue`	O(log n) insert and remove-min
List with stable pointers	`std.SegmentedList`	Element pointers survive growth
Fixed-size buffer	`[N]T` (plain array)	Zero overhead, compile-time size
View into array	`[]T` (slice)	Borrows, no ownership

The decision flowchart is roughly:

Do you need key-value lookup? -> Hash map family
Do you need ordered access to the minimum/maximum? -> PriorityQueue
Do you need a growable list? -> ArrayList (heap) or BoundedArray (stack)
Do other things hold pointers into your list? -> SegmentedList

And here's a practical example combining several of these together -- a simple word frequency counter that reports the top N most common words:

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const text =
        \\the quick brown fox jumps over the lazy dog
        \\the fox was quick but the dog was lazy
        \\brown fox brown dog quick quick quick
    ;

    // Step 1: Count word frequencies with StringHashMap
    var counts = std.StringHashMap(u32).init(allocator);
    defer counts.deinit();

    var words = std.mem.splitScalar(u8, text, ' ');
    while (words.next()) |raw_word| {
        // Skip newlines within words
        var word = raw_word;
        while (word.len > 0 and word[0] == '\n') word = word[1..];
        while (word.len > 0 and word[word.len - 1] == '\n') word = word[0 .. word.len - 1];
        if (word.len == 0) continue;

        const result = try counts.getOrPut(word);
        if (result.found_existing) {
            result.value_ptr.* += 1;
        } else {
            result.value_ptr.* = 1;
        }
    }

    // Step 2: Find top 3 using a max-heap approach
    const Entry = struct { word: []const u8, count: u32 };

    var all = std.ArrayList(Entry).init(allocator);
    defer all.deinit();

    var it = counts.iterator();
    while (it.next()) |entry| {
        try all.append(.{
            .word = entry.key_ptr.*,
            .count = entry.value_ptr.*,
        });
    }

    // Sort descending by count
    std.mem.sort(Entry, all.items, {}, struct {
        fn cmp(_: void, a: Entry, b: Entry) bool {
            return a.count > b.count;
        }
    }.cmp);

    std.debug.print("Top words:\n", .{});
    for (all.items[0..@min(5, all.items.len)]) |entry| {
        std.debug.print("  \"{s}\": {d}\n", .{ entry.word, entry.count });
    }
}

HashMap for counting, ArrayList for sorting, slicing for the top N. Each data structure does what it's best at, and they compose cleanly because they all work with the same allocator model and slice conventions. This composability is one of Zig's strengths -- the standard library types are designed to play nice together without requiring adapters or wrapper types.

Exercises

Build a contact book using std.StringHashMap. Store a Contact struct (with name, email, and phone fields) as the value. Implement add, find, remove, and list_all functions. Print all contacts sorted alphabetically by name (hint: extract keys into an ArrayList, sort them, then look up each one). This combines hash maps with sorting.
Implement a simple "LRU cache" (Least Recently Used) with a maximum capacity of N entries. When the cache is full and a new entry is inserted, remove the least recently accessed entry. Use an AutoHashMap for O(1) lookups combined with a tracking mechanism for access recency. Test it by inserting more than N entries and verifying that the oldest unused entries get evicted.
Build a "frequency-sorted unique words" tool. Read a hardcoded multi-line string, split it into words, count frequencies using StringHashMap, then use a PriorityQueue (max-heap) to extract words in decreasing frequency order. Print each word with its count. This combines three data structures: hash map for counting, priority queue for ordered extraction.

Dusssssss, wat hebben we geleerd?

std.AutoHashMap is the go-to for key-value storage with primitive or enum keys. It auto-generates hash and equality functions, handles dynamic resizing, and provides O(1) amortized lookups.
std.HashMap is the fully customizable variant -- bring your own hash and equality functions via a context struct. Essential for case-insensitive string maps, struct keys, or any non-standard equality semantics.
std.StringHashMap is syntactic sugar for AutoHashMap([]const u8, V). Convenient, but remember it does NOT own the string data -- dupe your keys if the source buffer is transient.
Hash maps resize automatically but you can pre-allocate with ensureTotalCapacity to avoid incremental rehashing. This matters in performance-critical code paths.
getOrPut is the most useful method on any hash map -- single lookup for "get existing or create new." Saves the double-lookup antipattern.
std.BoundedArray gives you a stack-allocated dynamic array with compile-time capacity. No allocator needed, no heap, just a [N]T with a length field.
std.PriorityQueue is a binary heap -- O(log n) insert and remove-min/max. Perfect for schedulers, pathfinding algorithms, and anything that needs ordered processing.
std.SegmentedList guarantees pointer stability across growth. Use it when other parts of your code hold references into the list.

Thanks for reading, tot de volgende keer!

@scipio

stem stemsocial steemstem zig programming

0.000

0 comments

Learn Zig Series (#22) - Hash Maps and Data Structures

Learn Zig Series (#22) - Hash Maps and Data Structures

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Zig Series):

Learn Zig Series (#22) - Hash Maps and Data Structures

Solutions to Episode 21 Exercises

std.AutoHashMap -- the one you'll use most

How allocation and resizing work

Custom hash and equality functions

std.StringHashMap -- the convenience wrapper

std.BoundedArray -- stack-allocated dynamic arrays

std.PriorityQueue -- heap-based ordering

std.SegmentedList -- stable pointers

Choosing the right data structure

Exercises

Dusssssss, wat hebben we geleerd?

Curriculum (of the `Learn Zig Series`):

`std.AutoHashMap` -- the one you'll use most

`std.StringHashMap` -- the convenience wrapper

`std.BoundedArray` -- stack-allocated dynamic arrays

`std.PriorityQueue` -- heap-based ordering

`std.SegmentedList` -- stable pointers