Learn Zig Series (#91) - MessagePack Format

@scipio 71

24 days ago

StemSocial

Learn Zig Series (#91) - MessagePack Format

What will I learn?

Why a self-describing binary format exists, and how it differs from the schema-driven protobuf we built last episode;
How MessagePack packs a type tag and (often) a small value into a single format byte;
The fixint / fixstr / fixarray / fixmap tricks that make small data almost free;
How to model dynamic, JSON-shaped data in Zig with a tagged union Value type;
How to write a recursive encoder and a recursive decoder that walk arbitrary nested structures;
Why MessagePack is big-endian, and how std.mem.writeInt/readInt make that a non-issue;
Where the lifetimes and allocations hide in a self-describing decoder, and how to free the tree you built;
How the Zig version stacks up against the C, Rust, and Go libraries you'd reach for in production.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Intermediate

Curriculum (of the `Learn Zig Series`):

Learn Zig Series (#91) - MessagePack Format

Solutions to Episode 90 Exercises

Last episode we hand-built Protocol Buffers -- the varint, ZigZag, the tag that fuses a field number with a wire type, length-delimited framing, and a comptime encoder generated straight off a struct definition. The three exercises pushed Person from a flat record into something with repetition, nesting and full generic handling. Here are the solutions, each reusing writeVarint, readVarint, encodeTag, decodeTag, writeLenField, readLen and skipField from episode 90.

Exercise 1: Add a repeated field

A repeated string in protobuf is not a special container -- it is simply the same tag, written more than once. So phones becomes "emit field 4 once per entry", and decoding collects every field-4 occurrence into a list:

const std = @import("std");

const Person = struct {
    id: i64,
    name: []const u8,
    email: []const u8,
    phones: []const []const u8,
};

pub fn encodePerson(alloc: std.mem.Allocator, p: Person) ![]u8 {
    var out: std.ArrayListUnmanaged(u8) = .{};
    errdefer out.deinit(alloc);

    try writeVarint(&out, alloc, encodeTag(1, .varint));
    try writeVarint(&out, alloc, @bitCast(p.id));
    try writeLenField(&out, alloc, 2, p.name);
    try writeLenField(&out, alloc, 3, p.email);
    for (p.phones) |phone| try writeLenField(&out, alloc, 4, phone); // field 4, repeated
    return out.toOwnedSlice(alloc);
}

pub fn decodePerson(alloc: std.mem.Allocator, buf: []const u8) !Person {
    var id: i64 = 0;
    var name: []const u8 = "";
    var email: []const u8 = "";
    var phones: std.ArrayListUnmanaged([]const u8) = .{};
    errdefer phones.deinit(alloc);

    var off: usize = 0;
    while (off < buf.len) {
        const d = decodeTag(try readVarint(buf, &off));
        switch (d.field) {
            1 => id = @bitCast(try readVarint(buf, &off)),
            2 => name = try readLen(buf, &off),
            3 => email = try readLen(buf, &off),
            4 => try phones.append(alloc, try readLen(buf, &off)), // accumulate
            else => try skipField(buf, &off, d.wire_type),
        }
    }
    return .{ .id = id, .name = name, .email = email, .phones = try phones.toOwnedSlice(alloc) };
}

test "repeated phones round-trip" {
    const alloc = std.testing.allocator;
    const phones = [_][]const u8{ "555-1111", "555-2222", "555-3333" };
    const original = Person{ .id = 7, .name = "scipio", .email = "[email protected]", .phones = &phones };

    const bytes = try encodePerson(alloc, original);
    defer alloc.free(bytes);
    const back = try decodePerson(alloc, bytes);
    defer alloc.free(back.phones);

    try std.testing.expectEqual(@as(usize, 3), back.phones.len);
    try std.testing.expectEqualStrings("555-2222", back.phones[1]);
}

The detail that matters is the errdefer phones.deinit(alloc): if a later field is malformed and we bail, the phone slices we already appended get reclaimed. Same allocator discipline from episode 7, same place it always shows up -- on the error path.

Exercise 2: Encode a nested message

A nested message is not a new wire type. It is an ordinary length-delimited field whose payload happens to be another fully encoded message. So we encode the Address into its own buffer first, then write that buffer as field 5 with wire type .len:

const Address = struct { street: []const u8, city: []const u8 };

pub fn encodeAddress(alloc: std.mem.Allocator, a: Address) ![]u8 {
    var out: std.ArrayListUnmanaged(u8) = .{};
    errdefer out.deinit(alloc);
    try writeLenField(&out, alloc, 1, a.street);
    try writeLenField(&out, alloc, 2, a.city);
    return out.toOwnedSlice(alloc);
}

pub fn decodeAddress(buf: []const u8) !Address {
    var a: Address = .{ .street = "", .city = "" };
    var off: usize = 0;
    while (off < buf.len) {
        const d = decodeTag(try readVarint(buf, &off));
        switch (d.field) {
            1 => a.street = try readLen(buf, &off),
            2 => a.city = try readLen(buf, &off),
            else => try skipField(buf, &off, d.wire_type),
        }
    }
    return a;
}

// Inside encodePerson, to embed the address as field 5:
//     const addr_bytes = try encodeAddress(alloc, p.address);
//     defer alloc.free(addr_bytes);
//     try writeLenField(&out, alloc, 5, addr_bytes);
//
// And inside decodePerson:
//     5 => p.address = try decodeAddress(try readLen(buf, &off)),

That is recursion at the wire level: readLen slices out the sub-message bytes, decodeAddress parses them in isolation. The genuinely nice property -- to an outer decoder that does not know field 5, the whole nested address is just one length-delimited blob it can skip. Nesting costs the format nothing.

Exercise 3: Write a generic decoder with comptime

Episode 90 gave us encodeStruct via @typeInfo. The matching decodeStruct walks the same fields, matching wire field numbers (1-based declaration order) and filling each slot by its Zig type:

pub fn decodeStruct(comptime T: type, buf: []const u8) !T {
    var result: T = undefined;
    // start every field at a sane zero so absent fields are well-defined
    inline for (@typeInfo(T).@"struct".fields) |field| {
        @field(result, field.name) = switch (@typeInfo(field.type)) {
            .int => 0,
            .pointer => "",
            else => @compileError("unsupported field: " ++ field.name),
        };
    }

    var off: usize = 0;
    while (off < buf.len) {
        const d = decodeTag(try readVarint(buf, &off));
        var matched = false;
        inline for (@typeInfo(T).@"struct".fields, 1..) |field, field_num| {
            if (d.field == field_num) {
                matched = true;
                switch (@typeInfo(field.type)) {
                    .int => @field(result, field.name) = @intCast(try readVarint(buf, &off)),
                    .pointer => @field(result, field.name) = try readLen(buf, &off),
                    else => unreachable, // already rejected above at compile time
                }
            }
        }
        if (!matched) try skipField(buf, &off, d.wire_type); // unknown field number
    }
    return result;
}

The inline for unrolls into one if per field, so field-number matching is straight-line code with zero runtime reflection -- exactly the property from episode 32. And because field numbers are unique, only one branch ever advances off, so the trailing iterations are harmless no-ops. Hand this a struct produced by encodeStruct and it round-trips. Three exercises, and protobuf went from "a hand-written codec for one struct" to "a generic, schema-shaped codec for any struct". Now for the format I pointed at when I closed last episode.

Learn Zig Series (#91) - MessagePack Format

Here we go ;-) At the very end of episode 90 I dangled a different family of binary formats in front of you: the ones that throw out the pre-shared schema and instead make the bytes self-describing, carrying their own type information inline. MessagePack -- "msgpack" if you're typing in a hurry -- is the cleanest, most widely deployed example of exactly that idea. Sadao Furuhashi created it, and today it's the wire format under Redis modules, Fluentd's log shipping, Neovim's whole editor API (msgpack-rpc), and a startling amount of mobile and game traffic where every byte over the radio costs battery.

The one-line pitch is the one everybody uses, and it happens to be true: MessagePack is JSON, but binary. Same data model -- nulls, booleans, integers, floats, strings, arrays, maps -- but in stead of {"id":1986,"name":"scipio"} taking thirty-odd characters of text with quotes and braces and a re-parse on every read, msgpack ships a compact byte sequence where each value announces its own type in a single leading byte. No .proto file, no schema agreed in advance, no field-number map. You can decode a msgpack blob you have never seen before and recover its full structure, the same way you can JSON.parse an arbitrary document. That is the whole trade we're studying today.

Schema-driven versus self-describing

It's worth pinning down the contrast with last episode, because it is the entire design axis. Protobuf is schema-driven: the bytes are tiny (a field is a one-byte tag plus a value, the name lives only in the shared .proto), but you cannot make sense of a message without that schema, and both ends must agree on it ahead of time. MessagePack is self-describing: every value pays a small tax to carry its own type tag inline, so the bytes are a touch larger, but any decoder can reconstruct any message with nothing agreed up front.

Neither is "better". Protobuf wins when the two sides are your own services and you control both ends -- you happily ship a schema to buy the smaller bytes. MessagePack wins when the shape is dynamic or the peer is a stranger -- an editor plugin, a cache holding arbitrary user blobs, a log pipeline swallowing whatever JSON people throw at it. The same way you'd pick a struct when you know the fields and a hash map when you don't (episode 22), you pick protobuf when you know the schema and msgpack when you don't. Having said that, you'll notice the machinery is cousins: a leading tag byte, big-endian fixed-width integers, length-prefixed strings. We sharpened those instincts last episode; here they pay off again.

The format byte: one byte to rule them all

Everything in MessagePack starts with a format byte. For small values that one byte is the whole encoding -- no separate length, no payload. The spec carves the 256 possible first bytes into ranges, and the cleverness is in how those ranges are chosen so the common cases vanish into a single byte:

positive fixint 0x00..0x7f -- the byte 0xxxxxxx literally is the integer 0..127. A small count, an array length, an enum tag: one byte, zero overhead.
negative fixint 0xe0..0xff -- the byte 111xxxxx is a small negative integer -32..-1, recovered by reading it as a signed i8.
fixstr 0xa0..0xbf -- 101xxxxx, a string whose length (0..31) is baked into the low 5 bits; the bytes follow.
fixarray 0x90..0x9f -- 1001xxxx, an array of up to 15 elements.
fixmap 0x80..0x8f -- 1000xxxx, a map of up to 15 key/value pairs.
and a table of explicit markers for everything larger: 0xc0 nil, 0xc2/0xc3 false/true, 0xcc..0xcf uint8..uint64, 0xd0..0xd3 int8..int64, 0xca/0xcb float32/float64, 0xd9..0xdb str8/str16/str32, 0xdc/0xdd array16/array32, 0xde/0xdf map16/map32, 0xc4..0xc6 bin8/bin16/bin32.

Look at what those fixint and fixstr ranges buy you. A number under 128 costs one byte. The string "id" costs three: one format byte plus the two letters. JSON's "id" costs four characters just for the quotes and key, and that's before the colon. The whole format is tuned so that small, frequent data -- which is most real data -- pays almost nothing.

Let's pin a few of those constants down in Zig so the rest of the code reads cleanly:

const std = @import("std");

// A handful of the fixed markers we'll branch on. The fix* ranges are handled
// by range matching, not named constants, because they pack a value into the byte.
const NIL: u8 = 0xc0;
const FALSE: u8 = 0xc2;
const TRUE: u8 = 0xc3;
const UINT8: u8 = 0xcc;
const UINT16: u8 = 0xcd;
const UINT32: u8 = 0xce;
const UINT64: u8 = 0xcf;
const FLOAT64: u8 = 0xcb;
const STR8: u8 = 0xd9;
const STR16: u8 = 0xda;
const STR32: u8 = 0xdb;
const ARRAY16: u8 = 0xdc;
const MAP16: u8 = 0xde;

A Value type for dynamic, JSON-shaped data

Because msgpack is self-describing, the natural Zig representation of "a decoded msgpack document" is a recursive tagged union -- exactly the tool from episode 6, and the same shape you'd reach for to model a JSON value. Each variant is one of the data-model types, and the container variants point at slices of more Values:

pub const Value = union(enum) {
    nil,
    boolean: bool,
    uint: u64,
    int: i64,
    float: f64,
    str: []const u8,
    bin: []const u8,
    array: []const Value, // recursive: an array of more Values
    map: []const Pair, // recursive: a list of key/value pairs

    pub const Pair = struct { key: Value, value: Value };
};

That array: []const Value is the recursion that gives the format its flexibility. A msgpack document is a tree, and this union is a faithful in-memory mirror of that tree. The thing Zig forces you to confront -- and JSON-in-a-garbage-collected-language lets you ignore -- is that someone owns those slices. When we decode an array, we allocate a []Value; freeing the document later means walking the tree and freeing every allocation. We'll get there, but hold the thought: self-describing means dynamically sized means heap allocation means a lifetime you're responsible for.

Encoding: writing the format bytes

Encoding is a recursive walk over the Value tree. For each node, pick the smallest format that fits, write its bytes, recurse into children. Integers are where the "smallest that fits" logic earns its keep -- we promote up the ladder only as far as the value demands:

fn appendBig(list: *std.ArrayListUnmanaged(u8), alloc: std.mem.Allocator, comptime T: type, n: T) !void {
    var tmp: [@sizeOf(T)]u8 = undefined;
    std.mem.writeInt(T, &tmp, n, .big); // msgpack is big-endian ("network byte order")
    try list.appendSlice(alloc, &tmp);
}

pub fn writeUint(list: *std.ArrayListUnmanaged(u8), alloc: std.mem.Allocator, n: u64) !void {
    if (n < 0x80) {
        try list.append(alloc, @intCast(n)); // positive fixint: the value IS the byte
    } else if (n <= 0xff) {
        try list.append(alloc, UINT8);
        try list.append(alloc, @intCast(n));
    } else if (n <= 0xffff) {
        try list.append(alloc, UINT16);
        try appendBig(list, alloc, u16, @intCast(n));
    } else if (n <= 0xffffffff) {
        try list.append(alloc, UINT32);
        try appendBig(list, alloc, u32, @intCast(n));
    } else {
        try list.append(alloc, UINT64);
        try appendBig(list, alloc, u64, n);
    }
}

The .big argument to std.mem.writeInt is the entire endianness story. MessagePack, like nearly every network format we've touched in this arc -- DNS in episode 82, the HTTP/2 frames in episode 85, protobuf's fixed64 last episode -- is big-endian. Zig's writeInt takes the byte order as an explicit parameter, so there's no htons/htonl dance and no silent assumption about the host: you name .big, you get big-endian, on a little-endian laptop or a big-endian router alike. That explicitness is precisely the episode-35 cross-compilation lesson made concrete -- the code is correct on every target because nothing is left to the platform.

Strings follow the same "smallest format that fits" shape, but the small case dissolves into the format byte itself via a bitwise OR:

pub fn writeStr(list: *std.ArrayListUnmanaged(u8), alloc: std.mem.Allocator, s: []const u8) !void {
    const len = s.len;
    if (len <= 31) {
        try list.append(alloc, 0xa0 | @as(u8, @intCast(len))); // fixstr: 101xxxxx
    } else if (len <= 0xff) {
        try list.append(alloc, STR8);
        try list.append(alloc, @intCast(len));
    } else if (len <= 0xffff) {
        try list.append(alloc, STR16);
        try appendBig(list, alloc, u16, @intCast(len));
    } else {
        try list.append(alloc, STR32);
        try appendBig(list, alloc, u32, @intCast(len));
    }
    try list.appendSlice(alloc, s); // the raw bytes, verbatim
}

Now the dispatcher that ties it together. One switch over the union tag, recursing into arrays and maps:

pub fn encode(list: *std.ArrayListUnmanaged(u8), alloc: std.mem.Allocator, v: Value) !void {
    switch (v) {
        .nil => try list.append(alloc, NIL),
        .boolean => |b| try list.append(alloc, if (b) TRUE else FALSE),
        .uint => |n| try writeUint(list, alloc, n),
        .int => |n| try writeInt(list, alloc, n), // mirror of writeUint for signed values
        .float => |f| {
            try list.append(alloc, FLOAT64);
            try appendBig(list, alloc, u64, @bitCast(f)); // raw IEEE-754 bits, big-endian
        },
        .str => |s| try writeStr(list, alloc, s),
        .bin => |b| try writeBin(list, alloc, b), // same shape as writeStr, bin8/16/32 markers
        .array => |items| {
            if (items.len <= 15) {
                try list.append(alloc, 0x90 | @as(u8, @intCast(items.len))); // fixarray
            } else {
                try list.append(alloc, ARRAY16);
                try appendBig(list, alloc, u16, @intCast(items.len));
            }
            for (items) |item| try encode(list, alloc, item); // recurse
        },
        .map => |pairs| {
            if (pairs.len <= 15) {
                try list.append(alloc, 0x80 | @as(u8, @intCast(pairs.len))); // fixmap
            } else {
                try list.append(alloc, MAP16);
                try appendBig(list, alloc, u16, @intCast(pairs.len));
            }
            for (pairs) |p| {
                try encode(list, alloc, p.key); // key, then value, then next pair
                try encode(list, alloc, p.value);
            }
        },
    }
}

The @bitCast(f) on the float is the same move we made for ZigZag last episode: we want the raw IEEE-754 bit pattern reinterpreted as a u64, not a numeric conversion. @bitCast says "same bits, different type", and that's exactly what a serializer needs -- the bytes of the float, not a rounded integer. Note also that the container headers carry only a count, not a byte length: a decoder reading fixarray 3 knows to decode exactly three more values, whatever sizes they turn out to be. That's the self-describing property doing the work -- each child announces its own size as the decoder reaches it.

Decoding: reading the type back off the wire

Decoding is the mirror image, and it's where self-describing really shows its hand. We read one format byte, and that byte alone tells us everything -- which variant, and how many more bytes (or children) to consume. I'll wrap the cursor in a small struct so the bounds checks live in one place:

const Decoder = struct {
    buf: []const u8,
    off: usize = 0,

    fn byte(self: *Decoder) !u8 {
        if (self.off >= self.buf.len) return error.Truncated;
        defer self.off += 1;
        return self.buf[self.off];
    }

    fn take(self: *Decoder, n: usize) ![]const u8 {
        if (self.off + n > self.buf.len) return error.Truncated; // bogus length must not run past the end
        defer self.off += n;
        return self.buf[self.off..][0..n];
    }

    fn readBig(self: *Decoder, comptime T: type) !T {
        const raw = try self.take(@sizeOf(T));
        return std.mem.readInt(T, raw[0..@sizeOf(T)], .big);
    }

    fn decode(self: *Decoder, alloc: std.mem.Allocator) !Value {
        const fmt = try self.byte();
        return switch (fmt) {
            0x00...0x7f => .{ .uint = fmt }, // positive fixint
            0xe0...0xff => .{ .int = @as(i8, @bitCast(fmt)) }, // negative fixint -> signed
            NIL => .nil,
            FALSE => .{ .boolean = false },
            TRUE => .{ .boolean = true },
            UINT8 => .{ .uint = try self.byte() },
            UINT16 => .{ .uint = try self.readBig(u16) },
            UINT32 => .{ .uint = try self.readBig(u32) },
            UINT64 => .{ .uint = try self.readBig(u64) },
            FLOAT64 => .{ .float = @bitCast(try self.readBig(u64)) },
            0xa0...0xbf => .{ .str = try self.take(fmt & 0x1f) }, // fixstr: low 5 bits = length
            STR8 => .{ .str = try self.take(try self.byte()) },
            STR16 => .{ .str = try self.take(try self.readBig(u16)) },
            0x90...0x9f => try self.decodeArray(alloc, fmt & 0x0f), // fixarray: low 4 bits = count
            ARRAY16 => try self.decodeArray(alloc, try self.readBig(u16)),
            0x80...0x8f => try self.decodeMap(alloc, fmt & 0x0f), // fixmap
            MAP16 => try self.decodeMap(alloc, try self.readBig(u16)),
            else => error.UnsupportedFormat,
        };
    }
};

That switch is the whole format on one screen, and Zig's range patterns (0x00...0x7f) make the fix* families read like the spec table rather than a pile of bit-masking ifs. Two small reinterpretations earn a comment. The negative fixint, 0xe0..0xff, is recovered by @as(i8, @bitCast(fmt)): the byte 0xff reinterpreted as a signed i8 is -1, and 0xe0 is -32, which is precisely the range the spec assigns. And fmt & 0x1f plucks the length straight back out of a fixstr byte -- the same five bits we OR-ed in on the way out. Encode and decode are mirror images, bit for bit.

The container decoders allocate, then recurse, with an errdefer so a failure halfway through a malformed array doesn't leak the elements already built:

fn decodeArray(self: *Decoder, alloc: std.mem.Allocator, n: usize) !Value {
    const items = try alloc.alloc(Value, n);
    errdefer alloc.free(items); // shallow free if a child decode fails
    for (items) |*slot| slot.* = try self.decode(alloc); // recurse for each element
    return .{ .array = items };
}

fn decodeMap(self: *Decoder, alloc: std.mem.Allocator, n: usize) !Value {
    const pairs = try alloc.alloc(Value.Pair, n);
    errdefer alloc.free(pairs);
    for (pairs) |*p| {
        p.key = try self.decode(alloc);
        p.value = try self.decode(alloc);
    }
    return .{ .map = pairs };
}

Worth being honest about: that errdefer alloc.free(items) is a shallow free. If element 5 of a 10-element array fails to decode, it reclaims the items slice but not the sub-arrays that elements 0..4 may themselves have allocated. For a teaching codec that's an acceptable simplification (the usual production answer is an arena allocator -- episode 26 -- so the whole tree frees in one shot regardless). The string slices, note, borrow from the input buffer via take; they are not copied, so Value.str is only valid as long as the original buf lives. That's the same zero-copy lifetime contract our protobuf readLen had last episode, and Zig's slices carry no ownership of their own to warn you, so it's on you to remember.

Testing a self-describing format

As ever, the highest-value test for a codec is the round-trip: encode a Value, decode it back, assert you recovered the same tree. Because msgpack documents nest, a good round-trip test should nest too:

test "msgpack round-trips a nested document" {
    const alloc = std.testing.allocator;

    // { "id": 1986, "tags": ["zig", "msgpack"] }
    const inner = [_]Value{ .{ .str = "zig" }, .{ .str = "msgpack" } };
    const pairs = [_]Value.Pair{
        .{ .key = .{ .str = "id" }, .value = .{ .uint = 1986 } },
        .{ .key = .{ .str = "tags" }, .value = .{ .array = &inner } },
    };
    const doc = Value{ .map = &pairs };

    var out: std.ArrayListUnmanaged(u8) = .{};
    defer out.deinit(alloc);
    try encode(&out, alloc, doc);

    var dec = Decoder{ .buf = out.items };
    const back = try dec.decode(alloc);
    defer freeValue(alloc, back); // walk the tree and free every allocation

    try std.testing.expect(back == .map);
    try std.testing.expectEqual(@as(usize, 2), back.map.len);
    try std.testing.expectEqualStrings("id", back.map[0].key.str);
    try std.testing.expectEqual(@as(u64, 1986), back.map[0].value.uint);
    try std.testing.expectEqualStrings("msgpack", back.map[1].value.array[1].str);
}

The freeValue helper is the recursive counterpart to decode -- the price of a self-describing format that allocates a tree. And beyond the happy path, point a fuzzer (episode 12's habit) at Decoder.decode with random bytes: every error.Truncated guard in byte and take is there so a hostile or corrupt buffer yields a clean error, never an out-of-bounds read. A fixarray header claiming 15 elements with zero bytes following must fail loudly rather than wander off the end -- and because every read funnels through those two bounds-checked helpers, it does.

Performance considerations

MessagePack is already compact, so the wins are in how you drive the codec, not the format. First, the encoder grows an ArrayListUnmanaged and reallocs as it fills -- fine for the occasional message, wasteful in a hot loop. The fix from the allocator episodes applies unchanged: keep a reusable buffer warm and clearRetainingCapacity between messages, so a firehose of small documents allocates zero times after warm-up. Second, the decoder's tree allocation is the real cost on the read side; an arena (episode 26) collapses the whole document's frees into one deinit, and for documents you only walk once, decoding into borrowed string slices (as we do) beats copying every string into fresh memory.

Third, a structural note: because each fixint and fixstr is branch-then-done, the decode switch is overwhelmingly predictable -- the CPU's branch predictor loves it -- but a deeply nested document still costs you a recursive call per node, and very deep nesting is a stack-depth risk a paranoid decoder caps. Having said all that, the rule from last episode stands unchanged: do not optimise what you have not measured. Episode 34's profiler will, as reliably as ever, point the finger at the syscalls moving bytes in and out long before it blames your varint loop ;-)

How this compares to C, Rust, and Go

In C, the reference library is msgpack-c, plus the lovely single-header mpack and cmp for embedded work. They hand you the same format-byte switch we just wrote, with every buffer and every length your responsibility -- and a forgotten bounds check on a str32 length is, once again, a remote read primitive. The code is what we wrote; the difference is nobody's checking the checks for you.

In Go, vmihailenco/msgpack and the tinylib/msgp code generator dominate. The reflection-based path is ergonomic and the codegen path is fast, but both lean on the garbage collector to own the decoded tree -- which is exactly the ownership question Zig made us answer out loud with freeValue.

In Rust, rmp and rmp-serde plug msgpack into serde, so a #[derive(Serialize, Deserialize)] on your struct gets you a codec for free, with the borrow checker enforcing the zero-copy lifetimes we had to track by hand. It's the most foolproof of the three, at the usual cost of fighting lifetimes when a decoded value needs to outlive its buffer.

Zig sits where it always does: we wrote the entire format -- every fix* range, big-endian integers, recursive arrays and maps, bounds-checked reads -- in a couple hundred lines, every allocation visible, every check ours. No hidden reflection, no GC owning your tree behind your back, no macro magic to peer through. For production you might still pull in a library, but you now know precisely which bytes it writes and reads, and why. Bam, jonguh -- that's the recurring payoff of building these from scratch.

Where this is heading

Step back and count what we've built across two episodes. We now have two complete serialization formats: protobuf, schema-driven and minimal, where both ends agree on a .proto; and MessagePack, self-describing and dynamic, where the bytes carry their own types. We've got varints, big-endian fixed-width integers, length-prefixed strings, tag bytes, recursive containers, and round-trip tests for all of it. That is a genuinely deep serialization toolkit, and it didn't come from a library -- it came from your own hands.

But serialization is only ever half a conversation. A format tells you how to turn a Person into bytes and back; it says nothing about how a client asks a remote server to do something with that person and gets an answer. Stand the protobuf encoder from episode 90 next to the TCP and HTTP/2 plumbing from earlier in this arc, and the missing piece almost names itself: a way to package "call this procedure, here are the arguments, send me the result" on top of a serialized payload and a streaming transport. The varint you've now written three times, the tag-then-value rhythm, the length-delimited framing -- they were never the destination. They were the vocabulary. Next time we start speaking it in full sentences.

The pieces were never separate tricks. Schema-driven or self-describing, big-endian or little, fixint or varint -- it's all the same instinct: treat the bytes as a contract, check every length like the peer is hostile, and make the type system carry the meaning. You've now written both sides of that contract, twice over, with your eyes wide open.

Exercises

Add the str32 and array32 paths. Our decoder handles fixstr/str8/str16 and fixarray/array16, but stops short of the 32-bit length markers (STR32 = 0xdb, 0xdd for array32). Add both branches to Decoder.decode, reading a u32 length big-endian, and write a round-trip test with a string longer than 65535 bytes to prove the str32 path actually fires.
Write the recursive freeValue. Implement freeValue(alloc, v) that walks a decoded Value and frees every allocation our decoder made -- the array slices and the map pair slices -- recursing into nested children first (free the leaves before the branch that owns them). The str/bin slices borrow from the input buffer and must not be freed. Run the nested round-trip test under std.testing.allocator and confirm it reports no leaks.
Convert msgpack to JSON. Write toJson(writer, v) that prints any Value as JSON text: nil becomes null, maps become {...}, arrays become [...], strings get quoted and escaped. Decode a msgpack blob and pipe it through toJson to a std.io writer. This proves the two formats share one data model -- and gives you a debugging tool for every msgpack byte stream you'll ever stare at.

Bedankt en tot de volgende keer!

@scipio

stem stemsocial steemstem zig programming

0.000

0 comments

Learn Zig Series (#91) - MessagePack Format

Learn Zig Series (#91) - MessagePack Format

What will I learn?

Requirements

Difficulty

Curriculum (of the Learn Zig Series):

Learn Zig Series (#91) - MessagePack Format

Solutions to Episode 90 Exercises

Learn Zig Series (#91) - MessagePack Format

Schema-driven versus self-describing

The format byte: one byte to rule them all

A Value type for dynamic, JSON-shaped data

Encoding: writing the format bytes

Decoding: reading the type back off the wire

Testing a self-describing format

Performance considerations

How this compares to C, Rust, and Go

Where this is heading

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn Zig Series`):