Learn Zig Series (#87) - WebSocket Protocol

Learn Zig Series (#87) - WebSocket Protocol

zig.png

What will I learn?

  • Why WebSocket exists, and what it does that plain HTTP simply cannot;
  • How the opening handshake upgrades an ordinary HTTP request into a persistent, two-way connection;
  • How to compute the Sec-WebSocket-Accept value with SHA-1 and base64;
  • How a WebSocket frame is laid out on the wire, bit by bit;
  • How to parse incoming frames, including the 16- and 64-bit extended length forms;
  • Why client-to-server frames are masked, and how the XOR masking actually works;
  • How to encode text, binary, and control frames (ping, pong, close);
  • How to unit-test a protocol implementation without a live browser anywhere in sight.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Zig 0.14+ distribution (download from ziglang.org);
  • The ambition to learn Zig programming.

Difficulty

  • Intermediate

Curriculum (of the Learn Zig Series):

Learn Zig Series (#87) - WebSocket Protocol

Solutions to Episode 86 Exercises

Last episode I left you three exercises on top of the TlsClient we wrapped around OpenSSL -- ALPN negotiation, a non-blocking client, and a tiny HTTPS server. They all build on that same struct, so keep the episode 86 file open beside this one.

Exercise 1: Add ALPN negotiation

// Reuses TlsClient and the `c` namespace from episode 86.

/// Advertise the protocols we speak, most-preferred first. The wire format is
/// NOT a comma-separated string: it's a sequence of length-prefixed entries,
/// each one byte of length followed by that many ASCII bytes.
pub fn setAlpn(self: *TlsClient) void {
    const protos = "\x02h2\x08http/1.1"; // "h2", then "http/1.1"
    _ = c.SSL_set_alpn_protos(self.ssl, protos, protos.len);
}

/// After a successful handshake, ask which protocol the server agreed to.
pub fn selectedAlpn(self: *TlsClient) []const u8 {
    var data: [*c]const u8 = null;
    var len: c_uint = 0;
    c.SSL_get0_alpn_selected(self.ssl, &data, &len);
    if (len == 0) return ""; // server ignored ALPN -> fall back to http/1.1
    return data[0..len];
}

The trap people fall into is the wire format. ALPN is not "h2,http/1.1" -- it's length-prefixed, so h2 becomes the two bytes 0x02 'h' '2'. Get that wrong and OpenSSL silently advertises garbage and the server picks nothing. After the handshake, SSL_get0_alpn_selected hands back a pointer-plus-length (no NUL terminator, hence the data[0..len] slice), and that single string is how a real client decides between speaking HTTP/2 or HTTP/1.1 over the same port 443.

Exercise 2: Make the client non-blocking

const std = @import("std");
const posix = std.posix;

/// Flip the socket into non-blocking mode (recall O_NONBLOCK from the I/O episodes).
pub fn setNonBlocking(fd: posix.fd_t) !void {
    var flags = try posix.fcntl(fd, posix.F.GETFL, 0);
    flags |= 1 << 11; // O_NONBLOCK == 0o4000 on Linux
    _ = try posix.fcntl(fd, posix.F.SETFL, flags);
}

/// Park on poll() until the fd is ready in the direction OpenSSL asked for.
fn waitReady(fd: posix.fd_t, want_write: bool) !void {
    var pfd = [_]posix.pollfd{.{
        .fd = fd,
        .events = if (want_write) posix.POLL.OUT else posix.POLL.IN,
        .revents = 0,
    }};
    _ = try posix.poll(&pfd, -1);
}

/// Retry the handshake, suspending on the fd between WantRead/WantWrite, so a
/// single thread can drive many connections at once.
pub fn handshakeNonBlocking(self: *TlsClient) !void {
    while (true) {
        self.handshake() catch |err| switch (err) {
            error.WantRead => { try waitReady(self.socket, false); continue; },
            error.WantWrite => { try waitReady(self.socket, true); continue; },
            else => return err,
        };
        return;
    }
}

This is where last episode's decision to surface WantRead and WantWrite as distinct Zig errors pays off. A non-blocking socket never sleeps inside OpenSSL; instead SSL_connect returns "I need to read more" or "I need to write", and we translate that into a poll on the right event. The whole point is that between two such waits the thread is free to service other connections -- which is exactly the muscle we'll need once we have long-lived sockets that stay open for minutes.

Exercise 3: Build a tiny HTTPS server

/// The server side mirrors the client: a different method, a loaded cert+key,
/// and SSL_accept instead of SSL_connect.
pub fn initServerCtx(cert_path: [:0]const u8, key_path: [:0]const u8) !*c.SSL_CTX {
    const ctx = c.SSL_CTX_new(c.TLS_server_method()) orelse return error.ContextInit;
    errdefer c.SSL_CTX_free(ctx);
    if (c.SSL_CTX_use_certificate_file(ctx, cert_path, c.SSL_FILETYPE_PEM) != 1)
        return error.ContextInit;
    if (c.SSL_CTX_use_PrivateKey_file(ctx, key_path, c.SSL_FILETYPE_PEM) != 1)
        return error.ContextInit;
    return ctx;
}

pub fn serveOne(ctx: *c.SSL_CTX, client_fd: std.posix.socket_t) !void {
    const ssl = c.SSL_new(ctx) orelse return error.ContextInit;
    defer c.SSL_free(ssl);
    _ = c.SSL_set_fd(ssl, @intCast(client_fd));
    if (c.SSL_accept(ssl) != 1) return error.HandshakeFailed; // server-side handshake
    const resp = "HTTP/1.1 200 OK\r\nContent-Length: 5\r\n\r\nhello";
    _ = c.SSL_write(ssl, resp.ptr, @intCast(resp.len));
}

The whole asymmetry of a TLS server versus a client is two function names: TLS_server_method instead of TLS_client_method, and SSL_accept instead of SSL_connect. Everything else -- the SSL_CTX, the per-connection SSL, the defer/errdefer cleanup -- is identical. Generate the throwaway cert with openssl req -x509 -newkey rsa:2048 -nodes -keyout key.pem -out cert.pem -days 1 and point a browser at https://localhost:port (it'll warn about the self-signed cert, which is expected).


At the very end of episode 86 I wrote that the next step was "the upgrade dance that turns an ordinary HTTPS request into a persistent, bidirectional channel... the same wss:// way your browser does it." Well -- here we are ;-) Today we build exactly that channel: WebSocket. And the lovely thing is that every layer underneath it is already in our hands. TCP sockets came in episode 21, the HTTP request parsing in episode 84, binary framing in episode 85, and TLS in episode 86. WebSocket is the protocol that stitches them into something a chat app or a live dashboard can actually use.

Why WebSocket exists at all

Plain HTTP has one structural limitation that no amount of cleverness fully removes: the client asks, the server answers, and then the exchange is over. The server cannot speak first. If you want the server to push you a new chat message the instant it arrives, classic HTTP forces ugly workarounds -- polling every second (wasteful and laggy), or long-polling (a request that the server holds open until it has something to say, then you immediately reopen another). Both fight the protocol in stead of working with it.

WebSocket solves this honestly. You start with a normal HTTP request, ask the server to upgrade the connection, and if it agrees, that same TCP socket stops speaking HTTP and starts speaking a tiny, symmetric, message-oriented protocol where either side can send a message at any time. No new connection, no polling, no headers repeated on every message. One socket, kept open, bytes flowing both directions. That's it. Bam, jonguh!

The key mental shift: after the handshake, WebSocket is not request/response anymore. It's a bidirectional stream of discrete messages, each one wrapped in a small binary frame. So this episode has two halves -- the one-time HTTP handshake that opens the door, and the framing protocol that carries everything afterward.

The opening handshake

A WebSocket connection starts life as an ordinary HTTP/1.1 GET, the kind we parsed back in episode 84, but with a few special headers:

GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

The Upgrade: websocket and Connection: Upgrade headers signal intent. Sec-WebSocket-Version: 13 pins the protocol version (13 is the version -- RFC 6455). The interesting one is Sec-WebSocket-Key: 16 random bytes, base64-encoded by the client. It is not a security token (it's sent in the clear, so it secures nothing); its only job is to prove that the server actually understood the WebSocket handshake and didn't just blindly echo a cached HTTP response.

The server proves comprehension by a fixed ritual. Take the client's key string, concatenate a magic GUID defined in the RFC, SHA-1 the result, base64-encode the 20-byte digest, and send it back in Sec-WebSocket-Accept. The magic string is a constant -- 258EAFA5-E914-47DA-95CA-C5AB0DC85B11 -- chosen precisely because no naive HTTP cache would ever append it on its own. Here's the computation in pure Zig, using the standard library's SHA-1 and base64 (no C interop needed this time):

const std = @import("std");

const ws_magic = "258EAFA5-E914-47DA-95CA-C5AB0DC85B11";

/// Compute Sec-WebSocket-Accept from the client's Sec-WebSocket-Key.
/// SHA-1 of (key ++ magic) is 20 bytes; base64 of 20 bytes is exactly 28 chars.
pub fn computeAccept(key: []const u8, out: *[28]u8) void {
    var sha1 = std.crypto.hash.Sha1.init(.{});
    sha1.update(key);
    sha1.update(ws_magic);
    var digest: [20]u8 = undefined;
    sha1.final(&digest);
    _ = std.base64.standard.Encoder.encode(out, &digest);
}

Note how the update calls let us hash the key and the magic constant without first allocating a joined buffer -- a streaming hash is the natural fit (we met the same pattern with file hashing back in the sync-tool project). The server's reply is then a bog-standard 101 status with the upgrade headers:

pub fn writeHandshakeResponse(key: []const u8, out: []u8) ![]u8 {
    var accept: [28]u8 = undefined;
    computeAccept(key, &accept);
    return std.fmt.bufPrint(out,
        "HTTP/1.1 101 Switching Protocols\r\n" ++
        "Upgrade: websocket\r\n" ++
        "Connection: Upgrade\r\n" ++
        "Sec-WebSocket-Accept: {s}\r\n\r\n",
        .{accept},
    );
}

Once the client receives that 101 Switching Protocols, both sides forget HTTP entirely. The socket is now a WebSocket. From here on, every byte is part of the framing protocol.

The frame format, bit by bit

This is where episode 17 (packed structs and bit manipulation) and episode 85 (binary framing) come roaring back. A WebSocket frame is compact -- the header is as small as 2 bytes -- and it packs several fields into individual bits of the first two octets:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len |    Extended payload length    |
|I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
|N|V|V|V|       |S|             |   (if payload len==126/127)   |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
:                     Masking-key (4 bytes, if MASK set)        :
+---------------------------------------------------------------+
:                     Payload Data continued ...                :
+---------------------------------------------------------------+

Let me walk the fields. The first byte holds the FIN bit (1 means "this is the final fragment of a message" -- WebSocket can split one big message across frames), three reserved bits (RSV1-3, zero unless an extension negotiated otherwise), and a 4-bit opcode. The opcode is the whole vocabulary of the protocol, so it's a perfect non-exhaustive enum (episode 6):

pub const Opcode = enum(u4) {
    continuation = 0x0, // a continuation of a fragmented message
    text = 0x1, // UTF-8 text payload
    binary = 0x2, // raw binary payload
    close = 0x8, // closing handshake
    ping = 0x9, // heartbeat request
    pong = 0xA, // heartbeat reply
    _, // forward-compatible: unknown opcodes don't crash us
};

The second byte starts with the MASK bit, then a 7-bit payload length. That length has three forms, which is the one genuinely fiddly part of parsing: if it's 0-125, that's the actual length. If it's exactly 126, the real length is the next 2 bytes as a big-endian u16. If it's 127, the real length is the next 8 bytes as a big-endian u64. This variable-length trick keeps small frames tiny while still allowing gigabyte payloads -- the same philosophy as varint encodings, just with three fixed buckets in stead of a continuation bit.

Masking: why client frames are scrambled

Here's a rule that surprises everyone the first time: every frame a client sends to a server MUST be masked, and every frame a server sends back MUST NOT be. Masking means XOR-ing each payload byte with one of four rotating key bytes that the client picks at random and includes in the frame.

Why on earth? It's not encryption -- the mask key is right there in the frame, so anyone reading the bytes can trivially unmask. The real reason is a defence against a specific attack on intermediaries. Before WebSocket was hardened, a malicious page could craft payloads that, passing through an old caching proxy that half-understood HTTP, looked enough like a fake HTTP request to poison the proxy's cache for other users. Forcing the client to XOR its payload with a fresh random key makes the bytes-on-the-wire unpredictable, so an attacker can't reliably smuggle a chosen plaintext past a confused proxy. The server, sitting at the trusted end, has no such worry and never masks.

The masking itself is delightfully simple -- byte i is XOR-ed with mask[i % 4]:

fn applyMask(payload: []u8, mask: [4]u8) void {
    for (payload, 0..) |*byte, i| {
        byte.* ^= mask[i % 4];
    }
}

Because XOR is its own inverse, the same function both masks and unmasks. The server calls it to recover the plaintext a client sent; a client would call it (with a random key) before sending. Nota bene: that i % 4 is on the hot path for large payloads -- we'll come back to it in the performance section.

Parsing an incoming frame

Now we assemble the pieces. Our parser takes a byte buffer (whatever we've read off the socket so far) and either returns a fully-decoded frame plus how many bytes it consumed, or null to mean "not enough bytes yet, come back when you've read more" -- the same incremental-reader contract the HTTP/2 FrameReader used last episode. Honest error handling (episode 4) covers the malformed cases:

pub const Frame = struct {
    fin: bool,
    opcode: Opcode,
    payload: []u8, // already unmasked, points into the input buffer
};

pub const Parsed = struct { frame: Frame, consumed: usize };

pub fn parseFrame(buf: []u8) !?Parsed {
    if (buf.len < 2) return null; // need at least the 2-byte header
    const fin = (buf[0] & 0x80) != 0;
    const opcode: Opcode = @enumFromInt(@as(u4, @truncate(buf[0] & 0x0F)));
    const masked = (buf[1] & 0x80) != 0;

    var len: u64 = buf[1] & 0x7F;
    var off: usize = 2;
    if (len == 126) {
        if (buf.len < off + 2) return null;
        len = std.mem.readInt(u16, buf[off..][0..2], .big);
        off += 2;
    } else if (len == 127) {
        if (buf.len < off + 8) return null;
        len = std.mem.readInt(u64, buf[off..][0..8], .big);
        off += 8;
    }

    var mask: [4]u8 = .{ 0, 0, 0, 0 };
    if (masked) {
        if (buf.len < off + 4) return null;
        @memcpy(&mask, buf[off..][0..4]);
        off += 4;
    }

    const total = off + @as(usize, @intCast(len));
    if (buf.len < total) return null; // payload not fully arrived yet
    const payload = buf[off..total];
    if (masked) applyMask(payload, mask);

    return Parsed{
        .frame = .{ .fin = fin, .opcode = opcode, .payload = payload },
        .consumed = total,
    };
}

Every return null is a "would block" -- the frame straddles more data than we've buffered, so the caller reads more and tries again. That pattern is what lets one parser sit on top of a stream socket where reads arrive in arbitrary chunks. The @truncate(buf[0] & 0x0F) pulls the low four bits into the u4 the opcode enum expects, and @enumFromInt lands on _ for any opcode we don't recognise rather than panicking.

Encoding frames to send

The server side (unmasked) is the mirror image, and shorter because we don't mask. We set FIN, write the opcode, choose the right length encoding, and copy the payload:

pub fn encodeFrame(opcode: Opcode, payload: []const u8, out: []u8) !usize {
    out[0] = 0x80 | @as(u8, @intFromEnum(opcode)); // FIN=1, single unfragmented frame
    var i: usize = 2;
    if (payload.len <= 125) {
        out[1] = @intCast(payload.len);
    } else if (payload.len <= 0xFFFF) {
        out[1] = 126;
        std.mem.writeInt(u16, out[2..4], @intCast(payload.len), .big);
        i = 4;
    } else {
        out[1] = 127;
        std.mem.writeInt(u64, out[2..10], payload.len, .big);
        i = 10;
    }
    if (out.len < i + payload.len) return error.BufferTooSmall;
    @memcpy(out[i..][0..payload.len], payload);
    return i + payload.len;
}

Because we're the server, the MASK bit in out[1] stays 0 -- we never set it, so we never write a masking key. The 0x80 on the first byte is the FIN flag, meaning "complete message in one frame", which is what you want 99% of the time (fragmentation is for streaming a message whose length you don't know up front).

Control frames: close, ping, pong

Three opcodes are control frames, and they have two extra rules: their payload must be 125 bytes or fewer, and they must never be fragmented. Ping and pong are the heartbeat -- either side sends a ping, the other must reply with a pong echoing the same payload, which lets you detect a half-dead connection that TCP hasn't noticed yet. Close is the polite shutdown: an optional 2-byte big-endian status code followed by a UTF-8 reason.

/// Build a close frame: a 2-byte big-endian status code plus an optional reason.
/// 1000 = normal, 1001 = going away, 1002 = protocol error (see RFC 6455 ss 7.4).
pub fn encodeClose(code: u16, reason: []const u8, out: []u8) !usize {
    var payload: [125]u8 = undefined;
    if (reason.len > 123) return error.ReasonTooLong; // 2 bytes go to the code
    std.mem.writeInt(u16, payload[0..2], code, .big);
    @memcpy(payload[2..][0..reason.len], reason);
    return encodeFrame(.close, payload[0 .. 2 + reason.len], out);
}

/// A pong MUST echo the ping's payload verbatim.
pub fn encodePong(ping_payload: []const u8, out: []u8) !usize {
    return encodeFrame(.pong, ping_payload, out);
}

A correct WebSocket endpoint answers a close with its own close and then stops sending, and answers a ping with a pong promptly. Those little courtesies are what keep a long-lived connection healthy instead of silently rotting behind a NAT timeout.

Testing without a browser

The beauty of pushing all of this into pure functions is that the entire protocol is testable with byte arrays -- no socket, no browser, no live peer. First, the handshake against the canonical example straight out of RFC 6455 (this exact key/accept pair is in the spec, so it's a perfect regression anchor):

test "accept value matches the RFC 6455 example" {
    var out: [28]u8 = undefined;
    computeAccept("dGhlIHNhbXBsZSBub25jZQ==", &out);
    try std.testing.expectEqualStrings("s3pPLMBiTxaQ9kYGzzhZRbK+xOo=", &out);
}

Then a masked client frame carrying the text Hi. The mask is 37 fa 21 3d; H (0x48) XOR 0x37 is 0x7f, i (0x69) XOR 0xfa is 0x93, so the masked bytes on the wire are 7f 93. Parsing must unmask them back to Hi:

test "parse a masked client text frame" {
    var buf = [_]u8{ 0x81, 0x82, 0x37, 0xfa, 0x21, 0x3d, 0x7f, 0x93 };
    // 0x81 = FIN+text, 0x82 = MASK bit + length 2
    const p = (try parseFrame(&buf)).?;
    try std.testing.expect(p.frame.fin);
    try std.testing.expectEqual(Opcode.text, p.frame.opcode);
    try std.testing.expectEqualStrings("Hi", p.frame.payload);
    try std.testing.expectEqual(@as(usize, 8), p.consumed);
}

test "encode then parse round-trips a server frame" {
    var out: [64]u8 = undefined;
    const n = try encodeFrame(.binary, "zig!", &out);
    const p = (try parseFrame(out[0..n])).?;
    try std.testing.expectEqual(Opcode.binary, p.frame.opcode);
    try std.testing.expectEqualStrings("zig!", p.frame.payload);
}

Notice that parseFrame handles both masked (client) and unmasked (server) frames, so the same parser tests both directions. The forementioned "return null when short" contract is worth a test too -- feed it a single byte and assert you get null, proving the incremental reader won't read past the buffer.

Performance considerations

Two things matter once you're moving real traffic. The first is masking throughput. That tidy payload[i] ^ mask[i % 4] loop does a modulo per byte, and on a megabyte payload that's a million modulos. The fix is to recognise that the mask repeats every 4 bytes, so you can load the 4-byte key into a u32 and XOR a word at a time, or -- even better on modern hardware -- lean on the @Vector SIMD we covered in episode 19 to mask 16 or 32 bytes per instruction. The naive version is correct and fine for chat messages; the vectorised version is what you reach for when you're proxying video.

The second is buffering and partial frames. A frame's payload can be larger than one read() returns, so your reader must accumulate bytes until a whole frame has arrived -- exactly the return null contract above. The mistake to avoid is reallocating that accumulation buffer on every read; size it once (16 KB is a sane default, matching a TLS record from episode 86) and grow only when a genuinely large frame demands it. Nota bene: also enforce a maximum frame size, or a hostile peer announcing a 2 KB header claiming a u64 payload of 16 exabytes will happily make you try to allocate the universe.

How this compares to C, Rust, and Go

In C, you'd hand-roll exactly this bit-twiddling -- and libraries like libwebsockets do, with a great deal of careful pointer arithmetic and manual length checking. The framing logic is identical; what C lacks is Zig's @enumFromInt landing safely on a non-exhaustive _, and slices that carry their length so an over-long payload claim can't walk off the end of your buffer unnoticed.

In Go, gorilla/websocket (and now nhooyr.io/websocket) gives you conn.ReadMessage() / conn.WriteMessage() and hides every byte we just decoded. It's productive and the goroutine-per-connection model makes the concurrency trivial. The cost is the usual one -- you're inside Go's runtime and its allocation patterns, with less control over exactly when and where buffers are reused.

In Rust, tungstenite (sync) and tokio-tungstenite (async) are the standard answer, memory-safe by construction and rigorous about the masking and UTF-8-validation rules the RFC demands. It's excellent, and arguably the most correct-by-default of the lot, at the price of Rust's steeper learning curve around async lifetimes.

Where does Zig sit? Right where it likes to: you write the protocol yourself, in maybe 150 lines, you see every bit, you control every allocation, and the result cross-compiles to a tiny static binary with no runtime. For a learning exercise it's unbeatable, because nothing is hidden. For production you might still reach for a hardened library -- but now you'll actually understand what it's doing under the hood, which is the entire point of building it from scratch ;-)

Where this is heading

We now have every piece of the WebSocket protocol as a set of pure, tested functions: the handshake, the frame parser, the encoder, masking, and the control frames. What we don't have yet is the thing that holds it all together over a live connection -- the loop that accepts a socket, performs the upgrade, then sits there reading frames and reacting to them, answering pings, honouring closes, and tracking per-connection state across many clients at once. All the non-blocking and state-machine groundwork from the last several episodes points straight at that. We've built the protocol; next we put it to work.

The handshake, the framing, the masking -- they aren't separate party tricks, they're the layers of one real-time channel you can now build with your eyes open.

Exercises

  1. Detect fragmentation. Extend the parser's caller to handle a message split across frames: a first frame with opcode = text and fin = false, followed by one or more opcode = continuation frames, ending with fin = true. Concatenate the payloads into a single message and assert the opcode of the assembled message is taken from the first frame, not the continuations.

  2. Write a client-side encoder. Add an encodeMaskedFrame that sets the MASK bit, generates 4 random key bytes (use std.crypto.random.bytes), writes the key into the frame, and masks the payload. Round-trip it through parseFrame and assert the recovered payload matches the original.

  3. Validate close codes. Write a function that takes a close frame's payload, extracts the 2-byte status code, and rejects the reserved/invalid codes (anything below 1000, plus 1004, 1005, 1006, and 1015, which the RFC says must never appear on the wire). Return a Zig error for an invalid code and the u16 for a valid one.

Thanks for reading -- De groeten!

@scipio



0
0
0.000
0 comments