Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer

avatar

Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer

zig.png

Project A: Markdown to HTML Converter (1/3)

What will I learn

  • You will learn designing token types for Markdown syntax elements;
  • You will learn scanning input character by character to produce tokens;
  • You will learn handling inline formatting: bold, italic, code, links;
  • You will learn block-level tokens: headings, paragraphs, code blocks, lists;
  • You will learn the state machine approach to lexing: normal, code-block, link modes;
  • You will learn error recovery for malformed Markdown;
  • You will learn building a token stream with an ArrayList;
  • You will learn testing the tokenizer with known Markdown inputs and expected token sequences.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Zig 0.14+ distribution (download from ziglang.org);
  • The ambition to learn Zig programming.

Difficulty

  • Intermediate

Curriculum (of the Learn Zig Series):

Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer

Here we go with our next multi-part project! ;-)

Over the past 36 episodes we've built up a serious toolkit -- from basic types and control flow all the way through state machines, performance profiling, and cross-compilation. Now we're going to put a big chunk of that to work on something every developer has used but few have actually built: a Markdown to HTML converter.

We're splitting this project across three episodes. In this first part, we build the tokenizer (also called a lexer) -- the component that reads raw Markdown text and breaks it down into a stream of meaningful tokens. Think of it like a human reader who scans through text and recognizes "oh, that's a heading", "that's a bold word", "that's the start of a code block". Next episode we'll build a parser that turns those tokens into an AST (abstract syntax tree), and the third part will render that tree to HTML with a proper CLI interface.

Why Markdown? Because the format looks simple on the surface but is surprisingly tricky in the edge cases. Asterisks can mean bold, italic, or a list item depending on context. Backticks can be inline code or the start of a fenced code block. Links have that weird [text](url) nesting. Building a tokenizer that handles all of this correctly is a genuinely fun systems programming challenge, and it exercises string processing, state machines (remember episode 33?), enums, memory management, and testing -- all in one project.

Designing token types for Markdown syntax elements

Before we write any scanning logic, we need to decide what our token vocabulary looks like. What are the "words" of Markdown?

I sat down and listed every Markdown construct I wanted to handle, then grouped them into categories. Block-level elements (headings, paragraphs, code blocks, lists, horizontal rules) exist on their own lines. Inline elements (bold, italic, code spans, links, images) exist within a line of text. And then there are structural tokens like line breaks and end-of-file that the scanner needs to communicate to the parser.

Here's the token type enum:

const std = @import("std");

pub const TokenType = enum {
    // Block-level tokens
    heading,          // # through ######
    paragraph_start,
    paragraph_end,
    code_block_start, // ``` or ~~~
    code_block_end,
    code_block_content,
    unordered_list_item, // - or * at line start
    ordered_list_item,   // 1. 2. etc at line start
    horizontal_rule,     // --- or *** or ___
    blockquote,          // > at line start

    // Inline tokens
    text,             // plain text content
    bold_start,       // **
    bold_end,
    italic_start,     // *
    italic_end,
    code_span,        // `inline code`
    link_text_start,  // [
    link_text_end,    // ]
    link_url_start,   // (
    link_url_end,     // )
    image_start,      // ![

    // Structural
    line_break,
    eof,
};

Each variant maps to exactly one Markdown syntactic construct. I deliberately chose to have separate bold_start / bold_end rather than a single bold token wrapping content. Why? Because the tokenizer doesn't know where the matching closing ** is -- it just sees characters one at a time. When it encounters **, it needs to decide whether that's an opening or closing delimiter. We'll track that with a simple boolean state.

Now the token itself:

pub const Token = struct {
    type: TokenType,
    text: []const u8,
    line: usize,
    col: usize,

    pub fn init(token_type: TokenType, text: []const u8, line: usize, col: usize) Token {
        return .{
            .type = token_type,
            .text = text,
            .line = line,
            .col = col,
        };
    }

    pub fn format(
        self: Token,
        comptime fmt: []const u8,
        options: std.fmt.FormatOptions,
        writer: anytype,
    ) !void {
        _ = fmt;
        _ = options;
        try writer.print("Token({s}, \"{s}\", {d}:{d})", .{
            @tagName(self.type),
            self.text,
            self.line,
            self.col,
        });
    }
};

Every token carries four pieces of information: its type (what kind of Markdown element it represents), the raw text it was produced from (so we can reconstruct the original or extract content like heading level from the number of # characters), and the line/column position for error reporting. The format function implements Zig's std.fmt interface so we can std.debug.print tokens directly -- super useful for debugging. We covered this formatting interface in episode 24.

Notice that the text field is a slice into the original input -- we're not copying strings. This is idiomatic Zig: the tokenizer borrows references into the source document rather than allocating copies. The source string must outlive the tokens, which is easy to guarantee since we read the entire file into memory before tokenizing (we covered file I/O back in episode 10).

Scanning input character by character to produce tokens

The tokenizer is fundamentally a loop that walks through the input one byte at a time (Markdown is ASCII-compatible -- UTF-8 multi-byte sequences just pass through as text). At each position, it looks at the current character (and sometimes the next few characters) to decide what token to emit.

Here's the core tokenizer structure:

pub const Tokenizer = struct {
    source: []const u8,
    pos: usize,
    line: usize,
    col: usize,
    tokens: std.ArrayList(Token),
    state: LexState,
    bold_open: bool,
    italic_open: bool,

    pub fn init(allocator: std.mem.Allocator, source: []const u8) Tokenizer {
        return .{
            .source = source,
            .pos = 0,
            .line = 1,
            .col = 1,
            .tokens = std.ArrayList(Token).init(allocator),
            .state = .normal,
            .bold_open = false,
            .italic_open = false,
        };
    }

    pub fn deinit(self: *Tokenizer) void {
        self.tokens.deinit();
    }

    fn peek(self: *const Tokenizer) ?u8 {
        if (self.pos >= self.source.len) return null;
        return self.source[self.pos];
    }

    fn peekAt(self: *const Tokenizer, offset: usize) ?u8 {
        const idx = self.pos + offset;
        if (idx >= self.source.len) return null;
        return self.source[idx];
    }

    fn advance(self: *Tokenizer) ?u8 {
        if (self.pos >= self.source.len) return null;
        const ch = self.source[self.pos];
        self.pos += 1;
        if (ch == '\n') {
            self.line += 1;
            self.col = 1;
        } else {
            self.col += 1;
        }
        return ch;
    }

    fn emit(self: *Tokenizer, token_type: TokenType, text: []const u8, line: usize, col: usize) !void {
        try self.tokens.append(Token.init(token_type, text, line, col));
    }
};

The peek, peekAt, and advance functions are the scanner's eyes. peek looks at the current character without consuming it. peekAt looks ahead by a given offset -- we need this for disambiguating * (italic) from ** (bold) and ` (code span) from ``` (code block). advance consumes a character and updates the line/column tracking.

This pattern -- a position cursor with peek/advance helpers -- is how every hand-written lexer works, whether it's for Markdown, JSON, or a full programming language compiler. The key insight is that lexing is a fundamentaly sequential process: you can't know what the current character means without knowing what came before it (context) and sometimes what comes after it (lookahead).

Handling inline formatting: bold, italic, code, links

Inline elements are the trickiest part of Markdown tokenization because they can nest, overlap, and the same character (*) has different meanings depending on context. Here's how we handle each one:

Bold and italic

The rules: a single * toggles italic, a double ** toggles bold. When we see an asterisk, we peek ahead to check for a second one:

fn scanInlineFormatting(self: *Tokenizer) !bool {
    const ch = self.peek() orelse return false;
    const start_line = self.line;
    const start_col = self.col;

    if (ch == '*') {
        if (self.peekAt(1) == @as(u8, '*')) {
            // Double asterisk: bold
            const start = self.pos;
            _ = self.advance();
            _ = self.advance();
            if (self.bold_open) {
                try self.emit(.bold_end, self.source[start .. start + 2], start_line, start_col);
                self.bold_open = false;
            } else {
                try self.emit(.bold_start, self.source[start .. start + 2], start_line, start_col);
                self.bold_open = true;
            }
            return true;
        } else {
            // Single asterisk: italic
            const start = self.pos;
            _ = self.advance();
            if (self.italic_open) {
                try self.emit(.italic_end, self.source[start .. start + 1], start_line, start_col);
                self.italic_open = false;
            } else {
                try self.emit(.italic_start, self.source[start .. start + 1], start_line, start_col);
                self.italic_open = true;
            }
            return true;
        }
    }

    return false;
}

The bold_open and italic_open booleans track whether we've seen an opening delimiter without a matching close. This is admittedly a simplification -- real Markdown parsers (like CommonMark) have more complex rules about delimiter runs, left-flanking vs right-flanking, etc. But for our purposes, a toggle approach works for well-formed Markdown and gracefully falls through for malformed input (the delimiter just stays "open" and gets treated as text eventually).

Code spans

Backtick-delimited inline code is simpler because nothing is interpreted inside it -- asterisks, brackets, everything is literal:

fn scanCodeSpan(self: *Tokenizer) !bool {
    if (self.peek() != @as(u8, '`')) return false;

    // Check it's not a code block (triple backtick at line start is handled elsewhere)
    if (self.peekAt(1) == @as(u8, '`') and self.peekAt(2) == @as(u8, '`')) return false;

    const start_line = self.line;
    const start_col = self.col;
    _ = self.advance(); // consume opening backtick
    const content_start = self.pos;

    // Scan until closing backtick or end of line
    while (self.peek()) |c| {
        if (c == '`') {
            const content = self.source[content_start..self.pos];
            _ = self.advance(); // consume closing backtick
            try self.emit(.code_span, content, start_line, start_col);
            return true;
        }
        if (c == '\n') break; // unclosed code span -- don't cross lines
        _ = self.advance();
    }

    // Unclosed backtick -- emit the backtick as text and backtrack content
    // We'll handle this in error recovery
    self.pos = content_start;
    self.col = start_col + 1;
    try self.emit(.text, self.source[content_start - 1 .. content_start], start_line, start_col);
    return true;
}

Notice the guard at the top: if we see three backticks in a row, we bail out and let the block-level scanner handle it as a fenced code block instead. The code span scanner won't cross a newline -- if someone writes an unclosed backtick, we treat it as a literal backtick character rather than eating the rest of the document.

Links

Links are [text](url) which means we need to scan four distinct pieces: the opening bracket, the link text, the closing bracket + opening paren, and the URL. We emit separate tokens for each piece so the parser can reconstruct the link properly:

fn scanLink(self: *Tokenizer) !bool {
    const ch = self.peek() orelse return false;
    const start_line = self.line;
    const start_col = self.col;

    // Check for image: ![
    var is_image = false;
    if (ch == '!' and self.peekAt(1) == @as(u8, '[')) {
        is_image = true;
        const start = self.pos;
        _ = self.advance(); // !
        _ = self.advance(); // [
        try self.emit(.image_start, self.source[start .. start + 2], start_line, start_col);
    } else if (ch == '[') {
        _ = self.advance();
        try self.emit(.link_text_start, "[", start_line, start_col);
    } else {
        return false;
    }
    _ = is_image;

    // Scan link text until ]
    const text_start = self.pos;
    while (self.peek()) |c| {
        if (c == ']') break;
        if (c == '\n') {
            // Unclosed bracket -- emit everything as text
            try self.emit(.text, self.source[text_start..self.pos], start_line, start_col);
            return true;
        }
        _ = self.advance();
    }

    if (self.peek() == null) {
        try self.emit(.text, self.source[text_start..self.pos], start_line, start_col);
        return true;
    }

    // Emit the link text
    if (self.pos > text_start) {
        try self.emit(.text, self.source[text_start..self.pos], start_line, start_col + (text_start - (start_col - 1)));
    }

    // Consume ]
    const bracket_line = self.line;
    const bracket_col = self.col;
    _ = self.advance();
    try self.emit(.link_text_end, "]", bracket_line, bracket_col);

    // Expect ( immediately after ]
    if (self.peek() != @as(u8, '(')) {
        // Not a link after all -- ] without ( is just text
        return true;
    }

    const paren_line = self.line;
    const paren_col = self.col;
    _ = self.advance();
    try self.emit(.link_url_start, "(", paren_line, paren_col);

    // Scan URL until )
    const url_start = self.pos;
    while (self.peek()) |c| {
        if (c == ')') break;
        if (c == '\n') break;
        _ = self.advance();
    }

    if (self.pos > url_start) {
        try self.emit(.text, self.source[url_start..self.pos], paren_line, paren_col + 1);
    }

    if (self.peek() == @as(u8, ')')) {
        const end_line = self.line;
        const end_col = self.col;
        _ = self.advance();
        try self.emit(.link_url_end, ")", end_line, end_col);
    }

    return true;
}

This is the longest single scanning function, and for good reason -- links are the most syntactically complex inline element in Markdown. The image check at the top is a nice bonus: image syntax (! followed by [alt text](image url)) is basically a link prefixed with an exclamation mark, so we handle both in the same function.

One design decision worth noting: we emit the link text as a regular .text token between .link_text_start and .link_text_end. This means the parser will see the link as a sequence like [link_text_start] [text "click here"] [link_text_end] [link_url_start] [text "https://..."] [link_url_end]. That's verbose but makes parsing unambiguous -- the parser doesn't have to re-scan the text to find where the URL starts.

Block-level tokens: headings, paragraphs, code blocks, lists

Block-level elements are detected at the start of each line. The scanning loop processes one line at a time for block-level detection, then switches to inline scanning for the contents:

fn scanBlockStart(self: *Tokenizer) !bool {
    // Only check block-level tokens at the start of a line
    const start_line = self.line;
    const start_col = self.col;

    const ch = self.peek() orelse return false;

    // Headings: # through ######
    if (ch == '#') {
        var level: usize = 0;
        var look = self.pos;
        while (look < self.source.len and self.source[look] == '#' and level < 6) {
            level += 1;
            look += 1;
        }
        // Must be followed by space or end of line
        if (look >= self.source.len or self.source[look] == ' ' or self.source[look] == '\n') {
            const heading_text = self.source[self.pos .. self.pos + level];
            self.pos = look;
            self.col += level;
            // Skip the space after #
            if (self.peek() == @as(u8, ' ')) {
                _ = self.advance();
            }
            try self.emit(.heading, heading_text, start_line, start_col);
            return true;
        }
    }

    // Horizontal rules: --- or *** or ___ (3+ of same char, nothing else on line)
    if (ch == '-' or ch == '*' or ch == '_') {
        var count: usize = 0;
        var look = self.pos;
        const rule_char = ch;
        while (look < self.source.len and self.source[look] == rule_char) {
            count += 1;
            look += 1;
        }
        // Skip trailing whitespace
        while (look < self.source.len and (self.source[look] == ' ' or self.source[look] == '\t')) {
            look += 1;
        }
        if (count >= 3 and (look >= self.source.len or self.source[look] == '\n')) {
            const rule_text = self.source[self.pos..look];
            try self.emit(.horizontal_rule, rule_text, start_line, start_col);
            self.pos = look;
            self.col += (look - (self.pos - (look - self.pos))); // approximate
            return true;
        }
    }

    // Unordered list items: - or * followed by space
    if ((ch == '-' or ch == '*') and self.peekAt(1) == @as(u8, ' ')) {
        const start = self.pos;
        _ = self.advance(); // consume - or *
        _ = self.advance(); // consume space
        try self.emit(.unordered_list_item, self.source[start .. start + 2], start_line, start_col);
        return true;
    }

    // Ordered list items: digit(s) followed by . and space
    if (ch >= '0' and ch <= '9') {
        var look = self.pos;
        while (look < self.source.len and self.source[look] >= '0' and self.source[look] <= '9') {
            look += 1;
        }
        if (look < self.source.len and self.source[look] == '.') {
            if (look + 1 < self.source.len and self.source[look + 1] == ' ') {
                const marker = self.source[self.pos .. look + 2];
                try self.emit(.ordered_list_item, marker, start_line, start_col);
                self.pos = look + 2;
                self.col += marker.len;
                return true;
            }
        }
    }

    // Blockquote: > at line start
    if (ch == '>' and (self.peekAt(1) == @as(u8, ' ') or self.peekAt(1) == @as(u8, '\n') or self.peekAt(1) == null)) {
        const start = self.pos;
        _ = self.advance(); // consume >
        if (self.peek() == @as(u8, ' ')) {
            _ = self.advance(); // consume optional space
        }
        try self.emit(.blockquote, self.source[start..self.pos], start_line, start_col);
        return true;
    }

    // Fenced code blocks: ``` or ~~~
    if ((ch == '`' or ch == '~') and self.peekAt(1) == ch and self.peekAt(2) == ch) {
        const fence_char = ch;
        const start = self.pos;
        _ = self.advance();
        _ = self.advance();
        _ = self.advance();

        // Optional language tag after opening fence
        const lang_start = self.pos;
        while (self.peek()) |c| {
            if (c == '\n') break;
            _ = self.advance();
        }
        const lang = std.mem.trim(u8, self.source[lang_start..self.pos], " \t");
        _ = lang;

        try self.emit(.code_block_start, self.source[start..self.pos], start_line, start_col);

        // Consume newline after opening fence
        if (self.peek() == @as(u8, '\n')) {
            _ = self.advance();
        }

        // Scan code block content until closing fence
        const content_start = self.pos;
        while (self.pos < self.source.len) {
            // Check for closing fence at start of line
            if (self.col == 1 or (self.pos > 0 and self.source[self.pos - 1] == '\n')) {
                if (self.pos + 3 <= self.source.len and
                    self.source[self.pos] == fence_char and
                    self.source[self.pos + 1] == fence_char and
                    self.source[self.pos + 2] == fence_char)
                {
                    // Found closing fence
                    if (self.pos > content_start) {
                        try self.emit(.code_block_content, self.source[content_start..self.pos], start_line + 1, 1);
                    }
                    const close_line = self.line;
                    const close_col = self.col;
                    _ = self.advance();
                    _ = self.advance();
                    _ = self.advance();
                    // Skip rest of closing fence line
                    while (self.peek()) |c| {
                        if (c == '\n') break;
                        _ = self.advance();
                    }
                    try self.emit(.code_block_end, self.source[self.pos - 3 .. self.pos], close_line, close_col);
                    return true;
                }
            }
            _ = self.advance();
        }

        // Unclosed code block -- emit everything remaining as content
        if (self.pos > content_start) {
            try self.emit(.code_block_content, self.source[content_start..self.pos], start_line + 1, 1);
        }
        try self.emit(.code_block_end, "", self.line, self.col);
        return true;
    }

    return false;
}

The headings scanner counts consecutive # characters (up to 6) and then checks for a space or newline. This prevents a word like #hashtag from being tokenized as a heading. The horizontal rule detection is interesting -- it needs to verify that the entire line consists of only the rule character (plus whitespace), which requires scanning ahead and then checking for a newline.

Code block scanning enters a diferent mode entirely: once we see the opening ```, everything until the closing fence is emitted as a single code_block_content token. No inline formatting scanning happens inside code blocks -- backticks, asterisks, brackets are all literal text. This is where the state machine approach really matters.

The state machine approach to lexing: normal, code-block, link modes

The lexer operates in distinct modes. In normal mode, we scan for both block-level and inline elements. In code block mode, everything is literal content. In link mode, we're inside a [...] and looking for the closing bracket.

We model these modes with an enum (tagged unions from episode 33 would also work, but a simple enum is sufficient here):

const LexState = enum {
    normal,
    code_block,
    front_matter,
};

The main tokenization loop uses this state to decide which scanning functions to call:

pub fn tokenize(self: *Tokenizer) ![]const Token {
    while (self.pos < self.source.len) {
        switch (self.state) {
            .code_block => {
                // This state is handled within scanBlockStart
                // If we somehow land here, advance past it
                _ = self.advance();
            },
            .front_matter => {
                // Skip YAML front matter (--- ... ---)
                while (self.peek()) |c| {
                    if (c == '\n' and self.peekAt(1) == @as(u8, '-') and
                        self.peekAt(2) == @as(u8, '-') and self.peekAt(3) == @as(u8, '-'))
                    {
                        _ = self.advance(); // newline
                        _ = self.advance(); // -
                        _ = self.advance(); // -
                        _ = self.advance(); // -
                        // skip to end of line
                        while (self.peek()) |fc| {
                            if (fc == '\n') break;
                            _ = self.advance();
                        }
                        self.state = .normal;
                        break;
                    }
                    _ = self.advance();
                }
            },
            .normal => {
                const ch = self.peek() orelse break;

                // Handle newlines
                if (ch == '\n') {
                    const nl_line = self.line;
                    const nl_col = self.col;
                    _ = self.advance();
                    try self.emit(.line_break, "\n", nl_line, nl_col);

                    // Check for YAML front matter at document start
                    if (nl_line == 1 and self.tokens.items.len <= 1) {
                        // already past first line
                    }
                    continue;
                }

                // At the start of a line, check block-level elements first
                if (self.col == 1 or (self.pos > 0 and self.source[self.pos - 1] == '\n')) {
                    if (try self.scanBlockStart()) continue;
                }

                // Try inline elements
                if (try self.scanCodeSpan()) continue;
                if (try self.scanInlineFormatting()) continue;
                if (try self.scanLink()) continue;

                // Plain text -- consume until we hit something interesting
                const text_start = self.pos;
                const text_line = self.line;
                const text_col = self.col;

                while (self.peek()) |c| {
                    // Stop at characters that might start a special token
                    if (c == '*' or c == '`' or c == '[' or c == '!' or c == '\n') break;
                    _ = self.advance();
                }

                if (self.pos > text_start) {
                    try self.emit(.text, self.source[text_start..self.pos], text_line, text_col);
                }
            },
        }
    }

    try self.emit(.eof, "", self.line, self.col);
    return self.tokens.items;
}

The normal-mode loop does something clever: after checking for all special characters, it falls through to a plain text scanner that consumes everything until it hits a character that might start a special token. This is important for performance -- instead of emitting one .text token per character, we batch up runs of plain text into single tokens. If you have a paragraph like "Hello world, this is some normal text" with no formatting, that entire string becomes one token.

Having said that, the stop-characters list (* [ ! \n`) is critical. Miss one and you'll eat through a delimiter that should have been a separate token. Include too many and you'll fragment text into tiny tokens unnecessarily. I arrived at this particular set by working through the test cases -- which brings us to...

Error recovery for malformed Markdown

Real-world Markdown is messy. People write unclosed bold markers, leave dangling brackets, nest things in weird ways. A tokenizer that crashes on bad input is useless. Our approach is simple: when something looks malformed, treat it as literal text and keep going.

Here's the recovery strategy:

fn recoverFromError(self: *Tokenizer) !void {
    // When something goes wrong, emit the current character as text
    // and let the normal scanning loop continue
    const start_line = self.line;
    const start_col = self.col;
    const start = self.pos;
    _ = self.advance();
    try self.emit(.text, self.source[start..self.pos], start_line, start_col);
}

Specific recovery cases:

  1. Unclosed code span (backtick without matching backtick on same line): We emit the backtick as text. The user probably just wanted a literal backtick.

  2. Unclosed bold/italic: If we reach a line break with bold_open or italic_open still true, we don't error -- the parser will see mismatched start/end tokens and can decide what to do. Some Markdown renderers close implicit formatting at paragraph boundaries.

  3. Unclosed link bracket: If ] is not followed by (, we emit the bracket as text. The user probably just wanted literal brackets. [this is not a link] but regular text should work fine.

  4. Unclosed code block: If we reach EOF without finding a closing fence, we emit all remaining text as code_block_content and then a synthetic code_block_end. This matches how most Markdown renderers behave -- an unclosed code block just captures everything to the end.

fn handleUnmatchedDelimiters(self: *Tokenizer) !void {
    // After tokenization, check for unclosed delimiters
    // This doesn't modify the token stream -- it just logs warnings
    if (self.bold_open) {
        std.log.warn("unclosed bold delimiter at end of input", .{});
    }
    if (self.italic_open) {
        std.log.warn("unclosed italic delimiter at end of input", .{});
    }
}

The philosophy here is: the tokenizer should always produce a valid token stream, even from garbage input. Errors are the parser's problem -- the tokenizer just faithfully reports what it sees. This separation of concerns makes both components easier to test and debug.

Building a token stream with an ArrayList

We've been using std.ArrayList(Token) to collect tokens, which is the natural choice for a growable sequence in Zig. But there are some subtleties worth discussing.

Memory layout matters. Each Token is 40 bytes (8 for the enum, 16 for the slice -- pointer + length, 8 each for line and col). An ArrayList(Token) stores these contiguously in memory, which means great cache locality when iterating through the token stream (and the parser will iterate linearly through every token).

const TokenStream = struct {
    tokens: std.ArrayList(Token),
    allocator: std.mem.Allocator,

    pub fn init(allocator: std.mem.Allocator) TokenStream {
        return .{
            .tokens = std.ArrayList(Token).init(allocator),
            .allocator = allocator,
        };
    }

    pub fn deinit(self: *TokenStream) void {
        self.tokens.deinit();
    }

    pub fn add(self: *TokenStream, token: Token) !void {
        try self.tokens.append(token);
    }

    pub fn getSlice(self: *const TokenStream) []const Token {
        return self.tokens.items;
    }

    pub fn debugPrint(self: *const TokenStream) void {
        for (self.tokens.items, 0..) |token, i| {
            std.debug.print("[{d:4}] {}\n", .{ i, token });
        }
    }
};

For a typical Markdown document (say, this very blog post), the tokenizer produces roughly 200-500 tokens. At 40 bytes each, that's 8-20 KB -- easily fits in L1 cache. The ArrayList will resize a few times as it grows (starting at capacity 0, doubling each time: 1, 2, 4, 8, 16, ...), but since we're processing the entire document in one pass, those resizes happen during tokenization and then the parser gets a nice contiguous array to iterate.

If we wanted to be fancier, we could pre-allocate based on the input size. A reasonable heuristic: one token per 10-15 characters of input. For a 10KB document, pre-allocating 700-1000 tokens would avoid all resizing:

pub fn initWithCapacity(allocator: std.mem.Allocator, estimated_tokens: usize) !TokenStream {
    var ts = TokenStream{
        .tokens = std.ArrayList(Token).init(allocator),
        .allocator = allocator,
    };
    try ts.tokens.ensureTotalCapacity(estimated_tokens);
    return ts;
}

But honestly, for a Markdown converter, the resize cost is negligible compared to the file I/O. Don't optimize what doesn't need optimizing -- something we talked about quite a bit in episode 34.

Testing the tokenizer with known Markdown inputs and expected token sequences

Testing a tokenizer is actually pretty satisfying because the input/output relationship is deterministic and easy to verify. You give it a Markdown string, it gives you tokens. Let's write some comprehensive tests:

test "tokenize heading" {
    var tokenizer = Tokenizer.init(std.testing.allocator, "# Hello World\n");
    defer tokenizer.deinit();

    const tokens = try tokenizer.tokenize();

    try std.testing.expectEqual(TokenType.heading, tokens[0].type);
    try std.testing.expectEqualStrings("#", tokens[0].text);
    try std.testing.expectEqual(TokenType.text, tokens[1].type);
    try std.testing.expectEqualStrings("Hello World", tokens[1].text);
}

test "tokenize bold text" {
    var tokenizer = Tokenizer.init(std.testing.allocator, "This is **bold** text\n");
    defer tokenizer.deinit();

    const tokens = try tokenizer.tokenize();

    try std.testing.expectEqual(TokenType.text, tokens[0].type);
    try std.testing.expectEqualStrings("This is ", tokens[0].text);
    try std.testing.expectEqual(TokenType.bold_start, tokens[1].type);
    try std.testing.expectEqual(TokenType.text, tokens[2].type);
    try std.testing.expectEqualStrings("bold", tokens[2].text);
    try std.testing.expectEqual(TokenType.bold_end, tokens[3].type);
    try std.testing.expectEqual(TokenType.text, tokens[4].type);
    try std.testing.expectEqualStrings(" text", tokens[4].text);
}

test "tokenize inline code" {
    var tokenizer = Tokenizer.init(std.testing.allocator, "Use `std.mem.eql` here\n");
    defer tokenizer.deinit();

    const tokens = try tokenizer.tokenize();

    try std.testing.expectEqual(TokenType.text, tokens[0].type);
    try std.testing.expectEqualStrings("Use ", tokens[0].text);
    try std.testing.expectEqual(TokenType.code_span, tokens[1].type);
    try std.testing.expectEqualStrings("std.mem.eql", tokens[1].text);
    try std.testing.expectEqual(TokenType.text, tokens[2].type);
    try std.testing.expectEqualStrings(" here", tokens[2].text);
}

test "tokenize link" {
    var tokenizer = Tokenizer.init(std.testing.allocator, "Click [here](https://example.com) now\n");
    defer tokenizer.deinit();

    const tokens = try tokenizer.tokenize();

    try std.testing.expectEqual(TokenType.text, tokens[0].type);
    try std.testing.expectEqualStrings("Click ", tokens[0].text);
    try std.testing.expectEqual(TokenType.link_text_start, tokens[1].type);
    try std.testing.expectEqual(TokenType.text, tokens[2].type);
    try std.testing.expectEqualStrings("here", tokens[2].text);
    try std.testing.expectEqual(TokenType.link_text_end, tokens[3].type);
    try std.testing.expectEqual(TokenType.link_url_start, tokens[4].type);
    try std.testing.expectEqual(TokenType.text, tokens[5].type);
    try std.testing.expectEqualStrings("https://example.com", tokens[5].text);
    try std.testing.expectEqual(TokenType.link_url_end, tokens[6].type);
}

test "tokenize fenced code block" {
    const input =
        \\```zig
        \\const x = 42;
        \\```
        \\
    ;
    var tokenizer = Tokenizer.init(std.testing.allocator, input);
    defer tokenizer.deinit();

    const tokens = try tokenizer.tokenize();

    try std.testing.expectEqual(TokenType.code_block_start, tokens[0].type);
    try std.testing.expectEqual(TokenType.code_block_content, tokens[1].type);
    try std.testing.expect(std.mem.indexOf(u8, tokens[1].text, "const x = 42;") != null);
    try std.testing.expectEqual(TokenType.code_block_end, tokens[2].type);
}

test "tokenize unordered list" {
    const input =
        \\- first item
        \\- second item
        \\- third item
        \\
    ;
    var tokenizer = Tokenizer.init(std.testing.allocator, input);
    defer tokenizer.deinit();

    const tokens = try tokenizer.tokenize();

    var list_count: usize = 0;
    for (tokens) |t| {
        if (t.type == .unordered_list_item) list_count += 1;
    }
    try std.testing.expectEqual(@as(usize, 3), list_count);
}

test "tokenize mixed inline formatting" {
    var tokenizer = Tokenizer.init(std.testing.allocator, "This is **bold and *italic* inside**\n");
    defer tokenizer.deinit();

    const tokens = try tokenizer.tokenize();

    // Should have: text, bold_start, text, italic_start, text, italic_end, text, bold_end
    var bold_count: usize = 0;
    var italic_count: usize = 0;
    for (tokens) |t| {
        if (t.type == .bold_start or t.type == .bold_end) bold_count += 1;
        if (t.type == .italic_start or t.type == .italic_end) italic_count += 1;
    }
    try std.testing.expectEqual(@as(usize, 2), bold_count);
    try std.testing.expectEqual(@as(usize, 2), italic_count);
}

test "tokenize horizontal rule" {
    var tokenizer = Tokenizer.init(std.testing.allocator, "---\n");
    defer tokenizer.deinit();

    const tokens = try tokenizer.tokenize();

    try std.testing.expectEqual(TokenType.horizontal_rule, tokens[0].type);
}

test "unclosed code span treated as text" {
    var tokenizer = Tokenizer.init(std.testing.allocator, "This has a `broken backtick\n");
    defer tokenizer.deinit();

    const tokens = try tokenizer.tokenize();

    // Should NOT have a code_span -- the backtick is treated as text
    for (tokens) |t| {
        try std.testing.expect(t.type != .code_span);
    }
}

Notice the pattern: create a tokenizer with std.testing.allocator (which detects leaks), tokenize the input, then check specific tokens by index or by counting token types. The std.testing.allocator is essential here -- if the tokenizer leaks memory (for example, if an error path skips a deinit), the test fails with a leak report. This is one of the things I genuinely love about Zig's test infrastructure.

The unclosed code span test at the end verifies our error recovery: a backtick without a matching close on the same line should NOT produce a code_span token. The tokenizer should degrade gracefully, not crash or produce nonsense.

The build.zig for this project

Let's set up the project structure. Since this is a multi-part project, we'll build it incrementally:

const std = @import("std");

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    const exe = b.addExecutable(.{
        .name = "md2html",
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    });
    b.installArtifact(exe);

    // Run step
    const run_cmd = b.addRunArtifact(exe);
    run_cmd.step.dependOn(b.getInstallStep());
    if (b.args) |args| run_cmd.addArgs(args);
    const run_step = b.step("run", "Run md2html");
    run_step.dependOn(&run_cmd.step);

    // Tokenizer tests
    const tokenizer_tests = b.addTest(.{
        .root_source_file = b.path("src/tokenizer.zig"),
        .target = target,
        .optimize = optimize,
    });
    const run_tok_tests = b.addRunArtifact(tokenizer_tests);
    const test_step = b.step("test", "Run unit tests");
    test_step.dependOn(&run_tok_tests.step);
}

For now, running zig build test exercises the tokenizer tests. As we add the parser and renderer in the next two episodes, we'll add more test targets. The md2html executable won't do much yet (just tokenize and print), but it gives us a way to run the tokenizer on real Markdown files from the command line for manual verification.

Wat we geleerd hebben

  • How to design a token type system using Zig enums, mapping each Markdown construct to a distinct variant
  • Building a scanner that walks input byte-by-byte with peek/advance helpers and tracks line/column position for error messages
  • Disambiguating overlapping syntax: single * (italic) vs double ** (bold), single backtick (code span) vs triple backtick (code block), # heading vs #hashtag
  • The state machine approach to lexing: normal mode scans everything, code block mode passes through raw text
  • Error recovery that degrades gracefully -- unclosed delimiters become literal text rather than crashing the tokenizer
  • Using std.ArrayList for building a token stream with good cache locality, and why pre-allocation is usually not worth the complexity for documents under 100KB
  • Testing tokenizers by verifying the exact token sequence for known inputs, with std.testing.allocator catching memory leaks on error paths

This episode built the tokenizer -- the part that "sees" the raw text and identifies what's what. But a flat list of tokens isn't very useful on its own. Next time we'll build a parser that takes this token stream and constructs a tree structure (AST) representing the document's hierarchy: paragraphs contain inline elements, lists contain list items, headings are top-level nodes. That tree will make rendering to HTML straightforward.

Bedankt en tot de volgende keer!

@scipio



0
0
0.000
1 comments
avatar

Congratulations @scipio! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)

You have been a buzzy bee and published a post every day of the week.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Check out our last posts:

Our Hive Power Delegations to the April PUM Winners
Feedback from the May Hive Power Up Day
Hive Power Up Month Challenge - April 2026 Winners List
0
0
0.000