Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1

@scipio 70

about 1 hour ago

StemSocial

Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1

Part of a multi-episode project

What will I learn

How to design a file synchronization tool architecture from the ground up;
How to recursively scan directories and build a file tree with metadata;
How to compute checksums using Zig's standard library hashing;
How to compare two file trees and produce a diff of changes;
How to build a file manifest containing path, size, modification time, and checksum;
How to serialize that manifest into a wire format suitable for network transfer;
How to use rolling checksums for efficient block-level change detection;
How to write tests that create temporary directory trees and verify diff output.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Advanced

Curriculum (of the `Learn Zig Series`):

Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1

We just wrapped up the process monitor last episode, and with it the entire Linux systems programming arc that started back in episode 62. We built filesystem scanners, file watchers, process managers, IPC mechanisms, signal handlers, seccomp sandboxes, ptrace debuggers, /proc parsers, and tied it all together in a working top-like tool. Quite some ground covered ;-)

Now it's time to build something bigger. Over the next four episodes we're going to build a file synchronization tool -- think of it as a stripped-down rsync written entirely in Zig. This is the kind of project that pulls together almost everything we've learned: file I/O (episode 10), directory scanning (episode 62), hashing, networking (episode 21), and serialization (episode 20). The end result will be a tool that can compare two directory trees (local and remote), figure out what changed, and transfer only the differences.

In this first part we're focusing on the local foundations: scanning directories into a tree structure, computing file checksums, building a manifest, comparing manifests to find differences, and serializing that manifest so it can eventually be sent over the network. No actual networking yet -- that comes in part 3. First we get the data model right.

Why build a file sync tool?

Every developer has used rsync, scp, or some variant of file synchronization. But how many of us actually understand what happens under the hood? The rsync algorithm is genuinely clever -- it uses rolling checksums to find matching blocks between files without needing to transfer the entire file. The basic idea is deceptively simple:

Scan the source directory and build a list of files with their sizes, timestamps, and content hashes
Send that manifest to the destination
The destination compares it against its own files
Only transfer the files (or parts of files) that actually differ

The tricky part is step 4. A naive approach would transfer entire files whenever anything changes. rsync's insight was that even within a modified file, most of the content is often unchanged -- you just need to find the matching blocks and skip them. We'll implement a simplified version of this.

Project structure

Let's start with the project layout. This is a multi-file project, similar to how we structured the key-value store (episodes 40-43) and the shell (episodes 47-50):

// build.zig
const std = @import("std");

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    const exe = b.addExecutable(.{
        .name = "zigsync",
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    });
    b.installArtifact(exe);

    const run_cmd = b.addRunArtifact(exe);
    run_cmd.step.dependOn(b.getInstallStep());
    if (b.args) |args| run_cmd.addArgs(args);

    const run_step = b.step("run", "Run zigsync");
    run_step.dependOn(&run_cmd.step);

    const tests = b.addTest(.{
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    });
    const run_tests = b.addRunArtifact(tests);
    const test_step = b.step("test", "Run unit tests");
    test_step.dependOn(&run_tests.step);
}

Nothing surprising here if you remember episode 15 where we covered the build system. We have a single executable and a test target. As the project grows across the next episodes we might split into multiple modules, but for now everything lives under src/.

The file entry: our core data type

Everything in our sync tool revolves around file entries. A file entry captures everything we need to know about a single file to determine whether it needs syncing:

// src/manifest.zig
const std = @import("std");

pub const FileType = enum(u8) {
    regular = 0,
    directory = 1,
    symlink = 2,
};

pub const FileEntry = struct {
    /// Relative path from the sync root (e.g. "src/main.zig")
    path: []const u8,
    file_type: FileType,
    size: u64,
    /// Modification time in nanoseconds since epoch
    mtime_ns: i128,
    /// SHA-256 checksum of file contents (only for regular files)
    checksum: [32]u8,
    /// Whether the checksum has been computed
    has_checksum: bool,

    pub fn isDir(self: FileEntry) bool {
        return self.file_type == .directory;
    }

    pub fn isRegular(self: FileEntry) bool {
        return self.file_type == .regular;
    }
};

A few design decisions worth talking about here. The path is relative to the sync root -- this is critical. If you're syncing /home/user/project/ to /backup/project/, the paths in the manifest are src/main.zig, README.md, etc. -- not absolute paths. This makes manifests comparable between different machines where the root directories live in different locations.

The mtime is stored in nanoseconds as i128. Linux provides nanosecond-resolution timestamps (we saw this back in episode 62 when reading file metadata), and storing them at full precision means we can detect sub-second modifications. Some older sync tools only compare seconds, which leads to missed changes when build tools modify files in rapid succession.

The checksum is SHA-256. Zig's standard library has a complete implementation in std.crypto.hash.sha2. We could have used a faster hash like xxHash or CRC32 for a toy project, but SHA-256 gives us collision resistance that actually matters when you're deciding whether to transfer a file. Nobody wants their sync tool to skip a file because of a hash collision.

Scanning directories recursively

Here's where things get interesting. We need to walk an entire directory tree and build a flat list of FileEntry values. Zig's std.fs.Dir gives us an iterator, but it only walks one level -- we need to recurse ourselves:

// src/scanner.zig
const std = @import("std");
const manifest = @import("manifest.zig");

pub const ScanError = error{
    OutOfMemory,
    AccessDenied,
    OpenFailed,
    ReadFailed,
};

pub fn scanDirectory(
    allocator: std.mem.Allocator,
    root_path: []const u8,
) ScanError![]manifest.FileEntry {
    var entries = std.ArrayList(manifest.FileEntry).init(allocator);
    errdefer {
        for (entries.items) |entry| {
            allocator.free(entry.path);
        }
        entries.deinit();
    }

    try scanRecursive(allocator, root_path, "", &entries);

    return entries.toOwnedSlice() catch return ScanError.OutOfMemory;
}

fn scanRecursive(
    allocator: std.mem.Allocator,
    root_path: []const u8,
    relative_prefix: []const u8,
    entries: *std.ArrayList(manifest.FileEntry),
) ScanError!void {
    // open the directory to scan
    var full_path_buf: [4096]u8 = undefined;
    const full_path = std.fmt.bufPrint(&full_path_buf, "{s}{s}{s}", .{
        root_path,
        if (relative_prefix.len > 0) "/" else "",
        relative_prefix,
    }) catch return ScanError.OutOfMemory;

    var dir = std.fs.openDirAbsolute(full_path, .{ .iterate = true }) catch
        return ScanError.OpenFailed;
    defer dir.close();

    var it = dir.iterate();
    while (it.next() catch return ScanError.ReadFailed) |entry| {
        // skip hidden files and common junk
        if (entry.name[0] == '.') continue;

        // build relative path
        const rel_path = if (relative_prefix.len > 0)
            std.fmt.allocPrint(allocator, "{s}/{s}", .{ relative_prefix, entry.name }) catch
                return ScanError.OutOfMemory
        else
            allocator.dupe(u8, entry.name) catch return ScanError.OutOfMemory;

        const file_type: manifest.FileType = switch (entry.kind) {
            .directory => .directory,
            .sym_link => .symlink,
            else => .regular,
        };

        // get file metadata
        var stat_buf: [4096]u8 = undefined;
        const stat_path = std.fmt.bufPrint(&stat_buf, "{s}/{s}", .{ root_path, rel_path }) catch {
            allocator.free(rel_path);
            return ScanError.OutOfMemory;
        };

        var size: u64 = 0;
        var mtime_ns: i128 = 0;
        if (file_type == .regular) {
            const file = std.fs.openFileAbsolute(stat_path, .{}) catch {
                allocator.free(rel_path);
                continue; // skip files we can't open
            };
            defer file.close();
            const stat = file.stat() catch {
                allocator.free(rel_path);
                continue;
            };
            size = stat.size;
            mtime_ns = stat.mtime;
        }

        entries.append(.{
            .path = rel_path,
            .file_type = file_type,
            .size = size,
            .mtime_ns = mtime_ns,
            .checksum = undefined,
            .has_checksum = false,
        }) catch {
            allocator.free(rel_path);
            return ScanError.OutOfMemory;
        };

        // recurse into subdirectories
        if (file_type == .directory) {
            try scanRecursive(allocator, root_path, rel_path, entries);
        }
    }
}

pub fn freeEntries(allocator: std.mem.Allocator, entries: []manifest.FileEntry) void {
    for (entries) |entry| {
        allocator.free(entry.path);
    }
    allocator.free(entries);
}

A few things to notice. We skip hidden files (names starting with .) -- this is a deliberate design choice. Most sync tools let you configure this, but for our base implementation skipping dotfiles avoids accidentally syncing .git/ directories, which would be a nightmare. The errdefer block at the top of scanDirectory makes sure we free all the allocated path strings if anything goes wrong partway through. Memory management in Zig is always explicit, but that explicitness is exactly what prevents the kind of leaks you'd get in C when a function bails out halfway through a loop.

The recursive scan builds paths by concatenating the prefix with each entry name. This produces paths like src/main.zig and src/manifest.zig -- exactly what we need for comparing manifests between diferent machines.

Computing file checksums

Once we have the file entries, we need to compute checksums for each regular file. This is separate from scanning because sometimes you only want a quick size+mtime comparison (which is what rsync does in its default --times mode) and only compute checksums when there's a discrepancy:

// src/checksum.zig
const std = @import("std");
const manifest = @import("manifest.zig");

const Sha256 = std.crypto.hash.sha2.Sha256;

pub fn computeChecksum(root_path: []const u8, entry: *manifest.FileEntry) !void {
    if (!entry.isRegular()) return;

    var path_buf: [4096]u8 = undefined;
    const full_path = try std.fmt.bufPrint(&path_buf, "{s}/{s}", .{ root_path, entry.path });

    const file = try std.fs.openFileAbsolute(full_path, .{});
    defer file.close();

    var hasher = Sha256.init(.{});
    var buf: [8192]u8 = undefined;

    while (true) {
        const n = try file.read(&buf);
        if (n == 0) break;
        hasher.update(buf[0..n]);
    }

    hasher.final(&entry.checksum);
    entry.has_checksum = true;
}

pub fn computeAllChecksums(root_path: []const u8, entries: []manifest.FileEntry) void {
    for (entries) |*entry| {
        computeChecksum(root_path, entry) catch {
            // file might have disappeared between scan and checksum
            entry.has_checksum = false;
        };
    }
}

pub fn checksumEqual(a: [32]u8, b: [32]u8) bool {
    return std.mem.eql(u8, &a, &b);
}

The 8KB read buffer is a reasonable trade-off between memory usage and system call overhead. Reading 8KB at a time means a 1MB file needs about 128 read() syscalls. You could bump it to 64KB or even mmap the file (as we covered in episode 31), but for a sync tool where the bottleneck is network transfer rather than local hashing, 8KB is perfectly fine.

SHA-256 processes data in 64-byte blocks internally, so the update call handles buffering for us -- we don't need to align our reads to any particular boundary.

The rolling checksum: Adler-32

Here's where things get clever. For block-level deduplication (which we'll use in part 2 for delta transfers), we need a rolling checksum -- a hash that can be efficiently updated when you slide a window across the data. The classic choice is Adler-32, which rsync uses for exactly this purpose:

// src/rolling.zig
const std = @import("std");

pub const RollingChecksum = struct {
    a: u32 = 1,
    b: u32 = 0,
    window_size: u32,

    const MOD = 65521; // largest prime smaller than 2^16

    pub fn init(window_size: u32) RollingChecksum {
        return .{ .window_size = window_size };
    }

    /// Feed a full block to initialize the checksum
    pub fn feedBlock(self: *RollingChecksum, data: []const u8) void {
        self.a = 1;
        self.b = 0;
        for (data) |byte| {
            self.a = (self.a + byte) % MOD;
            self.b = (self.b + self.a) % MOD;
        }
    }

    /// Roll the window forward by one byte:
    /// remove old_byte (leaving the window) and add new_byte (entering)
    pub fn roll(self: *RollingChecksum, old_byte: u8, new_byte: u8) void {
        self.a = (self.a -% old_byte +% new_byte) % MOD;
        self.b = (self.b -% @as(u32, self.window_size) *% old_byte +% self.a -% 1) % MOD;
    }

    /// Get the 32-bit checksum value
    pub fn value(self: RollingChecksum) u32 {
        return (self.b << 16) | self.a;
    }
};

/// Split a file into fixed-size blocks and compute rolling checksums for each
pub fn computeBlockChecksums(
    allocator: std.mem.Allocator,
    data: []const u8,
    block_size: u32,
) ![]u32 {
    if (data.len == 0) return &[_]u32{};

    const num_blocks = (data.len + block_size - 1) / block_size;
    var checksums = try allocator.alloc(u32, num_blocks);

    var i: usize = 0;
    var block_idx: usize = 0;
    while (i < data.len) : (block_idx += 1) {
        const end = @min(i + block_size, data.len);
        const block = data[i..end];

        var rc = RollingChecksum.init(block_size);
        rc.feedBlock(block);
        checksums[block_idx] = rc.value();

        i = end;
    }

    return checksums[0..block_idx];
}

The beauty of the rolling checksum is that roll() operation. Once you've computed the checksum for bytes [0..N], you can get the checksum for bytes [1..N+1] with just a couple of additions and subtractions -- no need to rehash the entire block. This is O(1) per position instead of O(block_size), which matters enormously when you're scanning a large file looking for matching blocks.

The wrapping arithmetic operators (-%, +%) are essential here. Adler-32 uses modular arithmetic, and we need the wrapping behavior to handle the subtraction of old_byte without underflow errors. This is one of those places where Zig's explicit overflow handling actually helps you think about the math correctly -- in C you'd just get implicit wrapping on unsigned integers and hope for the best.

We won't fully USE the rolling checksum until part 2, but the implementation belongs here because it's part of the core data layer. Think of this episode as building all the Lego bricks; next episodes we'll snap them together.

Comparing manifests: the diff engine

Now for the core sync logic. Given two manifests (source and destination), we need to figure out what's different. There are four possible states for any file:

Added: exists in source but not in destination
Deleted: exists in destination but not in source
Modified: exists in both but content differs (checksum mismatch, or size/mtime change)
Unchanged: exists in both and is identical

// src/diff.zig
const std = @import("std");
const manifest = @import("manifest.zig");
const checksum_mod = @import("checksum.zig");

pub const ChangeType = enum {
    added,
    deleted,
    modified,
    unchanged,
};

pub const DiffEntry = struct {
    path: []const u8,
    change: ChangeType,
    source_entry: ?manifest.FileEntry,
    dest_entry: ?manifest.FileEntry,
};

pub fn computeDiff(
    allocator: std.mem.Allocator,
    source: []const manifest.FileEntry,
    dest: []const manifest.FileEntry,
) ![]DiffEntry {
    var diffs = std.ArrayList(DiffEntry).init(allocator);

    // build a lookup map from destination paths
    var dest_map = std.StringHashMap(manifest.FileEntry).init(allocator);
    defer dest_map.deinit();
    for (dest) |entry| {
        try dest_map.put(entry.path, entry);
    }

    // track which destination entries we've seen
    var seen = std.StringHashMap(void).init(allocator);
    defer seen.deinit();

    // check each source entry against destination
    for (source) |src_entry| {
        try seen.put(src_entry.path, {});

        if (dest_map.get(src_entry.path)) |dst_entry| {
            // exists in both -- check if modified
            const changed = blk: {
                // different type (file became dir or vice versa)
                if (src_entry.file_type != dst_entry.file_type) break :blk true;
                // directories don't have meaningful content to compare
                if (src_entry.isDir()) break :blk false;
                // different size = definitely modified
                if (src_entry.size != dst_entry.size) break :blk true;
                // if checksums are available, compare them
                if (src_entry.has_checksum and dst_entry.has_checksum) {
                    break :blk !checksum_mod.checksumEqual(
                        src_entry.checksum,
                        dst_entry.checksum,
                    );
                }
                // fall back to mtime comparison
                break :blk src_entry.mtime_ns != dst_entry.mtime_ns;
            };

            try diffs.append(.{
                .path = src_entry.path,
                .change = if (changed) .modified else .unchanged,
                .source_entry = src_entry,
                .dest_entry = dst_entry,
            });
        } else {
            // exists in source only = added
            try diffs.append(.{
                .path = src_entry.path,
                .change = .added,
                .source_entry = src_entry,
                .dest_entry = null,
            });
        }
    }

    // anything in destination that wasn't in source = deleted
    for (dest) |dst_entry| {
        if (seen.get(dst_entry.path) == null) {
            try diffs.append(.{
                .path = dst_entry.path,
                .change = .deleted,
                .source_entry = null,
                .dest_entry = dst_entry,
            });
        }
    }

    return diffs.toOwnedSlice() catch return error.OutOfMemory;
}

pub fn countByType(diffs: []const DiffEntry) struct { added: usize, deleted: usize, modified: usize, unchanged: usize } {
    var added: usize = 0;
    var deleted: usize = 0;
    var modified: usize = 0;
    var unchanged: usize = 0;
    for (diffs) |d| {
        switch (d.change) {
            .added => added += 1,
            .deleted => deleted += 1,
            .modified => modified += 1,
            .unchanged => unchanged += 1,
        }
    }
    return .{ .added = added, .deleted = deleted, .modified = modified, .unchanged = unchanged };
}

The comparison logic has a deliberate hierarchy: type change > size change > checksum > mtime. This is important because each check is progressively more expensive. Type and size comparisons are instant (just compare two integers). Checksum comparison requires that you've already hashed the files (expensive for large files). And mtime is the fallback -- it's fast but unreliable (touching a file changes its mtime without changing its content, and copying files can reset mtimes).

Using a StringHashMap to look up destination entries by path gives us O(1) lookups per source entry, making the whole diff O(n + m) where n is source count and m is destination count. A naive nested-loop approach would be O(n*m), and if you're syncing a project with 10,000 files that's 100 million comparisons. Not great.

Serializing the manifest

For the sync to work across machines, we need to send the manifest over the network. We need a serialization format that's compact, unambiguous, and easy to parse on both ends. JSON would work but is wasteful for binary data like checksums. Instead we'll use a simple binary format:

// src/serialize.zig
const std = @import("std");
const manifest = @import("manifest.zig");

const MAGIC: [4]u8 = .{ 'Z', 'S', 'Y', 'N' };
const VERSION: u8 = 1;

pub fn serializeManifest(
    entries: []const manifest.FileEntry,
    writer: anytype,
) !void {
    // header
    try writer.writeAll(&MAGIC);
    try writer.writeByte(VERSION);
    try writer.writeInt(u32, @intCast(entries.len), .little);

    // entries
    for (entries) |entry| {
        // path length + path
        try writer.writeInt(u16, @intCast(entry.path.len), .little);
        try writer.writeAll(entry.path);

        // file type
        try writer.writeByte(@intFromEnum(entry.file_type));

        // size
        try writer.writeInt(u64, entry.size, .little);

        // mtime
        const mtime_bytes: [16]u8 = @bitCast(entry.mtime_ns);
        try writer.writeAll(&mtime_bytes);

        // checksum flag + checksum
        try writer.writeByte(if (entry.has_checksum) 1 else 0);
        if (entry.has_checksum) {
            try writer.writeAll(&entry.checksum);
        }
    }
}

pub fn deserializeManifest(
    allocator: std.mem.Allocator,
    reader: anytype,
) ![]manifest.FileEntry {
    // verify header
    var magic: [4]u8 = undefined;
    _ = try reader.readAll(&magic);
    if (!std.mem.eql(u8, &magic, &MAGIC)) return error.InvalidFormat;

    const version = try reader.readByte();
    if (version != VERSION) return error.UnsupportedVersion;

    const count = try reader.readInt(u32, .little);

    var entries = try allocator.alloc(manifest.FileEntry, count);
    errdefer allocator.free(entries);

    for (entries, 0..) |*entry, i| {
        _ = i;
        // path
        const path_len = try reader.readInt(u16, .little);
        const path = try allocator.alloc(u8, path_len);
        _ = try reader.readAll(path);
        entry.path = path;

        // file type
        entry.file_type = @enumFromInt(try reader.readByte());

        // size
        entry.size = try reader.readInt(u64, .little);

        // mtime
        var mtime_bytes: [16]u8 = undefined;
        _ = try reader.readAll(&mtime_bytes);
        entry.mtime_ns = @bitCast(mtime_bytes);

        // checksum
        const has_cs = try reader.readByte();
        entry.has_checksum = has_cs == 1;
        if (entry.has_checksum) {
            _ = try reader.readAll(&entry.checksum);
        } else {
            entry.checksum = undefined;
        }
    }

    return entries;
}

The magic bytes ZSYN and version number at the start are standard practice for binary formats. If someone accidentally feeds a JPEG to the deserializer, we'll reject it immediately instead of parsing garbage for 500 iterations and then crashing with some nonsensical error. The version byte gives us forward compatibility -- if we later add fields (like permissions or extended attributes), we increment the version and the deserializer knows to expect the new format.

Little-endian byte order is chosen because that's what x86 and ARM (in their standard configurations) use natively. On those platforms, writeInt with .little is essentially a memcpy. Big-endian would work fine too (and is traditional for network protocols), but since both sides of our sync are most likely x86 or ARM Linux boxes, little-endian avoids unnecessary byte swapping.

Testing the whole thing

We need tests that create real directory structures, scan them, compute checksums, and verify the diff output. Zig's testing infrastructure with std.testing.tmpDir makes this straightforward:

// src/main.zig
const std = @import("std");
pub const manifest_mod = @import("manifest.zig");
pub const scanner = @import("scanner.zig");
pub const checksum_mod = @import("checksum.zig");
pub const diff = @import("diff.zig");
pub const serialize = @import("serialize.zig");
pub const rolling = @import("rolling.zig");

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();
    var args = std.process.args();
    _ = args.next(); // skip program name

    const path = args.next() orelse {
        try stdout.print("Usage: zigsync <directory>\n", .{});
        try stdout.print("  Scans a directory and prints its manifest.\n", .{});
        return;
    };

    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    try stdout.print("Scanning: {s}\n", .{path});
    const entries = scanner.scanDirectory(allocator, path) catch |err| {
        try stdout.print("Scan failed: {s}\n", .{@errorName(err)});
        return;
    };
    defer scanner.freeEntries(allocator, entries);

    try stdout.print("Found {d} entries\n\n", .{entries.len});

    // compute checksums
    checksum_mod.computeAllChecksums(path, entries);

    for (entries) |entry| {
        const type_char: u8 = switch (entry.file_type) {
            .directory => 'D',
            .regular => 'F',
            .symlink => 'L',
        };
        if (entry.has_checksum) {
            try stdout.print("[{c}] {s}  ({d} bytes, sha256: {s})\n", .{
                type_char, entry.path, entry.size,
                std.fmt.fmtSliceHexLower(&entry.checksum),
            });
        } else {
            try stdout.print("[{c}] {s}\n", .{ type_char, entry.path });
        }
    }
}

test "scan and checksum a temp directory" {
    const allocator = std.testing.allocator;

    // create temp dir with known content
    var tmp = std.testing.tmpDir(.{});
    defer tmp.cleanup();

    // write some files
    var f1 = try tmp.dir.createFile("hello.txt", .{});
    try f1.writeAll("Hello, world!");
    f1.close();

    try tmp.dir.makeDir("subdir");
    var sub = try tmp.dir.openDir("subdir", .{});
    defer sub.close();
    var f2 = try sub.createFile("nested.txt", .{});
    try f2.writeAll("I am nested");
    f2.close();

    // get the absolute path of the tmp dir
    var path_buf: [4096]u8 = undefined;
    const tmp_path = try tmp.dir.realpath(".", &path_buf);

    const entries = try scanner.scanDirectory(allocator, tmp_path);
    defer scanner.freeEntries(allocator, entries);

    // should have: hello.txt, subdir, subdir/nested.txt
    try std.testing.expect(entries.len == 3);

    // compute checksums
    checksum_mod.computeAllChecksums(tmp_path, entries);

    // verify the regular files have checksums
    var checksum_count: usize = 0;
    for (entries) |e| {
        if (e.has_checksum) checksum_count += 1;
    }
    try std.testing.expectEqual(@as(usize, 2), checksum_count);
}

test "diff detects added and deleted files" {
    const allocator = std.testing.allocator;

    const source = [_]manifest_mod.FileEntry{
        .{ .path = "a.txt", .file_type = .regular, .size = 10, .mtime_ns = 1000, .checksum = undefined, .has_checksum = false },
        .{ .path = "b.txt", .file_type = .regular, .size = 20, .mtime_ns = 2000, .checksum = undefined, .has_checksum = false },
        .{ .path = "new.txt", .file_type = .regular, .size = 5, .mtime_ns = 3000, .checksum = undefined, .has_checksum = false },
    };

    const dest = [_]manifest_mod.FileEntry{
        .{ .path = "a.txt", .file_type = .regular, .size = 10, .mtime_ns = 1000, .checksum = undefined, .has_checksum = false },
        .{ .path = "b.txt", .file_type = .regular, .size = 25, .mtime_ns = 2500, .checksum = undefined, .has_checksum = false },
        .{ .path = "old.txt", .file_type = .regular, .size = 15, .mtime_ns = 500, .checksum = undefined, .has_checksum = false },
    };

    const diffs = try diff.computeDiff(allocator, &source, &dest);
    defer allocator.free(diffs);

    const counts = diff.countByType(diffs);
    try std.testing.expectEqual(@as(usize, 1), counts.added); // new.txt
    try std.testing.expectEqual(@as(usize, 1), counts.deleted); // old.txt
    try std.testing.expectEqual(@as(usize, 1), counts.modified); // b.txt (size+mtime changed)
    try std.testing.expectEqual(@as(usize, 1), counts.unchanged); // a.txt
}

test "serialize and deserialize round-trip" {
    const allocator = std.testing.allocator;

    const entries = [_]manifest_mod.FileEntry{
        .{
            .path = "test.txt",
            .file_type = .regular,
            .size = 42,
            .mtime_ns = 1234567890,
            .checksum = [_]u8{0xAB} ** 32,
            .has_checksum = true,
        },
        .{
            .path = "dir",
            .file_type = .directory,
            .size = 0,
            .mtime_ns = 9876543210,
            .checksum = undefined,
            .has_checksum = false,
        },
    };

    // serialize
    var buf: [1024]u8 = undefined;
    var fbs = std.io.fixedBufferStream(&buf);
    try serialize.serializeManifest(&entries, fbs.writer());

    // deserialize
    fbs.pos = 0;
    const result = try serialize.deserializeManifest(allocator, fbs.reader());
    defer {
        for (result) |e| allocator.free(@constCast(e.path));
        allocator.free(result);
    }

    try std.testing.expectEqual(@as(usize, 2), result.len);
    try std.testing.expectEqualStrings("test.txt", result[0].path);
    try std.testing.expectEqual(@as(u64, 42), result[0].size);
    try std.testing.expect(result[0].has_checksum);
    try std.testing.expectEqualStrings("dir", result[1].path);
    try std.testing.expectEqual(manifest_mod.FileType.directory, result[1].file_type);
}

test "rolling checksum consistency" {
    // compute checksum of a block normally
    var rc1 = rolling.RollingChecksum.init(4);
    rc1.feedBlock("abcd");
    const expected = rc1.value();

    // compute by rolling from "xabc" to "abcd"
    var rc2 = rolling.RollingChecksum.init(4);
    rc2.feedBlock("xabc");
    rc2.roll('x', 'd'); // remove 'x', add 'd'
    const rolled = rc2.value();

    try std.testing.expectEqual(expected, rolled);
}

The tests cover the four main components: directory scanning, diff computation, serialization round-trips, and rolling checksum correctness. The rolling checksum test is particularly important -- it verifies that computing a checksum by rolling forward gives the same result as computing it from scratch on the target block. If that invariant breaks, the entire delta transfer algorithm falls apart.

Running it

Build and test:

$ zig build test
All 4 tests passed.
$ zig build run -- /path/to/some/directory
Scanning: /path/to/some/directory
Found 47 entries

[D] src
[F] src/main.zig  (2841 bytes, sha256: a3f7c1...)
[F] src/manifest.zig  (891 bytes, sha256: 6b2e4a...)
[F] build.zig  (742 bytes, sha256: 9c0d1f...)
...

What we covered

The foundation is in place. We have:

A FileEntry struct that captures everything about a file (path, type, size, mtime, checksum)
A recursive directory scanner that builds a flat manifest from an arbitrary directory tree
SHA-256 checksumming for content-based change detection
A rolling checksum (Adler-32) for block-level matching -- we'll use this heavily in the next episode
A diff engine that compares two manifests and categorizes every file as added, deleted, modified, or unchanged
A binary serialization format with magic bytes and versioning for transferring manifests over the wire
Tests that verify each component independently

Having said that, this is still a tool that only works locally. In the next episode we'll build the delta transfer algorithm -- instead of sending whole files, we'll use the rolling checksum to identify matching blocks and only transmit the differences. That's where the real bandwidth savings come from, and it's the heart of what makes rsync so effective at syncing large files with small changes.

After that, we'll add the network protocol (TCP, naturally -- we covered sockets back in episode 21) and finally polish it into something you could actually use to keep directories in sync across machines.

Thanks for reading!

@scipio

stem stemsocial steemstem zig programming

0.000

0 comments

Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1

Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Zig Series):

Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1

Why build a file sync tool?

Project structure

The file entry: our core data type

Scanning directories recursively

Computing file checksums

The rolling checksum: Adler-32

Comparing manifests: the diff engine

Serializing the manifest

Testing the whole thing

Running it

What we covered

Thanks for reading!

Curriculum (of the `Learn Zig Series`):