Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1
Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1

Part of a multi-episode project
What will I learn
- How to design a file synchronization tool architecture from the ground up;
- How to recursively scan directories and build a file tree with metadata;
- How to compute checksums using Zig's standard library hashing;
- How to compare two file trees and produce a diff of changes;
- How to build a file manifest containing path, size, modification time, and checksum;
- How to serialize that manifest into a wire format suitable for network transfer;
- How to use rolling checksums for efficient block-level change detection;
- How to write tests that create temporary directory trees and verify diff output.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Zig 0.14+ distribution (download from ziglang.org);
- The ambition to learn Zig programming.
Difficulty
- Advanced
Curriculum (of the Learn Zig Series):
- Zig Programming Tutorial - ep001 - Intro
- Learn Zig Series (#2) - Hello Zig, Variables and Types
- Learn Zig Series (#3) - Functions and Control Flow
- Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
- Learn Zig Series (#5) - Arrays, Slices, and Strings
- Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
- Learn Zig Series (#7) - Memory Management and Allocators
- Learn Zig Series (#8) - Pointers and Memory Layout
- Learn Zig Series (#9) - Comptime (Zig's Superpower)
- Learn Zig Series (#10) - Project Structure, Modules, and File I/O
- Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
- Learn Zig Series (#12) - Testing and Test-Driven Development
- Learn Zig Series (#13) - Interfaces via Type Erasure
- Learn Zig Series (#14) - Generics with Comptime Parameters
- Learn Zig Series (#15) - The Build System (build.zig)
- Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
- Learn Zig Series (#17) - Packed Structs and Bit Manipulation
- Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
- Learn Zig Series (#19) - SIMD with @Vector
- Learn Zig Series (#20) - Working with JSON
- Learn Zig Series (#21) - Networking and TCP Sockets
- Learn Zig Series (#22) - Hash Maps and Data Structures
- Learn Zig Series (#23) - Iterators and Lazy Evaluation
- Learn Zig Series (#24) - Logging, Formatting, and Debug Output
- Learn Zig Series (#25) - Mini Project: HTTP Status Checker
- Learn Zig Series (#26) - Writing a Custom Allocator
- Learn Zig Series (#27) - C Interop: Calling C from Zig
- Learn Zig Series (#28) - C Interop: Exposing Zig to C
- Learn Zig Series (#29) - Inline Assembly and Low-Level Control
- Learn Zig Series (#30) - Thread Safety and Atomics
- Learn Zig Series (#31) - Memory-Mapped I/O and Files
- Learn Zig Series (#32) - Compile-Time Reflection with @typeInfo
- Learn Zig Series (#33) - Building a State Machine with Tagged Unions
- Learn Zig Series (#34) - Performance Profiling and Optimization
- Learn Zig Series (#35) - Cross-Compilation and Target Triples
- Learn Zig Series (#36) - Mini Project: CLI Task Runner
- Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer
- Learn Zig Series (#38) - Markdown to HTML: Parser and AST
- Learn Zig Series (#39) - Markdown to HTML: Renderer and CLI
- Learn Zig Series (#40) - Key-Value Store: In-Memory Store
- Learn Zig Series (#41) - Key-Value Store: Write-Ahead Log
- Learn Zig Series (#42) - Key-Value Store: TCP Server
- Learn Zig Series (#43) - Key-Value Store: Client Library and Benchmarks
- Learn Zig Series (#44) - Image Tool: Reading and Writing PPM/BMP
- Learn Zig Series (#45) - Image Tool: Pixel Operations
- Learn Zig Series (#46) - Image Tool: CLI Pipeline
- Learn Zig Series (#47) - Build a Shell: Parsing Commands
- Learn Zig Series (#48) - Build a Shell: Process Spawning
- Learn Zig Series (#49) - Build a Shell: Built-in Commands
- Learn Zig Series (#50) - Build a Shell: Job Control and Signals
- Learn Zig Series (#51) - HTTP Server: Accept Loop and Parsing
- Learn Zig Series (#52) - HTTP Server: Router and Responses
- Learn Zig Series (#53) - HTTP Server: Static Files and MIME
- Learn Zig Series (#54) - HTTP Server: Middleware and Logging
- Learn Zig Series (#55) - ECS Game Engine: Architecture
- Learn Zig Series (#56) - ECS Game Engine: Component Storage
- Learn Zig Series (#57) - ECS Game Engine: Systems and Queries
- Learn Zig Series (#58) - ECS Game Engine: Terminal Rendering
- Learn Zig Series (#59) - Assembler: Instruction Encoding
- Learn Zig Series (#60) - Assembler: Two-Pass Assembly
- Learn Zig Series (#61) - Assembler: Disassembler and Binary Inspector
- Learn Zig Series (#62) - File Systems: Reading Directories and Metadata
- Learn Zig Series (#63) - File Watching: Detecting Changes
- Learn Zig Series (#64) - Process Management: Fork, Exec, Wait
- Learn Zig Series (#65) - Pipes and Inter-Process Communication
- Learn Zig Series (#66) - Shared Memory and Semaphores
- Learn Zig Series (#67) - Signal Handling Deep Dive
- Learn Zig Series (#68) - Unix Domain Sockets
- Learn Zig Series (#69) - Daemonization: Background Services
- Learn Zig Series (#70) - Timers and Scheduling
- Learn Zig Series (#71) - Resource Limits and Capabilities
- Learn Zig Series (#72) - System Call Wrappers
- Learn Zig Series (#73) - seccomp and Sandboxing
- Learn Zig Series (#74) - ptrace: Process Tracing
- Learn Zig Series (#75) - Reading Kernel State from /proc and /sys
- Learn Zig Series (#76) - Mini Project: Process Monitor
- Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1 (this post)
Learn Zig Series (#77) - Mini Project: File Sync Tool - Part 1
We just wrapped up the process monitor last episode, and with it the entire Linux systems programming arc that started back in episode 62. We built filesystem scanners, file watchers, process managers, IPC mechanisms, signal handlers, seccomp sandboxes, ptrace debuggers, /proc parsers, and tied it all together in a working top-like tool. Quite some ground covered ;-)
Now it's time to build something bigger. Over the next four episodes we're going to build a file synchronization tool -- think of it as a stripped-down rsync written entirely in Zig. This is the kind of project that pulls together almost everything we've learned: file I/O (episode 10), directory scanning (episode 62), hashing, networking (episode 21), and serialization (episode 20). The end result will be a tool that can compare two directory trees (local and remote), figure out what changed, and transfer only the differences.
In this first part we're focusing on the local foundations: scanning directories into a tree structure, computing file checksums, building a manifest, comparing manifests to find differences, and serializing that manifest so it can eventually be sent over the network. No actual networking yet -- that comes in part 3. First we get the data model right.
Why build a file sync tool?
Every developer has used rsync, scp, or some variant of file synchronization. But how many of us actually understand what happens under the hood? The rsync algorithm is genuinely clever -- it uses rolling checksums to find matching blocks between files without needing to transfer the entire file. The basic idea is deceptively simple:
- Scan the source directory and build a list of files with their sizes, timestamps, and content hashes
- Send that manifest to the destination
- The destination compares it against its own files
- Only transfer the files (or parts of files) that actually differ
The tricky part is step 4. A naive approach would transfer entire files whenever anything changes. rsync's insight was that even within a modified file, most of the content is often unchanged -- you just need to find the matching blocks and skip them. We'll implement a simplified version of this.
Project structure
Let's start with the project layout. This is a multi-file project, similar to how we structured the key-value store (episodes 40-43) and the shell (episodes 47-50):
// build.zig
const std = @import("std");
pub fn build(b: *std.Build) void {
const target = b.standardTargetOptions(.{});
const optimize = b.standardOptimizeOption(.{});
const exe = b.addExecutable(.{
.name = "zigsync",
.root_source_file = b.path("src/main.zig"),
.target = target,
.optimize = optimize,
});
b.installArtifact(exe);
const run_cmd = b.addRunArtifact(exe);
run_cmd.step.dependOn(b.getInstallStep());
if (b.args) |args| run_cmd.addArgs(args);
const run_step = b.step("run", "Run zigsync");
run_step.dependOn(&run_cmd.step);
const tests = b.addTest(.{
.root_source_file = b.path("src/main.zig"),
.target = target,
.optimize = optimize,
});
const run_tests = b.addRunArtifact(tests);
const test_step = b.step("test", "Run unit tests");
test_step.dependOn(&run_tests.step);
}
Nothing surprising here if you remember episode 15 where we covered the build system. We have a single executable and a test target. As the project grows across the next episodes we might split into multiple modules, but for now everything lives under src/.
The file entry: our core data type
Everything in our sync tool revolves around file entries. A file entry captures everything we need to know about a single file to determine whether it needs syncing:
// src/manifest.zig
const std = @import("std");
pub const FileType = enum(u8) {
regular = 0,
directory = 1,
symlink = 2,
};
pub const FileEntry = struct {
/// Relative path from the sync root (e.g. "src/main.zig")
path: []const u8,
file_type: FileType,
size: u64,
/// Modification time in nanoseconds since epoch
mtime_ns: i128,
/// SHA-256 checksum of file contents (only for regular files)
checksum: [32]u8,
/// Whether the checksum has been computed
has_checksum: bool,
pub fn isDir(self: FileEntry) bool {
return self.file_type == .directory;
}
pub fn isRegular(self: FileEntry) bool {
return self.file_type == .regular;
}
};
A few design decisions worth talking about here. The path is relative to the sync root -- this is critical. If you're syncing /home/user/project/ to /backup/project/, the paths in the manifest are src/main.zig, README.md, etc. -- not absolute paths. This makes manifests comparable between different machines where the root directories live in different locations.
The mtime is stored in nanoseconds as i128. Linux provides nanosecond-resolution timestamps (we saw this back in episode 62 when reading file metadata), and storing them at full precision means we can detect sub-second modifications. Some older sync tools only compare seconds, which leads to missed changes when build tools modify files in rapid succession.
The checksum is SHA-256. Zig's standard library has a complete implementation in std.crypto.hash.sha2. We could have used a faster hash like xxHash or CRC32 for a toy project, but SHA-256 gives us collision resistance that actually matters when you're deciding whether to transfer a file. Nobody wants their sync tool to skip a file because of a hash collision.
Scanning directories recursively
Here's where things get interesting. We need to walk an entire directory tree and build a flat list of FileEntry values. Zig's std.fs.Dir gives us an iterator, but it only walks one level -- we need to recurse ourselves:
// src/scanner.zig
const std = @import("std");
const manifest = @import("manifest.zig");
pub const ScanError = error{
OutOfMemory,
AccessDenied,
OpenFailed,
ReadFailed,
};
pub fn scanDirectory(
allocator: std.mem.Allocator,
root_path: []const u8,
) ScanError![]manifest.FileEntry {
var entries = std.ArrayList(manifest.FileEntry).init(allocator);
errdefer {
for (entries.items) |entry| {
allocator.free(entry.path);
}
entries.deinit();
}
try scanRecursive(allocator, root_path, "", &entries);
return entries.toOwnedSlice() catch return ScanError.OutOfMemory;
}
fn scanRecursive(
allocator: std.mem.Allocator,
root_path: []const u8,
relative_prefix: []const u8,
entries: *std.ArrayList(manifest.FileEntry),
) ScanError!void {
// open the directory to scan
var full_path_buf: [4096]u8 = undefined;
const full_path = std.fmt.bufPrint(&full_path_buf, "{s}{s}{s}", .{
root_path,
if (relative_prefix.len > 0) "/" else "",
relative_prefix,
}) catch return ScanError.OutOfMemory;
var dir = std.fs.openDirAbsolute(full_path, .{ .iterate = true }) catch
return ScanError.OpenFailed;
defer dir.close();
var it = dir.iterate();
while (it.next() catch return ScanError.ReadFailed) |entry| {
// skip hidden files and common junk
if (entry.name[0] == '.') continue;
// build relative path
const rel_path = if (relative_prefix.len > 0)
std.fmt.allocPrint(allocator, "{s}/{s}", .{ relative_prefix, entry.name }) catch
return ScanError.OutOfMemory
else
allocator.dupe(u8, entry.name) catch return ScanError.OutOfMemory;
const file_type: manifest.FileType = switch (entry.kind) {
.directory => .directory,
.sym_link => .symlink,
else => .regular,
};
// get file metadata
var stat_buf: [4096]u8 = undefined;
const stat_path = std.fmt.bufPrint(&stat_buf, "{s}/{s}", .{ root_path, rel_path }) catch {
allocator.free(rel_path);
return ScanError.OutOfMemory;
};
var size: u64 = 0;
var mtime_ns: i128 = 0;
if (file_type == .regular) {
const file = std.fs.openFileAbsolute(stat_path, .{}) catch {
allocator.free(rel_path);
continue; // skip files we can't open
};
defer file.close();
const stat = file.stat() catch {
allocator.free(rel_path);
continue;
};
size = stat.size;
mtime_ns = stat.mtime;
}
entries.append(.{
.path = rel_path,
.file_type = file_type,
.size = size,
.mtime_ns = mtime_ns,
.checksum = undefined,
.has_checksum = false,
}) catch {
allocator.free(rel_path);
return ScanError.OutOfMemory;
};
// recurse into subdirectories
if (file_type == .directory) {
try scanRecursive(allocator, root_path, rel_path, entries);
}
}
}
pub fn freeEntries(allocator: std.mem.Allocator, entries: []manifest.FileEntry) void {
for (entries) |entry| {
allocator.free(entry.path);
}
allocator.free(entries);
}
A few things to notice. We skip hidden files (names starting with .) -- this is a deliberate design choice. Most sync tools let you configure this, but for our base implementation skipping dotfiles avoids accidentally syncing .git/ directories, which would be a nightmare. The errdefer block at the top of scanDirectory makes sure we free all the allocated path strings if anything goes wrong partway through. Memory management in Zig is always explicit, but that explicitness is exactly what prevents the kind of leaks you'd get in C when a function bails out halfway through a loop.
The recursive scan builds paths by concatenating the prefix with each entry name. This produces paths like src/main.zig and src/manifest.zig -- exactly what we need for comparing manifests between diferent machines.
Computing file checksums
Once we have the file entries, we need to compute checksums for each regular file. This is separate from scanning because sometimes you only want a quick size+mtime comparison (which is what rsync does in its default --times mode) and only compute checksums when there's a discrepancy:
// src/checksum.zig
const std = @import("std");
const manifest = @import("manifest.zig");
const Sha256 = std.crypto.hash.sha2.Sha256;
pub fn computeChecksum(root_path: []const u8, entry: *manifest.FileEntry) !void {
if (!entry.isRegular()) return;
var path_buf: [4096]u8 = undefined;
const full_path = try std.fmt.bufPrint(&path_buf, "{s}/{s}", .{ root_path, entry.path });
const file = try std.fs.openFileAbsolute(full_path, .{});
defer file.close();
var hasher = Sha256.init(.{});
var buf: [8192]u8 = undefined;
while (true) {
const n = try file.read(&buf);
if (n == 0) break;
hasher.update(buf[0..n]);
}
hasher.final(&entry.checksum);
entry.has_checksum = true;
}
pub fn computeAllChecksums(root_path: []const u8, entries: []manifest.FileEntry) void {
for (entries) |*entry| {
computeChecksum(root_path, entry) catch {
// file might have disappeared between scan and checksum
entry.has_checksum = false;
};
}
}
pub fn checksumEqual(a: [32]u8, b: [32]u8) bool {
return std.mem.eql(u8, &a, &b);
}
The 8KB read buffer is a reasonable trade-off between memory usage and system call overhead. Reading 8KB at a time means a 1MB file needs about 128 read() syscalls. You could bump it to 64KB or even mmap the file (as we covered in episode 31), but for a sync tool where the bottleneck is network transfer rather than local hashing, 8KB is perfectly fine.
SHA-256 processes data in 64-byte blocks internally, so the update call handles buffering for us -- we don't need to align our reads to any particular boundary.
The rolling checksum: Adler-32
Here's where things get clever. For block-level deduplication (which we'll use in part 2 for delta transfers), we need a rolling checksum -- a hash that can be efficiently updated when you slide a window across the data. The classic choice is Adler-32, which rsync uses for exactly this purpose:
// src/rolling.zig
const std = @import("std");
pub const RollingChecksum = struct {
a: u32 = 1,
b: u32 = 0,
window_size: u32,
const MOD = 65521; // largest prime smaller than 2^16
pub fn init(window_size: u32) RollingChecksum {
return .{ .window_size = window_size };
}
/// Feed a full block to initialize the checksum
pub fn feedBlock(self: *RollingChecksum, data: []const u8) void {
self.a = 1;
self.b = 0;
for (data) |byte| {
self.a = (self.a + byte) % MOD;
self.b = (self.b + self.a) % MOD;
}
}
/// Roll the window forward by one byte:
/// remove old_byte (leaving the window) and add new_byte (entering)
pub fn roll(self: *RollingChecksum, old_byte: u8, new_byte: u8) void {
self.a = (self.a -% old_byte +% new_byte) % MOD;
self.b = (self.b -% @as(u32, self.window_size) *% old_byte +% self.a -% 1) % MOD;
}
/// Get the 32-bit checksum value
pub fn value(self: RollingChecksum) u32 {
return (self.b << 16) | self.a;
}
};
/// Split a file into fixed-size blocks and compute rolling checksums for each
pub fn computeBlockChecksums(
allocator: std.mem.Allocator,
data: []const u8,
block_size: u32,
) ![]u32 {
if (data.len == 0) return &[_]u32{};
const num_blocks = (data.len + block_size - 1) / block_size;
var checksums = try allocator.alloc(u32, num_blocks);
var i: usize = 0;
var block_idx: usize = 0;
while (i < data.len) : (block_idx += 1) {
const end = @min(i + block_size, data.len);
const block = data[i..end];
var rc = RollingChecksum.init(block_size);
rc.feedBlock(block);
checksums[block_idx] = rc.value();
i = end;
}
return checksums[0..block_idx];
}
The beauty of the rolling checksum is that roll() operation. Once you've computed the checksum for bytes [0..N], you can get the checksum for bytes [1..N+1] with just a couple of additions and subtractions -- no need to rehash the entire block. This is O(1) per position instead of O(block_size), which matters enormously when you're scanning a large file looking for matching blocks.
The wrapping arithmetic operators (-%, +%) are essential here. Adler-32 uses modular arithmetic, and we need the wrapping behavior to handle the subtraction of old_byte without underflow errors. This is one of those places where Zig's explicit overflow handling actually helps you think about the math correctly -- in C you'd just get implicit wrapping on unsigned integers and hope for the best.
We won't fully USE the rolling checksum until part 2, but the implementation belongs here because it's part of the core data layer. Think of this episode as building all the Lego bricks; next episodes we'll snap them together.
Comparing manifests: the diff engine
Now for the core sync logic. Given two manifests (source and destination), we need to figure out what's different. There are four possible states for any file:
- Added: exists in source but not in destination
- Deleted: exists in destination but not in source
- Modified: exists in both but content differs (checksum mismatch, or size/mtime change)
- Unchanged: exists in both and is identical
// src/diff.zig
const std = @import("std");
const manifest = @import("manifest.zig");
const checksum_mod = @import("checksum.zig");
pub const ChangeType = enum {
added,
deleted,
modified,
unchanged,
};
pub const DiffEntry = struct {
path: []const u8,
change: ChangeType,
source_entry: ?manifest.FileEntry,
dest_entry: ?manifest.FileEntry,
};
pub fn computeDiff(
allocator: std.mem.Allocator,
source: []const manifest.FileEntry,
dest: []const manifest.FileEntry,
) ![]DiffEntry {
var diffs = std.ArrayList(DiffEntry).init(allocator);
// build a lookup map from destination paths
var dest_map = std.StringHashMap(manifest.FileEntry).init(allocator);
defer dest_map.deinit();
for (dest) |entry| {
try dest_map.put(entry.path, entry);
}
// track which destination entries we've seen
var seen = std.StringHashMap(void).init(allocator);
defer seen.deinit();
// check each source entry against destination
for (source) |src_entry| {
try seen.put(src_entry.path, {});
if (dest_map.get(src_entry.path)) |dst_entry| {
// exists in both -- check if modified
const changed = blk: {
// different type (file became dir or vice versa)
if (src_entry.file_type != dst_entry.file_type) break :blk true;
// directories don't have meaningful content to compare
if (src_entry.isDir()) break :blk false;
// different size = definitely modified
if (src_entry.size != dst_entry.size) break :blk true;
// if checksums are available, compare them
if (src_entry.has_checksum and dst_entry.has_checksum) {
break :blk !checksum_mod.checksumEqual(
src_entry.checksum,
dst_entry.checksum,
);
}
// fall back to mtime comparison
break :blk src_entry.mtime_ns != dst_entry.mtime_ns;
};
try diffs.append(.{
.path = src_entry.path,
.change = if (changed) .modified else .unchanged,
.source_entry = src_entry,
.dest_entry = dst_entry,
});
} else {
// exists in source only = added
try diffs.append(.{
.path = src_entry.path,
.change = .added,
.source_entry = src_entry,
.dest_entry = null,
});
}
}
// anything in destination that wasn't in source = deleted
for (dest) |dst_entry| {
if (seen.get(dst_entry.path) == null) {
try diffs.append(.{
.path = dst_entry.path,
.change = .deleted,
.source_entry = null,
.dest_entry = dst_entry,
});
}
}
return diffs.toOwnedSlice() catch return error.OutOfMemory;
}
pub fn countByType(diffs: []const DiffEntry) struct { added: usize, deleted: usize, modified: usize, unchanged: usize } {
var added: usize = 0;
var deleted: usize = 0;
var modified: usize = 0;
var unchanged: usize = 0;
for (diffs) |d| {
switch (d.change) {
.added => added += 1,
.deleted => deleted += 1,
.modified => modified += 1,
.unchanged => unchanged += 1,
}
}
return .{ .added = added, .deleted = deleted, .modified = modified, .unchanged = unchanged };
}
The comparison logic has a deliberate hierarchy: type change > size change > checksum > mtime. This is important because each check is progressively more expensive. Type and size comparisons are instant (just compare two integers). Checksum comparison requires that you've already hashed the files (expensive for large files). And mtime is the fallback -- it's fast but unreliable (touching a file changes its mtime without changing its content, and copying files can reset mtimes).
Using a StringHashMap to look up destination entries by path gives us O(1) lookups per source entry, making the whole diff O(n + m) where n is source count and m is destination count. A naive nested-loop approach would be O(n*m), and if you're syncing a project with 10,000 files that's 100 million comparisons. Not great.
Serializing the manifest
For the sync to work across machines, we need to send the manifest over the network. We need a serialization format that's compact, unambiguous, and easy to parse on both ends. JSON would work but is wasteful for binary data like checksums. Instead we'll use a simple binary format:
// src/serialize.zig
const std = @import("std");
const manifest = @import("manifest.zig");
const MAGIC: [4]u8 = .{ 'Z', 'S', 'Y', 'N' };
const VERSION: u8 = 1;
pub fn serializeManifest(
entries: []const manifest.FileEntry,
writer: anytype,
) !void {
// header
try writer.writeAll(&MAGIC);
try writer.writeByte(VERSION);
try writer.writeInt(u32, @intCast(entries.len), .little);
// entries
for (entries) |entry| {
// path length + path
try writer.writeInt(u16, @intCast(entry.path.len), .little);
try writer.writeAll(entry.path);
// file type
try writer.writeByte(@intFromEnum(entry.file_type));
// size
try writer.writeInt(u64, entry.size, .little);
// mtime
const mtime_bytes: [16]u8 = @bitCast(entry.mtime_ns);
try writer.writeAll(&mtime_bytes);
// checksum flag + checksum
try writer.writeByte(if (entry.has_checksum) 1 else 0);
if (entry.has_checksum) {
try writer.writeAll(&entry.checksum);
}
}
}
pub fn deserializeManifest(
allocator: std.mem.Allocator,
reader: anytype,
) ![]manifest.FileEntry {
// verify header
var magic: [4]u8 = undefined;
_ = try reader.readAll(&magic);
if (!std.mem.eql(u8, &magic, &MAGIC)) return error.InvalidFormat;
const version = try reader.readByte();
if (version != VERSION) return error.UnsupportedVersion;
const count = try reader.readInt(u32, .little);
var entries = try allocator.alloc(manifest.FileEntry, count);
errdefer allocator.free(entries);
for (entries, 0..) |*entry, i| {
_ = i;
// path
const path_len = try reader.readInt(u16, .little);
const path = try allocator.alloc(u8, path_len);
_ = try reader.readAll(path);
entry.path = path;
// file type
entry.file_type = @enumFromInt(try reader.readByte());
// size
entry.size = try reader.readInt(u64, .little);
// mtime
var mtime_bytes: [16]u8 = undefined;
_ = try reader.readAll(&mtime_bytes);
entry.mtime_ns = @bitCast(mtime_bytes);
// checksum
const has_cs = try reader.readByte();
entry.has_checksum = has_cs == 1;
if (entry.has_checksum) {
_ = try reader.readAll(&entry.checksum);
} else {
entry.checksum = undefined;
}
}
return entries;
}
The magic bytes ZSYN and version number at the start are standard practice for binary formats. If someone accidentally feeds a JPEG to the deserializer, we'll reject it immediately instead of parsing garbage for 500 iterations and then crashing with some nonsensical error. The version byte gives us forward compatibility -- if we later add fields (like permissions or extended attributes), we increment the version and the deserializer knows to expect the new format.
Little-endian byte order is chosen because that's what x86 and ARM (in their standard configurations) use natively. On those platforms, writeInt with .little is essentially a memcpy. Big-endian would work fine too (and is traditional for network protocols), but since both sides of our sync are most likely x86 or ARM Linux boxes, little-endian avoids unnecessary byte swapping.
Testing the whole thing
We need tests that create real directory structures, scan them, compute checksums, and verify the diff output. Zig's testing infrastructure with std.testing.tmpDir makes this straightforward:
// src/main.zig
const std = @import("std");
pub const manifest_mod = @import("manifest.zig");
pub const scanner = @import("scanner.zig");
pub const checksum_mod = @import("checksum.zig");
pub const diff = @import("diff.zig");
pub const serialize = @import("serialize.zig");
pub const rolling = @import("rolling.zig");
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
var args = std.process.args();
_ = args.next(); // skip program name
const path = args.next() orelse {
try stdout.print("Usage: zigsync <directory>\n", .{});
try stdout.print(" Scans a directory and prints its manifest.\n", .{});
return;
};
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
try stdout.print("Scanning: {s}\n", .{path});
const entries = scanner.scanDirectory(allocator, path) catch |err| {
try stdout.print("Scan failed: {s}\n", .{@errorName(err)});
return;
};
defer scanner.freeEntries(allocator, entries);
try stdout.print("Found {d} entries\n\n", .{entries.len});
// compute checksums
checksum_mod.computeAllChecksums(path, entries);
for (entries) |entry| {
const type_char: u8 = switch (entry.file_type) {
.directory => 'D',
.regular => 'F',
.symlink => 'L',
};
if (entry.has_checksum) {
try stdout.print("[{c}] {s} ({d} bytes, sha256: {s})\n", .{
type_char, entry.path, entry.size,
std.fmt.fmtSliceHexLower(&entry.checksum),
});
} else {
try stdout.print("[{c}] {s}\n", .{ type_char, entry.path });
}
}
}
test "scan and checksum a temp directory" {
const allocator = std.testing.allocator;
// create temp dir with known content
var tmp = std.testing.tmpDir(.{});
defer tmp.cleanup();
// write some files
var f1 = try tmp.dir.createFile("hello.txt", .{});
try f1.writeAll("Hello, world!");
f1.close();
try tmp.dir.makeDir("subdir");
var sub = try tmp.dir.openDir("subdir", .{});
defer sub.close();
var f2 = try sub.createFile("nested.txt", .{});
try f2.writeAll("I am nested");
f2.close();
// get the absolute path of the tmp dir
var path_buf: [4096]u8 = undefined;
const tmp_path = try tmp.dir.realpath(".", &path_buf);
const entries = try scanner.scanDirectory(allocator, tmp_path);
defer scanner.freeEntries(allocator, entries);
// should have: hello.txt, subdir, subdir/nested.txt
try std.testing.expect(entries.len == 3);
// compute checksums
checksum_mod.computeAllChecksums(tmp_path, entries);
// verify the regular files have checksums
var checksum_count: usize = 0;
for (entries) |e| {
if (e.has_checksum) checksum_count += 1;
}
try std.testing.expectEqual(@as(usize, 2), checksum_count);
}
test "diff detects added and deleted files" {
const allocator = std.testing.allocator;
const source = [_]manifest_mod.FileEntry{
.{ .path = "a.txt", .file_type = .regular, .size = 10, .mtime_ns = 1000, .checksum = undefined, .has_checksum = false },
.{ .path = "b.txt", .file_type = .regular, .size = 20, .mtime_ns = 2000, .checksum = undefined, .has_checksum = false },
.{ .path = "new.txt", .file_type = .regular, .size = 5, .mtime_ns = 3000, .checksum = undefined, .has_checksum = false },
};
const dest = [_]manifest_mod.FileEntry{
.{ .path = "a.txt", .file_type = .regular, .size = 10, .mtime_ns = 1000, .checksum = undefined, .has_checksum = false },
.{ .path = "b.txt", .file_type = .regular, .size = 25, .mtime_ns = 2500, .checksum = undefined, .has_checksum = false },
.{ .path = "old.txt", .file_type = .regular, .size = 15, .mtime_ns = 500, .checksum = undefined, .has_checksum = false },
};
const diffs = try diff.computeDiff(allocator, &source, &dest);
defer allocator.free(diffs);
const counts = diff.countByType(diffs);
try std.testing.expectEqual(@as(usize, 1), counts.added); // new.txt
try std.testing.expectEqual(@as(usize, 1), counts.deleted); // old.txt
try std.testing.expectEqual(@as(usize, 1), counts.modified); // b.txt (size+mtime changed)
try std.testing.expectEqual(@as(usize, 1), counts.unchanged); // a.txt
}
test "serialize and deserialize round-trip" {
const allocator = std.testing.allocator;
const entries = [_]manifest_mod.FileEntry{
.{
.path = "test.txt",
.file_type = .regular,
.size = 42,
.mtime_ns = 1234567890,
.checksum = [_]u8{0xAB} ** 32,
.has_checksum = true,
},
.{
.path = "dir",
.file_type = .directory,
.size = 0,
.mtime_ns = 9876543210,
.checksum = undefined,
.has_checksum = false,
},
};
// serialize
var buf: [1024]u8 = undefined;
var fbs = std.io.fixedBufferStream(&buf);
try serialize.serializeManifest(&entries, fbs.writer());
// deserialize
fbs.pos = 0;
const result = try serialize.deserializeManifest(allocator, fbs.reader());
defer {
for (result) |e| allocator.free(@constCast(e.path));
allocator.free(result);
}
try std.testing.expectEqual(@as(usize, 2), result.len);
try std.testing.expectEqualStrings("test.txt", result[0].path);
try std.testing.expectEqual(@as(u64, 42), result[0].size);
try std.testing.expect(result[0].has_checksum);
try std.testing.expectEqualStrings("dir", result[1].path);
try std.testing.expectEqual(manifest_mod.FileType.directory, result[1].file_type);
}
test "rolling checksum consistency" {
// compute checksum of a block normally
var rc1 = rolling.RollingChecksum.init(4);
rc1.feedBlock("abcd");
const expected = rc1.value();
// compute by rolling from "xabc" to "abcd"
var rc2 = rolling.RollingChecksum.init(4);
rc2.feedBlock("xabc");
rc2.roll('x', 'd'); // remove 'x', add 'd'
const rolled = rc2.value();
try std.testing.expectEqual(expected, rolled);
}
The tests cover the four main components: directory scanning, diff computation, serialization round-trips, and rolling checksum correctness. The rolling checksum test is particularly important -- it verifies that computing a checksum by rolling forward gives the same result as computing it from scratch on the target block. If that invariant breaks, the entire delta transfer algorithm falls apart.
Running it
Build and test:
$ zig build test
All 4 tests passed.
$ zig build run -- /path/to/some/directory
Scanning: /path/to/some/directory
Found 47 entries
[D] src
[F] src/main.zig (2841 bytes, sha256: a3f7c1...)
[F] src/manifest.zig (891 bytes, sha256: 6b2e4a...)
[F] build.zig (742 bytes, sha256: 9c0d1f...)
...
What we covered
The foundation is in place. We have:
- A FileEntry struct that captures everything about a file (path, type, size, mtime, checksum)
- A recursive directory scanner that builds a flat manifest from an arbitrary directory tree
- SHA-256 checksumming for content-based change detection
- A rolling checksum (Adler-32) for block-level matching -- we'll use this heavily in the next episode
- A diff engine that compares two manifests and categorizes every file as added, deleted, modified, or unchanged
- A binary serialization format with magic bytes and versioning for transferring manifests over the wire
- Tests that verify each component independently
Having said that, this is still a tool that only works locally. In the next episode we'll build the delta transfer algorithm -- instead of sending whole files, we'll use the rolling checksum to identify matching blocks and only transmit the differences. That's where the real bandwidth savings come from, and it's the heart of what makes rsync so effective at syncing large files with small changes.
After that, we'll add the network protocol (TCP, naturally -- we covered sockets back in episode 21) and finally polish it into something you could actually use to keep directories in sync across machines.