Learn Zig Series (#71) - Resource Limits and Capabilities
Learn Zig Series (#71) - Resource Limits and Capabilities

What will I learn
- How getrlimit/setrlimit control per-process resource usage and why the kernel enforces soft vs hard limits;
- How to query and modify common resource limits: open files, virtual memory, CPU time, and core dump size;
- How Linux capabilities provide fine-grained privilege control without requiring full root access;
- How to drop privileges after binding a privileged port so your daemon runs as an unprivileged user;
- How cgroups v2 isolate CPU, memory, and I/O for groups of processes;
- How Linux namespaces create isolated views of system resources -- the fundamental building block of containers;
- How chroot and pivot_root restrict a process's filesystem view;
- How to combine resource limits, capability dropping, and namespace isolation to sandbox a child process.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Zig 0.14+ distribution (download from ziglang.org);
- The ambition to learn Zig programming.
Difficulty
- Intermediate
Curriculum (of the Learn Zig Series):
- Zig Programming Tutorial - ep001 - Intro
- Learn Zig Series (#2) - Hello Zig, Variables and Types
- Learn Zig Series (#3) - Functions and Control Flow
- Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
- Learn Zig Series (#5) - Arrays, Slices, and Strings
- Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
- Learn Zig Series (#7) - Memory Management and Allocators
- Learn Zig Series (#8) - Pointers and Memory Layout
- Learn Zig Series (#9) - Comptime (Zig's Superpower)
- Learn Zig Series (#10) - Project Structure, Modules, and File I/O
- Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
- Learn Zig Series (#12) - Testing and Test-Driven Development
- Learn Zig Series (#13) - Interfaces via Type Erasure
- Learn Zig Series (#14) - Generics with Comptime Parameters
- Learn Zig Series (#15) - The Build System (build.zig)
- Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
- Learn Zig Series (#17) - Packed Structs and Bit Manipulation
- Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
- Learn Zig Series (#19) - SIMD with @Vector
- Learn Zig Series (#20) - Working with JSON
- Learn Zig Series (#21) - Networking and TCP Sockets
- Learn Zig Series (#22) - Hash Maps and Data Structures
- Learn Zig Series (#23) - Iterators and Lazy Evaluation
- Learn Zig Series (#24) - Logging, Formatting, and Debug Output
- Learn Zig Series (#25) - Mini Project: HTTP Status Checker
- Learn Zig Series (#26) - Writing a Custom Allocator
- Learn Zig Series (#27) - C Interop: Calling C from Zig
- Learn Zig Series (#28) - C Interop: Exposing Zig to C
- Learn Zig Series (#29) - Inline Assembly and Low-Level Control
- Learn Zig Series (#30) - Thread Safety and Atomics
- Learn Zig Series (#31) - Memory-Mapped I/O and Files
- Learn Zig Series (#32) - Compile-Time Reflection with @typeInfo
- Learn Zig Series (#33) - Building a State Machine with Tagged Unions
- Learn Zig Series (#34) - Performance Profiling and Optimization
- Learn Zig Series (#35) - Cross-Compilation and Target Triples
- Learn Zig Series (#36) - Mini Project: CLI Task Runner
- Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer
- Learn Zig Series (#38) - Markdown to HTML: Parser and AST
- Learn Zig Series (#39) - Markdown to HTML: Renderer and CLI
- Learn Zig Series (#40) - Key-Value Store: In-Memory Store
- Learn Zig Series (#41) - Key-Value Store: Write-Ahead Log
- Learn Zig Series (#42) - Key-Value Store: TCP Server
- Learn Zig Series (#43) - Key-Value Store: Client Library and Benchmarks
- Learn Zig Series (#44) - Image Tool: Reading and Writing PPM/BMP
- Learn Zig Series (#45) - Image Tool: Pixel Operations
- Learn Zig Series (#46) - Image Tool: CLI Pipeline
- Learn Zig Series (#47) - Build a Shell: Parsing Commands
- Learn Zig Series (#48) - Build a Shell: Process Spawning
- Learn Zig Series (#49) - Build a Shell: Built-in Commands
- Learn Zig Series (#50) - Build a Shell: Job Control and Signals
- Learn Zig Series (#51) - HTTP Server: Accept Loop and Parsing
- Learn Zig Series (#52) - HTTP Server: Router and Responses
- Learn Zig Series (#53) - HTTP Server: Static Files and MIME
- Learn Zig Series (#54) - HTTP Server: Middleware and Logging
- Learn Zig Series (#55) - ECS Game Engine: Architecture
- Learn Zig Series (#56) - ECS Game Engine: Component Storage
- Learn Zig Series (#57) - ECS Game Engine: Systems and Queries
- Learn Zig Series (#58) - ECS Game Engine: Terminal Rendering
- Learn Zig Series (#59) - Assembler: Instruction Encoding
- Learn Zig Series (#60) - Assembler: Two-Pass Assembly
- Learn Zig Series (#61) - Assembler: Disassembler and Binary Inspector
- Learn Zig Series (#62) - File Systems: Reading Directories and Metadata
- Learn Zig Series (#63) - File Watching: Detecting Changes
- Learn Zig Series (#64) - Process Management: Fork, Exec, Wait
- Learn Zig Series (#65) - Pipes and Inter-Process Communication
- Learn Zig Series (#66) - Shared Memory and Semaphores
- Learn Zig Series (#67) - Signal Handling Deep Dive
- Learn Zig Series (#68) - Unix Domain Sockets
- Learn Zig Series (#69) - Daemonization: Background Services
- Learn Zig Series (#70) - Timers and Scheduling
- Learn Zig Series (#71) - Resource Limits and Capabilities (this post)
Learn Zig Series (#71) - Resource Limits and Capabilities
Solutions to Episode 70 Exercises
Exercise 1: Timerfd-based interval scheduler with jitter tracking
const std = @import("std");
const posix = std.posix;
const linux = std.os.linux;
const TaskInfo = struct {
name: []const u8,
interval_ms: u64,
tfd: posix.fd_t,
expected_fire_ns: u64,
fire_count: u32,
max_jitter_us: u64,
total_jitter_us: u64,
};
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
const base = try std.time.Instant.now();
var tasks = [_]TaskInfo{
.{ .name = "heartbeat", .interval_ms = 200, .tfd = -1, .expected_fire_ns = 0, .fire_count = 0, .max_jitter_us = 0, .total_jitter_us = 0 },
.{ .name = "cleanup", .interval_ms = 1000, .tfd = -1, .expected_fire_ns = 0, .fire_count = 0, .max_jitter_us = 0, .total_jitter_us = 0 },
.{ .name = "report", .interval_ms = 3000, .tfd = -1, .expected_fire_ns = 0, .fire_count = 0, .max_jitter_us = 0, .total_jitter_us = 0 },
};
// create timerfds
for (&tasks) |*t| {
const result = linux.timerfd_create(linux.CLOCK.MONOTONIC, .{});
const signed: isize = @bitCast(@as(usize, result));
if (signed < 0) return error.TimerfdCreateFailed;
t.tfd = @intCast(result);
const ns: u64 = t.interval_ms * 1_000_000;
const sec_part: isize = @intCast(ns / 1_000_000_000);
const nsec_part: isize = @intCast(ns % 1_000_000_000);
const spec = linux.itimerspec{
.it_interval = .{ .sec = sec_part, .nsec = nsec_part },
.it_value = .{ .sec = sec_part, .nsec = nsec_part },
};
_ = linux.timerfd_settime(@intCast(t.tfd), .{}, &spec, null);
t.expected_fire_ns = ns;
}
try stdout.print("Timerfd scheduler running for 6 seconds...\n\n", .{});
const deadline_ns: u64 = 6_000_000_000;
while (true) {
const elapsed = (try std.time.Instant.now()).since(base);
if (elapsed >= deadline_ns) break;
var pollfds: [3]linux.pollfd = undefined;
for (&tasks, 0..) |*t, i| {
pollfds[i] = .{ .fd = t.tfd, .events = linux.POLL.IN, .revents = 0 };
}
_ = linux.poll(&pollfds, 3, 100);
for (&tasks, 0..) |*t, i| {
if (pollfds[i].revents & linux.POLL.IN != 0) {
var exp: u64 = 0;
_ = posix.read(t.tfd, std.mem.asBytes(&exp)) catch continue;
const now_ns = (try std.time.Instant.now()).since(base);
t.fire_count += 1;
const expected_at = t.expected_fire_ns * t.fire_count;
const jitter_ns = if (now_ns > expected_at) now_ns - expected_at else expected_at - now_ns;
const jitter_us = jitter_ns / 1000;
t.total_jitter_us += jitter_us;
if (jitter_us > t.max_jitter_us) t.max_jitter_us = jitter_us;
}
}
}
try stdout.print("Results:\n", .{});
for (tasks) |t| {
const avg = if (t.fire_count > 0) t.total_jitter_us / t.fire_count else 0;
try stdout.print(" {s}: fired {d}x, max jitter {d}us, avg jitter {d}us\n", .{
t.name, t.fire_count, t.max_jitter_us, avg,
});
posix.close(t.tfd);
}
}
Each task gets its own timerfd set to its specific interval. A single poll() multiplexes all three, and when any fires we calculate the jitter as the difference between actual and expected elapsed time. The kernel-managed timerfd typically shows sub-millisecond jitter vs the sleep-based approach which accumulates drift from processing time.
Exercise 2: Extended cron parser with commas and ranges
const std = @import("std");
const CronField = struct {
bits: u64,
fn matchesValue(self: CronField, val: u6) bool {
return (self.bits >> val) & 1 == 1;
}
};
fn parseField(field: []const u8, min: u6, max: u6) !CronField {
var bits: u64 = 0;
// handle comma-separated parts
var comma_iter = std.mem.splitScalar(u8, field, ',');
while (comma_iter.next()) |part| {
if (part.len == 0) continue;
if (std.mem.eql(u8, part, "*")) {
var v = min;
while (v <= max) : (v += 1) { bits |= @as(u64, 1) << v; if (v == max) break; }
} else if (std.mem.startsWith(u8, part, "*/")) {
const step = try std.fmt.parseInt(u6, part[2..], 10);
if (step == 0) return error.InvalidStep;
var v: u7 = min;
while (v <= max) { bits |= @as(u64, 1) << @as(u6, @intCast(v)); v += step; }
} else if (std.mem.indexOf(u8, part, "-")) |dash_pos| {
const lo = try std.fmt.parseInt(u6, part[0..dash_pos], 10);
const hi = try std.fmt.parseInt(u6, part[dash_pos + 1 ..], 10);
if (lo < min or hi > max or lo > hi) return error.OutOfRange;
var v = lo;
while (v <= hi) : (v += 1) { bits |= @as(u64, 1) << v; if (v == hi) break; }
} else {
const val = try std.fmt.parseInt(u6, part, 10);
if (val < min or val > max) return error.OutOfRange;
bits |= @as(u64, 1) << val;
}
}
return CronField{ .bits = bits };
}
test "comma separated" {
const f = try parseField("1,15,30", 0, 59);
try std.testing.expect(f.matchesValue(1));
try std.testing.expect(f.matchesValue(15));
try std.testing.expect(f.matchesValue(30));
try std.testing.expect(!f.matchesValue(0));
try std.testing.expect(!f.matchesValue(29));
}
test "range" {
const f = try parseField("9-17", 0, 23);
try std.testing.expect(f.matchesValue(9));
try std.testing.expect(f.matchesValue(13));
try std.testing.expect(f.matchesValue(17));
try std.testing.expect(!f.matchesValue(8));
try std.testing.expect(!f.matchesValue(18));
}
test "last minute of year on sunday" {
const min_f = try parseField("59", 0, 59);
const hr_f = try parseField("23", 0, 23);
const md_f = try parseField("31", 1, 31);
const mo_f = try parseField("12", 1, 12);
const wd_f = try parseField("0", 0, 6);
try std.testing.expect(min_f.matchesValue(59));
try std.testing.expect(hr_f.matchesValue(23));
try std.testing.expect(md_f.matchesValue(31));
try std.testing.expect(mo_f.matchesValue(12));
try std.testing.expect(wd_f.matchesValue(0));
}
test "every minute" {
const f = try parseField("*/1", 0, 59);
for (0..60) |i| {
try std.testing.expect(f.matchesValue(@intCast(i)));
}
}
test "combined weekday range" {
const f = try parseField("1-5", 0, 6);
try std.testing.expect(!f.matchesValue(0)); // Sunday
try std.testing.expect(f.matchesValue(1)); // Monday
try std.testing.expect(f.matchesValue(5)); // Friday
try std.testing.expect(!f.matchesValue(6)); // Saturday
}
The key addition is splitting on commas first, then handling each sub-expression (wildcard, step, range, or single value) independently. The bits OR together so 1,15,30 sets exactly those three bits.
Exercise 3: Hierarchical two-level timer wheel
const std = @import("std");
const FINE_SLOTS = 256;
const COARSE_SLOTS = 64;
const FINE_RES_MS = 10;
const COARSE_RES_MS = FINE_SLOTS * FINE_RES_MS; // 2560ms
const WheelEntry = struct {
id: u32,
target_tick: u64, // absolute fine-tick when this should fire
};
const HierarchicalWheel = struct {
fine: [FINE_SLOTS]std.BoundedArray(WheelEntry, 16),
coarse: [COARSE_SLOTS]std.BoundedArray(WheelEntry, 16),
current_fine: u32,
current_coarse: u32,
absolute_tick: u64,
fired: std.BoundedArray(struct { id: u32, actual_tick: u64 }, 128),
fn init() HierarchicalWheel {
var hw: HierarchicalWheel = undefined;
for (&hw.fine) |*s| s.* = .{};
for (&hw.coarse) |*s| s.* = .{};
hw.current_fine = 0;
hw.current_coarse = 0;
hw.absolute_tick = 0;
hw.fired = .{};
return hw;
}
fn insert(self: *HierarchicalWheel, id: u32, delay_ticks: u64) void {
const target = self.absolute_tick + delay_ticks;
if (delay_ticks < FINE_SLOTS) {
const slot = (self.current_fine + @as(u32, @intCast(delay_ticks))) % FINE_SLOTS;
self.fine[slot].append(.{ .id = id, .target_tick = target }) catch {};
} else {
const coarse_delay = delay_ticks / FINE_SLOTS;
const slot = (self.current_coarse + @as(u32, @intCast(coarse_delay))) % COARSE_SLOTS;
self.coarse[slot].append(.{ .id = id, .target_tick = target }) catch {};
}
}
fn tick(self: *HierarchicalWheel) void {
// fire all entries in current fine slot
const fslot = &self.fine[self.current_fine];
for (fslot.slice()) |entry| {
self.fired.append(.{ .id = entry.id, .actual_tick = self.absolute_tick }) catch {};
}
fslot.resize(0) catch {};
self.current_fine = (self.current_fine + 1) % FINE_SLOTS;
self.absolute_tick += 1;
// if fine wheel completed a revolution, cascade from coarse
if (self.current_fine == 0) {
self.current_coarse = (self.current_coarse + 1) % COARSE_SLOTS;
const cslot = &self.coarse[self.current_coarse];
for (cslot.slice()) |entry| {
const remaining = if (entry.target_tick > self.absolute_tick)
entry.target_tick - self.absolute_tick
else
0;
const fine_slot = (self.current_fine + @as(u32, @intCast(remaining))) % FINE_SLOTS;
self.fine[fine_slot].append(entry) catch {};
}
cslot.resize(0) catch {};
}
}
};
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
var wheel = HierarchicalWheel.init();
var prng = std.Random.DefaultPrng.init(42);
const rand = prng.random();
var targets: [100]u64 = undefined;
for (&targets, 0..) |*t, i| {
const delay = rand.intRangeAtMost(u64, 1, 6000); // 10ms to 60s in ticks
t.* = delay;
wheel.insert(@intCast(i), delay);
}
// run for enough ticks
for (0..6100) |_| {
wheel.tick();
}
var max_dev: u64 = 0;
for (wheel.fired.slice()) |f| {
const expected = targets[f.id];
const dev = if (f.actual_tick > expected) f.actual_tick - expected else expected - f.actual_tick;
if (dev > max_dev) max_dev = dev;
}
try stdout.print("Fired {d}/100 timers. Max deviation: {d} ticks ({d}ms)\n", .{
wheel.fired.len, max_dev, max_dev * FINE_RES_MS,
});
}
The cascade happens exactly once per fine-wheel revolution: when current_fine wraps to 0, we advance the coarse pointer and redistribute all entries in that coarse slot into the fine wheel based on their remaining delay. This gives O(1) per tick for the common case and amortized O(1) for cascading.
Every daemon we've built so far -- the daemonized services from episode 69, the schedulers from episode 70 -- they all run with whatever resources the OS happens to give them. Open a million file descriptors? Sure, until you hit the default limit and get a mysterious error.SystemFdQuotaExceeded. Allocate 64 GB of memory on a 16 GB machine? The OOM killer shows up uninvited and shoots your process in the head. Burn 100% CPU forever? Nobody stops you, but your coworkers' processes starve.
"But who actually controls these things?"
The kernel does, and it gives you precise knobs to control them. Resource limits (rlimits) let you cap what a process can consume. Linux capabilities let you grant specific privileges without handing out full root. And namespaces and cgroups let you isolate processes from each other entirely -- which is, as you might have guessed, exactly how Docker works under the hood ;-)
Getrlimit/setrlimit: controlling process resource usage
Every process on Linux has a set of resource limits -- soft limits and hard limits. The soft limit is the actual enforced ceiling. The hard limit is the maximum the soft limit can be raised to. A regular user can lower their hard limit (permanently!) and can raise their soft limit up to the hard limit. Only root (or specifically, CAP_SYS_RESOURCE) can raise the hard limit.
The syscalls are getrlimit and setrlimit, and in Zig we access them through std.os.linux:
const std = @import("std");
const linux = std.os.linux;
const Resource = enum(u32) {
NOFILE = 7, // max open file descriptors
AS = 9, // max virtual memory (address space) in bytes
CPU = 0, // max CPU time in seconds
CORE = 4, // max core dump file size in bytes
NPROC = 6, // max number of processes for this uid
FSIZE = 1, // max file size that can be created
STACK = 3, // max stack size
DATA = 2, // max data segment size
};
const Rlimit = extern struct {
cur: u64, // soft limit
max: u64, // hard limit
};
fn getrlimit(resource: Resource) !Rlimit {
var rlim: Rlimit = undefined;
const result = linux.syscall2(.getrlimit, @intFromEnum(resource), @intFromPtr(&rlim));
const signed: isize = @bitCast(result);
if (signed < 0) return error.GetrlimitFailed;
return rlim;
}
fn setrlimit(resource: Resource, rlim: Rlimit) !void {
const result = linux.syscall2(.setrlimit, @intFromEnum(resource), @intFromPtr(&rlim));
const signed: isize = @bitCast(result);
if (signed < 0) return error.SetrlimitFailed;
}
fn formatLimit(val: u64) [20]u8 {
var buf: [20]u8 = [_]u8{' '} ** 20;
if (val == std.math.maxInt(u64)) {
@memcpy(buf[0..9], "unlimited");
} else {
_ = std.fmt.bufPrint(&buf, "{d}", .{val}) catch {};
}
return buf;
}
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
try stdout.print("Current resource limits:\n", .{});
try stdout.print("{s:<12} {s:<20} {s:<20}\n", .{ "Resource", "Soft", "Hard" });
try stdout.print("{s}\n", .{"-" ** 52});
const resources = [_]struct { r: Resource, name: []const u8 }{
.{ .r = .NOFILE, .name = "Open files" },
.{ .r = .AS, .name = "Virt memory" },
.{ .r = .CPU, .name = "CPU time(s)" },
.{ .r = .CORE, .name = "Core dump" },
.{ .r = .NPROC, .name = "Max procs" },
.{ .r = .FSIZE, .name = "File size" },
.{ .r = .STACK, .name = "Stack size" },
};
for (resources) |item| {
const rlim = try getrlimit(item.r);
const soft_str = formatLimit(rlim.cur);
const hard_str = formatLimit(rlim.max);
try stdout.print("{s:<12} {s:<20} {s:<20}\n", .{ item.name, &soft_str, &hard_str });
}
}
On a typical system you'll see NOFILE (open files) at 1024 soft / 1048576 hard, AS (virtual memory) at unlimited, and NPROC (processes) at something like 63000. The difference between soft and hard is the key design decision -- programs can self-limit by lowering their soft limit, and the hard limit acts as a backstop that prevents escalation.
Common limits: open files, memory, CPU time, core dump size
Let's actually SET some limits and see what happens when a process hits them. This is how you'd harden a service -- restrict what it can consume so a bug or attack can't take down the whole machine:
const std = @import("std");
const linux = std.os.linux;
const posix = std.posix;
const Rlimit = extern struct { cur: u64, max: u64 };
const RLIMIT_NOFILE: u32 = 7;
const RLIMIT_AS: u32 = 9;
const RLIMIT_FSIZE: u32 = 1;
fn setrlimit(resource: u32, rlim: Rlimit) !void {
const result = linux.syscall2(.setrlimit, resource, @intFromPtr(&rlim));
const signed: isize = @bitCast(result);
if (signed < 0) return error.SetrlimitFailed;
}
fn testFileDescriptorLimit() !void {
const stdout = std.io.getStdOut().writer();
try stdout.print("\n--- Test: RLIMIT_NOFILE (max open files) ---\n", .{});
// limit to 20 file descriptors (stdin/stdout/stderr take 3)
try setrlimit(RLIMIT_NOFILE, .{ .cur = 20, .max = 20 });
try stdout.print("Set NOFILE limit to 20\n", .{});
var opened: u32 = 0;
var fds: [25]posix.fd_t = undefined;
for (0..25) |i| {
const result = posix.open("/dev/null", .{ .ACCMODE = .RDONLY }, 0) catch |err| {
try stdout.print(" open() failed at fd #{d}: {s}\n", .{ i, @errorName(err) });
break;
};
fds[opened] = result;
opened += 1;
}
try stdout.print("Successfully opened {d} fds before hitting limit\n", .{opened});
// clean up
for (fds[0..opened]) |fd| posix.close(fd);
}
fn testFileSizeLimit() !void {
const stdout = std.io.getStdOut().writer();
try stdout.print("\n--- Test: RLIMIT_FSIZE (max file size) ---\n", .{});
// limit files to 1024 bytes
try setrlimit(RLIMIT_FSIZE, .{ .cur = 1024, .max = 1024 });
try stdout.print("Set FSIZE limit to 1024 bytes\n", .{});
const path = "/tmp/zig_rlimit_test.dat";
var file = std.fs.cwd().createFile(path, .{}) catch return;
defer {
file.close();
std.fs.cwd().deleteFile(path) catch {};
}
// try writing 2048 bytes
const data = [_]u8{'A'} ** 512;
var total_written: u64 = 0;
for (0..4) |_| {
const n = file.write(&data) catch |err| {
try stdout.print(" write() failed after {d} bytes: {s}\n", .{ total_written, @errorName(err) });
break;
};
total_written += n;
}
try stdout.print("Wrote {d} bytes total (limit was 1024)\n", .{total_written});
}
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
try stdout.print("Resource limit enforcement demo\n", .{});
// run tests in child processes so limits don't affect us
const pid1 = try posix.fork();
if (pid1 == 0) {
testFileDescriptorLimit() catch {};
posix.exit(0);
}
_ = posix.waitpid(pid1, 0);
const pid2 = try posix.fork();
if (pid2 == 0) {
testFileSizeLimit() catch {};
posix.exit(0);
}
_ = posix.waitpid(pid2, 0);
try stdout.print("\nAll tests completed in child processes.\n", .{});
}
Notice we run each test in a child process (via fork()). That's because setrlimit applies to the calling process permanently -- once you lower your hard limit, you can't raise it back. By forking first, the parent keeps its original limits. This is the standard pattern for any process that needs to impose limits: fork, set limits in the child, then exec the target program.
When you hit RLIMIT_NOFILE, open() returns EMFILE. When you hit RLIMIT_FSIZE, write() returns EFBIG and the kernel sends SIGXFSZ. When CPU time expires, the kernel sends SIGXCPU (soft limit) and then SIGKILL (hard limit). Each limit has its own enforcement mechanism -- some are signals, some are error returns, some are silent caps.
Linux capabilities: fine-grained privilege control
Traditional Unix has a binary privilege model: you're root (uid 0) or you're not. Root can do everything. Everyone else is restricted. This is terrible for security because programs that need ONE privileged operation (like binding port 80) get ALL privileges.
Linux capabilities break root's power into ~40 individual capabilities. A process can have exactly the capabilities it needs and nothing more. The important ones for systems programming:
CAP_NET_BIND_SERVICE-- bind to ports below 1024CAP_SYS_RESOURCE-- raise hard rlimits, override disk quotaCAP_SETUID/CAP_SETGID-- change process uid/gidCAP_DAC_OVERRIDE-- bypass file permission checksCAP_NET_RAW-- use raw sockets (ping, packet capture)CAP_SYS_PTRACE-- trace other processesCAP_SYS_ADMIN-- the "catch-all" capability (mount, sethostname, etc.)
Capabilities are stored in three sets per thread: permitted (what you're allowed to have), effective (what's currently active), and inheritable (what children can inherit). You manipulate them with the capset syscall or (more commonly) with the prctl syscall:
const std = @import("std");
const linux = std.os.linux;
// capability constants
const CAP_NET_BIND_SERVICE: u32 = 10;
const CAP_SYS_RESOURCE: u32 = 24;
const CAP_SETUID: u32 = 7;
const CAP_SETGID: u32 = 6;
// prctl operations
const PR_CAPBSET_READ: u32 = 23;
const PR_CAPBSET_DROP: u32 = 24;
const PR_SET_KEEPCAPS: u32 = 8;
const PR_GET_KEEPCAPS: u32 = 7;
fn prctl(option: u32, arg2: u64, arg3: u64, arg4: u64, arg5: u64) isize {
const result = linux.syscall5(.prctl, option, arg2, arg3, arg4, arg5);
return @bitCast(result);
}
fn hasBoundingCap(cap: u32) bool {
return prctl(PR_CAPBSET_READ, cap, 0, 0, 0) == 1;
}
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
const caps = [_]struct { id: u32, name: []const u8 }{
.{ .id = 0, .name = "CAP_CHOWN" },
.{ .id = 1, .name = "CAP_DAC_OVERRIDE" },
.{ .id = 6, .name = "CAP_SETGID" },
.{ .id = 7, .name = "CAP_SETUID" },
.{ .id = 10, .name = "CAP_NET_BIND_SERVICE" },
.{ .id = 12, .name = "CAP_NET_RAW" },
.{ .id = 21, .name = "CAP_SYS_ADMIN" },
.{ .id = 24, .name = "CAP_SYS_RESOURCE" },
.{ .id = 25, .name = "CAP_SYS_TIME" },
.{ .id = 38, .name = "CAP_PERFMON" },
};
try stdout.print("Bounding set capabilities:\n", .{});
for (caps) |cap| {
const has = hasBoundingCap(cap.id);
try stdout.print(" {s:<25} {s}\n", .{ cap.name, if (has) "YES" else "NO" });
}
// demonstrate PR_SET_KEEPCAPS (preserves capabilities across setuid)
const keepcaps = prctl(PR_GET_KEEPCAPS, 0, 0, 0, 0);
try stdout.print("\nPR_SET_KEEPCAPS: {s}\n", .{
if (keepcaps == 1) "enabled" else "disabled",
});
try stdout.print("\nNote: To actually use capabilities, the binary must be\n", .{});
try stdout.print("granted them via setcap(8) or inherited from a capable parent.\n", .{});
try stdout.print("Example: sudo setcap cap_net_bind_service=+ep ./my_server\n", .{});
}
The bounding set is the superset of capabilities a process can ever acquire. Even if a binary has file capabilities set, the kernel intersects them with the bounding set. Dropping a capability from the bounding set (via PR_CAPBSET_DROP) permanently removes it -- it can't be re-acquired even through exec'ing a setuid binary. This is how container runtimes restrict what processes inside the container can ever do.
Dropping privileges: running as non-root after binding port 80
Here's the classic scenario: your web server needs to bind port 80 (requires CAP_NET_BIND_SERVICE or root), but once the socket is bound, it should run as an unprivelged user for safety. The sequence is: start as root, bind the socket, then drop to a regular user. If an attacker exploits a vulnerability after the drop, they only get unprivileged access:
const std = @import("std");
const linux = std.os.linux;
const posix = std.posix;
const PR_SET_KEEPCAPS: u32 = 8;
fn prctl(option: u32, arg2: u64) isize {
const result = linux.syscall5(.prctl, option, arg2, 0, 0, 0);
return @bitCast(result);
}
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
const uid = linux.getuid();
try stdout.print("Privilege drop demonstration\n", .{});
try stdout.print("Current UID: {d}\n", .{uid});
if (uid != 0) {
try stdout.print("\nNot running as root -- simulating the pattern instead.\n", .{});
try stdout.print("In production, run with: sudo ./priv_drop\n\n", .{});
// demonstrate the PATTERN without actually needing root
try stdout.print("The privilege drop sequence:\n", .{});
try stdout.print(" 1. Start as root (uid=0)\n", .{});
try stdout.print(" 2. Create and bind socket to port 80\n", .{});
try stdout.print(" 3. PR_SET_KEEPCAPS=1 (preserve caps across setuid)\n", .{});
try stdout.print(" 4. setgroups() -- drop supplementary groups\n", .{});
try stdout.print(" 5. setgid(nobody) -- change group first\n", .{});
try stdout.print(" 6. setuid(nobody) -- change user (point of no return)\n", .{});
try stdout.print(" 7. Drop all capabilities except what's needed\n", .{});
try stdout.print(" 8. Now running as nobody with a bound port-80 socket\n", .{});
// simulate with a non-privileged port
try stdout.print("\nBinding to port 8080 (no root needed) as demonstration:\n", .{});
const sock = try posix.socket(posix.AF.INET, posix.SOCK.STREAM, 0);
defer posix.close(sock);
const addr = std.net.Address.initIp4([4]u8{ 127, 0, 0, 1 }, 8080);
posix.bind(sock, &addr.any, addr.getOsSockLen()) catch |err| {
try stdout.print(" bind failed: {s} (port likely in use)\n", .{@errorName(err)});
return;
};
try posix.listen(sock, 5);
try stdout.print(" Bound and listening on 127.0.0.1:8080\n", .{});
try stdout.print(" Socket fd {d} survives privilege drop -- fd stays valid.\n", .{sock});
return;
}
// actual root path (only runs if started as root)
try stdout.print("Running as root -- performing actual privilege drop.\n", .{});
// step 1: bind privileged port
const sock = try posix.socket(posix.AF.INET, posix.SOCK.STREAM, 0);
const addr = std.net.Address.initIp4([4]u8{ 0, 0, 0, 0 }, 80);
try posix.bind(sock, &addr.any, addr.getOsSockLen());
try posix.listen(sock, 128);
try stdout.print("Bound to port 80 as root\n", .{});
// step 2: keep capabilities across uid change
_ = prctl(PR_SET_KEEPCAPS, 1);
// step 3: drop to nobody (uid/gid 65534 on most systems)
const nobody_gid = 65534;
const nobody_uid = 65534;
_ = linux.syscall1(.setgid, nobody_gid);
_ = linux.syscall1(.setuid, nobody_uid);
try stdout.print("Dropped to uid={d} gid={d}\n", .{ linux.getuid(), linux.getgid() });
try stdout.print("Socket fd {d} is still valid and listening!\n", .{sock});
// step 4: clear keepcaps (no longer needed)
_ = prctl(PR_SET_KEEPCAPS, 0);
posix.close(sock);
}
The critical ordering here: setgid BEFORE setuid. If you call setuid first, you lose the privilege to call setgid (because you're no longer root). Also note that PR_SET_KEEPCAPS must be set BEFORE the uid change -- otherwise the kernel automatically clears all capabilities when the effective uid changes from 0 to non-zero. This is a common mistake that results in a process with zero capabilities and no way to get them back.
Having said that, the socket file descriptor itself doesn't care about privileges. Once bind() succeeds, the fd is just a number. The kernel already associated it with port 80 -- dropping privileges afterwards doesn't un-bind it. This is the whole point of the pattern: acquire resources while privileged, then drop privileges for the rest of the process lifetime.
Cgroups v2: resource isolation for groups of processes
Resource limits (rlimits) apply to individual processes. But what if you want to limit a GROUP of processes collectively? Say you're running 50 worker threads that each individually stay under their rlimits, but together they're consuming 90% of system memory. That's what cgroups (control groups) solve.
Cgroups v2 uses a filesystem interface mounted at /sys/fs/cgroup/. You create a directory (= a cgroup), write configuration to files in that directory, and then move process PIDs into it. The kernel enforces the limits on all processes in that group collectively:
const std = @import("std");
const posix = std.posix;
const linux = std.os.linux;
const CGROUP_BASE = "/sys/fs/cgroup";
fn writeCgroupFile(cgroup: []const u8, filename: []const u8, value: []const u8) !void {
var path_buf: [256]u8 = undefined;
const path = std.fmt.bufPrint(&path_buf, "{s}/{s}/{s}", .{ CGROUP_BASE, cgroup, filename }) catch return error.PathTooLong;
var file = std.fs.cwd().openFile(path, .{ .mode = .write_only }) catch |err| {
return err;
};
defer file.close();
try file.writeAll(value);
}
fn readCgroupFile(cgroup: []const u8, filename: []const u8, buf: []u8) ![]u8 {
var path_buf: [256]u8 = undefined;
const path = std.fmt.bufPrint(&path_buf, "{s}/{s}/{s}", .{ CGROUP_BASE, cgroup, filename }) catch return error.PathTooLong;
var file = std.fs.cwd().openFile(path, .{}) catch |err| {
return err;
};
defer file.close();
const n = try file.read(buf);
return buf[0..n];
}
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
try stdout.print("Cgroups v2 resource isolation\n", .{});
try stdout.print("Base path: {s}\n\n", .{CGROUP_BASE});
// read current cgroup membership
var buf: [512]u8 = undefined;
const proc_cgroup = std.fs.cwd().openFile("/proc/self/cgroup", .{}) catch {
try stdout.print("Cannot read /proc/self/cgroup (cgroups not available?)\n", .{});
return;
};
defer proc_cgroup.close();
const n = try proc_cgroup.read(&buf);
try stdout.print("Current cgroup: {s}\n", .{std.mem.trim(u8, buf[0..n], "\n")});
// demonstrate reading cgroup controllers
if (readCgroupFile("", "cgroup.controllers", &buf)) |controllers| {
try stdout.print("Available controllers: {s}\n", .{std.mem.trim(u8, controllers, "\n")});
} else |_| {
try stdout.print("Cannot read controllers (need root or delegation)\n", .{});
}
// show the cgroup creation pattern (won't execute without root)
try stdout.print("\nTo create a cgroup with memory limit:\n", .{});
try stdout.print(" mkdir /sys/fs/cgroup/my_sandbox\n", .{});
try stdout.print(" echo '+memory +cpu' > /sys/fs/cgroup/cgroup.subtree_control\n", .{});
try stdout.print(" echo '100M' > /sys/fs/cgroup/my_sandbox/memory.max\n", .{});
try stdout.print(" echo '50000 100000' > /sys/fs/cgroup/my_sandbox/cpu.max\n", .{});
try stdout.print(" echo $PID > /sys/fs/cgroup/my_sandbox/cgroup.procs\n", .{});
try stdout.print("\nThis limits the group to 100MB RAM and 50%% of one CPU core.\n", .{});
// if we have permissions, actually read memory stats
if (readCgroupFile("", "memory.current", &buf)) |mem| {
try stdout.print("\nRoot cgroup memory.current: {s}", .{mem});
} else |_| {}
if (readCgroupFile("", "cpu.stat", &buf)) |stat| {
try stdout.print("Root cgroup cpu.stat:\n{s}\n", .{stat});
} else |_| {}
}
The filesystem interface is elegant -- everything is a plain text file. memory.max takes a number of bytes (or "max" for unlimited). cpu.max takes two numbers: quota and period in microseconds, so 50000 100000 means "50ms of CPU time per 100ms period" = 50% of one core. cgroup.procs is where you write PIDs to move processes into the group.
In production, systemd creates cgroups for every service automatically (that's what MemoryMax= and CPUQuota= in unit files control). Container runtimes like Docker create a cgroup per container. The filesystem interface means you can inspect any container's resource usage just by reading files in /sys/fs/cgroup/.
Namespaces: the building blocks of containers
While cgroups limit HOW MUCH of a resource processes can use, namespaces control WHAT a process can SEE. Each namespace type creates an isolated view of one system resource:
- PID namespace: process sees its own PID tree (init is PID 1 inside)
- Mount namespace: process has its own filesystem mount table
- Network namespace: process has its own network interfaces, routing, ports
- UTS namespace: process has its own hostname
- User namespace: process has its own uid/gid mapping (can be "root" inside but nobody outside)
- IPC namespace: process has its own System V IPC objects
- Cgroup namespace: process sees its own cgroup tree as root
You create them with the clone() or unshare() syscall:
const std = @import("std");
const linux = std.os.linux;
const posix = std.posix;
// clone flags for namespaces
const CLONE_NEWUTS: u64 = 0x04000000;
const CLONE_NEWPID: u64 = 0x20000000;
const CLONE_NEWNS: u64 = 0x00020000;
const CLONE_NEWNET: u64 = 0x40000000;
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
try stdout.print("Linux namespace demonstration\n\n", .{});
// read our current namespace IDs from /proc
try stdout.print("Current namespace IDs:\n", .{});
const ns_types = [_][]const u8{ "pid", "mnt", "net", "uts", "ipc", "user", "cgroup" };
for (ns_types) |ns| {
var path_buf: [64]u8 = undefined;
const path = std.fmt.bufPrint(&path_buf, "/proc/self/ns/{s}", .{ns}) catch continue;
var link_buf: [128]u8 = undefined;
const link = posix.readlink(path, &link_buf) catch continue;
try stdout.print(" {s:<8}: {s}\n", .{ ns, link });
}
// demonstrate UTS namespace (hostname isolation) via fork + unshare
try stdout.print("\nUTS namespace test (requires root for unshare):\n", .{});
const pid = try posix.fork();
if (pid == 0) {
// child: try to unshare UTS namespace
const result = linux.syscall1(.unshare, CLONE_NEWUTS);
const signed: isize = @bitCast(result);
if (signed < 0) {
std.debug.print(" unshare(CLONE_NEWUTS) failed (need root)\n", .{});
posix.exit(1);
}
// in new UTS namespace -- set hostname
const new_name = "sandbox";
const set_result = linux.syscall2(
.sethostname,
@intFromPtr(new_name.ptr),
new_name.len,
);
const set_signed: isize = @bitCast(set_result);
if (set_signed == 0) {
std.debug.print(" Child hostname set to '{s}' (isolated!)\n", .{new_name});
}
// verify
var hbuf: [64]u8 = undefined;
const uname_result = linux.uname(&hbuf);
_ = uname_result;
std.debug.print(" Child done.\n", .{});
posix.exit(0);
}
_ = posix.waitpid(pid, 0);
// parent's hostname is unchanged
var hostname_buf: [256]u8 = undefined;
const hostname_file = std.fs.cwd().openFile("/proc/sys/kernel/hostname", .{}) catch {
try stdout.print(" Parent hostname: (could not read)\n", .{});
return;
};
defer hostname_file.close();
const hn = try hostname_file.read(&hostname_buf);
try stdout.print(" Parent hostname still: {s}", .{hostname_buf[0..hn]});
}
The namespace IDs in /proc/self/ns/ are inode numbers. When two processes share the same inode number for a namespace type, they're in the same namespace. When the numbers differ, they're isolated from each other. Docker containers typically have different inodes for ALL namespace types -- they're fully isolated.
unshare(CLONE_NEWUTS) creates a new UTS namespace for the calling process. Changes to the hostname inside that namespace (via sethostname) are invisible to processes outside. The parent process's hostname remains unchanged. This is how containers get their own hostname without affecting the host.
Chroot and pivot_root: filesystem isolation
The oldest form of filesystem isolation on Unix is chroot -- it changes the process's root directory so it can't see anything above it. Combined with a mount namespace, this is how containers get their own filesystem view:
const std = @import("std");
const linux = std.os.linux;
const posix = std.posix;
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
try stdout.print("Filesystem isolation with chroot\n\n", .{});
const uid = linux.getuid();
if (uid != 0) {
try stdout.print("Not running as root -- demonstrating the pattern.\n\n", .{});
try stdout.print("chroot requires CAP_SYS_CHROOT (effectively root).\n", .{});
try stdout.print("The sequence for a minimal sandbox:\n\n", .{});
try stdout.print(" 1. Create a minimal root filesystem:\n", .{});
try stdout.print(" mkdir -p /tmp/sandbox/{{bin,lib,lib64,proc,tmp}}\n", .{});
try stdout.print(" cp /bin/sh /tmp/sandbox/bin/\n", .{});
try stdout.print(" cp /lib/x86_64-linux-gnu/libc.so.6 /tmp/sandbox/lib/\n", .{});
try stdout.print(" cp /lib64/ld-linux-x86-64.so.2 /tmp/sandbox/lib64/\n", .{});
try stdout.print("\n 2. Fork a child process\n", .{});
try stdout.print(" 3. In the child:\n", .{});
try stdout.print(" chroot(\"/tmp/sandbox\")\n", .{});
try stdout.print(" chdir(\"/\") <- CRITICAL: without this, cwd is outside!\n", .{});
try stdout.print(" exec(\"/bin/sh\")\n", .{});
try stdout.print("\n 4. The child now sees /tmp/sandbox as its entire filesystem.\n", .{});
try stdout.print(" It cannot access /etc/passwd, /home, or anything on the host.\n", .{});
try stdout.print("\nSecurity note: chroot alone is NOT a security boundary!\n", .{});
try stdout.print("A root process inside a chroot can escape via:\n", .{});
try stdout.print(" - Creating a device node and mounting the real root\n", .{});
try stdout.print(" - Using ptrace on a process outside the chroot\n", .{});
try stdout.print(" - The classic double-chroot escape\n", .{});
try stdout.print("\npivot_root (used by containers) is stronger:\n", .{});
try stdout.print(" - Requires a mount namespace (clone CLONE_NEWNS)\n", .{});
try stdout.print(" - Actually moves the old root to a subdirectory\n", .{});
try stdout.print(" - Combined with unmounting old root, truly isolates\n", .{});
return;
}
// actual chroot (only if root)
const sandbox = "/tmp/zig_chroot_demo";
// create minimal filesystem
std.fs.cwd().makePath(sandbox ++ "/tmp") catch {};
std.fs.cwd().makePath(sandbox ++ "/proc") catch {};
// write a file inside the sandbox
{
var f = try std.fs.cwd().createFile(sandbox ++ "/tmp/hello.txt", .{});
defer f.close();
try f.writeAll("Inside the sandbox!\n");
}
const pid = try posix.fork();
if (pid == 0) {
// child: chroot into sandbox
const result = linux.syscall1(.chroot, @intFromPtr(sandbox.ptr));
const signed: isize = @bitCast(result);
if (signed < 0) {
std.debug.print("chroot failed\n", .{});
posix.exit(1);
}
// CRITICAL: change working directory to new root
_ = linux.syscall1(.chdir, @intFromPtr("/".ptr));
// verify isolation
if (std.fs.cwd().openFile("/tmp/hello.txt", .{})) |file| {
var buf: [64]u8 = undefined;
const n = file.read(&buf) catch 0;
std.debug.print("Read from sandbox: {s}", .{buf[0..n]});
file.close();
} else |_| {
std.debug.print("Cannot read /tmp/hello.txt\n", .{});
}
// try to access host file (should fail)
if (std.fs.cwd().openFile("/etc/passwd", .{})) |_| {
std.debug.print("WARNING: /etc/passwd accessible (escape!)\n", .{});
} else |_| {
std.debug.print("Good: /etc/passwd not accessible\n", .{});
}
posix.exit(0);
}
_ = posix.waitpid(pid, 0);
// cleanup
std.fs.cwd().deleteTree(sandbox) catch {};
try stdout.print("Sandbox cleaned up.\n", .{});
}
The critical step people forget: chdir("/") after chroot(). Without it, the process's current working directory is still the old path (outside the chroot), and it can use relative paths like ../../etc/passwd to escape. The chdir("/") resolves to the NEW root after chroot, trapping the process properly.
Practical example: sandboxing a child process with resource limits
Let's combine everything into a practical sandbox. We'll fork a child process and apply multiple layers of restriction -- file descriptor limits, memory limits, capability dropping, and filesystem isolation. This is essentially what a simplified container runtime does:
const std = @import("std");
const linux = std.os.linux;
const posix = std.posix;
const Rlimit = extern struct { cur: u64, max: u64 };
const RLIMIT_NOFILE: u32 = 7;
const RLIMIT_AS: u32 = 9;
const RLIMIT_CPU: u32 = 0;
const RLIMIT_FSIZE: u32 = 1;
const RLIMIT_NPROC: u32 = 6;
const PR_SET_NO_NEW_PRIVS: u32 = 38;
fn setrlimit(resource: u32, rlim: Rlimit) void {
_ = linux.syscall2(.setrlimit, resource, @intFromPtr(&rlim));
}
fn prctl1(option: u32, arg: u64) void {
_ = linux.syscall5(.prctl, option, arg, 0, 0, 0);
}
fn runSandboxed(allocator: std.mem.Allocator) !void {
const stdout = std.io.getStdOut().writer();
_ = allocator;
// layer 1: resource limits
setrlimit(RLIMIT_NOFILE, .{ .cur = 32, .max = 32 }); // max 32 open files
setrlimit(RLIMIT_AS, .{ .cur = 128 * 1024 * 1024, .max = 128 * 1024 * 1024 }); // 128MB virtual memory
setrlimit(RLIMIT_CPU, .{ .cur = 5, .max = 10 }); // 5s soft, 10s hard CPU
setrlimit(RLIMIT_FSIZE, .{ .cur = 10 * 1024 * 1024, .max = 10 * 1024 * 1024 }); // 10MB max file
setrlimit(RLIMIT_NPROC, .{ .cur = 4, .max = 4 }); // max 4 child processes
// layer 2: no new privileges (prevents exec of setuid/setcap binaries)
prctl1(PR_SET_NO_NEW_PRIVS, 1);
try stdout.print(" [sandbox] Limits applied:\n", .{});
try stdout.print(" Max open files: 32\n", .{});
try stdout.print(" Max memory: 128 MB\n", .{});
try stdout.print(" Max CPU time: 5s soft / 10s hard\n", .{});
try stdout.print(" Max file size: 10 MB\n", .{});
try stdout.print(" Max processes: 4\n", .{});
try stdout.print(" No new privs: enabled\n", .{});
// layer 3: verify limits are in effect
try stdout.print("\n [sandbox] Testing limits...\n", .{});
// test: try to open many files
var opened: u32 = 0;
for (0..40) |_| {
_ = posix.open("/dev/null", .{ .ACCMODE = .RDONLY }, 0) catch break;
opened += 1;
}
try stdout.print(" Opened {d}/40 files (limit: 32, minus stdin/out/err)\n", .{opened});
// test: try to allocate too much memory
// (mmap will fail when AS limit is hit)
const page_size: usize = 4096;
var alloc_count: u32 = 0;
for (0..40) |_| {
const result = linux.mmap(null, page_size * 1024, // 4MB chunks
linux.PROT.READ | linux.PROT.WRITE,
.{ .TYPE = .PRIVATE, .ANONYMOUS = true }, -1, 0);
const signed: isize = @bitCast(result);
if (signed < 0) break;
alloc_count += 1;
}
try stdout.print(" Allocated {d}x4MB chunks before hitting memory limit\n", .{alloc_count});
try stdout.print("\n [sandbox] Sandbox operational. Process is restricted.\n", .{});
}
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const allocator = gpa.allocator();
try stdout.print("Process sandbox demonstration\n", .{});
try stdout.print("Forking child and applying restrictions...\n\n", .{});
const pid = try posix.fork();
if (pid == 0) {
// child: apply sandbox
runSandboxed(allocator) catch |err| {
std.debug.print("Sandbox error: {s}\n", .{@errorName(err)});
};
posix.exit(0);
}
// parent: wait for child
const result = posix.waitpid(pid, 0);
const status = result.status;
if (status.signal()) |sig| {
try stdout.print("\nChild killed by signal {d}", .{@intFromEnum(sig)});
if (@intFromEnum(sig) == 9) try stdout.print(" (SIGKILL -- likely OOM or CPU hard limit)");
if (@intFromEnum(sig) == 24) try stdout.print(" (SIGXCPU -- CPU time exceeded)");
try stdout.print("\n", .{});
} else {
try stdout.print("\nChild exited normally with code {d}\n", .{status.exit_status()});
}
try stdout.print("Parent process remains unrestricted.\n", .{});
}
The layering matters. PR_SET_NO_NEW_PRIVS is the final lock -- once set, the process (and all its descendants) can never gain new privileges, even by exec'ing a setuid binary. Combined with resource limits, this creates a box the child cannot escape from. The parent remains completely unrestricted because limits are per-process (inherited by children at fork time, but not retroactively applied to the parent).
This is the foundation of every container runtime, every browser sandbox, and every CI/CD job isolation system. Docker adds a few more layers (seccomp filters, AppArmor/SELinux profiles, read-only filesystems) but the core technique is exactly what we've built here: fork, restrict, exec.
Exercises
Build a resource limit enforcer that reads limits from a config file and applies them before exec'ing a target program. The config format should be one limit per line:
NOFILE 64,AS 256M,CPU 30, etc. Parse the file, apply each limit via setrlimit, then exec the given program with its arguments. Handle theM,G,Ksuffixes for byte-based limits. Test it by running a program that opens files in a loop and verifying it gets EMFILE at the expected count.Write a capability inspector that reads
/proc/<pid>/statusfor a given PID and parses theCapPrm,CapEff, andCapBndlines (which are hex-encoded capability bitmasks). Decode each bitmask into human-readable capability names (CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, etc.) and display them in a table. Run it on PID 1 (init/systemd) and on your own process to see the difference. The program should accept a PID as a command-line argument and handle the case where the process doesn't exist.Implement a process jail that combines three isolation layers: (a) resource limits (NOFILE=32, AS=64MB, CPU=5s, NPROC=2), (b) PR_SET_NO_NEW_PRIVS, and (c) a read-only filesystem view using a bind mount of /tmp into a new directory as the process's working directory. Fork a child, apply all three layers, then have the child attempt to: open 50 files, allocate 100MB, fork 5 grandchildren, write to a read-only path, and exec a setuid binary (/usr/bin/passwd). Report which operations succeeded and which were blocked by which layer.
Zo, en dat is het voor nu!
- getrlimit/setrlimit provide per-process resource caps with soft (enforced) and hard (maximum raisable) limits -- always apply them in a child process via fork to avoid permanently restricting the parent
- Common limits include NOFILE (open fds), AS (virtual memory), CPU (time in seconds), FSIZE (max file creation size), and NPROC (max child processes per uid)
- Linux capabilities split root's monolithic power into ~40 individual privileges -- grant only what's needed (like CAP_NET_BIND_SERVICE for port 80) instead of full root
- The privilege drop pattern is: bind privileged resources, PR_SET_KEEPCAPS, setgid, setuid, clear keepcaps -- ordering is critical because setuid clears capabilities by default
- Cgroups v2 limit resource usage for groups of processes collectively via the /sys/fs/cgroup filesystem interface -- memory.max, cpu.max, io.max control the three main resources
- Namespaces isolate what a process can see (PID tree, mounts, network, hostname, users) -- each namespace type creates an independent view of that resource
- chroot changes the visible root directory but is not a security boundary alone -- pivot_root in a mount namespace is stronger
- PR_SET_NO_NEW_PRIVS permanently prevents privilege escalation through exec of setuid/setcap binaries -- the final lock on a sandboxed process
Thanks for reading!