Learn Ethical Hacking (#65) - OSINT Automation - Large-Scale Intelligence Gathering
Learn Ethical Hacking (#65) - OSINT Automation - Large-Scale Intelligence Gathering

What will I learn
- Automating OSINT at scale -- going from manual Google searches to systematic intelligence pipelines;
- Subdomain enumeration pipelines -- chaining amass, subfinder, and httpx for comprehensive discovery;
- Certificate Transparency monitoring -- real-time alerts when a target registers new certificates;
- Social media intelligence -- automated collection from LinkedIn, Twitter, GitHub, and public forums;
- Dark web monitoring -- safely searching for leaked credentials, stolen data, and threat actor chatter;
- Data correlation and enrichment -- linking disparate OSINT sources into actionable intelligence;
- Building OSINT workflows -- Python scripts that collect, deduplicate, enrich, and report;
- Defense: monitoring your own attack surface, digital footprint reduction, takedown procedures.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- Python 3.11+ with requests and aiohttp;
- Understanding of OSINT fundamentals from Episode 47;
- The ambition to learn ethical hacking and security research.
Difficulty
- Advanced
Curriculum (of the Learn Ethical Hacking Series):
- Learn Ethical Hacking (#1) - Why Hackers Win
- Learn Ethical Hacking (#2) - Your Hacking Lab
- Learn Ethical Hacking (#3) - How the Internet Actually Works - For Attackers
- Learn Ethical Hacking (#4) - Reconnaissance - The Art of Not Being Noticed
- Learn Ethical Hacking (#5) - Active Scanning - Mapping the Attack Surface
- Learn Ethical Hacking (#6) - The AI Slop Epidemic - Why AI-Generated Code Is a Security Disaster
- Learn Ethical Hacking (#7) - Passwords - Why Humans Are the Weakest Cipher
- Learn Ethical Hacking (#8) - Social Engineering - Hacking the Human
- Learn Ethical Hacking (#9) - Cryptography for Hackers - What Protects Data (and What Doesn't)
- Learn Ethical Hacking (#10) - The Vulnerability Lifecycle - From Discovery to Patch to Exploit
- Learn Ethical Hacking (#11) - HTTP Deep Dive - Request Smuggling and Header Injection
- Learn Ethical Hacking (#12) - SQL Injection - The Bug That Won't Die
- Learn Ethical Hacking (#13) - SQL Injection Advanced - Extracting Entire Databases
- Learn Ethical Hacking (#14) - Cross-Site Scripting (XSS) - Injecting Code Into Browsers
- Learn Ethical Hacking (#15) - XSS Advanced - Bypassing Filters and CSP
- Learn Ethical Hacking (#16) - Cross-Site Request Forgery - Making Users Attack Themselves
- Learn Ethical Hacking (#17) - Authentication Bypass - Getting In Without a Password
- Learn Ethical Hacking (#18) - Server-Side Request Forgery - Making Servers Betray Themselves
- Learn Ethical Hacking (#19) - Insecure Deserialization - Code Execution via Data
- Learn Ethical Hacking (#20) - File Upload Vulnerabilities - When Users Upload Weapons
- Learn Ethical Hacking (#21) - API Security - The New Attack Surface
- Learn Ethical Hacking (#22) - Business Logic Flaws - When the Code Works But the Logic Doesn't
- Learn Ethical Hacking (#23) - Client-Side Attacks - Beyond XSS
- Learn Ethical Hacking (#24) - Content Management Systems - Hacking WordPress and Friends
- Learn Ethical Hacking (#25) - Web Application Firewalls - Bypassing the Guards
- Learn Ethical Hacking (#26) - The Full Web Pentest - Methodology and Reporting
- Learn Ethical Hacking (#27) - Bug Bounty Hunting - Getting Paid to Hack the Web
- Learn Ethical Hacking (#28) - The AI Web Attack Surface - AI Features as Vulnerabilities
- Learn Ethical Hacking (#29) - Network Sniffing - Seeing Everything on the Wire
- Learn Ethical Hacking (#30) - Wireless Network Attacks - Breaking Wi-Fi
- Learn Ethical Hacking (#31) - Privilege Escalation - Linux
- Learn Ethical Hacking (#32) - Privilege Escalation - Windows
- Learn Ethical Hacking (#33) - Active Directory Attacks - The Crown Jewels
- Learn Ethical Hacking (#34) - Pivoting and Lateral Movement - Spreading Through Networks
- Learn Ethical Hacking (#35) - Cloud Security - AWS Attack and Defense
- Learn Ethical Hacking (#36) - Cloud Security - Azure and GCP
- Learn Ethical Hacking (#37) - Container Security - Docker and Kubernetes Attacks
- Learn Ethical Hacking (#38) - Infrastructure as Code - Securing the Automation
- Learn Ethical Hacking (#39) - Email Security - Phishing Infrastructure and Defense
- Learn Ethical Hacking (#40) - DNS Attacks - Exploiting the Internet's Foundation
- Learn Ethical Hacking (#41) - Exploitation Frameworks - Metasploit and Cobalt Strike
- Learn Ethical Hacking (#42) - Custom Exploit Development - Writing Your Own
- Learn Ethical Hacking (#43) - Exploit Development Advanced - Modern Mitigations and Bypasses
- Learn Ethical Hacking (#44) - Reverse Engineering - Understanding Binaries
- Learn Ethical Hacking (#45) - Supply Chain Attacks - Poisoning the Source
- Learn Ethical Hacking (#46) - The Human Factor - Why Security Training Fails
- Learn Ethical Hacking (#47) - Physical Security and OSINT - The Forgotten Attack Vectors
- Learn Ethical Hacking (#48) - Insider Threats - When the Call Is Coming from Inside the House
- Learn Ethical Hacking (#49) - Deepfakes and AI Deception - The New Social Engineering
- Learn Ethical Hacking (#50) - Red Team Operations - Simulating Real Attacks
- Learn Ethical Hacking (#51) - Incident Response - When Things Go Wrong
- Learn Ethical Hacking (#52) - Threat Intelligence - Knowing Your Enemy
- Learn Ethical Hacking (#53) - Security Architecture - Designing Systems That Resist Attack
- Learn Ethical Hacking (#54) - Compliance and Governance - The Business of Security
- Learn Ethical Hacking (#55) - Privacy and Data Protection - GDPR, CCPA, and Beyond
- Learn Ethical Hacking (#56) - Cryptocurrency Security - Attacking and Defending Digital Assets
- Learn Ethical Hacking (#57) - IoT and Embedded Security - Hacking the Physical World
- Learn Ethical Hacking (#58) - The AI Security Landscape - Attacking and Defending AI Systems
- Learn Ethical Hacking (#59) - Python for Pentesters - Automating Everything
- Learn Ethical Hacking (#60) - Zig for Security Tools - When Speed and Memory Matter
- Learn Ethical Hacking (#61) - Writing Custom Scanners - Beyond Off-the-Shelf
- Learn Ethical Hacking (#62) - C2 Frameworks - Building Command and Control
- Learn Ethical Hacking (#63) - Payload Generation and Evasion - Defeating Antivirus
- Learn Ethical Hacking (#64) - Fuzzing - Finding Bugs at Machine Speed
- Learn Ethical Hacking (#65) - OSINT Automation - Large-Scale Intelligence Gathering (this post)
Learn Ethical Hacking (#65) - OSINT Automation - Large-Scale Intelligence Gathering
Solutions to Episode 64 Exercises
Exercise 1: AFL++ fuzzing of config parser.
Target: custom config parser (key=value, sections, comments)
Compiled: afl-clang-fast -fsanitize=address config_parser.c
Seeds: 3 valid config files (5-50 lines each)
seed1.txt: # comment-only file (12 lines of #comments)
seed2.txt: key=value pairs (8 entries, no sections)
seed3.txt: [section1]\nhost=localhost\n[section2]\nport=8080
Fuzzing duration: 30 minutes
Results:
Total executions: 4.2 million
Executions/sec: ~2,300 (reasonable for a parser with ASAN overhead)
Unique paths: 847
Crashes: 3
- Crash 1: heap-buffer-overflow reading past key length
(key with no '=' delimiter, parser reads past buffer)
ASAN: READ of size 1 at offset 64 beyond 64-byte alloc
- Crash 2: stack-buffer-overflow on line >4096 chars
(fixed-size line buffer without length check)
ASAN: WRITE of size 1 at offset 4097 beyond stack frame
- Crash 3: null pointer deref on empty section header "[]"
Parser calls strlen on section name extracted between
brackets -- empty string passes, but next function
dereferences name[0] assuming non-empty.
All crashes reproduced with ASAN, minimized with afl-tmin.
Crash 1 minimized from 847 bytes to 9 bytes: "keynoequal"
The three crashes found here represent three of the most common parser bug categories. Crash 1 (missing delimiter) is the classic bounds check omission -- the parser searches for = using strchr, gets NULL back when the delimiter is absent, and then tries to compute key length by pointer subtraction against the NULL return. This is exactly the kind of input that no developer tests manually because every real config file has = signs in it. Crash 2 (oversized line) is the fixed buffer overflow -- the developer picked 4096 as "big enough" and never checked whether the input respected that assumption. Crash 3 (empty section header) is a logic error masked by convention -- every real config file has section names inside brackets, so the parser never had to handle the empty case. AFL generated [] as a mutation, which is structurally valid (the brackets match) but semantically empty, and the downstream code crashed. Three bugs in 30 minutes from a parser the developer probably considered "done" -- and that is precisely why fuzzing exists.
Exercise 2: ffuf web fuzzing comparison.
Target: DVWA on 192.168.1.100 (Apache + PHP)
Wordlist: raft-large-words.txt (119,600 entries, SecLists)
Directory brute force:
ffuf: 14,200 req/sec, 119,600 tested in 8.4 seconds
Found 31 paths (200/301/403 responses)
Filter used: -fc 404 (exclude Not Found)
gobuster: 9,100 req/sec, same wordlist in 13.1 seconds
Found 31 paths (identical results)
Ratio: ffuf 1.56x faster (Go HTTP client reuse vs gobuster)
Parameter discovery on /vulnerabilities/sqli/:
ffuf -u "http://target/vulnerabilities/sqli/?FUZZ=test"
-w burp-parameter-names.txt -fs 1785
Found: id (known), page (hidden redirect param), debug
(returns verbose PHP error output when set to any value)
Filter: -fs 1785 (baseline response size for unknown params)
Virtual host enumeration:
ffuf -u http://192.168.1.100/ -H "Host: FUZZ.lab.local"
-w subdomains-top1million-5000.txt -fs 11321
Found: admin.lab.local (302 -> /admin/login),
dev.lab.local (200, development dashboard),
staging.lab.local (200, copy of production app)
Filter: -fs 11321 (default vhost response size)
The filter flags are the real skill here, not the tool itself. Without -fs 1785 on the parameter discovery run, every single one of the 6,453 tested parameter names returns a 200 response (the page renders regardless of what GET parameters you throw at it). The page is 1785 bytes when the parameter is ignored. When a parameter actually DOES something -- like debug changing the output to include PHP error messages -- the response size changes. The -fs (filter size) flag hides the 1785-byte baseline responses and only shows you the anomalies. Finding the right filter value is the first thing you do: send one request with a parameter you know does nothing (?aaaaaaa=test), note the response size, and that becomes your filter. The speed difference between ffuf and gobuster is real but not as important as getting the filters right -- a fast fuzzer with wrong filters gives you 119,600 false positives, and a slow fuzzer with correct filters gives you exactly the results that matter.
Exercise 3: OSS-Fuzz research.
Total bugs found: 10,000+ across 1,000+ projects (as of 2025)
Top 5 projects by bugs found:
1. Chrome/Chromium - 1,200+ bugs
2. ffmpeg - 400+ bugs
3. OpenSSL - 300+ bugs
4. systemd - 250+ bugs
5. curl - 200+ bugs
Most common CWE categories:
heap-buffer-overflow: ~35% of all findings
use-after-free: ~15%
stack-buffer-overflow: ~12%
integer-overflow: ~8%
null-dereference: ~7%
others: ~23%
Average time from report to fix:
Critical severity: 14 days
High severity: 30 days
Medium severity: 45 days
Case study: CVE-2022-1292 (OpenSSL c_rehash command injection)
Found by: OSS-Fuzz continuous fuzzing
Bug class: OS command injection in the c_rehash Perl script
Trigger: certificate filename containing shell metacharacters
Impact: RCE when c_rehash processes untrusted cert directories
Fix: replaced system() call with direct file operations
Timeline: reported Mar 15, fixed Apr 20 (36 days), CVSSv3 9.8
The distribution of bug types is telling. Heap buffer overflows dominate at 35% because heap allocations are where variable-length data lives (strings, packets, file contents), and fuzzers are exceptionally good at generating inputs that push length fields past what the allocation expects. The use-after-free at 15% is more interesting -- these bugs require a specific temporal sequence (allocate, free, use) that random testing would almost never trigger, but coverage-guided fuzzers evolve inputs that exercise exactly those paths by building on mutations that reach the free-then-use code. The CVE-2022-1292 case is a good reminder that fuzzing finds bugs beyond just memory corruption -- the c_rehash injection is a shell metacharacter issue in a Perl script, not a buffer overflow in C, but the fuzzer found it because it generated filenames with characters like backticks and semicolons that the script passed unsanitized to system().
Episode 64 was about fuzzing -- throwing millions of garbage inputs at software at machine speed to find bugs that human review misses. We covered AFL++ for native code (coverage-guided, compile-time instrumentation, evolutionary input generation), libFuzzer for in-process targets, ffuf for web application fuzzing (directory discovery, parameter fuzzing, virtual host enumeration), protocol fuzzing at the binary level, and the triage process that turns a raw crash into a vulnerability assessment. The core insight was that developers test the happy path while fuzzers explore the hostile universe of malformed inputs that nobody thought to construct manually.
Today we pivot from finding bugs in code to finding information about targets. In episode 47 we covered OSINT fundamentals -- Google dorking, theHarvester, Sherlock, basic LinkedIn research, Shodan queries. Those techniques work beautifully when you have one target, one domain, one person of interest. But professional pentest engagements and red team operations almost never have just one target. You have a company with 50 domains, 10,000 employees, hundreds of web applications, shadow IT nobody documented, and the scope letter says "find everything." Manual OSINT does not scale to that. Automation does.
From Manual to Machine
The fundamental shift from manual to automated OSINT is not just about speed (though speed matters enormously -- what takes an analyst a full day of clicking and copy-pasting takes a pipeline five minutes). The real shift is about completeness. A human analyst checking subdomains will query crt.sh, maybe run subfinder, note down the results, and move on. An automated pipeline queries crt.sh, subfinder, amass, GitHub code search, DNS brute forcing, web archives, certificate transparency logs, and six other sources -- deduplicates the combined results -- and produces a list that is genuinely comprehensive rather than "the first 30 results I found before I got tired."
In a penetration test, the one subdomain you missed is the one running an outdated Jenkins instance with default credentials. The one GitHub commit you did not search for is the one where a developer pushed an AWS secret key three years ago and deleted it in the next commit (but git history is forever). Automated OSINT is not optional for professional security work -- it is the difference between "I checked a few things" and "I systematically enumerated everything that is publicly discoverable about this target." The reconnaissance phase (episode 4) taught us that attackers invest heavily in information gathering before touching a single system. Automation is how they do it at scale.
Subdomain Enumeration Pipeline
The modern approach chains multiple tools, each pulling from different data sources, and deduplicates the combined output. No single tool covers everything -- subfinder is excellent at passive API-based enumeration (it queries 30+ services like VirusTotal, Censys, Shodan, SecurityTrails), amass adds DNS brute forcing and recursive enumeration, crt.sh provides Certificate Transparency log data, and GitHub code search sometimes reveals subdomains that exist only in source code (development URLs, staging environments, internal API endpoints that developers accidentally committed to public repos).
#!/bin/bash
# recon_pipeline.sh -- comprehensive subdomain enumeration
DOMAIN=$1
OUTPUT_DIR="./recon/$DOMAIN"
mkdir -p "$OUTPUT_DIR"
echo "[*] Enumerating subdomains for $DOMAIN..."
# Tool 1: subfinder (passive, 30+ data sources)
subfinder -d "$DOMAIN" -silent -o "$OUTPUT_DIR/subfinder.txt"
# Tool 2: amass (passive + active, most comprehensive)
amass enum -passive -d "$DOMAIN" -o "$OUTPUT_DIR/amass.txt" 2>/dev/null
# Tool 3: Certificate Transparency logs
curl -s "https://crt.sh/?q=%25.$DOMAIN&output=json" | \
jq -r '.[].name_value' 2>/dev/null | sort -u > "$OUTPUT_DIR/crt.txt"
# Tool 4: GitHub code search (finds subdomains in source code)
# Requires GITHUB_TOKEN
github-subdomains -d "$DOMAIN" -t "$GITHUB_TOKEN" -o "$OUTPUT_DIR/github.txt" 2>/dev/null
# Combine and deduplicate
cat "$OUTPUT_DIR"/*.txt | sort -u > "$OUTPUT_DIR/all_subdomains.txt"
TOTAL=$(wc -l < "$OUTPUT_DIR/all_subdomains.txt")
echo "[*] Found $TOTAL unique subdomains"
# Resolve live subdomains
echo "[*] Checking which are live..."
httpx -l "$OUTPUT_DIR/all_subdomains.txt" -silent -status-code \
-title -tech-detect -o "$OUTPUT_DIR/live_hosts.txt"
LIVE=$(wc -l < "$OUTPUT_DIR/live_hosts.txt")
echo "[*] $LIVE live hosts found"
# Screenshot live hosts
echo "[*] Taking screenshots..."
gowitness file -f "$OUTPUT_DIR/live_hosts.txt" -P "$OUTPUT_DIR/screenshots/"
echo "[*] Done. Results in $OUTPUT_DIR/"
Typical results for a medium-sized company:
subfinder: 150 subdomains (API sources: VirusTotal, Shodan, Censys...)
amass: 280 subdomains (includes overlap + DNS brute force findings)
crt.sh: 120 subdomains (certificate names only)
GitHub: 30 subdomains (hardcoded URLs in public repos)
Combined: 340 unique subdomains (after dedup)
Live: 185 responding to HTTP/HTTPS
Without automation: manually finding 340 subdomains would take
an analyst an entire day of tab-switching and note-taking.
With the pipeline: under 5 minutes.
The deduplication step is more important than it might appear. Each tool returns results in a slightly different format -- some include wildcards (*.example.com), some include port numbers, some return both the raw domain and the www. prefix. The sort -u handles exact duplicates, but in production pipelines you would also normalize the data (strip ports, resolve wildcards, lowercase everything) before deduplication. The reason for running four tools instead of just the "best" one is that no single tool has access to all data sources. Subfinder might find staging-api.target.com through a VirusTotal passive DNS record that amass never sees because amass's VirusTotal integration uses a different API endpoint. Amass might discover internal-tools.target.com through recursive DNS resolution that subfinder's passive-only approach cannot perform. The union of all four tools is substantially larger than any individual tool's output -- typically 30-40% more unique subdomains compared to the best single tool.
The httpx resolution step filters the enumerated subdomains down to those that actually respond to HTTP requests. Many enumerated subdomains are dead -- old DNS records pointing to decommissioned servers, expired certificates for services that no longer exist, development environments that were shut down but never cleaned up in DNS. httpx also extracts useful metadata (HTTP status codes, page titles, technology fingerprints via Wappalyzer signatures) that helps you prioritize which hosts to investigate further. A host returning a 200 with title "Jenkins [Jenkins]" is a much higher-priority finding than a host returning a generic 403 Forbidden page.
The gowitness screenshots at the end are for the report -- and for your own sanity when dealing with 185 live hosts. Scrolling through a folder of screenshots is dramatically faster for initial triage than manually visiting each URL. You can immediately spot login pages, default installations, error pages that leak information, and admin panels that should not be publicly accessible. Visual triage at scale is one of those things that sounds trivial until you have actually tried to manually visit 185 URLs one by one.
Certificate Transparency Monitoring
CT logs are public records of every SSL/TLS certificate ever issued by participating Certificate Authorities. This was created as a transparency mechanism to detect mis-issued certificates (remember the DigiNotar incident from 2011? -- a compromised CA issued fraudulent certificates for google.com, and nobody knew until it was too late). For OSINT purposes, CT logs are an intelligence goldmine because they reveal infrastructure changes in real time: when a target registers a new certificate, it usually means new infrastructure is being deployed.
#!/usr/bin/env python3
"""ct_monitor.py -- monitor Certificate Transparency for new certs"""
import requests
import json
import time
from datetime import datetime
def check_new_certs(domain, last_check=None):
"""Query crt.sh for certificates issued since last check."""
url = f"https://crt.sh/?q=%.{domain}&output=json"
try:
r = requests.get(url, timeout=30)
certs = r.json()
except Exception:
return []
new_certs = []
for cert in certs:
issued = cert.get('entry_timestamp', '')
name = cert.get('name_value', '')
issuer = cert.get('issuer_name', '')
if last_check and issued > last_check:
new_certs.append({
'domain': name,
'issued': issued,
'issuer': issuer,
})
return new_certs
def monitor(domain, interval=3600):
"""Continuously monitor for new certificates."""
last_check = datetime.utcnow().isoformat()
print(f"[*] Monitoring CT logs for {domain}")
print(f"[*] Checking every {interval} seconds")
while True:
time.sleep(interval)
new = check_new_certs(domain, last_check)
if new:
for cert in new:
print(f"[!] NEW CERT: {cert['domain']} "
f"(issued {cert['issued']}, issuer: {cert['issuer'][:40]})")
last_check = datetime.utcnow().isoformat()
if __name__ == '__main__':
import sys
monitor(sys.argv[1] if len(sys.argv) > 1 else 'example.com')
What CT monitoring reveals in practice:
- New subdomains being deployed (staging.target.com, api-v2.target.com)
- Shadow IT (departments spinning up services without central IT approval)
- Phishing infrastructure (attacker registers target-login.com or target-sso.com)
- Development/test environments exposed to the internet
- Partner integrations (partner-api.target.com)
- Certificate renewals (useful for tracking certificate management practices)
- Wildcard certificate usage (*.target.com -- a single cert covering everything)
The offensive value of CT monitoring is timing. When you see a new certificate for staging-v2.target.com appear in the logs, that infrastructure is probably being set up RIGHT NOW -- which means it is likely in its most vulnerable state (default configurations, incomplete hardening, maybe even temporary credentials for the deployment team). Attackers monitor CT logs for exactly this reason: fresh infrastructure is easy infrastructure. The defensive value is equally important: if you see a certificate for target-sso.com and your organization did not request it, that is almost certainly a phishing campaign in preparation, and you have a window to take action (report the domain, update your email filters, alert your users) before the campaign launches.
Having said that, the crt.sh API has rate limits and sometimes returns incomplete results for very large domains. For production-grade CT monitoring, tools like certstream provide real-time streaming of CT log entries via a WebSocket connection -- you see every certificate within seconds of it being logged, not on a polling interval. Facebook's Certificate Transparency monitoring tool does the same thing internally and has caught quit some phishing domains targeting their users before any employee received a single phishing email.
Automated Credential Monitoring
One of the highest-impact OSINT automation tasks is checking whether your organization's employee credentials have been exposed in data breaches. The logic is simple: if [email protected] appeared in the LinkedIn 2012 breach and John is still using the same password (or a predictable variation), that is a real attack vector -- no exploitation required, just credential stuffing.
#!/usr/bin/env python3
"""breach_monitor.py -- check employee emails against breach databases"""
import requests
import time
def check_hibp(email, api_key):
"""Check Have I Been Pwned for breaches."""
url = f"https://haveibeenpwned.com/api/v3/breachedaccount/{email}"
headers = {
'hibp-api-key': api_key,
'user-agent': 'breach-monitor'
}
r = requests.get(url, headers=headers, timeout=10)
if r.status_code == 200:
return r.json()
return []
def monitor_company(email_list_file, api_key):
"""Check all company emails for breach exposure."""
with open(email_list_file) as f:
emails = [line.strip() for line in f if line.strip()]
print(f"[*] Checking {len(emails)} emails against HIBP...")
exposed = []
for email in emails:
breaches = check_hibp(email, api_key)
if breaches:
breach_names = [b['Name'] for b in breaches]
print(f" [!] {email}: exposed in {', '.join(breach_names)}")
exposed.append({
'email': email,
'breaches': breach_names,
'breach_count': len(breaches),
})
time.sleep(1.5) # HIBP rate limit: 1 request per 1.5 seconds
print(f"\n[*] {len(exposed)} / {len(emails)} emails found in breaches")
return exposed
The 1.5-second rate limit on HIBP is worth noting because it defines the practical ceiling for automated checking: 10,000 employee emails take approximately 4 hours and 10 minutes to process. For large organizations, the HIBP commercial API (Enterprise subscription) provides higher rate limits and domain-level searching (check all emails at @company.com without knowing individual addresses). The alternative is maintaining your own breach database -- tools like h8mail aggregate results from multiple breach databases and paste sites -- but the legal and ethical considerations of possessing breach data are non-trivial, and for most pentest engagements the HIBP API (which never exposes actual passwords, only breach membership) is the appropriate tool.
The pentest value of breach data is enormous. If 40% of a company's employees appear in breaches, the probability that at least some of those passwords are still in use (or have been "rotated" to predictable variants like Password2024! -> Password2025!) is extremely high. Combined with the username enumeration techniques from earlier in this series, breach data enables credential stuffing attacks that bypass technical controls entirely -- you are not exploiting software, you are logging in with valid credentials.
GitHub OSINT
Developers leak secrets on GitHub constantly. API keys, database credentials, internal URLs, private SSH keys, cloud access tokens -- all committed to public repositories by accident and (usually) deleted in the next commit. The problem is that git never forgets. Even if the developer force-pushes to remove the secret, the commit history still contains it unless the repository is completely deleted and recreated. Automated GitHub OSINT searches for these leaked secrets at scale.
#!/usr/bin/env python3
"""github_osint.py -- search GitHub for leaked company secrets"""
import requests
import re
GITHUB_API = "https://api.github.com"
def search_code(query, token, max_results=100):
"""Search GitHub code for secrets."""
headers = {'Authorization': f'token {token}'}
results = []
page = 1
while len(results) < max_results:
r = requests.get(f"{GITHUB_API}/search/code",
params={'q': query, 'per_page': 30, 'page': page},
headers=headers, timeout=10)
if r.status_code != 200:
break
data = r.json()
for item in data.get('items', []):
results.append({
'repo': item['repository']['full_name'],
'file': item['path'],
'url': item['html_url'],
})
if len(data.get('items', [])) < 30:
break
page += 1
return results
def hunt_secrets(domain, token):
"""Search for company secrets on GitHub."""
queries = [
f'"{domain}" password',
f'"{domain}" api_key',
f'"{domain}" secret_key',
f'"{domain}" AWS_ACCESS_KEY',
f'"{domain}" jdbc:mysql',
f'org:{domain.split(".")[0]} password filename:.env',
f'org:{domain.split(".")[0]} filename:credentials',
]
all_findings = []
for query in queries:
print(f" Searching: {query}")
findings = search_code(query, token, max_results=20)
for f in findings:
print(f" [!] {f['repo']} / {f['file']}")
print(f" {f['url']}")
all_findings.extend(findings)
return all_findings
The org: prefix in the GitHub search API is critical because it restricts results to repositories owned by the organization. Without it, you search ALL of GitHub and get drowned in false positives (every tutorial that uses example.com as a placeholder will match "example.com" password). The filename:.env filter is particularly effective because .env files are the standard location for environment variables (API keys, database URLs, secret keys) in modern web applications, and developers who forget to add .env to their .gitignore commit these files to version control regularly.
For more thorough secret detection, trufflehog (from Truffle Security) scans the ENTIRE git history of a repository -- not just the current HEAD -- looking for high-entropy strings and known secret patterns (AWS keys, Slack tokens, private keys). It catches secrets that were committed three years ago and deleted in the next commit. GitGuardian provides the same capability as a SaaS product with continuous monitoring -- useful for organizations that want to detect leaked secrets in real time rather than retroactively. On a pentest engagement, running trufflehog against the target organization's public GitHub repos is one of the highest-yield activities in the entire reconnaissance phase, because a valid AWS access key is not a vulnerability you need to exploit -- it is a door that is already open.
Beyond the Public Internet
The forementioned techniques all operate on publicly accessible data -- DNS records, certificates, GitHub, breach databases. But OSINT extends beyond what is indexed by search engines.
Social media intelligence (SOCMINT) involves systematic collection from platforms like LinkedIn, Twitter, GitHub profiles, personal blogs, and public forums. For pentest engagements, LinkedIn is the most valuable source because it reveals organizational structure (who reports to whom), technology stack (job postings list specific technologies -- "experience with Kubernetes, Terraform, and AWS Lambda" tells you exactly what the target's infrastructure looks like), and individual targets for social engineering (the help desk employee, the new hire who is not yet security-aware, the sysadmin who posts about their work on Twitter). Tools like linkedin2username generate likely email formats from LinkedIn employee names, and Sherlock (which we covered in episode 47) maps a username across 300+ platforms to build a profile of a person's online presence.
Dark web monitoring is the other end of the spectrum. Threat intelligence teams and SOC analysts monitor underground forums, paste sites, and marketplace listings for mentions of their organization. The practical tools for this include IntelX (Intelligence X -- a search engine for the historical internet including leaked databases, paste sites, and Tor content), commercial platforms like Recorded Future and DarkTracer, and Telegram channel monitors (many threat actors have moved from dark web forums to Telegram groups, which are substantially easier to monitor). The defensive goal is early warning: if an employee's credentials appear for sale on a dark web marketplace, you want to know about it before the buyer uses them.
OSINT data source hierarchy (by accessibility):
1. Public internet (Google, Shodan, Censys, crt.sh)
2. Social media (LinkedIn, Twitter, GitHub, Reddit)
3. Breach databases (HIBP, IntelX, dehashed)
4. Code repositories (GitHub, GitLab, Bitbucket -- public repos)
5. Paste sites (Pastebin, Ghostbin, dpaste)
6. Deep web (forums requiring registration, gated content)
7. Dark web (Tor hidden services, I2P, marketplace listings)
Automation coverage:
Layers 1-5: fully automatable with public APIs and web scraping
Layer 6: partially automatable (account creation + scraping)
Layer 7: requires specialized tools and significant OPSEC
The key insight for OSINT automation is that layers 1-5 contain 90% of the actionable intelligence for most engagements. Going deeper (layers 6-7) has diminishing returns and increasing legal and ethical complexity. Unless you are specifically doing threat intelligence work (episode 52) or investigating an active incident (episode 51), the public layers provide more than enough data to map an organization's attack surface comprehensively.
Building an OSINT Dashboard
With data flowing in from multiple sources -- subdomains, certificates, breaches, GitHub findings, social media profiles -- the challenge shifts from collection to correlation. A subdomain list and a breach list are useful individually, but linking them reveals patterns that neither source shows alone. If admin.target.com is in your subdomain list and [email protected] appears in three breaches, that is a high-priority finding that combines infrastructure exposure with credential compromise.
#!/usr/bin/env python3
"""osint_report.py -- generate consolidated OSINT report"""
import json
from datetime import datetime
class OSINTReport:
def __init__(self, target):
self.report = {
'target': target,
'generated': datetime.utcnow().isoformat(),
'sections': {}
}
def add_subdomains(self, subdomains):
self.report['sections']['subdomains'] = {
'total': len(subdomains),
'live': len([s for s in subdomains if s.get('live')]),
'data': subdomains
}
def add_breaches(self, breaches):
self.report['sections']['breaches'] = {
'exposed_accounts': len(breaches),
'total_breaches': sum(b['breach_count'] for b in breaches),
'data': breaches
}
def add_github_findings(self, findings):
self.report['sections']['github'] = {
'total_findings': len(findings),
'repos_affected': len(set(f['repo'] for f in findings)),
'data': findings
}
def add_ct_certs(self, certs):
self.report['sections']['certificates'] = {
'new_certs': len(certs),
'data': certs
}
def generate(self):
risk = 0
subs = self.report['sections'].get('subdomains', {})
risk += min(subs.get('live', 0) * 0.5, 25)
breaches = self.report['sections'].get('breaches', {})
risk += min(breaches.get('exposed_accounts', 0) * 5, 30)
github = self.report['sections'].get('github', {})
risk += min(github.get('total_findings', 0) * 10, 30)
certs = self.report['sections'].get('certificates', {})
risk += min(certs.get('new_certs', 0) * 2, 15)
self.report['risk_score'] = min(risk, 100)
self.report['risk_level'] = (
'CRITICAL' if risk > 75 else
'HIGH' if risk > 50 else
'MEDIUM' if risk > 25 else
'LOW'
)
return self.report
def save(self, filename):
report = self.generate()
with open(filename, 'w') as f:
json.dump(report, f, indent=2)
print(f"Report saved: {filename}")
print(f"Risk: {report['risk_level']} ({report['risk_score']}/100)")
The risk scoring model here is deliberately simple -- linear weights per category, capped per section to prevent any single source from dominating the score. In practice you would weight the categories based on the engagement scope and the client's threat model: a financial institution might weight breach exposure at 3x compared to subdomain count (because credential compromise leads directly to fraud), while a SaaS company might weight GitHub findings at 3x (because leaked API keys grant direct access to customer data). The point of the automated scoring is not to replace analyst judgment -- it is to provide a consistent baseline that enables comparison over time. If you run the same pipeline weekly and the risk score jumps from 35 to 68, something changed and you need to investigate.
The dashboard approach also enables correlation across data sources. Python's set operations make this trivial: take the set of subdomain hostnames, take the set of email prefixes from the breach data, compute the intersection, and you have a list of hosts that map to compromised user accounts. Add the GitHub findings (which often contain database connection strings with hostnames), intersect those hostnames with the subdomain list, and you might discover that db-production.target.com is both publicly discoverable AND has its credentials in a public GitHub repo. Each data source is moderately useful alone. The combination is devastating.
Defense: Attack Surface Management
Everything in this episode so far has been offensive -- techniques for mapping a target's exposure. The defensive application is identical: use the same tools and automation against your OWN organization, before an attacker does.
The same OSINT techniques attackers use, defenders should use first:
1. Continuous subdomain monitoring
Run enumeration pipelines weekly. Diff against previous results.
New subdomains = potential shadow IT or attacker infrastructure.
Set up alerting for new entries that were not approved by IT.
2. Certificate Transparency alerts
Monitor CT logs for your domain and common misspellings.
New certs you did not request = shadow IT or phishing campaign.
Phishing domains often register certs 24-48 hours before launch.
3. Breach monitoring
Check employee emails against HIBP monthly. Breached accounts
need immediate password resets and MFA verification.
Track which breaches are new vs previously known.
4. GitHub scanning
Search for your company name, domain, and internal project names
on GitHub weekly. Employees leak credentials in personal repos.
Tools: GitGuardian (continuous), trufflehog (git history scanning)
5. Attack surface management platforms
Commercial: Censys ASM, Shodan Monitor, SecurityTrails, Mandiant ASM
These provide continuous discovery and monitoring of your
internet-facing assets with alerting on changes.
Open-source: ProjectDiscovery's suite (subfinder, httpx, nuclei)
6. Digital footprint reduction
- Remove old DNS records for decommissioned services
- Revoke unused SSL certificates
- Clean up public code repositories (check git history too)
- Minimize information in job postings (do you really need to
list your exact tech stack publicly?)
- Monitor executive social media for information disclosure
The concept of Attack Surface Management (ASM) has become a commercial product category in the last few years -- Censys, Shodan, SecurityTrails, and others sell continuous monitoring services that essentially run the automation from this episode 24/7 against your organization and alert you when something changes. The value proposition is real: most organizations do not know their own attack surface. Shadow IT (departments deploying cloud services without IT approval), forgotten development servers, acquired company infrastructure that was never integrated into the security program, contractor-deployed test environments -- all of these create exposure that no CMDB or asset inventory captures accurately. Automated discovery is the only way to get a genuinely complete picture.
The digital footprint reduction point deserves emphasis. Every piece of information that is publicly discoverable about your organization is OSINT ammunition for an attacker. Job postings that specify "experience with Kubernetes 1.28, Terraform, and AWS Lambda in us-east-1" tell an attacker exactly what your infrastructure looks like and where it is hosted. Conference talks where engineers discuss their architecture provide the same information. Executive LinkedIn profiles listing their direct reports provide organizational charts for social engineering targeting. None of this is a vulnerability in the traditional sense, but all of it is intelligence that makes the attacker's job easier. Reducing what is discoverable -- without going dark entirely, which is impractical -- is a legitimate defensive strategy.
The AI Slop Connection
AI supercharges OSINT automation in both directions, and the implications are significant. On the offensive side, AI can process hundreds of LinkedIn profiles and build organizational charts in minutes. AI-powered tools correlate data across platforms -- linking a GitHub username to a Twitter handle to a LinkedIn profile to a real identity -- by recognizing patterns in usernames, profile pictures, writing style, and cross-platform references. The Python scripts in this episode are straightforward automation; add a language model to the pipeline and you get an analyst that can read every GitHub commit message, every LinkedIn post, every public forum comment, and extract relevant intelligence from the noise. What used to require a human analyst's full attention for a week can be done in an afternoon.
On the defensive side, the same capability means your organization's digital footprint is being analyzed by AI-powered tools at scale, right now, by threat actors you will never know about. Every public post, every job listing, every conference talk, every GitHub commit is intelligence. The question is not whether attackers are collecting this data -- they absolutely are. The question is whether you know what they are finding before they use it against you.
The output from OSINT automation -- the raw data, the correlations, the risk scores -- is only useful if it gets communicated effectively to the people who can act on it. All the subdomain enumeration and breach checking in the world is worthless if the report sits in a JSON file that nobody reads. Turning raw intelligence into actionable findings that drive remediation is a skill in itself, and arguably the most undervalued skill in the entire security profession ;-)
Exercises
Exercise 1: Build and run the subdomain enumeration pipeline from this episode against a domain you own or have explicit permission to test. Document: (a) results per tool (subfinder, amass, crt.sh) including the unique count from each, (b) total unique subdomains vs total with overlap (calculate the percentage of new discoveries each tool contributed beyond what the others found), (c) how many resolved to live hosts via httpx, (d) any surprising findings (subdomains you did not know existed, development environments, shadow IT). If you do not own a domain, use hackthebox.com or tryhackme.com as your target (these are intentionally public-facing security platforms). Save your full results and analysis to ~/lab-notes/osint-subdomain-pipeline.md.
Exercise 2: Set up Certificate Transparency monitoring for your own domain (or a test domain you control). Run the ct_monitor.py script from this episode for 24 hours with a 1-hour polling interval. During the monitoring period, manually issue a Let's Encrypt certificate for a new subdomain using certbot certonly --standalone -d newhost.yourdomain.com. Verify the monitor detects the new certificate. Document: (a) the detection latency (time between certificate issuance and monitor alert), (b) any false positives or noise in the CT log data (wildcard certs, CDN certs, etc.), (c) how you would filter the results in a production deployment to reduce noise. If you cannot issue certificates, use the script against a large public domain (google.com, microsoft.com) for 24 hours and document every new certificate that appears. Save to ~/lab-notes/osint-ct-monitoring.md.
Exercise 3: Conduct a full OSINT assessment of a fictional company (create one: pick a domain you own, populate it with test DNS records and a few GitHub repos) or your own organization (with explicit written permission from your employer). Run ALL the automation from this episode: subdomain enumeration, breach checking (HIBP), GitHub secret scanning, and CT log analysis. Feed the results into the OSINTReport class to generate a consolidated JSON report. Then write a 1-page executive summary: what was found, what the risk level is, what the top 3 remediation priorities should be, and what ongoing monitoring you recommend. The executive summary is as important as the technical findings -- practice writing for a non-technical audience. Save both the JSON report and the executive summary to ~/lab-notes/osint-full-assessment/.