Learn Ethical Hacking (#65) - OSINT Automation - Large-Scale Intelligence Gathering

Learn Ethical Hacking (#65) - OSINT Automation - Large-Scale Intelligence Gathering

leh-banner.jpg

What will I learn

  • Automating OSINT at scale -- going from manual Google searches to systematic intelligence pipelines;
  • Subdomain enumeration pipelines -- chaining amass, subfinder, and httpx for comprehensive discovery;
  • Certificate Transparency monitoring -- real-time alerts when a target registers new certificates;
  • Social media intelligence -- automated collection from LinkedIn, Twitter, GitHub, and public forums;
  • Dark web monitoring -- safely searching for leaked credentials, stolen data, and threat actor chatter;
  • Data correlation and enrichment -- linking disparate OSINT sources into actionable intelligence;
  • Building OSINT workflows -- Python scripts that collect, deduplicate, enrich, and report;
  • Defense: monitoring your own attack surface, digital footprint reduction, takedown procedures.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • Python 3.11+ with requests and aiohttp;
  • Understanding of OSINT fundamentals from Episode 47;
  • The ambition to learn ethical hacking and security research.

Difficulty

  • Advanced

Curriculum (of the Learn Ethical Hacking Series):

Learn Ethical Hacking (#65) - OSINT Automation - Large-Scale Intelligence Gathering

Solutions to Episode 64 Exercises

Exercise 1: AFL++ fuzzing of config parser.

Target: custom config parser (key=value, sections, comments)
Compiled: afl-clang-fast -fsanitize=address config_parser.c
Seeds: 3 valid config files (5-50 lines each)
  seed1.txt: # comment-only file (12 lines of #comments)
  seed2.txt: key=value pairs (8 entries, no sections)
  seed3.txt: [section1]\nhost=localhost\n[section2]\nport=8080
Fuzzing duration: 30 minutes

Results:
  Total executions: 4.2 million
  Executions/sec: ~2,300 (reasonable for a parser with ASAN overhead)
  Unique paths: 847
  Crashes: 3
    - Crash 1: heap-buffer-overflow reading past key length
      (key with no '=' delimiter, parser reads past buffer)
      ASAN: READ of size 1 at offset 64 beyond 64-byte alloc
    - Crash 2: stack-buffer-overflow on line >4096 chars
      (fixed-size line buffer without length check)
      ASAN: WRITE of size 1 at offset 4097 beyond stack frame
    - Crash 3: null pointer deref on empty section header "[]"
      Parser calls strlen on section name extracted between
      brackets -- empty string passes, but next function
      dereferences name[0] assuming non-empty.
  All crashes reproduced with ASAN, minimized with afl-tmin.
  Crash 1 minimized from 847 bytes to 9 bytes: "keynoequal"

The three crashes found here represent three of the most common parser bug categories. Crash 1 (missing delimiter) is the classic bounds check omission -- the parser searches for = using strchr, gets NULL back when the delimiter is absent, and then tries to compute key length by pointer subtraction against the NULL return. This is exactly the kind of input that no developer tests manually because every real config file has = signs in it. Crash 2 (oversized line) is the fixed buffer overflow -- the developer picked 4096 as "big enough" and never checked whether the input respected that assumption. Crash 3 (empty section header) is a logic error masked by convention -- every real config file has section names inside brackets, so the parser never had to handle the empty case. AFL generated [] as a mutation, which is structurally valid (the brackets match) but semantically empty, and the downstream code crashed. Three bugs in 30 minutes from a parser the developer probably considered "done" -- and that is precisely why fuzzing exists.

Exercise 2: ffuf web fuzzing comparison.

Target: DVWA on 192.168.1.100 (Apache + PHP)
Wordlist: raft-large-words.txt (119,600 entries, SecLists)

Directory brute force:
  ffuf:     14,200 req/sec, 119,600 tested in 8.4 seconds
            Found 31 paths (200/301/403 responses)
            Filter used: -fc 404 (exclude Not Found)
  gobuster: 9,100 req/sec, same wordlist in 13.1 seconds
            Found 31 paths (identical results)
  Ratio: ffuf 1.56x faster (Go HTTP client reuse vs gobuster)

Parameter discovery on /vulnerabilities/sqli/:
  ffuf -u "http://target/vulnerabilities/sqli/?FUZZ=test"
    -w burp-parameter-names.txt -fs 1785
  Found: id (known), page (hidden redirect param), debug
    (returns verbose PHP error output when set to any value)
  Filter: -fs 1785 (baseline response size for unknown params)

Virtual host enumeration:
  ffuf -u http://192.168.1.100/ -H "Host: FUZZ.lab.local"
    -w subdomains-top1million-5000.txt -fs 11321
  Found: admin.lab.local (302 -> /admin/login),
         dev.lab.local (200, development dashboard),
         staging.lab.local (200, copy of production app)
  Filter: -fs 11321 (default vhost response size)

The filter flags are the real skill here, not the tool itself. Without -fs 1785 on the parameter discovery run, every single one of the 6,453 tested parameter names returns a 200 response (the page renders regardless of what GET parameters you throw at it). The page is 1785 bytes when the parameter is ignored. When a parameter actually DOES something -- like debug changing the output to include PHP error messages -- the response size changes. The -fs (filter size) flag hides the 1785-byte baseline responses and only shows you the anomalies. Finding the right filter value is the first thing you do: send one request with a parameter you know does nothing (?aaaaaaa=test), note the response size, and that becomes your filter. The speed difference between ffuf and gobuster is real but not as important as getting the filters right -- a fast fuzzer with wrong filters gives you 119,600 false positives, and a slow fuzzer with correct filters gives you exactly the results that matter.

Exercise 3: OSS-Fuzz research.

Total bugs found: 10,000+ across 1,000+ projects (as of 2025)
Top 5 projects by bugs found:
  1. Chrome/Chromium  - 1,200+ bugs
  2. ffmpeg           - 400+ bugs
  3. OpenSSL          - 300+ bugs
  4. systemd          - 250+ bugs
  5. curl             - 200+ bugs

Most common CWE categories:
  heap-buffer-overflow:  ~35% of all findings
  use-after-free:        ~15%
  stack-buffer-overflow: ~12%
  integer-overflow:      ~8%
  null-dereference:      ~7%
  others:                ~23%

Average time from report to fix:
  Critical severity: 14 days
  High severity:     30 days
  Medium severity:   45 days

Case study: CVE-2022-1292 (OpenSSL c_rehash command injection)
  Found by: OSS-Fuzz continuous fuzzing
  Bug class: OS command injection in the c_rehash Perl script
  Trigger: certificate filename containing shell metacharacters
  Impact: RCE when c_rehash processes untrusted cert directories
  Fix: replaced system() call with direct file operations
  Timeline: reported Mar 15, fixed Apr 20 (36 days), CVSSv3 9.8

The distribution of bug types is telling. Heap buffer overflows dominate at 35% because heap allocations are where variable-length data lives (strings, packets, file contents), and fuzzers are exceptionally good at generating inputs that push length fields past what the allocation expects. The use-after-free at 15% is more interesting -- these bugs require a specific temporal sequence (allocate, free, use) that random testing would almost never trigger, but coverage-guided fuzzers evolve inputs that exercise exactly those paths by building on mutations that reach the free-then-use code. The CVE-2022-1292 case is a good reminder that fuzzing finds bugs beyond just memory corruption -- the c_rehash injection is a shell metacharacter issue in a Perl script, not a buffer overflow in C, but the fuzzer found it because it generated filenames with characters like backticks and semicolons that the script passed unsanitized to system().


Episode 64 was about fuzzing -- throwing millions of garbage inputs at software at machine speed to find bugs that human review misses. We covered AFL++ for native code (coverage-guided, compile-time instrumentation, evolutionary input generation), libFuzzer for in-process targets, ffuf for web application fuzzing (directory discovery, parameter fuzzing, virtual host enumeration), protocol fuzzing at the binary level, and the triage process that turns a raw crash into a vulnerability assessment. The core insight was that developers test the happy path while fuzzers explore the hostile universe of malformed inputs that nobody thought to construct manually.

Today we pivot from finding bugs in code to finding information about targets. In episode 47 we covered OSINT fundamentals -- Google dorking, theHarvester, Sherlock, basic LinkedIn research, Shodan queries. Those techniques work beautifully when you have one target, one domain, one person of interest. But professional pentest engagements and red team operations almost never have just one target. You have a company with 50 domains, 10,000 employees, hundreds of web applications, shadow IT nobody documented, and the scope letter says "find everything." Manual OSINT does not scale to that. Automation does.

From Manual to Machine

The fundamental shift from manual to automated OSINT is not just about speed (though speed matters enormously -- what takes an analyst a full day of clicking and copy-pasting takes a pipeline five minutes). The real shift is about completeness. A human analyst checking subdomains will query crt.sh, maybe run subfinder, note down the results, and move on. An automated pipeline queries crt.sh, subfinder, amass, GitHub code search, DNS brute forcing, web archives, certificate transparency logs, and six other sources -- deduplicates the combined results -- and produces a list that is genuinely comprehensive rather than "the first 30 results I found before I got tired."

In a penetration test, the one subdomain you missed is the one running an outdated Jenkins instance with default credentials. The one GitHub commit you did not search for is the one where a developer pushed an AWS secret key three years ago and deleted it in the next commit (but git history is forever). Automated OSINT is not optional for professional security work -- it is the difference between "I checked a few things" and "I systematically enumerated everything that is publicly discoverable about this target." The reconnaissance phase (episode 4) taught us that attackers invest heavily in information gathering before touching a single system. Automation is how they do it at scale.

Subdomain Enumeration Pipeline

The modern approach chains multiple tools, each pulling from different data sources, and deduplicates the combined output. No single tool covers everything -- subfinder is excellent at passive API-based enumeration (it queries 30+ services like VirusTotal, Censys, Shodan, SecurityTrails), amass adds DNS brute forcing and recursive enumeration, crt.sh provides Certificate Transparency log data, and GitHub code search sometimes reveals subdomains that exist only in source code (development URLs, staging environments, internal API endpoints that developers accidentally committed to public repos).

#!/bin/bash
# recon_pipeline.sh -- comprehensive subdomain enumeration

DOMAIN=$1
OUTPUT_DIR="./recon/$DOMAIN"
mkdir -p "$OUTPUT_DIR"

echo "[*] Enumerating subdomains for $DOMAIN..."

# Tool 1: subfinder (passive, 30+ data sources)
subfinder -d "$DOMAIN" -silent -o "$OUTPUT_DIR/subfinder.txt"

# Tool 2: amass (passive + active, most comprehensive)
amass enum -passive -d "$DOMAIN" -o "$OUTPUT_DIR/amass.txt" 2>/dev/null

# Tool 3: Certificate Transparency logs
curl -s "https://crt.sh/?q=%25.$DOMAIN&output=json" | \
    jq -r '.[].name_value' 2>/dev/null | sort -u > "$OUTPUT_DIR/crt.txt"

# Tool 4: GitHub code search (finds subdomains in source code)
# Requires GITHUB_TOKEN
github-subdomains -d "$DOMAIN" -t "$GITHUB_TOKEN" -o "$OUTPUT_DIR/github.txt" 2>/dev/null

# Combine and deduplicate
cat "$OUTPUT_DIR"/*.txt | sort -u > "$OUTPUT_DIR/all_subdomains.txt"
TOTAL=$(wc -l < "$OUTPUT_DIR/all_subdomains.txt")
echo "[*] Found $TOTAL unique subdomains"

# Resolve live subdomains
echo "[*] Checking which are live..."
httpx -l "$OUTPUT_DIR/all_subdomains.txt" -silent -status-code \
    -title -tech-detect -o "$OUTPUT_DIR/live_hosts.txt"

LIVE=$(wc -l < "$OUTPUT_DIR/live_hosts.txt")
echo "[*] $LIVE live hosts found"

# Screenshot live hosts
echo "[*] Taking screenshots..."
gowitness file -f "$OUTPUT_DIR/live_hosts.txt" -P "$OUTPUT_DIR/screenshots/"

echo "[*] Done. Results in $OUTPUT_DIR/"
Typical results for a medium-sized company:
  subfinder:  150 subdomains (API sources: VirusTotal, Shodan, Censys...)
  amass:      280 subdomains (includes overlap + DNS brute force findings)
  crt.sh:     120 subdomains (certificate names only)
  GitHub:     30 subdomains (hardcoded URLs in public repos)
  Combined:   340 unique subdomains (after dedup)
  Live:       185 responding to HTTP/HTTPS

Without automation: manually finding 340 subdomains would take
an analyst an entire day of tab-switching and note-taking.
With the pipeline: under 5 minutes.

The deduplication step is more important than it might appear. Each tool returns results in a slightly different format -- some include wildcards (*.example.com), some include port numbers, some return both the raw domain and the www. prefix. The sort -u handles exact duplicates, but in production pipelines you would also normalize the data (strip ports, resolve wildcards, lowercase everything) before deduplication. The reason for running four tools instead of just the "best" one is that no single tool has access to all data sources. Subfinder might find staging-api.target.com through a VirusTotal passive DNS record that amass never sees because amass's VirusTotal integration uses a different API endpoint. Amass might discover internal-tools.target.com through recursive DNS resolution that subfinder's passive-only approach cannot perform. The union of all four tools is substantially larger than any individual tool's output -- typically 30-40% more unique subdomains compared to the best single tool.

The httpx resolution step filters the enumerated subdomains down to those that actually respond to HTTP requests. Many enumerated subdomains are dead -- old DNS records pointing to decommissioned servers, expired certificates for services that no longer exist, development environments that were shut down but never cleaned up in DNS. httpx also extracts useful metadata (HTTP status codes, page titles, technology fingerprints via Wappalyzer signatures) that helps you prioritize which hosts to investigate further. A host returning a 200 with title "Jenkins [Jenkins]" is a much higher-priority finding than a host returning a generic 403 Forbidden page.

The gowitness screenshots at the end are for the report -- and for your own sanity when dealing with 185 live hosts. Scrolling through a folder of screenshots is dramatically faster for initial triage than manually visiting each URL. You can immediately spot login pages, default installations, error pages that leak information, and admin panels that should not be publicly accessible. Visual triage at scale is one of those things that sounds trivial until you have actually tried to manually visit 185 URLs one by one.

Certificate Transparency Monitoring

CT logs are public records of every SSL/TLS certificate ever issued by participating Certificate Authorities. This was created as a transparency mechanism to detect mis-issued certificates (remember the DigiNotar incident from 2011? -- a compromised CA issued fraudulent certificates for google.com, and nobody knew until it was too late). For OSINT purposes, CT logs are an intelligence goldmine because they reveal infrastructure changes in real time: when a target registers a new certificate, it usually means new infrastructure is being deployed.

#!/usr/bin/env python3
"""ct_monitor.py -- monitor Certificate Transparency for new certs"""
import requests
import json
import time
from datetime import datetime

def check_new_certs(domain, last_check=None):
    """Query crt.sh for certificates issued since last check."""
    url = f"https://crt.sh/?q=%.{domain}&output=json"
    try:
        r = requests.get(url, timeout=30)
        certs = r.json()
    except Exception:
        return []

    new_certs = []
    for cert in certs:
        issued = cert.get('entry_timestamp', '')
        name = cert.get('name_value', '')
        issuer = cert.get('issuer_name', '')

        if last_check and issued > last_check:
            new_certs.append({
                'domain': name,
                'issued': issued,
                'issuer': issuer,
            })

    return new_certs

def monitor(domain, interval=3600):
    """Continuously monitor for new certificates."""
    last_check = datetime.utcnow().isoformat()
    print(f"[*] Monitoring CT logs for {domain}")
    print(f"[*] Checking every {interval} seconds")

    while True:
        time.sleep(interval)
        new = check_new_certs(domain, last_check)
        if new:
            for cert in new:
                print(f"[!] NEW CERT: {cert['domain']} "
                      f"(issued {cert['issued']}, issuer: {cert['issuer'][:40]})")
        last_check = datetime.utcnow().isoformat()

if __name__ == '__main__':
    import sys
    monitor(sys.argv[1] if len(sys.argv) > 1 else 'example.com')
What CT monitoring reveals in practice:
- New subdomains being deployed (staging.target.com, api-v2.target.com)
- Shadow IT (departments spinning up services without central IT approval)
- Phishing infrastructure (attacker registers target-login.com or target-sso.com)
- Development/test environments exposed to the internet
- Partner integrations (partner-api.target.com)
- Certificate renewals (useful for tracking certificate management practices)
- Wildcard certificate usage (*.target.com -- a single cert covering everything)

The offensive value of CT monitoring is timing. When you see a new certificate for staging-v2.target.com appear in the logs, that infrastructure is probably being set up RIGHT NOW -- which means it is likely in its most vulnerable state (default configurations, incomplete hardening, maybe even temporary credentials for the deployment team). Attackers monitor CT logs for exactly this reason: fresh infrastructure is easy infrastructure. The defensive value is equally important: if you see a certificate for target-sso.com and your organization did not request it, that is almost certainly a phishing campaign in preparation, and you have a window to take action (report the domain, update your email filters, alert your users) before the campaign launches.

Having said that, the crt.sh API has rate limits and sometimes returns incomplete results for very large domains. For production-grade CT monitoring, tools like certstream provide real-time streaming of CT log entries via a WebSocket connection -- you see every certificate within seconds of it being logged, not on a polling interval. Facebook's Certificate Transparency monitoring tool does the same thing internally and has caught quit some phishing domains targeting their users before any employee received a single phishing email.

Automated Credential Monitoring

One of the highest-impact OSINT automation tasks is checking whether your organization's employee credentials have been exposed in data breaches. The logic is simple: if [email protected] appeared in the LinkedIn 2012 breach and John is still using the same password (or a predictable variation), that is a real attack vector -- no exploitation required, just credential stuffing.

#!/usr/bin/env python3
"""breach_monitor.py -- check employee emails against breach databases"""
import requests
import time

def check_hibp(email, api_key):
    """Check Have I Been Pwned for breaches."""
    url = f"https://haveibeenpwned.com/api/v3/breachedaccount/{email}"
    headers = {
        'hibp-api-key': api_key,
        'user-agent': 'breach-monitor'
    }
    r = requests.get(url, headers=headers, timeout=10)
    if r.status_code == 200:
        return r.json()
    return []

def monitor_company(email_list_file, api_key):
    """Check all company emails for breach exposure."""
    with open(email_list_file) as f:
        emails = [line.strip() for line in f if line.strip()]

    print(f"[*] Checking {len(emails)} emails against HIBP...")
    exposed = []

    for email in emails:
        breaches = check_hibp(email, api_key)
        if breaches:
            breach_names = [b['Name'] for b in breaches]
            print(f"  [!] {email}: exposed in {', '.join(breach_names)}")
            exposed.append({
                'email': email,
                'breaches': breach_names,
                'breach_count': len(breaches),
            })
        time.sleep(1.5)  # HIBP rate limit: 1 request per 1.5 seconds

    print(f"\n[*] {len(exposed)} / {len(emails)} emails found in breaches")
    return exposed

The 1.5-second rate limit on HIBP is worth noting because it defines the practical ceiling for automated checking: 10,000 employee emails take approximately 4 hours and 10 minutes to process. For large organizations, the HIBP commercial API (Enterprise subscription) provides higher rate limits and domain-level searching (check all emails at @company.com without knowing individual addresses). The alternative is maintaining your own breach database -- tools like h8mail aggregate results from multiple breach databases and paste sites -- but the legal and ethical considerations of possessing breach data are non-trivial, and for most pentest engagements the HIBP API (which never exposes actual passwords, only breach membership) is the appropriate tool.

The pentest value of breach data is enormous. If 40% of a company's employees appear in breaches, the probability that at least some of those passwords are still in use (or have been "rotated" to predictable variants like Password2024! -> Password2025!) is extremely high. Combined with the username enumeration techniques from earlier in this series, breach data enables credential stuffing attacks that bypass technical controls entirely -- you are not exploiting software, you are logging in with valid credentials.

GitHub OSINT

Developers leak secrets on GitHub constantly. API keys, database credentials, internal URLs, private SSH keys, cloud access tokens -- all committed to public repositories by accident and (usually) deleted in the next commit. The problem is that git never forgets. Even if the developer force-pushes to remove the secret, the commit history still contains it unless the repository is completely deleted and recreated. Automated GitHub OSINT searches for these leaked secrets at scale.

#!/usr/bin/env python3
"""github_osint.py -- search GitHub for leaked company secrets"""
import requests
import re

GITHUB_API = "https://api.github.com"

def search_code(query, token, max_results=100):
    """Search GitHub code for secrets."""
    headers = {'Authorization': f'token {token}'}
    results = []
    page = 1

    while len(results) < max_results:
        r = requests.get(f"{GITHUB_API}/search/code",
                        params={'q': query, 'per_page': 30, 'page': page},
                        headers=headers, timeout=10)
        if r.status_code != 200:
            break
        data = r.json()
        for item in data.get('items', []):
            results.append({
                'repo': item['repository']['full_name'],
                'file': item['path'],
                'url': item['html_url'],
            })
        if len(data.get('items', [])) < 30:
            break
        page += 1

    return results

def hunt_secrets(domain, token):
    """Search for company secrets on GitHub."""
    queries = [
        f'"{domain}" password',
        f'"{domain}" api_key',
        f'"{domain}" secret_key',
        f'"{domain}" AWS_ACCESS_KEY',
        f'"{domain}" jdbc:mysql',
        f'org:{domain.split(".")[0]} password filename:.env',
        f'org:{domain.split(".")[0]} filename:credentials',
    ]

    all_findings = []
    for query in queries:
        print(f"  Searching: {query}")
        findings = search_code(query, token, max_results=20)
        for f in findings:
            print(f"    [!] {f['repo']} / {f['file']}")
            print(f"        {f['url']}")
        all_findings.extend(findings)

    return all_findings

The org: prefix in the GitHub search API is critical because it restricts results to repositories owned by the organization. Without it, you search ALL of GitHub and get drowned in false positives (every tutorial that uses example.com as a placeholder will match "example.com" password). The filename:.env filter is particularly effective because .env files are the standard location for environment variables (API keys, database URLs, secret keys) in modern web applications, and developers who forget to add .env to their .gitignore commit these files to version control regularly.

For more thorough secret detection, trufflehog (from Truffle Security) scans the ENTIRE git history of a repository -- not just the current HEAD -- looking for high-entropy strings and known secret patterns (AWS keys, Slack tokens, private keys). It catches secrets that were committed three years ago and deleted in the next commit. GitGuardian provides the same capability as a SaaS product with continuous monitoring -- useful for organizations that want to detect leaked secrets in real time rather than retroactively. On a pentest engagement, running trufflehog against the target organization's public GitHub repos is one of the highest-yield activities in the entire reconnaissance phase, because a valid AWS access key is not a vulnerability you need to exploit -- it is a door that is already open.

Beyond the Public Internet

The forementioned techniques all operate on publicly accessible data -- DNS records, certificates, GitHub, breach databases. But OSINT extends beyond what is indexed by search engines.

Social media intelligence (SOCMINT) involves systematic collection from platforms like LinkedIn, Twitter, GitHub profiles, personal blogs, and public forums. For pentest engagements, LinkedIn is the most valuable source because it reveals organizational structure (who reports to whom), technology stack (job postings list specific technologies -- "experience with Kubernetes, Terraform, and AWS Lambda" tells you exactly what the target's infrastructure looks like), and individual targets for social engineering (the help desk employee, the new hire who is not yet security-aware, the sysadmin who posts about their work on Twitter). Tools like linkedin2username generate likely email formats from LinkedIn employee names, and Sherlock (which we covered in episode 47) maps a username across 300+ platforms to build a profile of a person's online presence.

Dark web monitoring is the other end of the spectrum. Threat intelligence teams and SOC analysts monitor underground forums, paste sites, and marketplace listings for mentions of their organization. The practical tools for this include IntelX (Intelligence X -- a search engine for the historical internet including leaked databases, paste sites, and Tor content), commercial platforms like Recorded Future and DarkTracer, and Telegram channel monitors (many threat actors have moved from dark web forums to Telegram groups, which are substantially easier to monitor). The defensive goal is early warning: if an employee's credentials appear for sale on a dark web marketplace, you want to know about it before the buyer uses them.

OSINT data source hierarchy (by accessibility):
  1. Public internet     (Google, Shodan, Censys, crt.sh)
  2. Social media        (LinkedIn, Twitter, GitHub, Reddit)
  3. Breach databases    (HIBP, IntelX, dehashed)
  4. Code repositories   (GitHub, GitLab, Bitbucket -- public repos)
  5. Paste sites         (Pastebin, Ghostbin, dpaste)
  6. Deep web            (forums requiring registration, gated content)
  7. Dark web            (Tor hidden services, I2P, marketplace listings)

Automation coverage:
  Layers 1-5: fully automatable with public APIs and web scraping
  Layer 6: partially automatable (account creation + scraping)
  Layer 7: requires specialized tools and significant OPSEC

The key insight for OSINT automation is that layers 1-5 contain 90% of the actionable intelligence for most engagements. Going deeper (layers 6-7) has diminishing returns and increasing legal and ethical complexity. Unless you are specifically doing threat intelligence work (episode 52) or investigating an active incident (episode 51), the public layers provide more than enough data to map an organization's attack surface comprehensively.

Building an OSINT Dashboard

With data flowing in from multiple sources -- subdomains, certificates, breaches, GitHub findings, social media profiles -- the challenge shifts from collection to correlation. A subdomain list and a breach list are useful individually, but linking them reveals patterns that neither source shows alone. If admin.target.com is in your subdomain list and [email protected] appears in three breaches, that is a high-priority finding that combines infrastructure exposure with credential compromise.

#!/usr/bin/env python3
"""osint_report.py -- generate consolidated OSINT report"""
import json
from datetime import datetime

class OSINTReport:
    def __init__(self, target):
        self.report = {
            'target': target,
            'generated': datetime.utcnow().isoformat(),
            'sections': {}
        }

    def add_subdomains(self, subdomains):
        self.report['sections']['subdomains'] = {
            'total': len(subdomains),
            'live': len([s for s in subdomains if s.get('live')]),
            'data': subdomains
        }

    def add_breaches(self, breaches):
        self.report['sections']['breaches'] = {
            'exposed_accounts': len(breaches),
            'total_breaches': sum(b['breach_count'] for b in breaches),
            'data': breaches
        }

    def add_github_findings(self, findings):
        self.report['sections']['github'] = {
            'total_findings': len(findings),
            'repos_affected': len(set(f['repo'] for f in findings)),
            'data': findings
        }

    def add_ct_certs(self, certs):
        self.report['sections']['certificates'] = {
            'new_certs': len(certs),
            'data': certs
        }

    def generate(self):
        risk = 0
        subs = self.report['sections'].get('subdomains', {})
        risk += min(subs.get('live', 0) * 0.5, 25)

        breaches = self.report['sections'].get('breaches', {})
        risk += min(breaches.get('exposed_accounts', 0) * 5, 30)

        github = self.report['sections'].get('github', {})
        risk += min(github.get('total_findings', 0) * 10, 30)

        certs = self.report['sections'].get('certificates', {})
        risk += min(certs.get('new_certs', 0) * 2, 15)

        self.report['risk_score'] = min(risk, 100)
        self.report['risk_level'] = (
            'CRITICAL' if risk > 75 else
            'HIGH' if risk > 50 else
            'MEDIUM' if risk > 25 else
            'LOW'
        )

        return self.report

    def save(self, filename):
        report = self.generate()
        with open(filename, 'w') as f:
            json.dump(report, f, indent=2)
        print(f"Report saved: {filename}")
        print(f"Risk: {report['risk_level']} ({report['risk_score']}/100)")

The risk scoring model here is deliberately simple -- linear weights per category, capped per section to prevent any single source from dominating the score. In practice you would weight the categories based on the engagement scope and the client's threat model: a financial institution might weight breach exposure at 3x compared to subdomain count (because credential compromise leads directly to fraud), while a SaaS company might weight GitHub findings at 3x (because leaked API keys grant direct access to customer data). The point of the automated scoring is not to replace analyst judgment -- it is to provide a consistent baseline that enables comparison over time. If you run the same pipeline weekly and the risk score jumps from 35 to 68, something changed and you need to investigate.

The dashboard approach also enables correlation across data sources. Python's set operations make this trivial: take the set of subdomain hostnames, take the set of email prefixes from the breach data, compute the intersection, and you have a list of hosts that map to compromised user accounts. Add the GitHub findings (which often contain database connection strings with hostnames), intersect those hostnames with the subdomain list, and you might discover that db-production.target.com is both publicly discoverable AND has its credentials in a public GitHub repo. Each data source is moderately useful alone. The combination is devastating.

Defense: Attack Surface Management

Everything in this episode so far has been offensive -- techniques for mapping a target's exposure. The defensive application is identical: use the same tools and automation against your OWN organization, before an attacker does.

The same OSINT techniques attackers use, defenders should use first:

1. Continuous subdomain monitoring
   Run enumeration pipelines weekly. Diff against previous results.
   New subdomains = potential shadow IT or attacker infrastructure.
   Set up alerting for new entries that were not approved by IT.

2. Certificate Transparency alerts
   Monitor CT logs for your domain and common misspellings.
   New certs you did not request = shadow IT or phishing campaign.
   Phishing domains often register certs 24-48 hours before launch.

3. Breach monitoring
   Check employee emails against HIBP monthly. Breached accounts
   need immediate password resets and MFA verification.
   Track which breaches are new vs previously known.

4. GitHub scanning
   Search for your company name, domain, and internal project names
   on GitHub weekly. Employees leak credentials in personal repos.
   Tools: GitGuardian (continuous), trufflehog (git history scanning)

5. Attack surface management platforms
   Commercial: Censys ASM, Shodan Monitor, SecurityTrails, Mandiant ASM
   These provide continuous discovery and monitoring of your
   internet-facing assets with alerting on changes.
   Open-source: ProjectDiscovery's suite (subfinder, httpx, nuclei)

6. Digital footprint reduction
   - Remove old DNS records for decommissioned services
   - Revoke unused SSL certificates
   - Clean up public code repositories (check git history too)
   - Minimize information in job postings (do you really need to
     list your exact tech stack publicly?)
   - Monitor executive social media for information disclosure

The concept of Attack Surface Management (ASM) has become a commercial product category in the last few years -- Censys, Shodan, SecurityTrails, and others sell continuous monitoring services that essentially run the automation from this episode 24/7 against your organization and alert you when something changes. The value proposition is real: most organizations do not know their own attack surface. Shadow IT (departments deploying cloud services without IT approval), forgotten development servers, acquired company infrastructure that was never integrated into the security program, contractor-deployed test environments -- all of these create exposure that no CMDB or asset inventory captures accurately. Automated discovery is the only way to get a genuinely complete picture.

The digital footprint reduction point deserves emphasis. Every piece of information that is publicly discoverable about your organization is OSINT ammunition for an attacker. Job postings that specify "experience with Kubernetes 1.28, Terraform, and AWS Lambda in us-east-1" tell an attacker exactly what your infrastructure looks like and where it is hosted. Conference talks where engineers discuss their architecture provide the same information. Executive LinkedIn profiles listing their direct reports provide organizational charts for social engineering targeting. None of this is a vulnerability in the traditional sense, but all of it is intelligence that makes the attacker's job easier. Reducing what is discoverable -- without going dark entirely, which is impractical -- is a legitimate defensive strategy.

The AI Slop Connection

AI supercharges OSINT automation in both directions, and the implications are significant. On the offensive side, AI can process hundreds of LinkedIn profiles and build organizational charts in minutes. AI-powered tools correlate data across platforms -- linking a GitHub username to a Twitter handle to a LinkedIn profile to a real identity -- by recognizing patterns in usernames, profile pictures, writing style, and cross-platform references. The Python scripts in this episode are straightforward automation; add a language model to the pipeline and you get an analyst that can read every GitHub commit message, every LinkedIn post, every public forum comment, and extract relevant intelligence from the noise. What used to require a human analyst's full attention for a week can be done in an afternoon.

On the defensive side, the same capability means your organization's digital footprint is being analyzed by AI-powered tools at scale, right now, by threat actors you will never know about. Every public post, every job listing, every conference talk, every GitHub commit is intelligence. The question is not whether attackers are collecting this data -- they absolutely are. The question is whether you know what they are finding before they use it against you.

The output from OSINT automation -- the raw data, the correlations, the risk scores -- is only useful if it gets communicated effectively to the people who can act on it. All the subdomain enumeration and breach checking in the world is worthless if the report sits in a JSON file that nobody reads. Turning raw intelligence into actionable findings that drive remediation is a skill in itself, and arguably the most undervalued skill in the entire security profession ;-)

Exercises

Exercise 1: Build and run the subdomain enumeration pipeline from this episode against a domain you own or have explicit permission to test. Document: (a) results per tool (subfinder, amass, crt.sh) including the unique count from each, (b) total unique subdomains vs total with overlap (calculate the percentage of new discoveries each tool contributed beyond what the others found), (c) how many resolved to live hosts via httpx, (d) any surprising findings (subdomains you did not know existed, development environments, shadow IT). If you do not own a domain, use hackthebox.com or tryhackme.com as your target (these are intentionally public-facing security platforms). Save your full results and analysis to ~/lab-notes/osint-subdomain-pipeline.md.

Exercise 2: Set up Certificate Transparency monitoring for your own domain (or a test domain you control). Run the ct_monitor.py script from this episode for 24 hours with a 1-hour polling interval. During the monitoring period, manually issue a Let's Encrypt certificate for a new subdomain using certbot certonly --standalone -d newhost.yourdomain.com. Verify the monitor detects the new certificate. Document: (a) the detection latency (time between certificate issuance and monitor alert), (b) any false positives or noise in the CT log data (wildcard certs, CDN certs, etc.), (c) how you would filter the results in a production deployment to reduce noise. If you cannot issue certificates, use the script against a large public domain (google.com, microsoft.com) for 24 hours and document every new certificate that appears. Save to ~/lab-notes/osint-ct-monitoring.md.

Exercise 3: Conduct a full OSINT assessment of a fictional company (create one: pick a domain you own, populate it with test DNS records and a few GitHub repos) or your own organization (with explicit written permission from your employer). Run ALL the automation from this episode: subdomain enumeration, breach checking (HIBP), GitHub secret scanning, and CT log analysis. Feed the results into the OSINTReport class to generate a consolidated JSON report. Then write a 1-page executive summary: what was found, what the risk level is, what the top 3 remediation priorities should be, and what ongoing monitoring you recommend. The executive summary is as important as the technical findings -- practice writing for a non-technical audience. Save both the JSON report and the executive summary to ~/lab-notes/osint-full-assessment/.


De groeten!

@scipio



0
0
0.000
0 comments