Learn Ethical Hacking (#51) - Incident Response - When Things Go Wrong

@scipio 70

21 days ago

StemSocial

Learn Ethical Hacking (#51) - Incident Response - When Things Go Wrong

What will I learn

What incident response is and why every organization needs a plan before the breach happens;
The IR lifecycle -- preparation, identification, containment, eradication, recovery, lessons learned;
Triage and classification -- determining severity, scope, and priority under pressure;
Evidence collection -- forensically sound acquisition of memory, disk, logs, and network captures;
Containment strategies -- isolating compromised systems without alerting the attacker or destroying evidence;
Root cause analysis -- tracing the kill chain backwards from detection to initial access;
IR playbooks -- pre-built response procedures for common incident types (ransomware, BEC, data breach);
Defense: building an IR team, tabletop exercises, IR retainers, and continuous improvement.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
Understanding of attack techniques from episodes 1-50;
Familiarity with logging and monitoring concepts;
The ambition to learn ethical hacking and security research.

Difficulty

Intermediate

Curriculum (of the `Learn Ethical Hacking Series`):

Learn Ethical Hacking (#51) - Incident Response - When Things Go Wrong

Solutions to Episode 50 Exercises

Exercise 1: Red team operation plan (abbreviated).

Target: Apex Financial Services (fictional)
Objectives:
  1. Access the CEO's email inbox
  2. Exfiltrate customer database (prove access, do not copy real data)
  3. Demonstrate ransomware deployment capability on a test subnet

Kill chain mapping:
  Recon: LinkedIn OSINT (T1593), DNS enumeration (T1596)
  Weaponize: Sliver implant with HTTPS C2 (T1587.001)
  Deliver: spearphishing to HR (T1566.001) + credential harvest page
  Exploit: OAuth token theft from O365 (T1528)
  Install: scheduled task persistence (T1053.005)
  C2: HTTPS beaconing via domain-fronted CDN (T1071.001)
  Objective: Kerberoast -> DA -> DCSync -> CEO mailbox (T1558.003)

OPSEC: domain registered 45 days prior, redirector infrastructure
  on separate cloud provider, C2 profile mimicking Slack API traffic.

The key insight with this operation plan is that every phase maps directly to ATT&CK technique IDs -- which is not just academic rigor, it's operational necesity. When the engagement is over and you're writing the report (as we discussed in episode 50), those technique IDs are how you tell the blue team exactly where their detection gaps are. "We used T1558.003 and you didn't catch it" is infinitely more actionable than "we did some Kerberos stuff."

Exercise 2: ATT&CK Navigator coverage map (abbreviated).

APT29 techniques assessed (selection):
  T1566.002 Spearphishing Link:     GREEN (email gateway detects)
  T1059.001 PowerShell:             YELLOW (logged but no alert rule)
  T1003.001 LSASS Memory:           RED (no detection)
  T1053.005 Scheduled Task:         YELLOW (logged, high FP rate)
  T1021.006 WinRM:                  RED (no detection)

Top 5 gaps:
  1. T1003.001 -- no LSASS protection or dump detection
  2. T1021.006 -- WinRM lateral movement unmonitored
  3. T1070.001 -- log clearing undetected
  4. T1090.004 -- domain fronting undetectable at network layer
  5. T1558.003 -- Kerberoasting (no TGS request volume alerting)

The Navigator is one of those tools that looks like a toy until you use it on a real environment -- then it becomes genuinely terrifying. You overlay APT29's known techniques, color-code your detection coverage, and suddenly you're staring at a heatmap that says "if Russian intelligence targeted us tomorrow, we would detect exactly 30% of their playbook." That's a conversation-starter for the CISO budget meeting, I can tell you that much ;-)

Exercise 3: Mini purple team exercise (abbreviated).

T1059.001 PowerShell: Sysmon Event 1 logged (process creation),
  but no alert rule. Built: alert on powershell.exe with
  -encodedcommand or -e flag.

T1003.001 LSASS dump: Sysmon Event 10 logged (process access to
  lsass.exe). Built: alert on any non-system process accessing
  lsass.exe with READ access.

T1053.005 Scheduled task: Sysmon Event 1 + Windows Event 4698.
  Built: alert on schtasks.exe creating tasks with encoded payloads.

T1021.006 WinRM: Windows Event 6 (WSMan session). No Sysmon event.
  Built: alert on Event 6 from non-admin workstations.

T1070.001 Log clearing: Windows Event 1102 (audit log cleared).
  Built: alert on 1102 -- this should NEVER happen in production.

Five techniques tested, five detection rules built. That's the purple team value proposition in miniature. In a traditional red team engagement, you would have tested these same five techniques, written them up in a 40-page report that arrives six weeks later, and the blue team would start building detection rules in Q3. In the purple team model, the detection rules exist before lunch. The feedback loop is everything.

Learn Ethical Hacking (#51) - Incident Response - When Things Go Wrong

Episode 50 covered red team operations -- the discipline of combining every offensive skill we've built over 49 episodes into a single coordinated operation that simulates a real-world attacker. We walked through the difference between pentests and red teams (pentests find bugs, red teams test whether your organization survives a determined attacker), rules of engagement (the piece of paper that separates "authorized security professional" from "computer criminal"), the Lockheed Martin Cyber Kill Chain mapped against our entire curriculum, adversary emulation using MITRE ATT&CK profiles for specific threat actors like APT29 and FIN7, operational security for staying undetected inside a target network, C2 infrastructure architecture with redirectors and team servers, purple teaming as a collaborative alternative, and the red team report as an attack narrative rather than a vulnerability list. The core takeaway: a red team doesn't just find weaknesses -- it demonstrates how those weaknesses chain together into a complete organizational compromise.

Today we switch sides. For real, this time.

For 50 episodes you have been the attacker. You found the SQL injection (episode 12), escalated privileges (episodes 31-32), moved laterally through the domain (episode 34), exfiltrated data, and in episode 50 you orchestrated all of those skills into a coordinated red team operation. Very satisfying. Very educational. And completely useless the moment your phone rings at 3 AM and it's the SOC analyst saying "we have ransomware on the file servers."

What do you do RIGHT NOW?

Incident response is the discipline of answering that question -- under pressure, with incomplete information, while the clock is ticking and the damage is compounding every minute you spend figuring out what happened. It is (I argue) the most stressful job in cybersecurity, and the one that matters most when every other defense has failed.

Here we go.

The NIST Framework -- Six Phases

The canonical incident response lifecycle comes from NIST SP 800-61 (Computer Security Incident Handling Guide). Six phases, in order:

#!/usr/bin/env python3
"""ir_lifecycle.py -- the six phases of incident response"""

IR_PHASES = {
    1: {
        'name': 'Preparation',
        'when': 'BEFORE the incident (this is where 90% of success lives)',
        'key_activities': [
            'IR plan: documented, tested, approved by leadership',
            'IR team: defined roles (incident commander, technical lead, '
            'communications, legal, HR)',
            'Contact list: IR retainer firm, legal counsel, law enforcement, '
            'cyber insurance carrier, PR firm',
            'Tooling: forensic workstation, write blockers, evidence bags, '
            'SIEM access, EDR console, network capture capability',
            'Playbooks: pre-written procedures for common incident types',
            'Communication templates: breach notification, press statement, '
            'internal comms, regulatory filing',
            'Tabletop exercises: quarterly simulations',
        ],
    },
    2: {
        'name': 'Identification',
        'when': 'Is this real? How bad is it?',
        'key_activities': [
            'Triage: true positive or false positive?',
            'Scope: what systems are affected?',
            'Timeline: when did it start?',
            'Classification: SEV-1 through SEV-4',
            'Active threat: is the attacker still inside?',
        ],
    },
    3: {
        'name': 'Containment',
        'when': 'Stop the bleeding without destroying evidence',
        'key_activities': [
            'Short-term: network isolation, account disabling, C2 blocking',
            'Long-term: rebuild from known-good images, segment networks',
            'The containment dilemma: observe vs isolate',
        ],
    },
    4: {
        'name': 'Eradication',
        'when': 'Remove the attacker completely',
        'key_activities': [
            'Find ALL compromised systems (not just the ones you know about)',
            'Remove ALL persistence mechanisms',
            'Reset ALL potentially compromised credentials',
            'Patch the initial access vector',
        ],
    },
    5: {
        'name': 'Recovery',
        'when': 'Return to normal operations',
        'key_activities': [
            'Restore from verified clean backups',
            'Validate restored systems are clean',
            'Heightened monitoring for 30-90 days',
            'Gradual return to normal operations',
        ],
    },
    6: {
        'name': 'Lessons Learned',
        'when': 'Make sure this does not happen again',
        'key_activities': [
            'Blameless post-incident review',
            'Incident report (timeline, scope, impact, root cause)',
            'Updated playbooks and detection rules',
            'Remediation projects for identified gaps',
        ],
    },
}

for phase_num, phase in IR_PHASES.items():
    print(f"Phase {phase_num}: {phase['name']}")
    print(f"  When: {phase['when']}")
    for activity in phase['key_activities']:
        print(f"  - {activity}")
    print()

Most organizations skip Phase 1. Then they're surprised when Phases 2 through 5 are a chaotic nightmare. You do not rise to the level of your expectations. You fall to the level of your preparation. I've seen this quote attributed to Archilochus, the Greek poet, though the internet attributes it to basically everyone who ever lived. Regardless of who said it first -- it is the single most important truth in incident response.

Phase 1: Preparation -- Before the Storm

This is the phase where 90% of incident response success is determined. Everything after this is execution under stress, and execution under stress is only as good as the plan you rehearsed before the stress started.

#!/usr/bin/env python3
"""ir_team_roles.py -- who does what when things go wrong"""

IR_TEAM = {
    'incident_commander': {
        'role': 'Overall decision authority during the incident',
        'responsibilities': [
            'Declares incident severity level',
            'Coordinates all response activities',
            'Makes containment/eradication decisions',
            'Manages escalation to executives',
            'Owns the incident timeline and log',
        ],
        'common_mistake': 'Too junior. The IC needs authority to '
                         'shut down production systems at 3 AM '
                         'without asking permission first.',
    },
    'technical_lead': {
        'role': 'Leads technical investigation and remediation',
        'responsibilities': [
            'Directs forensic evidence collection',
            'Coordinates SIEM queries and threat hunting',
            'Identifies scope of compromise',
            'Designs containment and eradication strategy',
            'Verifies clean state after recovery',
        ],
        'common_mistake': 'Only one person. You need at least '
                         'two -- incidents run 24/7 and one person '
                         'cannot sustain that without sleep.',
    },
    'communications_lead': {
        'role': 'Manages all internal and external messaging',
        'responsibilities': [
            'Drafts internal status updates (hourly during SEV-1)',
            'Coordinates with PR for external statements',
            'Manages customer notification if required',
            'Ensures consistent messaging across all channels',
        ],
        'common_mistake': 'Forgotten until a journalist calls. '
                         'By then the narrative is already written '
                         'for you -- and not in your favor.',
    },
    'legal_counsel': {
        'role': 'Manages legal and regulatory obligations',
        'responsibilities': [
            'Determines breach notification requirements',
            'Coordinates with law enforcement if appropriate',
            'Manages evidence preservation for potential litigation',
            'Advises on regulatory filings (GDPR 72h, etc.)',
            'Establishes attorney-client privilege over IR comms',
        ],
        'common_mistake': 'Not involved until day 3. Legal should '
                         'be in the room within the first hour of '
                         'a SEV-1 or SEV-2 incident.',
    },
}

print("=== Incident Response Team Roles ===\n")
for role_name, data in IR_TEAM.items():
    label = role_name.replace('_', ' ').title()
    print(f"{label}: {data['role']}")
    for resp in data['responsibilities']:
        print(f"  - {resp}")
    print(f"  WARNING: {data['common_mistake']}")
    print()

That attorney-client privilege point from legal counsel is critically important and almost universally overlooked. In many jurisdictions, if the legal team directs the IR investigation, the findings may be protected by attorney-client privilege. If the IT team runs it themselves, everything they find is discoverable in court. This is not a technicality -- it's the difference between "our investigation found X and we fixed it" (protected) and "here are all the embarrassing details about our security failures, please use them in your class action lawsuit against us" (not protected). Get legal involved early. Having said that, this varies by jurisdiction, so consult YOUR legal team.

Phase 2: Identification -- Is This Real?

The hardest phase. You need to answer three questions simultaneously: is this a real incident or a false positive, how bad is it, and is the attacker still active?

#!/usr/bin/env python3
"""ir_triage.py -- initial triage and severity classification"""

SEVERITY_LEVELS = {
    'SEV-1': {
        'label': 'Critical',
        'criteria': [
            'Active data exfiltration confirmed',
            'Ransomware deployment in progress',
            'Domain controller compromised',
            'Customer PII confirmed exposed',
            'Active adversary with domain admin access',
        ],
        'response': 'All hands on deck. War room activated. '
                    'Executive notification within 30 minutes. '
                    'Legal engaged immediately.',
        'sla': 'Response: 15 minutes. Update: every 30 minutes.',
    },
    'SEV-2': {
        'label': 'High',
        'criteria': [
            'Confirmed intrusion but scope unclear',
            'Lateral movement detected',
            'Malware on multiple endpoints',
            'Privileged account compromised',
            'C2 beacon activity confirmed',
        ],
        'response': 'IR team activated. Containment is priority. '
                    'Manager notification within 1 hour.',
        'sla': 'Response: 30 minutes. Update: every hour.',
    },
    'SEV-3': {
        'label': 'Medium',
        'criteria': [
            'Single system compromised, contained',
            'Phishing with credential capture (single user)',
            'Suspicious but unconfirmed activity',
            'Malware detected and quarantined by EDR',
        ],
        'response': 'IR team investigates during business hours. '
                    'Ticket created and tracked.',
        'sla': 'Response: 4 hours. Update: daily.',
    },
    'SEV-4': {
        'label': 'Low',
        'criteria': [
            'Malware blocked by AV before execution',
            'Failed brute force attempts',
            'Policy violation (no security impact)',
            'Suspicious email reported but not clicked',
        ],
        'response': 'SOC analyst handles. Documented for trending.',
        'sla': 'Response: 24 hours. Update: as needed.',
    },
}

print("=== Incident Severity Classification ===\n")
for sev, data in SEVERITY_LEVELS.items():
    print(f"[{sev}] {data['label']}")
    print(f"  Response: {data['response']}")
    print(f"  SLA: {data['sla']}")
    print(f"  Criteria:")
    for criterion in data['criteria']:
        print(f"    - {criterion}")
    print()

The classification seems straightforward on paper. In practice, you're making this decision with incomplete information at 3 AM after being woken by a pager. The SIEM is showing 47 alerts. Three of them look related. Maybe. The endpoint in question belongs to someone in finance. The alert says "Cobalt Strike beacon detected" -- but is it the red team engagement that was supposed to start next week, or is it a real attacker using a cracked copy of Cobalt Strike (as we discussed in episode 50, real APT groups love pirated Cobalt Strike)?

This is where preparation pays off. If you have playbooks, if you have a decision tree, if you have a severity matrix taped to the wall of the SOC -- you don't have to think from first principles at 3 AM. You follow the procedure. Procedure is what keeps you functional when your brain wants to panic.

Phase 3: Containment -- Stop the Bleeding

Containment has one goal: stop the damage from getting worse. But it has a critical constraint: don't destroy evidence, and if possible, don't tip off the attacker that you know they're there.

# Short-term containment actions (Windows environment)

# Network isolation -- CRITICAL: do NOT power off the machine!
# Powering off destroys volatile memory (running processes, network
# connections, encryption keys, malware in RAM-only).
# Instead, isolate it from the network while keeping it running.

# Option 1: host firewall quarantine (block everything except forensics)
netsh advfirewall set allprofiles firewallpolicy blockinbound,blockoutbound
netsh advfirewall firewall add rule name="forensics-in" dir=in \
    action=allow remoteip=10.10.14.5
netsh advfirewall firewall add rule name="forensics-out" dir=out \
    action=allow remoteip=10.10.14.5

# Option 2: VLAN quarantine (move host to isolated VLAN at switch level)
# This is preferred because the host itself can't undo it.
# Requires coordination with network team.

# Disable compromised accounts -- do NOT delete them (evidence!)
net user compromised_user /active:no /domain
# PowerShell alternative:
# Set-ADUser -Identity jsmith -Enabled $false

# Block known C2 at the firewall and proxy
# Add attacker IPs and domains to blocklists immediately.
# But remember: sophisticated attackers have backup C2 channels.
# Blocking one domain might trigger a fallback to DNS C2.

# Credential rotation for affected scope
# Reset passwords for ALL compromised and POTENTIALLY compromised
# accounts. Not just the one you found -- the attacker may have
# harvested credentials from other accounts during lateral movement.
# Force re-authentication everywhere (revoke active sessions/tokens).

The "do NOT power off the machine" rule is one of those things that seems counterintuitive until you've lost evidence because of it. Your instinct when you find a compromised system is to pull the plug. Fight that instinct. A running system has volatile data in memory -- active network connections (showing where the attacker is connecting from), running processes (showing what malware is executing), decryption keys (if the attacker is using in-memory-only tools, those keys vanish the moment power is cut), and cached credentials. All of that disappears the instant you hit the power button.

The Containment Dilemma

This is the hardest decision in incident response, and there is no correct answer that works for every situation:

#!/usr/bin/env python3
"""containment_decision.py -- the fundamental IR tension"""

def containment_decision(scenario):
    """Model the contain-now vs observe-first tradeoff."""

    factors = {
        'active_exfiltration': {
            'weight': 50,
            'description': 'Data is actively leaving the network',
            'recommendation': 'CONTAIN IMMEDIATELY -- every second '
                            'of delay means more data lost',
        },
        'ransomware_spreading': {
            'weight': 50,
            'description': 'Ransomware is encrypting additional systems',
            'recommendation': 'CONTAIN IMMEDIATELY -- network isolation '
                            'of affected segments NOW',
        },
        'c2_beacon_dormant': {
            'weight': -20,
            'description': 'C2 beacon detected but sleeping (not active)',
            'recommendation': 'You may have time to investigate scope '
                            'before acting. The attacker is not currently '
                            'active -- use that window.',
        },
        'scope_unknown': {
            'weight': -30,
            'description': 'You found one compromised host but do not '
                          'know if others are also compromised',
            'recommendation': 'Investigate scope FIRST. If you contain '
                            'one host and miss three others, the attacker '
                            'still has access and now knows you are hunting.',
        },
        'business_critical_system': {
            'weight': -15,
            'description': 'The compromised system is production-critical',
            'recommendation': 'Containment may cause more business damage '
                            'than the incident itself. Coordinate with '
                            'business owners before isolating.',
        },
    }

    contain_score = 0
    print("=== Containment Decision Analysis ===\n")

    for factor_name, factor in factors.items():
        present = scenario.get(factor_name, False)
        if present:
            contain_score += factor['weight']
            marker = '+' if factor['weight'] > 0 else '-'
            print(f"[{marker}] {factor['description']}")
            print(f"    -> {factor['recommendation']}")
            print()

    decision = 'CONTAIN NOW' if contain_score > 0 else 'INVESTIGATE FIRST'
    print(f"Score: {contain_score}")
    print(f"Decision: {decision}")
    return decision

# Example: ransomware scenario -- contain immediately
print("--- Scenario 1: Active ransomware ---")
containment_decision({
    'ransomware_spreading': True,
    'scope_unknown': True,
    'business_critical_system': True,
})

print("\n--- Scenario 2: Dormant beacon discovered ---")
containment_decision({
    'c2_beacon_dormant': True,
    'scope_unknown': True,
})

The fundamental tension: if you contain too early, you might miss other compromised systems (the attacker has backup access you haven't found yet, and now they know you're looking). If you contain too late, more data gets stolen and more systems get encrypted. In ransomware cases the answer is always "contain immediately" -- every second of delay means more encrypted files. In a quiet, dormant C2 beacon scenario, you might have hours or days to investigate scope before the attacker's next check-in.

Phase 4: Eradication -- Getting Rid of Them

Eradication means removing the attacker completely from your environment. This is harder than it sounds, because the most common failure mode is missing a persistence mechanism:

#!/usr/bin/env python3
"""persistence_hunt.py -- find everything the attacker left behind"""

# When eradicating an intrusion, you must check ALL known
# persistence mechanisms. Missing one means the attacker
# comes back in two weeks.

PERSISTENCE_CHECKLIST = {
    'windows': [
        {
            'mechanism': 'Scheduled Tasks',
            'check': 'schtasks /query /fo LIST /v | findstr /i "task\\|status\\|command"',
            'attck_id': 'T1053.005',
            'notes': 'Check for tasks running encoded PowerShell, '
                    'tasks created outside business hours, tasks '
                    'with suspicious binary paths',
        },
        {
            'mechanism': 'Registry Run Keys',
            'check': 'reg query HKLM\\SOFTWARE\\Microsoft\\Windows\\'
                    'CurrentVersion\\Run',
            'attck_id': 'T1547.001',
            'notes': 'Check both HKLM and HKCU. Also check RunOnce, '
                    'RunServices, and Explorer\\Shell Folders',
        },
        {
            'mechanism': 'Services',
            'check': 'sc query state= all | findstr /i "service_name\\|binary"',
            'attck_id': 'T1543.003',
            'notes': 'Look for services with unusual binary paths, '
                    'services created recently, services running as SYSTEM '
                    'with non-standard executables',
        },
        {
            'mechanism': 'WMI Event Subscriptions',
            'check': 'Get-WMIObject -Namespace root\\Subscription '
                    '-Class __EventFilter',
            'attck_id': 'T1546.003',
            'notes': 'WMI persistence is sneaky -- no files on disk, '
                    'survives reboots, rarely checked by defenders',
        },
        {
            'mechanism': 'DLL Search Order Hijacking',
            'check': 'Check for unsigned DLLs in system directories, '
                    'DLLs with recent modification timestamps',
            'attck_id': 'T1574.001',
            'notes': 'The attacker places a malicious DLL where a '
                    'legitimate program will load it first',
        },
    ],
    'linux': [
        {
            'mechanism': 'Cron Jobs',
            'check': 'for user in $(cut -f1 -d: /etc/passwd); '
                    'do crontab -l -u $user 2>/dev/null; done',
            'attck_id': 'T1053.003',
            'notes': 'Check /etc/crontab, /etc/cron.d/*, and every '
                    'user crontab. Also check /var/spool/cron/',
        },
        {
            'mechanism': 'SSH Authorized Keys',
            'check': 'find / -name authorized_keys -type f 2>/dev/null',
            'attck_id': 'T1098.004',
            'notes': 'Attacker adds their public key for passwordless '
                    'SSH access. Check ALL users, not just root.',
        },
        {
            'mechanism': 'Systemd Services',
            'check': 'systemctl list-unit-files --type=service '
                    '| grep enabled',
            'attck_id': 'T1543.002',
            'notes': 'Look for recently created .service files in '
                    '/etc/systemd/system/ with unusual ExecStart paths',
        },
        {
            'mechanism': 'Web Shells',
            'check': 'find /var/www -name "*.php" -newer /var/www/index.php',
            'attck_id': 'T1505.003',
            'notes': 'Web shells are THE most common persistence method '
                    'after a web application compromise (episodes 12-25)',
        },
    ],
}

for os_type, mechanisms in PERSISTENCE_CHECKLIST.items():
    print(f"=== {os_type.upper()} Persistence Check ===\n")
    for mech in mechanisms:
        print(f"[{mech['attck_id']}] {mech['mechanism']}")
        print(f"  Check: {mech['check']}")
        print(f"  Notes: {mech['notes']}")
        print()

That web shell entry is worth emphasizing. If the initial access was through a web application vulnerability (SQL injection from episode 12, file upload from episode 20, deserialization from episode 19), the attacker almost certainly dropped a web shell as their first persistence mechanism. It's a tiny PHP or JSP file sitting in the web root that gives them command execution through a normal HTTP request. Web shells survive server reboots, they don't show up in process lists (they execute within the web server process), and they look like normal web traffic to network monitoring tools. If you eradicate everything else but miss the web shell, the attacker just browses to https://target.com/images/thumb_cache.php?cmd=whoami and they're right back in.

The nuclear option for eradication is rebuilding from scratch. If you're not 100% confident you've found every persistence mechanism -- and you rarely will be -- the safest option is to wipe the system and restore from a known-good backup that predates the compromise. This is expensive and time-consuming, but it's the only approach that guarantees a clean system.

Phase 5: Recovery

#!/usr/bin/env python3
"""recovery_priority.py -- what to restore first and how"""

RECOVERY_PRIORITIES = [
    {
        'priority': 1,
        'systems': 'Authentication infrastructure (AD, LDAP, IAM)',
        'reason': 'Nothing else works if authentication is compromised. '
                 'If the attacker has domain admin, restoring servers '
                 'is pointless -- they will re-compromise immediately.',
        'verification': 'Verify all admin accounts, reset KRBTGT twice '
                       '(with 12+ hour gap), audit all trust relationships',
    },
    {
        'priority': 2,
        'systems': 'Network infrastructure (DNS, DHCP, firewalls)',
        'reason': 'Network controls must be trusted before you bring '
                 'other systems back online. If DNS is compromised, '
                 'the attacker can redirect all traffic.',
        'verification': 'Verify firewall rules, check for rogue DNS '
                       'entries, validate routing tables',
    },
    {
        'priority': 3,
        'systems': 'Business-critical applications (ERP, email, CRM)',
        'reason': 'Revenue-generating and customer-facing systems. '
                 'The business needs these to operate.',
        'verification': 'Restore from pre-compromise backup, validate '
                       'data integrity, monitor for re-infection',
    },
    {
        'priority': 4,
        'systems': 'Everything else (workstations, dev environments, etc.)',
        'reason': 'Rebuild rather than restore where possible. '
                 'Workstations are cheap to reimage.',
        'verification': 'Deploy from gold image, apply latest patches, '
                       'verify EDR agent is active and reporting',
    },
]

print("=== Recovery Priority Order ===\n")
for item in RECOVERY_PRIORITIES:
    print(f"Priority {item['priority']}: {item['systems']}")
    print(f"  Why: {item['reason']}")
    print(f"  Verify: {item['verification']}")
    print()

# The KRBTGT reset deserves special attention.
# If the attacker performed DCSync (episode 33), they have the
# KRBTGT hash and can forge Golden Tickets -- which means they
# can impersonate ANY user, including accounts that don't exist,
# for up to 10 years (default ticket lifetime).
# Resetting KRBTGT TWICE (with a 12+ hour gap) invalidates all
# existing Kerberos tickets. Skip this step and the attacker
# has a permanent backdoor regardless of what else you fix.
print("CRITICAL: If DCSync was performed, reset KRBTGT TWICE.")
print("Golden Tickets survive password resets of normal accounts.")
print("Only KRBTGT reset invalidates them. See episode 33.")

That KRBTGT double-reset is one of those details that separates IR professionals from IT generalists. If the attacker performed a DCSync attack (episode 33), they have the KRBTGT password hash, which means they can forge Kerberos tickets for any user, including domain admin, for years after you think you've cleaned up. You can change every password in Active Directory, rebuild every server, and deploy new EDR -- and the attacker still walks back in with a Golden Ticket. The ONLY way to invalidate Golden Tickets is to reset the KRBTGT account password twice (with at least 12 hours between resets so replication completes). Miss this step and your entire eradication was wasted effort.

Phase 6: Lessons Learned -- The Phase Everyone Skips

Post-incident review format (blameless -- focus on process, not people):

Timeline reconstruction:
  T+0h:    Phishing email delivered to HR coordinator
  T+0h15m: Macro executed, PowerShell beacon deployed
  T+3h:    Attacker begins lateral movement via WinRM
  T+8h:    LSASS dump on file server (undetected)
  T+26h:   DCSync performed (undetected)
  T+48h:   Customer database accessed (undetected)
  T+72h:   SIEM alert fires on anomalous data volume
  T+73h:   SOC analyst triages alert as potential incident
  T+74h:   IR team activated, SEV-1 declared
  T+76h:   Containment initiated
  T+120h:  Full eradication achieved
  T+168h:  Recovery complete, systems restored

Questions to answer:
  1. Initial access: how did the phishing email bypass the
     email gateway? Answer: gateway was configured to allow
     macros from external senders. Fix: block by default.
  2. Detection gap: why 72 hours before detection?
     Answer: no alert on LSASS access, no Kerberoasting
     detection, no DCSync monitoring. The first alert fired
     on data volume, not on attacker activity.
  3. What worked: SOC analyst correctly triaged the alert
     and escalated within 1 hour. IR team responded within
     30 minutes of activation. Containment was effective.
  4. What failed: no detection for the first 72 hours. The
     attacker had complete freedom of movement for 3 days.
  5. Playbook adequacy: ransomware playbook was followed
     correctly but we did not have a playbook for "data
     exfiltration without ransomware." We improvised.

Remediation projects (prioritized):
  P1: Deploy LSASS protection + dump detection (1 week)
  P2: Implement DCSync monitoring via Sysmon (1 week)
  P3: Block macros in email from external senders (1 day)
  P4: Create data exfiltration playbook (2 weeks)
  P5: Conduct purple team exercise against detection gaps (1 month)

The blameless part is not optional. If your post-incident review turns into a blame session ("whose fault is it that the phishing email got through?"), people stop reporting incidents. They hide them. They cover them up. And you lose the only feedback mechanism that lets your organization learn and improve. The question is never "whose fault was this?" The question is "what in our process allowed this to happen, and how do we change the process?"

Evidence Collection -- Getting It Right

Evidence collection in incident response is governed by one overriding principle: forensic soundness. Evidence that is not collected properly is useless in court and unreliable for analysis.

# Evidence collection procedure (Linux)

# Step 1: memory acquisition FIRST (most volatile, most valuable)
# Using LiME kernel module (Linux Memory Extractor):
sudo insmod lime.ko "path=/evidence/memory.lime format=lime"
# Calculate hash immediately:
sha256sum /evidence/memory.lime > /evidence/memory.lime.sha256

# Step 2: volatile system state (before anything changes)
# Running processes:
ps auxwwf > /evidence/ps_output.txt
# Network connections:
ss -tulnp > /evidence/network_connections.txt
netstat -anp > /evidence/netstat_output.txt
# Logged-in users:
w > /evidence/who_is_logged_in.txt
last -a > /evidence/last_logins.txt
# Open files:
lsof > /evidence/open_files.txt

# Step 3: disk imaging (forensically sound, write blocker required)
# dc3dd is dd with built-in hashing and logging:
sudo dc3dd if=/dev/sda of=/evidence/disk.dd \
    hash=sha256 log=/evidence/disk_acquisition.log

# Step 4: log collection
tar czf /evidence/var_log.tar.gz /var/log/
sha256sum /evidence/var_log.tar.gz > /evidence/var_log.tar.gz.sha256

# Step 5: DOCUMENT EVERYTHING
# Chain of custody record for each piece of evidence:
# - What was collected (memory/disk/logs)
# - From which system (hostname, IP, serial number)
# - By whom (name, role)
# - When (timestamp, timezone)
# - How (tool name and version)
# - Hash value (SHA-256)
# - Where it is stored now (evidence locker, encrypted drive)
# Without chain of custody, evidence is inadmissible in court.

The order of volatility matters. Memory is the most volatile -- it changes every millisecond and disappears entirely when power is lost. Disk is less volatile but can be overwritten by normal system activity (log rotation, temp files). Network captures are only available in real time. You collect the most volatile evidence first because that's the evidence you're most likely to lose.

IR Playbooks -- Your Cheat Sheet at 3 AM

Playbooks are pre-written response procedures for common incident types. They exist because you cannot think clearly at 3 AM when the CISO is calling and the ransomware is spreading. You follow the playbook.

#!/usr/bin/env python3
"""ir_playbooks.py -- pre-built response procedures"""

PLAYBOOKS = {
    'ransomware': {
        'first_15_minutes': [
            'DO NOT power off affected systems (preserve memory)',
            'Isolate affected systems from network immediately',
            'Identify ransomware variant from ransom note or file extension',
            'Check nomoreransom.org for available decryptors',
            'Activate IR team, notify IC on call',
        ],
        'first_hour': [
            'Determine scope: how many systems encrypted?',
            'Identify initial access vector (email? RDP? VPN?)',
            'Check backup status: are backups intact or also encrypted?',
            'Notify legal and cyber insurance carrier',
            'Begin forensic evidence collection on first affected system',
        ],
        'first_day': [
            'Complete scoping: all affected systems identified',
            'Begin eradication of persistence mechanisms',
            'Test backup restoration on isolated network',
            'Prepare internal communications',
            'Decide: pay or dont pay (legal + insurance + ethics)',
        ],
        'critical_warning': 'Attackers increasingly target backup systems '
                           'FIRST. If backups are encrypted, recovery time '
                           'jumps from days to weeks or months.',
    },
    'business_email_compromise': {
        'first_15_minutes': [
            'Disable compromised account immediately',
            'Revoke all active sessions and OAuth tokens',
            'Check for email forwarding rules (auto-forward to external)',
            'Check sent items for fraudulent messages',
        ],
        'first_hour': [
            'Notify recipients of any fraudulent emails sent',
            'If wire fraud: contact bank IMMEDIATELY for wire recall',
            'Reset credentials + enable phishing-resistant MFA',
            'Review audit logs for full scope of account access',
        ],
        'first_day': [
            'Complete audit log review',
            'Determine if other accounts were targeted or compromised',
            'Notify legal for regulatory obligations',
            'Update email gateway rules to block similar attacks',
        ],
        'critical_warning': 'Wire recall success drops dramatically after '
                           '24 hours. If money was transferred, the FIRST '
                           'call should be to the bank, not to the SOC.',
    },
    'data_breach': {
        'first_15_minutes': [
            'Identify what data was accessed or exfiltrated',
            'Contain the exfiltration channel',
            'Preserve evidence (do not wipe the system)',
            'Engage legal counsel IMMEDIATELY',
        ],
        'first_hour': [
            'Classify the exposed data (PII, PHI, financial, IP)',
            'Determine notification requirements by jurisdiction',
            'Begin preparing breach notification drafts',
            'Coordinate with law enforcement if appropriate',
        ],
        'first_day': [
            'File regulatory notifications (GDPR: 72 hours)',
            'Prepare customer notification',
            'Set up credit monitoring if PII exposed',
            'Coordinate public statement with legal and PR',
        ],
        'critical_warning': 'GDPR requires notification within 72 hours '
                           'of becoming AWARE of the breach. The clock '
                           'starts when you confirm it, not when you finish '
                           'investigating it. Do not delay notification '
                           'to wait for complete analysis.',
    },
}

for incident_type, playbook in PLAYBOOKS.items():
    label = incident_type.replace('_', ' ').title()
    print(f"=== PLAYBOOK: {label} ===\n")
    for phase, actions in playbook.items():
        if phase == 'critical_warning':
            print(f"  !! WARNING: {actions}")
        else:
            phase_label = phase.replace('_', ' ').title()
            print(f"  {phase_label}:")
            for action in actions:
                print(f"    - {action}")
        print()

The BEC wire recall point cannot be overstated. I've talked to IR consultants who have seen cases where the difference between recovering the money and losing it permanently was literally 4 hours. Once the money moves through two or three intermediary accounts (which happens automatically in many fraud operations), recall becomes near-impossible. If someone tells you a wire transfer was triggered by a compromised email, your FIRST action is to pick up the phone and call the bank. Not "open a ticket." Not "escalate to the IC." Call. The. Bank.

The AI Slop Connection

AI is changing incident response in both directions, and the asymmetry is not in the defender's favor.

On the defense side, AI-powered SOAR (Security Orchestration, Automation, and Response) platforms can automate initial triage, correlate thousands of alerts into clusters, execute containment playbooks faster than any human, and dramatically reduce the time between detection and response. An AI system that processes 10,000 alerts per hour and surfaces the 3 that actually matter is genuinely valuable. The SOC analyst spends their time on real incidents in stead of drowning in false positives.

On the attack side, AI accelerates every phase of the kill chain we discussed in episodes 4 through 50. And here's the uncomfortable part for IR teams: AI-generated malware is polymorphic by default. Every sample is slightly different. Signature-based detection (which still underpins most antivirus products) was designed for a world where malware was written by humans and reused across campaigns. AI-generated malware breaks that assumption because generating a unique variant costs nothing. Your IOC (Indicator of Compromise) list from today's incident is useless against tomorrow's AI-generated variant because the hashes, strings, and network signatures will all be different.

The implication for IR: behavioral detection becomes the only reliable approach. You can't match signatures for malware that mutates every deployment. You CAN detect the behavior -- unusual process trees, anomalous network connections, credential access patterns that don't match baseline. The techniques we covered in episode 48 (UEBA behavioral baselines) are not just useful for insider threat detection -- they're the foundation of incident detection in an AI-powered threat landscape.

What Comes Next

Incident response is reactive by definition -- you're responding to something that already happened. The next logical question is: can you get ahead of the attack instead of always being behind it? Understanding who is targeting you, what tools they use, and what their objectives are BEFORE they breach your perimeter is the domain of threat intelligence. It transforms security from "react to what happened" into "prepare for what is likely to happen" -- and that shift in posture is where defenders finally get a structural advantage over attackers.

Exercises

Exercise 1: Write an IR playbook for a phishing-to-ransomware scenario. The scenario: an employee clicked a phishing link, downloaded malware, and 4 hours later ransomware begins encrypting file servers. Your playbook must include: (a) immediate actions (first 15 minutes), (b) containment steps, (c) investigation questions to answer, (d) recovery procedure, (e) communication plan (internal + external). Save to ~/lab-notes/ransomware-playbook.md.

Exercise 2: Conduct a tabletop exercise with yourself (or a study partner). Scenario: your SIEM alerts on a Cobalt Strike Beacon callback from an internal workstation at 2 AM. Walk through: (a) triage questions you would ask, (b) SIEM queries you would run, (c) containment decisions, (d) who you would notify and when. Document your decision tree including "what if" branches. Save to ~/lab-notes/tabletop-exercise.md.

Exercise 3: Practice evidence collection on a lab VM. Acquire: (a) a memory image using LiME (Linux) or WinPmem (Windows), (b) a disk image using dc3dd or dd, (c) Windows Event Logs or Linux /var/log/. Verify the hash of each evidence file. Document the chain of custody: who collected it, when, from which system, using which tool, and the SHA-256 hash. Save to ~/lab-notes/evidence-collection.md.

Thanks for reading!

@scipio

stem stemsocial steemstem security programming

0.000

1 comments

@stemsocial 64

21 days ago

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

Consider setting @stemsocial as a beneficiary of this post's rewards if you would like to support the community and contribute to its mission of promoting science and education on Hive.

0.000

Learn Ethical Hacking (#51) - Incident Response - When Things Go Wrong

Learn Ethical Hacking (#51) - Incident Response - When Things Go Wrong

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Ethical Hacking Series):

Learn Ethical Hacking (#51) - Incident Response - When Things Go Wrong

Solutions to Episode 50 Exercises

Learn Ethical Hacking (#51) - Incident Response - When Things Go Wrong

The NIST Framework -- Six Phases

Phase 1: Preparation -- Before the Storm

Phase 2: Identification -- Is This Real?

Phase 3: Containment -- Stop the Bleeding

The Containment Dilemma

Phase 4: Eradication -- Getting Rid of Them

Phase 5: Recovery

Phase 6: Lessons Learned -- The Phase Everyone Skips

Evidence Collection -- Getting It Right

IR Playbooks -- Your Cheat Sheet at 3 AM

The AI Slop Connection

What Comes Next

Exercises

Thanks for reading!

Curriculum (of the `Learn Ethical Hacking Series`):