7.2.4 Incident response workflow

Overview of an Incident Response Workflow

Incident response (IR) is most effective when it follows a clear, repeatable workflow. On Linux systems this workflow needs to both respect technical realities (evidence volatility, system behavior) and organizational constraints (business impact, policy, legal).

A typical high‑level IR workflow:

Preparation
Detection & Triage
Containment
Eradication
Recovery
Post‑Incident Review

In this chapter we focus on how to execute those steps in practice, specifically in Linux environments, assuming you already understand basic forensics and IR concepts from the parent chapter.

1. Preparation (Before Anything Goes Wrong)

Preparation sets the conditions for everything else to work. Without preparation, later phases are slower, riskier, and more error‑prone.

1.1 Define Roles and Communication

Have a written IR plan that clearly states:

Who leads the incident (IR lead / coordinator).
Who owns affected systems (system owners).
Who handles:

Legal / privacy / regulatory notifications.
Communication with management.
Communication with users or customers.

Escalation paths: when and how to involve:

External IR consultants.
Law enforcement.
Cloud or hosting providers.

Create and maintain:

A call tree or contact list (with multiple contact channels).
A clear “who can declare an incident” rule.
A decision matrix: when to isolate, when to keep systems online for monitoring, etc.

1.2 Technical Readiness on Linux Systems

Prepare systems so they are “investigation‑friendly”:

Time sync: NTP/chrony configured and monitored; accurate timestamps are critical.
Logging strategy:

Centralized logging (e.g. rsyslog/journalbeat to a SIEM).
Retention set to sufficient length.

Baseline configurations:

Standardized OS builds.
Standard locations for logs and critical data.

Tooling in place:

Pre‑approved IR tools (hash checked) stored locally or in a trusted repository.
Secure remote access method (e.g. SSH via bastion, not direct root logins).

Access controls:

Clear procedure for granting IR team temporary elevated access.
Audit logging for IR actions.

1.3 Playbooks and Runbooks

Prepare playbooks (what to do for a type of incident) and runbooks (detailed command‑by‑command guides). For Linux, common playbooks include:

Compromised web server.
Suspected malware on a database host.
Ransomware on a file server.
Suspicious outbound connections from a container host.

Each should specify:

Triggers (what events start the playbook).
Initial triage actions.
Containment options and their impact.
Data to collect and how (commands, scripts).
Recovery/validation steps.

2. Detection and Triage

Detection identifies potential incidents; triage quickly determines how bad it is and what to do next.

2.1 Detection Sources

Common Linux‑specific signals:

System logs:

journalctl output for auth failures, service restarts.
/var/log/auth.log, /var/log/secure, /var/log/messages.

Application logs:

Web server logs (/var/log/nginx/access.log, Apache logs).
Database audit logs.

Security tooling:

IDS/IPS alerts.
EDR/endpoint agents.
File integrity monitoring (FIM) alerts.

User reports:

“System is slow”, “unexpected processes”, “weird SSH fingerprints”.

External notifications:

ISP abuse reports, blacklists, upstream providers.

The workflow must define:

How alerts are ingested (email, ticket system, SIEM).
Which alerts are automatically escalated (e.g. root login from unknown country).

2.2 Triage: Rapid Assessment

Your goal in triage is to quickly answer:

Is this a real incident or a false positive?
Rough severity and scope:

One system or many?
Privileged access gained?
Data exposure likely?

Urgency:

Active data destruction or exfiltration?
Lateral movement in progress?

Triage should be time‑boxed (e.g. 15–30 minutes per alert category) to avoid analysis paralysis.

2.2.1 Minimal Live Checks on Linux

Use low‑impact commands to get a snapshot without modifying too much state:

Check processes and network:

ps auxf
ss -tulpn or ss -tanp

Check recent log anomalies:

journalctl -xe
grep 'Failed password' /var/log/auth.log (or distro equivalent)

Quick user/privilege check:

who, last
sudo -l (only if you must; it can change logs slightly)

Record what you did — command lists are part of the incident record.

2.3 Classification and Escalation

Once triage is done, apply your classification scheme. A simple one:

P1 (Critical): Active attack; data loss or service outage; attacker likely has root or domain‑level access.
P2 (High): Confirmed compromise, but contained to limited systems; no confirmed data loss yet.
P3 (Medium): Suspicious activity requiring investigation but not yet confirmed as compromise.
P4 (Low): Minor policy violations, misconfigurations, or informational alerts.

Workflow rules:

Who can classify incidents.
Automatic actions for each level:

P1: Immediate containment, 24/7 response, executive notification.
P2: Contain within a defined SLA, notify system owners.
P3/P4: Investigation during business hours, create tickets.

3. Containment

Containment aims to limit damage and prevent spread, while preserving evidence and minimizing business disruption.

3.1 Containment Strategies on Linux

Common approaches:

Network isolation:

Remove from network (VLAN change, hypervisor network disconnect, security group rules).
Use host‑based firewall (iptables, nftables, ufw, firewalld) to block in/out traffic while keeping SSH from a trusted IP.

Service isolation:

Stop or disable compromised services: systemctl stop nginx, systemctl stop sshd (only if you retain other access).
Restrict listening ports.

Account and credential containment:

Lock compromised accounts: passwd -l username.
Remove/rotate SSH keys; revoke API tokens, credentials, VPN accounts.

Logical isolation:

Mark the system as “under investigation” and restrict who can log in.
Read‑only mount certain filesystems (where possible) to reduce tampering.

Your workflow must specify conditions to choose each strategy:

When to fully disconnect vs keep online for observation.
When legal/compliance requires you to preserve a running system instead of shutting it down.

3.2 Evidence Preservation During Containment

Containment must be done in a way that:

Minimizes changes to disk and volatile memory.
Preserves logs and artifacts for later analysis or legal use.

Typical actions:

Before drastic changes:

Execute pre‑approved evidence collection script (e.g. collects ps, netstat/ss, logs, configuration snapshots).
Take a snapshot if it’s a VM (with clear naming and notes).

Avoid:

rm -rf on suspected malware directories before capturing them.
Rebooting unless specifically decided and documented.

Document:

Exact containment steps.
Commands/changes made, with timestamps.

4. Eradication

Eradication removes the attacker’s footholds and malicious artifacts from the environment.

4.1 Deciding the Eradication Approach

Typical choices:

Rebuild from trusted images:

Preferred for heavily compromised hosts.
Faster and safer than trying to “clean” in depth.

In‑place cleanup:

When rebuild is not immediately possible (legacy systems).
Must be guided by thorough investigation to avoid missing persistence mechanisms.

Workflow should define:

Criteria for “must rebuild” (e.g. rootkit detected, kernel tampering, unknown binaries in /bin or /lib, unauthorized cron).
Who approves risky actions (e.g. immediate rebuild of production DB).

4.2 Linux‑Specific Eradication Tasks

Executed after sufficient evidence is collected:

Remove attacker accounts:

Check /etc/passwd, /etc/shadow, /etc/sudoers and sudoers.d for unauthorized entries.

Remove persistence:

Inspect crontab -l, /etc/crontab, /etc/cron.*, systemd unit files under /etc/systemd/system, user ~/.config/systemd/user.

Clean binaries and tools:

Delete or quarantine known malicious binaries (after hashing and copying for analysis).
Reinstall critical packages if suspicion exists (apt reinstall, dnf reinstall, etc.).

Credential rotation:

Rotate passwords, SSH keys, database credentials, API tokens that might have been exposed.
Update configs where those credentials are used.

Ensure all steps are scripted or repeatable where possible to reduce human error.

5. Recovery

Recovery returns systems and services to normal operations in a controlled, monitored way.

5.1 Restore and Rebuild

Depending on your eradication choice:

From backups:

Restore from backups taken before compromise.
Validate backups for integrity and age (ensure you’re not restoring compromised data).

From golden images / templates:

Deploy replacing instances (for servers, containers, VMs).
Apply current patches and configuration management.

Keep a record of:

Image/backup version.
Hashes of critical files if part of your process.

5.2 Validation and Testing

Before declaring systems “back in service”:

Functional tests:

Application health checks.
User login and common workflows.

Security validation:

Verify that indicators of compromise (IOCs) are absent.
Confirm that security patches and hardening measures are applied.
Run targeted scans (e.g. vulnerability scan on the rebuilt host).

Define go/no‑go criteria:

What must pass before traffic is switched back (for load‑balanced or clustered systems).
Who signs off (system owner, security, operations).

5.3 Monitoring After Recovery

Enable heightened monitoring for a defined “watch period”:

Extra alerts on:

Reappearance of known IOCs (hashes, domains, IPs, file paths).
Anomalous logins or network traffic from/into the recovered system.

Periodic reviews:

Daily or more frequent log review during the watch period.

Document:

Start and end of watch period.
Any anomalies detected during it.

6. Post‑Incident Review

The post‑incident phase turns experience into improvement.

6.1 Incident Documentation

Maintain an incident record that includes:

Timeline:

First detection, triage start, containment start/end, eradication, recovery, closure.

Scope:

Systems and data affected.

Root cause and contributing factors.
Tools and commands used (especially anything that might have altered evidence).
Decisions made, with rationale.

On Linux, include:

Important log excerpts.
Config diffs (before/after).
IOCs (file paths, process names, hashes, IPs, domains).

6.2 Root Cause and Lessons Learned

Root cause analysis (RCA) should answer:

Technical root cause:

Example: unpatched vulnerability in a PHP application; credential theft via reused SSH key; misconfigured firewall.

Process root cause:

Ineffective patch management, missing monitoring, no 2FA, lack of training.

From this, define:

Immediate fixes (e.g. patch, configuration change).
Medium‑term improvements (e.g. deploy centralized log management, FIM).
Long‑term changes (e.g. redesign of access model, network segmentation).

6.3 Updating Playbooks, Controls, and Training

Use the incident to improve the workflow:

Update:

IR playbooks, runbooks (add or modify Linux commands that proved useful).
Detection rules in SIEM or IDS (based on new IOCs and behaviors).
Configuration management baselines (e.g. disallow weak ciphers, default accounts).

Train:

Walk through the incident in a blameless post‑mortem meeting.
Incorporate scenarios into future tabletop exercises.

Set explicit due dates and owners for all improvement tasks, and track them to completion.

7. Putting It All Together: A Simple Linux IR Flow Example

To visualize how this works in practice, here is a concise example workflow for a suspected root compromise on a Linux web server:

Detection:
SIEM alert: unusual outbound connections from web01 to an unknown IP on port 4444.
Triage (within 30 minutes):

IR on‑call connects via bastion host (from trusted IP).
Runs pre‑approved triage commands (ps, ss, quick log checks).
Confirms unknown process running as root, established connection to that IP.
Classifies as P1 (critical).

Containment:

Network team isolates web01 into a quarantine VLAN, keeping bastion access.
IR runs evidence collection script; snapshots VM.
Locks service account credentials used by web01.

Eradication:

Decision: Rebuild web01 from golden image due to suspected rootkit.
IR team documents all IOCs (file paths, hashes, IPs, domains).

Recovery:

Ops deploys a new web01 VM from hardened image.
Restores web content from clean backup, applies config management.
Validates application behavior and confirms IOC absence.
Switches load balancer traffic back to new web01.

Post‑Incident:

RCA finds initial access via outdated CMS plugin.
Patches applied across all similar sites.
New detection rule added for similar traffic patterns.
Playbook updated with improved triage steps for web compromises.

This kind of structured, repeatable workflow is the goal: clear steps, clear responsibilities, and Linux‑specific details integrated where they matter.

Comments

Please login to add a comment.

Don't have an account? Register now!