Table of Contents
Understanding Email Spam Filtering
Spam filtering systems analyze incoming (and sometimes outgoing) email and decide whether to accept, reject, quarantine, or tag it as spam. In a mail server stack, filters typically sit:
- Before the MTA (as a gateway or SMTP proxy),
- Inside the MTA’s policy/ACL engine,
- Or after delivery (e.g., in a user’s mailbox filter).
Common goals:
- Reduce spam and phishing reaching users.
- Avoid false positives (legitimate mail marked as spam).
- Help keep your server’s IP/domain reputation clean.
In this chapter, the focus is on how spam filtering systems work and how to integrate the main tools in Linux mail environments.
Types of Spam Filtering Approaches
Most systems combine several techniques:
1. Header and Content Rule-Based Filters
These rely on a large set of rules:
- Check message headers:
From,Subject,Received,Message-ID, etc. - Inspect body content: URLs, keywords, HTML patterns, attachments.
Rules give positive or negative scores. For example:
+3.0points if the message contains suspicious pharmaceutical terms.+2.5points ifFromdomain doesn’t matchReturn-Path.-2.0points if sender is in a whitelist.
The final spam score is the sum:
$$
\text{score} = \sum_{i=1}^{n} w_i \cdot r_i
$$
where $r_i$ is rule $i$’s result (0 or 1, or sometimes a numeric measure) and $w_i$ its weight.
If score ≥ threshold (e.g. 5.0), mail is marked spam.
2. Bayesian and Statistical Filters
Bayesian filters learn from examples of spam and ham (legitimate mail):
- Admin/user feeds known spam and known ham to the filter.
- It builds probability tables for words and patterns.
- It later estimates $P(\text{spam} \mid \text{message})$ and adds/subtracts from the spam score.
Effect: adapts to your environment and languages.
3. Reputation and Blacklist-Based Filtering
Uses external data sources:
- DNS-based blacklists (DNSBLs/RBLs): IP addresses known to send spam.
- URI blacklists: Domains/URLs commonly found in spam.
- Dynamic reputation services (commercial or community).
Lookups are done via DNS queries during spam scanning. Positive hits usually add significant score.
4. Sender Authentication Results (SPF, DKIM, DMARC)
Spam filters consume the results of sender-authentication checks (usually performed by a separate component):
- SPF result:
pass,fail,softfail,neutral. - DKIM verification:
pass/fail. - DMARC policy outcome:
none,quarantine, orrejectrecommendation.
Spam filters then apply rules like:
- Add points if SPF fails and domain publishes a strict policy.
- Subtract points for a valid, aligned DKIM+DMARC.
5. Heuristics and Structural Analysis
Heuristics detect suspicious structure:
- HTML-only message with invisible text.
- Excessive use of obfuscation (e.g.
v1@gr@instead of “viagra”). - Message claiming to be from a bank, but link hostnames differ from the visible text.
- Attachment types frequently used in malware campaigns.
6. Collaborative and Cloud-Based Filters
Some systems submit message fingerprints or metadata to central services:
- Check if similar messages are broadly flagged as spam.
- Get real-time signatures against current campaigns.
Common in commercial gateways and some open-source projects with optional online services.
Core Open-Source Spam Filtering Tools
SpamAssassin Overview
Apache SpamAssassin is the most widely used open-source spam filtering engine.
Key characteristics:
- Rule-based scoring with many built-in tests.
- Bayesian learner (optional but powerful).
- Integrates with DNSBLs, SPF/DKIM results, and more.
- Pluggable architecture via Perl plugins.
- Can run as:
- A standalone filter (
spamc/spamd), - A local milter for MTAs,
- A library called by other tools (e.g., Amavis, rspamd integration).
Typical outcomes:
- Adds headers like:
X-Spam-Status: Yes, score=7.3 required=5.0 ...X-Spam-Flag: YES- Optionally rewrites the subject:
[SPAM] Original Subject.
Basic SpamAssassin Configuration Concepts
Main config files (paths depend on distro):
- Global:
/etc/spamassassin/local.cfand.cffiles in/etc/spamassassin/ - User-specific overrides:
~/.spamassassin/user_prefs
Common settings to adjust:
- Required score threshold:
required_score 5.0- Subject tagging:
rewrite_header Subject ***** SPAM *****- Adding headers:
add_header all Score _SCORE_
add_header all Status _YESNO_, score=_SCORE_ required=_REQD_- Enabling/disabling specific rules:
score BAYES_99 4.0
score HTML_MESSAGE 0.0You normally start with defaults and adjust thresholds after monitoring for false positives/negatives.
Training the Bayesian Filter
Bayes is usually disabled or untrained by default. To enable and train:
- Enable Bayes:
use_bayes 1
bayes_auto_learn 1- Feed sample spam and ham:
sa-learn --spam /path/to/spam/
sa-learn --ham /path/to/ham/- Check stats:
sa-learn --dump magicFor multi-user servers, be careful who controls training data; centralizing training or using per-domain training may be necessary.
Rspamd Overview
Rspamd is a newer, high-performance spam filtering system designed as a full policy and filtering engine.
Key properties:
- Very fast (written in C, asynchronous).
- Advanced plugin system (Lua-based).
- Web UI for stats, training, tuning.
- Tight integration with MTAs via milter (especially Postfix, Exim).
- Built-in modules for:
- Spam rules and Bayesian filtering,
- DKIM signing/verification,
- SPF/DMARC,
- Fuzzy checks and URL reputation.
Compared to SpamAssassin, Rspamd is more of an all-in-one policy engine than a pure content filter.
Rspamd Architecture Basics
Components:
rspamddaemon: core scanning engine.- Redis: often used for Bayes, fuzzy hashes, rate limiting.
- Web UI: configuration, statistics, and interactive learning.
Configuration is modular, usually in /etc/rspamd/:
worker-*configs define listening sockets and roles.modules.ddirectory for individual modules (e.g.bayes.conf,dkim.conf).override.dfor site-specific overrides.
Rspamd assigns symbols and scores; each message ends up with:
- An overall score.
- “Symbols” indicating which rules fired.
- Suggested action:
no action,add header,rewrite subject,reject.
Example snippet from log-like output:
Action: reject
Score: 12.3 (required 7.0)
Symbols: BAYES_SPAM(4.20), RBL_SPAMHAUS(3.50), DKIM_INVALID(2.50), ...Admins typically control thresholds for actions:
actions {
reject = 15;
add_header = 7;
greylist = 4;
}Amavis and Content Filter Stacks
Amavis (amavisd-new) is a “content filter controller” that often runs between the MTA (e.g., Postfix) and the actual engines:
- Accepts mail from the MTA via SMTP or LMTP.
- Sends the content to:
- SpamAssassin (spam scoring).
- Antivirus (e.g., ClamAV).
- Other scanners (DKIM verification, attachment policy checks).
- Makes a decision and returns result to MTA.
Typical email flow with Amavis:
- Incoming SMTP → Postfix.
- Postfix passes message to Amavis based on a
content_filtersetting. - Amavis runs SpamAssassin + antivirus.
- Amavis returns cleaned and annotated message, or rejects/quarantines.
Amavis is configuration-heavy but standard in many “all-in-one” mail server solutions.
Integration with MTAs
Filtering at SMTP Time vs After Delivery
Two main strategies:
- SMTP-time filtering:
- Filter while the SMTP conversation is still open.
- Can reject spam before accepting the message.
- Saves disk space and avoids bounce messages.
- Requires efficient, low-latency filtering.
- After-delivery filtering:
- Accept all mail, run filters later (e.g., via LMTP, LDA, or mailbox filters).
- Less risk of accidentally rejecting legitimate messages.
- But spam still consumes bandwidth and storage.
High-volume and security-focused setups tend to prefer SMTP-time filtering.
Milter-Based Integration (Postfix, Sendmail, etc.)
Milters are filter daemons speaking the milter protocol:
- MTA connects to milter, streams message events (connect, headers, body).
- Milter returns actions: accept, reject, tempfail, modify headers, etc.
Typical setup:
- SpamAssassin Milter (e.g.,
spamass-milter) or Rspamd as a milter. - Configure MTA to call the milter on incoming connections.
Example Postfix configuration (conceptual):
smtpd_milters = inet:localhost:11332
non_smtpd_milters = $smtpd_milters
milter_default_action = accept
milter_protocol = 6
Where localhost:11332 is an Rspamd or SpamAssassin milter socket.
Policy and Header-Based Delivery
Even when filtering is done pre-queue, delivery agents (Dovecot, local delivery programs) often use headers or flags to route spam:
- Common headers:
X-Spam-Flag: YESX-Spam-Level: **X-Spam-Status: Yes, score=...
Mailbox rules can move spam-tagged messages into a “Junk” folder based on these headers.
Example Dovecot Sieve rule:
if header :contains "X-Spam-Flag" "YES" {
fileinto "Junk";
stop;
}This separates the act of scoring from where the message finally lands.
Supporting Technologies in Spam Filtering
DNSBLs and DNS Lookups
Filters frequently use DNSBLs:
- Spamhaus, Barracuda, and many others.
- Query pattern: reverse the IP and append the zone, e.g.
2.0.0.127.zen.spamhaus.org.
If the query returns an address (usually in 127.0.0.x), it’s a listing.
Spam filters cache these results to avoid excessive DNS load. Using DNSBLs at scale requires:
- Access rights (some providers require a subscription).
- Local or fast resolvers to minimize latency.
Greylisting
Greylisting is often used alongside content filters:
- On first contact from a new sender/IP, temporarily reject with a 4xx code.
- Legitimate MTAs retry later.
- Many spam bots don’t retry, so their messages never get through.
Filters track a triplet:
- Sender IP,
- Envelope sender,
- Envelope recipient.
Once seen retrying after a delay, mails from that triplet are accepted.
Effects:
- Reduces spam with very little content analysis.
- Adds delay to first-time correspondents.
Rate Limiting and Abuse Detection
Spam filtering stacks may also implement rate limits:
- Messages per time window per:
- IP,
- Authenticated user,
- Sender domain.
Helps control compromised accounts or abuse from a single client before content filters even run.
Tuning and Managing Spam Filters
Thresholds and Policies
Key spam policy questions:
- At what score do you:
- Reject outright?
- Add headers only?
- Quarantine?
Common patterns:
score < 3: deliver normally.3 ≤ score < 7: tag mail, deliver to inbox or “likely spam”.score ≥ 7: move to spam folder or reject.- Higher thresholds in early deployment, then tighten as confidence grows.
Whitelists and Blacklists
To reduce false positives:
- Whitelist trusted senders or domains:
whitelist_from *@trustedpartner.com- For Rspamd, configure
multimaprules for internal or trusted ranges.
Use blacklists with care; better to rely on reputation systems than manual blacklists unless dealing with very persistent sources.
Per-User vs Global Settings
On multi-tenant or hosting servers:
- Global settings:
- Base rule set, DNSBLs, required scores, antivirus.
- Per-domain or per-user:
- Subject rewrite preferences.
- Per-user Bayes databases.
- User-level whitelists/blacklists.
IMAP clients often expose spam settings (like “Junk” folder) but the actual logic usually lives in server-side filters (Sieve, Rspamd, SpamAssassin).
Monitoring Effectiveness
Key metrics:
- Spam catch rate: percentage of spam that’s successfully tagged/blocked.
- False positive rate: legitimate messages marked as spam.
- False negative rate: spam delivered as ham.
Practical monitoring methods:
- Sample check: periodically review spam folders and server logs.
- Feedback loops: allow users to report:
- “This is spam” (train as spam),
- “Not spam” (train as ham).
- Rspamd or other web UIs: dashboards showing symbol statistics, top senders, scores.
Log and Header Analysis
Headers added by filters are your main diagnostic tool:
- Examine
X-Spam-Status,X-Spam-Score,X-Spam-Report(SpamAssassin). - Rspamd’s
X-Spam-Status-like headers orAuthentication-Resultsenhancements.
For example:
X-Spam-Status: Yes, score=8.1 required=5.0
tests=BAYES_95,DKIM_INVALID,HTML_IMAGE_ONLY_32,URIBL_BLACK ...From this, you can:
- Identify which tests are over-aggressive or too weak.
- Adjust rule scores or thresholds.
- Identify misconfigurations (e.g., DKIM invalid due to local signing error).
Security and Anti-Abuse Considerations
Spam filtering interacts closely with security:
- Phishing and malware detection: Many filters recognize known phishing templates or suspicious attachment types.
- Outgoing spam control:
- If your server is compromised and sends spam, your IP/domain reputation suffers.
- Apply the same or lighter rules to outbound messages to detect anomalies.
- Privacy:
- Full-content scanning means filters see message bodies.
- For privacy-sensitive environments, be clear about storage, logs, and data retention policies.
Typical hardening measures:
- Use TLS between MTA and filter if on separate hosts.
- Limit who can access web UIs (e.g. Rspamd) to admins.
- Use strong authentication and role separation for training and configuration.
Deployment Patterns and Examples
Simple On-Box SpamAssassin on a Small Server
Scenario: Single Postfix server for a small organization.
- Run SpamAssassin as a daemon (
spamd). - Use a milter or content filter to send mail to
spamd. - Tag mail with spam headers and adjust Dovecot Sieve to move spam to user Junk folders.
- Optional: use greylisting and a DNSBL at the MTA level.
Advantages:
- Easy to understand.
- Lightweight for modest volume.
High-Performance Stack with Rspamd
Scenario: Larger installation or ISP.
- Postfix with Rspamd milter handling:
- Spam scoring,
- DKIM signing/verification,
- SPF/DMARC checks,
- Rate limiting, greylisting.
- Redis backend for Bayes and fuzzy storage.
- Web UI for monitoring and interactive training.
- Per-domain config overrides to offer different policies.
Advantages:
- Scales well.
- Single integrated policy engine.
- Fine-grained multi-tenant control.
Gateway Scanning with Amavis + Back-End Mail Store
Scenario: Separate MX gateway servers.
- Front-end MX runs Postfix + Amavis (SpamAssassin + antivirus).
- Back-end server (e.g., Exchange, or another Postfix/Dovecot) receives “cleaned” mail.
- Filtering decisions mostly done on gateway; back-end applies user-level folder rules.
Advantages:
- Isolates internet-facing components.
- Centralizes spam/AV control at the perimeter.
This chapter focused on the concepts, main tools, and integration patterns of spam filtering systems in Linux-based mail environments. Configuration details for the MTA itself and for other email components are covered in their respective chapters.