5.3.4 Spam filtering systems

Table of Contents

Understanding Email Spam Filtering

Spam filtering systems analyze incoming (and sometimes outgoing) email and decide whether to accept, reject, quarantine, or tag it as spam. In a mail server stack, filters typically sit:

Before the MTA (as a gateway or SMTP proxy),
Inside the MTA’s policy/ACL engine,
Or after delivery (e.g., in a user’s mailbox filter).

Common goals:

Reduce spam and phishing reaching users.
Avoid false positives (legitimate mail marked as spam).
Help keep your server’s IP/domain reputation clean.

In this chapter, the focus is on how spam filtering systems work and how to integrate the main tools in Linux mail environments.

Types of Spam Filtering Approaches

Most systems combine several techniques:

1. Header and Content Rule-Based Filters

These rely on a large set of rules:

Check message headers: From, Subject, Received, Message-ID, etc.
Inspect body content: URLs, keywords, HTML patterns, attachments.

Rules give positive or negative scores. For example:

+3.0 points if the message contains suspicious pharmaceutical terms.
+2.5 points if From domain doesn’t match Return-Path.
-2.0 points if sender is in a whitelist.

The final spam score is the sum:

$$
\text{score} = \sum_{i=1}^{n} w_i \cdot r_i
$$

where $r_i$ is rule $i$’s result (0 or 1, or sometimes a numeric measure) and $w_i$ its weight.

If score ≥ threshold (e.g. 5.0), mail is marked spam.

2. Bayesian and Statistical Filters

Bayesian filters learn from examples of spam and ham (legitimate mail):

Admin/user feeds known spam and known ham to the filter.
It builds probability tables for words and patterns.
It later estimates $P(\text{spam} \mid \text{message})$ and adds/subtracts from the spam score.

Effect: adapts to your environment and languages.

3. Reputation and Blacklist-Based Filtering

Uses external data sources:

DNS-based blacklists (DNSBLs/RBLs): IP addresses known to send spam.
URI blacklists: Domains/URLs commonly found in spam.
Dynamic reputation services (commercial or community).

Lookups are done via DNS queries during spam scanning. Positive hits usually add significant score.

4. Sender Authentication Results (SPF, DKIM, DMARC)

Spam filters consume the results of sender-authentication checks (usually performed by a separate component):

SPF result: pass, fail, softfail, neutral.
DKIM verification: pass/fail.
DMARC policy outcome: none, quarantine, or reject recommendation.

Spam filters then apply rules like:

Add points if SPF fails and domain publishes a strict policy.
Subtract points for a valid, aligned DKIM+DMARC.

5. Heuristics and Structural Analysis

Heuristics detect suspicious structure:

HTML-only message with invisible text.
Excessive use of obfuscation (e.g. v1@gr@ instead of “viagra”).
Message claiming to be from a bank, but link hostnames differ from the visible text.
Attachment types frequently used in malware campaigns.

6. Collaborative and Cloud-Based Filters

Some systems submit message fingerprints or metadata to central services:

Check if similar messages are broadly flagged as spam.
Get real-time signatures against current campaigns.

Common in commercial gateways and some open-source projects with optional online services.

Core Open-Source Spam Filtering Tools

SpamAssassin Overview

Apache SpamAssassin is the most widely used open-source spam filtering engine.

Key characteristics:

Rule-based scoring with many built-in tests.
Bayesian learner (optional but powerful).
Integrates with DNSBLs, SPF/DKIM results, and more.
Pluggable architecture via Perl plugins.
Can run as:

A standalone filter (spamc/spamd),
A local milter for MTAs,
A library called by other tools (e.g., Amavis, rspamd integration).

Typical outcomes:

Adds headers like:

X-Spam-Status: Yes, score=7.3 required=5.0 ...
X-Spam-Flag: YES

Optionally rewrites the subject: [SPAM] Original Subject.

Basic SpamAssassin Configuration Concepts

Main config files (paths depend on distro):

Global: /etc/spamassassin/local.cf and .cf files in /etc/spamassassin/
User-specific overrides: ~/.spamassassin/user_prefs

Common settings to adjust:

Required score threshold:

  required_score 5.0

Subject tagging:

  rewrite_header Subject ***** SPAM *****

Adding headers:

  add_header all  Score _SCORE_
  add_header all  Status _YESNO_, score=_SCORE_ required=_REQD_

Enabling/disabling specific rules:

  score  BAYES_99  4.0
  score  HTML_MESSAGE  0.0

You normally start with defaults and adjust thresholds after monitoring for false positives/negatives.

Training the Bayesian Filter

Bayes is usually disabled or untrained by default. To enable and train:

Enable Bayes:

   use_bayes 1
   bayes_auto_learn 1

Feed sample spam and ham:

   sa-learn --spam /path/to/spam/
   sa-learn --ham  /path/to/ham/

Check stats:

   sa-learn --dump magic

For multi-user servers, be careful who controls training data; centralizing training or using per-domain training may be necessary.

Rspamd Overview

Rspamd is a newer, high-performance spam filtering system designed as a full policy and filtering engine.

Key properties:

Very fast (written in C, asynchronous).
Advanced plugin system (Lua-based).
Web UI for stats, training, tuning.
Tight integration with MTAs via milter (especially Postfix, Exim).
Built-in modules for:

Spam rules and Bayesian filtering,
DKIM signing/verification,
SPF/DMARC,
Fuzzy checks and URL reputation.

Compared to SpamAssassin, Rspamd is more of an all-in-one policy engine than a pure content filter.

Rspamd Architecture Basics

Components:

rspamd daemon: core scanning engine.
Redis: often used for Bayes, fuzzy hashes, rate limiting.
Web UI: configuration, statistics, and interactive learning.

Configuration is modular, usually in /etc/rspamd/:

worker-* configs define listening sockets and roles.
modules.d directory for individual modules (e.g. bayes.conf, dkim.conf).
override.d for site-specific overrides.

Rspamd assigns symbols and scores; each message ends up with:

An overall score.
“Symbols” indicating which rules fired.
Suggested action: no action, add header, rewrite subject, reject.

Example snippet from log-like output:

Action: reject
Score: 12.3 (required 7.0)
Symbols: BAYES_SPAM(4.20), RBL_SPAMHAUS(3.50), DKIM_INVALID(2.50), ...

Admins typically control thresholds for actions:

actions {
  reject = 15;
  add_header = 7;
  greylist = 4;
}

Amavis and Content Filter Stacks

Amavis (amavisd-new) is a “content filter controller” that often runs between the MTA (e.g., Postfix) and the actual engines:

Accepts mail from the MTA via SMTP or LMTP.
Sends the content to:

SpamAssassin (spam scoring).
Antivirus (e.g., ClamAV).
Other scanners (DKIM verification, attachment policy checks).

Makes a decision and returns result to MTA.

Typical email flow with Amavis:

Incoming SMTP → Postfix.
Postfix passes message to Amavis based on a content_filter setting.
Amavis runs SpamAssassin + antivirus.
Amavis returns cleaned and annotated message, or rejects/quarantines.

Amavis is configuration-heavy but standard in many “all-in-one” mail server solutions.

Integration with MTAs

Filtering at SMTP Time vs After Delivery

Two main strategies:

SMTP-time filtering:

Filter while the SMTP conversation is still open.
Can reject spam before accepting the message.
Saves disk space and avoids bounce messages.
Requires efficient, low-latency filtering.

After-delivery filtering:

Accept all mail, run filters later (e.g., via LMTP, LDA, or mailbox filters).
Less risk of accidentally rejecting legitimate messages.
But spam still consumes bandwidth and storage.

High-volume and security-focused setups tend to prefer SMTP-time filtering.

Milter-Based Integration (Postfix, Sendmail, etc.)

Milters are filter daemons speaking the milter protocol:

MTA connects to milter, streams message events (connect, headers, body).
Milter returns actions: accept, reject, tempfail, modify headers, etc.

Typical setup:

SpamAssassin Milter (e.g., spamass-milter) or Rspamd as a milter.
Configure MTA to call the milter on incoming connections.

Example Postfix configuration (conceptual):

smtpd_milters = inet:localhost:11332
non_smtpd_milters = $smtpd_milters
milter_default_action = accept
milter_protocol = 6

Where localhost:11332 is an Rspamd or SpamAssassin milter socket.

Policy and Header-Based Delivery

Even when filtering is done pre-queue, delivery agents (Dovecot, local delivery programs) often use headers or flags to route spam:

Common headers:

X-Spam-Flag: YES
X-Spam-Level: **
X-Spam-Status: Yes, score=...

Mailbox rules can move spam-tagged messages into a “Junk” folder based on these headers.

Example Dovecot Sieve rule:

if header :contains "X-Spam-Flag" "YES" {
  fileinto "Junk";
  stop;
}

This separates the act of scoring from where the message finally lands.

Supporting Technologies in Spam Filtering

DNSBLs and DNS Lookups

Filters frequently use DNSBLs:

Spamhaus, Barracuda, and many others.
Query pattern: reverse the IP and append the zone, e.g. 2.0.0.127.zen.spamhaus.org.

If the query returns an address (usually in 127.0.0.x), it’s a listing.

Spam filters cache these results to avoid excessive DNS load. Using DNSBLs at scale requires:

Access rights (some providers require a subscription).
Local or fast resolvers to minimize latency.

Greylisting

Greylisting is often used alongside content filters:

On first contact from a new sender/IP, temporarily reject with a 4xx code.
Legitimate MTAs retry later.
Many spam bots don’t retry, so their messages never get through.

Filters track a triplet:

Sender IP,
Envelope sender,
Envelope recipient.

Once seen retrying after a delay, mails from that triplet are accepted.

Effects:

Reduces spam with very little content analysis.
Adds delay to first-time correspondents.

Rate Limiting and Abuse Detection

Spam filtering stacks may also implement rate limits:

Messages per time window per:

IP,
Authenticated user,
Sender domain.

Helps control compromised accounts or abuse from a single client before content filters even run.

Tuning and Managing Spam Filters

Thresholds and Policies

Key spam policy questions:

At what score do you:

Reject outright?
Add headers only?
Quarantine?

Common patterns:

score < 3: deliver normally.
3 ≤ score < 7: tag mail, deliver to inbox or “likely spam”.
score ≥ 7: move to spam folder or reject.
Higher thresholds in early deployment, then tighten as confidence grows.

Whitelists and Blacklists

To reduce false positives:

Whitelist trusted senders or domains:

  whitelist_from   *@trustedpartner.com

For Rspamd, configure multimap rules for internal or trusted ranges.

Use blacklists with care; better to rely on reputation systems than manual blacklists unless dealing with very persistent sources.

Per-User vs Global Settings

On multi-tenant or hosting servers:

Global settings:

Base rule set, DNSBLs, required scores, antivirus.

Per-domain or per-user:

Subject rewrite preferences.
Per-user Bayes databases.
User-level whitelists/blacklists.

IMAP clients often expose spam settings (like “Junk” folder) but the actual logic usually lives in server-side filters (Sieve, Rspamd, SpamAssassin).

Monitoring Effectiveness

Key metrics:

Spam catch rate: percentage of spam that’s successfully tagged/blocked.
False positive rate: legitimate messages marked as spam.
False negative rate: spam delivered as ham.

Practical monitoring methods:

Sample check: periodically review spam folders and server logs.
Feedback loops: allow users to report:

“This is spam” (train as spam),
“Not spam” (train as ham).

Rspamd or other web UIs: dashboards showing symbol statistics, top senders, scores.

Log and Header Analysis

Headers added by filters are your main diagnostic tool:

Examine X-Spam-Status, X-Spam-Score, X-Spam-Report (SpamAssassin).
Rspamd’s X-Spam-Status-like headers or Authentication-Results enhancements.

For example:

X-Spam-Status: Yes, score=8.1 required=5.0
    tests=BAYES_95,DKIM_INVALID,HTML_IMAGE_ONLY_32,URIBL_BLACK ...

From this, you can:

Identify which tests are over-aggressive or too weak.
Adjust rule scores or thresholds.
Identify misconfigurations (e.g., DKIM invalid due to local signing error).

Security and Anti-Abuse Considerations

Spam filtering interacts closely with security:

Phishing and malware detection: Many filters recognize known phishing templates or suspicious attachment types.
Outgoing spam control:

If your server is compromised and sends spam, your IP/domain reputation suffers.
Apply the same or lighter rules to outbound messages to detect anomalies.

Privacy:

Full-content scanning means filters see message bodies.
For privacy-sensitive environments, be clear about storage, logs, and data retention policies.

Typical hardening measures:

Use TLS between MTA and filter if on separate hosts.
Limit who can access web UIs (e.g. Rspamd) to admins.
Use strong authentication and role separation for training and configuration.

Deployment Patterns and Examples

Simple On-Box SpamAssassin on a Small Server

Scenario: Single Postfix server for a small organization.

Run SpamAssassin as a daemon (spamd).
Use a milter or content filter to send mail to spamd.
Tag mail with spam headers and adjust Dovecot Sieve to move spam to user Junk folders.
Optional: use greylisting and a DNSBL at the MTA level.

Advantages:

Easy to understand.
Lightweight for modest volume.

High-Performance Stack with Rspamd

Scenario: Larger installation or ISP.

Postfix with Rspamd milter handling:

Spam scoring,
DKIM signing/verification,
SPF/DMARC checks,
Rate limiting, greylisting.

Redis backend for Bayes and fuzzy storage.
Web UI for monitoring and interactive training.
Per-domain config overrides to offer different policies.

Advantages:

Scales well.
Single integrated policy engine.
Fine-grained multi-tenant control.

Gateway Scanning with Amavis + Back-End Mail Store

Scenario: Separate MX gateway servers.

Front-end MX runs Postfix + Amavis (SpamAssassin + antivirus).
Back-end server (e.g., Exchange, or another Postfix/Dovecot) receives “cleaned” mail.
Filtering decisions mostly done on gateway; back-end applies user-level folder rules.

Advantages:

Isolates internet-facing components.
Centralizes spam/AV control at the perimeter.

This chapter focused on the concepts, main tools, and integration patterns of spam filtering systems in Linux-based mail environments. Configuration details for the MTA itself and for other email components are covered in their respective chapters.

Comments

Please login to add a comment.

Don't have an account? Register now!

5.3.4 Spam filtering systems

Understanding Email Spam Filtering

Types of Spam Filtering Approaches

1. Header and Content Rule-Based Filters

2. Bayesian and Statistical Filters

3. Reputation and Blacklist-Based Filtering

4. Sender Authentication Results (SPF, DKIM, DMARC)

5. Heuristics and Structural Analysis

6. Collaborative and Cloud-Based Filters

Core Open-Source Spam Filtering Tools

SpamAssassin Overview

Basic SpamAssassin Configuration Concepts

Training the Bayesian Filter

Rspamd Overview

Rspamd Architecture Basics

Amavis and Content Filter Stacks

Integration with MTAs

Filtering at SMTP Time vs After Delivery

Milter-Based Integration (Postfix, Sendmail, etc.)

Policy and Header-Based Delivery

Supporting Technologies in Spam Filtering

DNSBLs and DNS Lookups

Greylisting

Rate Limiting and Abuse Detection

Tuning and Managing Spam Filters

Thresholds and Policies

Whitelists and Blacklists

Per-User vs Global Settings

Monitoring Effectiveness

Log and Header Analysis

Security and Anti-Abuse Considerations

Deployment Patterns and Examples

Simple On-Box SpamAssassin on a Small Server

High-Performance Stack with Rspamd

Gateway Scanning with Amavis + Back-End Mail Store

Comments

Where to Move