rspamd-fosdem

84

Transcript of rspamd-fosdem

Page 1: rspamd-fosdem
Page 2: rspamd-fosdem

Part I: Introduction

Page 3: rspamd-fosdem

Why Rspamd? SpamAssassin → rspamd in “Rambler” Mail

Page 4: rspamd-fosdem

Rspamd in actionReal-time stats

Page 5: rspamd-fosdem

Rspamd in actionWeb UI

Page 6: rspamd-fosdem

Rspamd in a nutshell• Written in plain C

• Uses non-blocking (event-driven) data processing model

• Oriented on performance

• Plugins and rules are written in Lua

• Supports all major filtering methods

• Packaged for the most of *nix systems

Page 7: rspamd-fosdem

Spam filtering

Page 8: rspamd-fosdem

Major spam classes• Fraud and scam:

• “Nigerian” mail

• Phishing (banks, social networks)

• Advertisement:

• Plain text

• Pictures, obfuscated HTML etc

• Spam from the social networks notifications

Page 9: rspamd-fosdem

Scam:a typical example

Page 10: rspamd-fosdem

Advertisement: images

Page 11: rspamd-fosdem

Advertisement: images

Page 12: rspamd-fosdem

Filtering methods Basic principles

• Policies (particularly efficient against fraud)

• DNS lists (for IP and URLs)

• SPF, DKIM, DMARC

• Sender’s reputation

• Static content filtering (patterns match)

• Statistics and ML (e.g. Bayesian)

Page 13: rspamd-fosdem

Filtering methods Difficulties

• Policies:

• Require network requests with unpredictable delay

• False positive (e.g. when forwarding)

• Static Content filtering:

• CPU intensive

• Unlikely to be useful long-term

• Statistics:

• Not very precise

• Hard to adopt for multi-user systems

Page 14: rspamd-fosdem

Rspamd special features • Text lemmatisation and normalisation (snowball stemmer)

• Rspamd converts all texts to utf-8

• Support for users settings in JSON

• Possibility to have multiple metrics for a message

• Web interface

• Rspamd can load SpamAssassin rules directly

• OpenSource (BSD 2-clause)

Page 15: rspamd-fosdem

Rspamd integration Using Rmilter

MTA Rmilter Rspamd

Storage

Message and SMTP DATA HTTP request

Redis cache

JSON: action, symbolsDelay, modify or reject

Greylisting, limits Statistics, shared cache

✉✉✉✉

✉ ✉

🚮

Inbox Spam

✉Reject

Page 16: rspamd-fosdem

Rspamd integration milter

• rmilter + postfix/sendmail:

• supports all rspamd features (e.g. greylisting action)

• can split inbound and outbound messages

• can whitelist reply to own messages

• can add DKIM signatures to outgoing email

Page 17: rspamd-fosdem

Rspamd integration Other MTA

• exim - supports rspamd internally (from 4.86)

• haraka - supports rspamd internally

• any other MTA - LDA mode when rspamc checks message on rspamd and calls delivery agent with altered message using Unix pipes

Page 18: rspamd-fosdem

Part II: Architecture

Page 19: rspamd-fosdem

Development principles

• Performance and scalability

• Filtering quality and extensions

• Security

• Simplicity and documentation coverage

Page 20: rspamd-fosdem

Rules in Rspamd

Pre-filters Filters Post-filters

Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule

Policy rules Content rules Statistics rules

Settings, ratelimit Main scan Postprocessing

Page 21: rspamd-fosdem

Message lifecycle

RulesRulesFilters

RulesRulesPre-filters

RulesRulesPost-filters

📩 Message

📬 Result

Wait

Wait

Wait for dependencies

Page 22: rspamd-fosdem

Event driven model Never blocks*

• Advantages:

✅ Can process network replies asynchronously

✅ Sends all network request simultaneously

✅ Can process many messages in parallel, scales well

• Disadvantages:

📛 Callbacks hell

⛔ Memory consumption can go wild

*almost never

Page 23: rspamd-fosdem

Sequential processing Traditional approach

Rule 1 Rule 2 Rule 3

DNS Hashes

Wait Wait

Processing

Page 24: rspamd-fosdem

Event-driven processing

Rule 1 Rule 2 Rule 3

DNS Hashes

Wait

Processing

Page 25: rspamd-fosdem

Rule 1Rule 1Rules

DNS Hashes

Wait

Event-driven processing What really happens?

Page 26: rspamd-fosdem

Event-driven processing• Rspamd can send thousands of DNS requests:

time: 5540.8ms real, 2427.4ms virtual, dns req: 120543

Page 27: rspamd-fosdem

Event-driven processing

• Rspamd can send thousands of DNS requests: time: 5540.8ms real, 2427.4ms virtual, dns req: 120543

• For short messages waiting prevails on the processing time: time: 996.140ms real, 22.000ms virtual,

Page 28: rspamd-fosdem

Statistics in rspamd• Uses Markov chains for statistic tokens (using

sparse bi-gramms model)

• Considers metadata in statistics: mime structure, headers and so on

• Statistics could be stored in redis or sqlite3

• Rspamd supports per-user, per-domain statistics, multiple classifiers and autolearning

Page 29: rspamd-fosdem

Tokens generation Quick brown fox jumps over lazy dog

1234

1

2

Page 30: rspamd-fosdem

Statistics example Hard case: images spam

0%

25%

50%

75%

100%

Spam Ham

8%5%

92%95%

Correctly recognized Unrecognized

Page 31: rspamd-fosdem

Fuzzy hashes Introduction

• Can find similar or close to similar messages

• Rspamd uses fuzzy logic for text and exact hashes for attachments

• Storage engine - sqlite3

Page 32: rspamd-fosdem

Shingles AlgorithmQuick brown fox jumps over lazy dog

w1 w2 w3

w2 w3 w4

w3 w4 w5

w4 w5 w6

w1 w2 w3

w2 w3 w4

w3 w4 w5

w4 w5 w6N hashes

h1

h2

h3

h4

h1’

h2’

h3’

h4’

Page 33: rspamd-fosdem

Shingles Algorithmh1 h2 h3

h1’ h2’ h3’

h1’’ h2’’ h3’’

h1’’’’’ h2’’’’ h3’’’’

min

min

min

min

N shinglesN hash pipes

Page 34: rspamd-fosdem

Shingles Algorithm

• Probabilistic in general

• Uses siphash with different keys

• Window size - 3 words, number of hashes - 32

Page 35: rspamd-fosdem

Part III: Performance

Page 36: rspamd-fosdem

Main Principles• Always consider performance.

• If something could be skipped, skip it

• Use more specific solution if it is faster than generic one (special FSM, asm fragments)

• If it is possible to do something fast but not 100% accurate, try faster method

Page 37: rspamd-fosdem

Rules Optimisation Global optimisation

• Stop checks if a message has reached threshold

• Prioritise rules with negative weight

• Prioritise faster and more frequent rules:

• Check weight, hit probability and average time

• Use greedy sorting algorithm

Page 38: rspamd-fosdem

Rules Optimisation Local optimisations

• Build Abstract Syntax Tree for each rule

• Tree is periodically resorted in respect of hit probability and execution time

• Regular expressions are optimised using PCRE+JIT and Hyperscan (from 1.1)

• Lua code is optimised by means of LuaJIT

Page 39: rspamd-fosdem

AST Optimisations Branches cut

&

|C

! B

A

A = 0, B = 1, C = 0

0

110

1

0

Order of processing

4/6 branches skipped

(!A or B) and C

Page 40: rspamd-fosdem

AST optimisations N-ary operations

>

+ 2

! B

A

Check order

C D E

What we compare?

Here is our limit

Stop processing

(!A+B+C+D+E) > 2

Page 41: rspamd-fosdem

C Library optimisations• Accelerated base32/base64

• Optimised string functions

• Fixed size strings

• Own printf implementation

• Own logger layer

• Assembly versions for crypto ciphers and hashes

Page 42: rspamd-fosdem

Hyperscan in Rspamd Multi-regexp match engine

Page 43: rspamd-fosdem

Hyperscan in Rspamd Problem statement

• Need to match many regular expressions for the same data:

len: 610591, time: 2492.457ms real, 882.251ms virtual regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre

• No need to capture patterns

• Need to match both UTF8 and raw data

• Need to support Perl compatible syntax

Page 44: rspamd-fosdem

Naive solution

• Join all expressions with ‘|’:

/abc/ + /cde/ => /(?:abc)|(?:cde)/

• Won’t work:

/abc/ + /a/ => /(?:abc)|(?:a)/ -> the second rule won’t match for ‘abc’

Page 45: rspamd-fosdem

Hyperscan

Text

/re1/

/re2/

/re3/

/re4/

/re5/

/re6/

/re7/

Scan

Match

Match

Match

Page 46: rspamd-fosdem

Rules and Hyperscan

From

Subject

Text Part

HTML Part

Attachment

Header RE classHeader RE class

Text Parts class

Raw Body class

Whole message

m1

m2

m3

m4

m5

m6

MIME message Hyperscan regexps

Page 47: rspamd-fosdem

Problems with Hyperscan

• Slow compile time (like 100 times slower than PCRE+JIT)

• Many PCRE features are unsupported (lookahead, look-behind)

• Is not packaged for the major OSes (requires relatively fresh boost)

Page 48: rspamd-fosdem

Lazy hyperscan compile

HS compiler

Main

Scanners

Compiled class

Compiled class

Compiled class

Use PCRE only

Compile to disk

Page 49: rspamd-fosdem

Lazy hyperscan compile

HS compiler

Main

Scanners

Compiled class

Compiled class

Compiled class

Use PCRE only

All done Broadcast message

Page 50: rspamd-fosdem

Lazy hyperscan compile

HS compiler

Main

Scanners

Compiled class

Compiled class

Compiled class

Switch to hyperscan

Loaded from cache

Wait for ‘recompile’ command

Page 51: rspamd-fosdem

Lazy hyperscan compile

• Does not affect the starting time

• If nothing changes then scanners load pre-compiled expressions on startup (using crypto hashes)

• Recompile command is used to force recompilation process:rspamadm recompile

Page 52: rspamd-fosdem

Unsupported features Pre-filter mode

Text

/re1/pre

/re2/

/re3/pre

/re4/

/re5/

/re6/

/re7/

Scan

Maybe match?

Match

Match

Check PCRE

Match!

Replaces unsupported constructs with "weaker" supported ones, guaranteeing a match superset:

Page 53: rspamd-fosdem

Hyperscan resultsUse Hyperscan for the vast majority of RE rules:

len: 610591, time: 2492.457ms real, 882.251ms virtual regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre

len: 610591, time: 654.596ms real, 309.785ms virtual regexp statistics: 34 pcre regexps scanned, 41 regexps matched, 8.41M bytes scanned using pcre, 9.56M bytes scanned total

Page 54: rspamd-fosdem

Part IV: Security

Page 55: rspamd-fosdem

Main Principles • Use secure C programming guidelines

• Perform static analysis for the code including special clang plugin for own printf format safety

• Deal with malicious messages

• Limit and protect DNS subsystem

• Encrypt and protect all data passed over the network

Page 56: rspamd-fosdem

Examples

• An email can contain thousands of objects to check (e.g. URLs)

• SPF records could have infinite recursion

• Need limits for network checks and memoization of the results

Page 57: rspamd-fosdem

DNS protection

• Cryptographic DNS id generation

• Socket pool with constant rotation

• Source port randomisation

• Input filtering (+ IDN encoding)

Page 58: rspamd-fosdem

Traffic Encryption

• Security -> performance -> simplicity

• TLS is hard in non-blocking IO

• TLS does a lot of unnecessary (in case of rspamd) operations increasing thus CPU load and latency

Page 59: rspamd-fosdem

HTTPCrypt Protocol

Page 60: rspamd-fosdem

Connection EstablishmentClientHelo

ServerHelo Certificates⚠

ClientChangeCipher

ServerChangeCipher

Data

TLS handshake

ClientPK Encrypted request

Encrypted reply Validation (optional)

HTTPCrypt handshake

Page 61: rspamd-fosdem

Throughput Cached connections

Page 62: rspamd-fosdem

Throughput New connections

Page 63: rspamd-fosdem

Throughput New connections

Page 64: rspamd-fosdem

Connection Latency Cached connections

Page 65: rspamd-fosdem

Connection Latency New connections

Page 66: rspamd-fosdem

Why HTTPCrypt is fast• Faster key exchange (ECDH)

• Server does not need to sign anything (needed for replay protection in TLS)

• Encryption is done in-place

• No replay protection for the first message - still suitable for idempotent scanning

Page 67: rspamd-fosdem

Conclusions

Page 68: rspamd-fosdem

Conclusions

• Rspamd is designed to work as fast as possible

• Rspamd has a rich and actively maintained collection of rules and plugins and can re-use existing SA rules

• It is possible to encrypt all data transferred over the network in rspamd ecosystem

Page 69: rspamd-fosdem

Open issues

• No Debian maintainer (any volunteers?)

• No dynamic rules updates (planned for 1.2) yet

• Rapid development (hard to keep compatibility)

Page 70: rspamd-fosdem

Questions?Vsevolod Stakhov

https://rspamd.com

Page 71: rspamd-fosdem

Appendix

Page 72: rspamd-fosdem

Rspamd processes

Scanning processesScanning processesScanner processes Controller Service processes

Main process

✉✉

HTTP📬 ✉✉✉

Learn

Messages Results

Page 73: rspamd-fosdem

Configuration evolution1. Grammar parser (lex + yacc)

⛔ Hard to manage

⛔ Hard to extend

2. XML

⛔ Unreadable

⛔ Problems with expressions (A > B)

3. UCL - universal configuration language

✅ Easy to manage (looks like nginx.conf)

✅ Macro support

✅ JSON data model (can be used as JSON parser)

Page 74: rspamd-fosdem

UCL building blocks• Sections

• Arrays

• Variables

• Macros

• Comments

section { key = “value”; number = 10K; }

upstreams = [ “localhost:80”, “example.com:8080”,

]

static_dir = “${WWWDIR}/“; filepath = “${CURDIR}/data”;

.include “${CONFDIR}/workers.conf”

.include (glob=true,priority=2) “${CONFDIR}/conf.d/*.conf”

.lua { print(“hey!”); }

key = value; // Single line comment /* Multiline comment /* can also be nested */ */

Page 75: rspamd-fosdem

Configuration components

Global options

Workers configuration

Scores (metrics)

Modules configuration

Statistics

• Each component is normally included to the main configuration

• rspamd.local.conf is used to extend configuration

• rspamd.override.conf is used to override values in the configuration

• It is possible to use numeric multipliers: “k/m/g” or “ms/s/m/h/d” for time values

Page 76: rspamd-fosdem

Lua rules

• Most of rules are defined in Lua configuration

• Two types of Lua rules:

• Regexp rules (look like strings)

• Lua functions (pure Lua code)

Page 77: rspamd-fosdem

Lua rules Some examples

-- Outlook versions that should be excluded from summary rulelocal fmo_excl_o3416 = 'X-Mailer=/^Microsoft Outlook, Build 10.0.3416$/H'local fmo_excl_oe3790 = 'X-Mailer=/^Microsoft Outlook Express 6.00.3790.3959$/H'-- Summary rule for forged outlookreconf['FORGED_MUA_OUTLOOK'] = string.format('(%s | %s) & !%s & !%s & !%s',

forged_oe, forged_outlook_dollars, fmo_excl_o3416, fmo_excl_oe3790, vista_msgid)

• Regexp rule

• Lua rulerspamd_config.R_EMPTY_IMAGE = function(task) local tp = task:get_text_parts() -- get text parts in a message for _,p in ipairs(tp) do -- iterate over text parts array using `ipairs` if p:is_html() then -- if the current part is html part local hc = p:get_html() -- we get HTML context local len = p:get_length() -- and part's length if len < 50 then -- if we have a part that has less than 50 bytes of text local images = hc:get_images() -- then we check for HTML images if images then -- if there are images for _,i in ipairs(images) do -- then iterate over images in the part if i['height'] + i['width'] >= 400 then -- if we have a large image return true -- add symbol end end end end end endend

Page 78: rspamd-fosdem

Pure LUA functions Review

• Are very powerful

• Have access to all information from rspamd via lua API: https://rspamd.com/doc/lua/

• Are very fast since C <-> LUA interaction is cheap

• Can use zero-copy objects called rspamd{text} to avoid copying when moving data between C and LUA

Page 79: rspamd-fosdem

Pure LUA functions• Variables:

• Conditionals:

• Loops:

• Tables:

• Functions:

• Closures:

local ret = false -- Generic variablelocal rules = {} -- Empty tablelocal rspamd_logger = require “rspamd_logger" -- Load rspamd module

if not ret then -- can use ‘not’, ‘and’, ‘or’ here…elseif ret ~= 10 then -- note ~= for ‘not equal’ operatorend

for k,m in pairs(opts) do … end -- Iterate over keyed table a[‘key’] = valuefor _,i in ipairs(images) do … end -- Iterate over array table a[1] = valuefor i=1,10 do … end -- Count from 1 to 10

local options = { [1] = ‘value’, [‘key’] = 1, -- Numbers starts from 1 another_key = function(task) … end, -- Functions can be values [2] = {} -- Other tables can be values} -- Can have both numbers and strings as key and anything as values

local function something(task) -- Normal definition local cb = function(data) -- Functions can be nested … endend

local function gen_closure(option) local ret = false -- Local variable return function(task) task:do_something(ret, option) -- Both ‘ret’ and ‘option’ are accessible here endendrspamd_config.SYMBOL = gen_closure(‘some_option’)

Page 80: rspamd-fosdem

Pure LUA functions Generic recommendations

• Use local whenever possible (otherwise, global variables are expensive)

• Callbacks, closures and recursion are generally cheap (when using LuaJIT)

• Do not mix string and number keys in tables, that makes them hard to iterate

• ipairs and pairs are not equal

• Strings are constant in LUA

Page 81: rspamd-fosdem

Regexp rules Types

• Can work with the following elements:

• Headers: Message-Id=/^something$/H

• Mime parts: /some word/P

• Raw messages: /some pattern/M

• URLs: /example.com/U

• Some new flags are added:

• UTF8 flag: /u

Page 82: rspamd-fosdem

Regexp rules Generic information

• Can be combined using the following operators:

• AND: /something/P && Subject=/some/H

• OR: /something/P || Subject=/some/H

• NOT: !/something/P

• PLUS: /A/P + /B/P + /C/P >= 2

• Priority goes as following: NOT ➡ AND ➡ OR ➡ PLUS

• Braces can change priority: !A AND (B OR C)

Page 83: rspamd-fosdem

Regexp rules Performance considerations

• Avoid message regexps at any cost (use trie instead)

• Regexp expressions are highly optimised in rspamd and unnecessary evaluations are not performed

• UTF regexps are more expensive than default ones (but could be useful sometimes)

• Always use the appropriate type of expression (e.g. url for links and part for textual content)

Page 84: rspamd-fosdem

Trie matching• Perfect for fast raw message and text pattern

matching

• Scales almost linearly with input size (aho-corasic algorithm)

• Can handle thousands and hundreds of thousands of patterns (is a base for all antivirus scanners)

• Highly optimised for 64 bit systems