Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis ([email protected]),...

15
Filtron Filtron : A Learning-Based : A Learning-Based Anti-Spam Filter Anti-Spam Filter Eirinaios Michelakis Eirinaios Michelakis ( ( [email protected] [email protected] ), ), Ion Androutsopoulos ( Ion Androutsopoulos ( [email protected] [email protected] ), ), George Paliouras ( George Paliouras ( [email protected] [email protected] ), ), George Sakkis ( George Sakkis ( [email protected] [email protected] ), ), Panagiotis Stamatopoulos ( Panagiotis Stamatopoulos ( [email protected] [email protected] ) ) First Conference on First Conference on Email and Anti-Spam Email and Anti-Spam (CEAS) (CEAS)

Transcript of Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis ([email protected]),...

FiltronFiltron: A Learning-Based : A Learning-Based Anti-Spam FilterAnti-Spam Filter

Eirinaios Michelakis Eirinaios Michelakis (([email protected]@iit.demokritos.gr),),Ion Androutsopoulos (Ion Androutsopoulos ([email protected]@aueb.gr),),

George Paliouras (George Paliouras ([email protected]@iit.demokritos.gr),),George Sakkis (George Sakkis ([email protected]@rutgers.edu),),

Panagiotis Stamatopoulos (Panagiotis Stamatopoulos ([email protected]@di.uoa.gr))

Mountain View, CA, July 30Mountain View, CA, July 30thth and 31 and 31stst 2004 2004

First Conference on Email and First Conference on Email and Anti-Spam (CEAS)Anti-Spam (CEAS)

OutlineOutline

Spam Filtering: past, present and futureSpam Filtering: past, present and future Anti-spam filtering withAnti-spam filtering with Filtron Filtron In Vitro EvaluationIn Vitro Evaluation In Vivo EvaluationIn Vivo Evaluation ConclusionsConclusions

Spam Filtering: Spam Filtering: past, present and futurepast, present and future

Past:Past: Black-lists and white-lists of e-mail addressesBlack-lists and white-lists of e-mail addresses Handcrafted rules looking for suspicious keywords Handcrafted rules looking for suspicious keywords

and patterns in headersand patterns in headers Present:Present:

Machine learning-based filtersMachine learning-based filters– Mostly using Naïve Bayes classifierMostly using Naïve Bayes classifier– Examples: Mozilla’s spam filter, POPFILE, K9Examples: Mozilla’s spam filter, POPFILE, K9

Signature based filtering (Vipul’s Razor)Signature based filtering (Vipul’s Razor) Future:Future:

Combination of several techniques (SpamAssassin)Combination of several techniques (SpamAssassin)

Filtron: An overviewFiltron: An overview

A multi-platform learning-based anti-spam filter.A multi-platform learning-based anti-spam filter. Features for simple the userFeatures for simple the user::

Personalized: based on her legitimate messagesPersonalized: based on her legitimate messages Automatically updating black/white listsAutomatically updating black/white lists Efficient: server-side filtering and interception rulesEfficient: server-side filtering and interception rules

Features for the advanced user and the researcherFeatures for the advanced user and the researcher:: Customizable learning componentCustomizable learning component

– Through WekaThrough Weka open source machine learning platformopen source machine learning platform Support for creating publicly available message collectionsSupport for creating publicly available message collections

– Privacy-preserving encoding of messages and user profilesPrivacy-preserving encoding of messages and user profiles

Portable: Implemented in Java and Tcl/TkPortable: Implemented in Java and Tcl/Tk Currently supported under POSIX-compatible mail Currently supported under POSIX-compatible mail

servers (MS Exchange Server port efforts under way)servers (MS Exchange Server port efforts under way)

LegitimateLegitimate

foldersfolders

SpamSpam

foldersfoldersPreprocessorPreprocessor

Vectorizer Vectorizer

LearnerLearner

Attribute Attribute

SelectorSelector

FiltronFiltron

FiltronFiltron’s Architecture’s Architecture

attribute set

training

vectors

User

modelinducedclassifier

black list,white list

PreprocessingPreprocessing

1. Break down mailbox(es) into distinct messages2. Remove from every message:

mail headers html tags attached files

3. Remove messages with no textual content4. Store 5 messages per sender

Avoids bias towards regular correspondents.

5. Remove duplicates6. Encode messages (optional)

Message ClassificationMessage Classification

Incoming e-mail

User’s Mailbox

Unix Mail Server

Procmail

Classifiede-mail

Classification

Filtron

Address Book

Black List

User’sProfile

Classifier

From: sender@provider

Dear Fred,Thanks for the immediatereply. I am glad to hear...

Attachments: 1. File.zip

In Vitro EvaluationIn Vitro Evaluation

We investigated the effect of:We investigated the effect of: Single-token versus multi-token attributes (n-grams Single-token versus multi-token attributes (n-grams

for n=1,2,3)for n=1,2,3) Number of attributes (40-3000)Number of attributes (40-3000) Learning algorithm (Naïve Bayes, Flexible Bayes, Learning algorithm (Naïve Bayes, Flexible Bayes,

SVMs, LogitBoost)SVMs, LogitBoost) Training corpus size (~ 10%-100% of full training Training corpus size (~ 10%-100% of full training

corpus)corpus) Cost-Sensitive Learning FormulationCost-Sensitive Learning Formulation

Misclassifying a legitimate message as spam (LMisclassifying a legitimate message as spam (LS) S) is is λλ times more serious times more serious an error than misclassifying a an error than misclassifying a spam to legitimate (Sspam to legitimate (SL)L)

Two usage scenarios (Two usage scenarios (λ = λ = 1, 9)1, 9)

In Vitro Evaluation (cont.)In Vitro Evaluation (cont.)

Evaluation:Evaluation: Four message collections (PU1, PU2, PU3, PUA)Four message collections (PU1, PU2, PU3, PUA) Stratified 10-fold cross validationStratified 10-fold cross validation

Results:Results: No clear winner among learning algorithms wrt accuracy No clear winner among learning algorithms wrt accuracy

Efficiency (or other criteria) more important for real usage.Efficiency (or other criteria) more important for real usage.

Nevertheless, SVMs consistently among two bestNevertheless, SVMs consistently among two best No substantial improvement with n-grams (for n>1)No substantial improvement with n-grams (for n>1)

Refer to the TR for more details: Refer to the TR for more details: Learning to filter unsolicited commercial e-mailLearning to filter unsolicited commercial e-mail, TRN 2004/2, , TRN 2004/2,

NCSR “Demokritos” (NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-confighttp://www.iit.demokritos.gr/skel/i-config//))

Summary of in Vitro EvaluationSummary of in Vitro Evaluation

λλ = 1 = 1 λλ = 9 = 9

PrPr ReRe WAccWAcc PrPr ReRe WAccWAcc

1-grams1-grams

Naive BayesNaive BayesFlexible BayesFlexible BayesLogitBoostLogitBoostSVMSVM

90.5695.5592.4394.95

94.7389.8990.0891.43

94.6595.1593.6495.42

91.5798.8897.7198.12

92.1774.6374.8978.33

94.8797.7697.2497.60

1/2/3-grams1/2/3-grams

Flexible BayesFlexible BayesSVMSVM

92.9894.73

91.8991.70

93.8995.05

97.4398.70

81.3676.40

96.9197.67

In Vivo EvaluationIn Vivo Evaluation

Seven month live-evaluation by the third authorSeven month live-evaluation by the third author Training collection: PU3 Training collection: PU3

2313 legitimate / 1826 spam2313 legitimate / 1826 spam Learning algorithm: SVMLearning algorithm: SVM Cost scenario: Cost scenario: λλ = 1 = 1 Retained attributes: 520 1-gramsRetained attributes: 520 1-grams

Numeric values (term frequency)Numeric values (term frequency) No black-list was usedNo black-list was used

Summary of in Vivo EvaluationSummary of in Vivo Evaluation

Days usedDays used

Messages receivedMessages received

Spam messages receivedSpam messages received

Legitimate messages receivedLegitimate messages received

Legitimate-to-Spam RatioLegitimate-to-Spam Ratio

212212

6732 (avg. 31.75 per day)6732 (avg. 31.75 per day)

1623 (avg. 7.66 per day)1623 (avg. 7.66 per day)

5109 (avg. 24.10 per day)5109 (avg. 24.10 per day)

3.153.15

Correctly classified legitimate messages (LCorrectly classified legitimate messages (LL)L)

Incorrectly classified legitimate messages (LIncorrectly classified legitimate messages (LS)S)

Correctly classified spam messages (SCorrectly classified spam messages (SS)S)

Incorrectly classified spam messages (SIncorrectly classified spam messages (SL)L)

50575057

52 (avg. 1.72 per week)52 (avg. 1.72 per week)

14501450

173 (avg. 5.71 per week)173 (avg. 5.71 per week)

PrecisionPrecision

RecallRecall

WAccWAcc

96.54% (PU3: 96.43%)96.54% (PU3: 96.43%)

89.34% (PU3: 95.05%)89.34% (PU3: 95.05%)

96.66% (PU3: 96.22%)96.66% (PU3: 96.22%)

Post-Mortem AnalysisPost-Mortem AnalysisFalse PositivesFalse Positives

52 false positives (out of 6732)52 false positives (out of 6732) 52%: Automatically generated messages 52%: Automatically generated messages

subscription verifications, virus warnings, etc.subscription verifications, virus warnings, etc. 22%: Very short messages22%: Very short messages

3-5 words in message body3-5 words in message body Along with attachments and hyperlinksAlong with attachments and hyperlinks

26%: Short messages26%: Short messages 1-2 lines1-2 lines Written in casual style, often exploited by spammersWritten in casual style, often exploited by spammers With no attachments or hyperlinksWith no attachments or hyperlinks

Post-Mortem AnalysisPost-Mortem AnalysisFalse NegativesFalse Negatives

173 false negatives (out of 6732)173 false negatives (out of 6732) 30%: “Hard Spam” 30%: “Hard Spam”

Little textual information, avoiding common suspicious word patternsLittle textual information, avoiding common suspicious word patterns Many images and hyperlinksMany images and hyperlinks Tricks to confuse tokenizersTricks to confuse tokenizers

8%: Advertisements of pornographic sites with very casual and well 8%: Advertisements of pornographic sites with very casual and well chosen vocabularychosen vocabulary

23%: Non-English messages 23%: Non-English messages Under-represented in the training corpusUnder-represented in the training corpus

30%: Encoded messages 30%: Encoded messages BASE64 format; Filtron could not process it at that timeBASE64 format; Filtron could not process it at that time

6%: Hoax letters 6%: Hoax letters Long formal letters (“tremendous business opportunity !”)Long formal letters (“tremendous business opportunity !”) Many occurrences of the receiver’s full nameMany occurrences of the receiver’s full name

3%: Short messages with unusual content3%: Short messages with unusual content

ConclusionsConclusions

Signs of arms race between spammers and content-based Signs of arms race between spammers and content-based filtersfilters

Filtron’sFiltron’s performance deemed satisfactory, though it can be performance deemed satisfactory, though it can be improved with:improved with: More elaborate preprocessing to tackle usual countermeasures of More elaborate preprocessing to tackle usual countermeasures of

spammers (misspellings, uncommon words, text on images)spammers (misspellings, uncommon words, text on images) Regular retraining Regular retraining

Currently most promising approach: combination of Currently most promising approach: combination of different filtering approaches along with Machine Learningdifferent filtering approaches along with Machine Learning Collaborative filteringCollaborative filtering Filtering in the transport layer levelFiltering in the transport layer level ……