Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis ([email protected]),...
-
Upload
marshall-montgomery -
Category
Documents
-
view
217 -
download
0
Transcript of Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis ([email protected]),...
FiltronFiltron: A Learning-Based : A Learning-Based Anti-Spam FilterAnti-Spam Filter
Eirinaios Michelakis Eirinaios Michelakis (([email protected]@iit.demokritos.gr),),Ion Androutsopoulos (Ion Androutsopoulos ([email protected]@aueb.gr),),
George Paliouras (George Paliouras ([email protected]@iit.demokritos.gr),),George Sakkis (George Sakkis ([email protected]@rutgers.edu),),
Panagiotis Stamatopoulos (Panagiotis Stamatopoulos ([email protected]@di.uoa.gr))
Mountain View, CA, July 30Mountain View, CA, July 30thth and 31 and 31stst 2004 2004
First Conference on Email and First Conference on Email and Anti-Spam (CEAS)Anti-Spam (CEAS)
OutlineOutline
Spam Filtering: past, present and futureSpam Filtering: past, present and future Anti-spam filtering withAnti-spam filtering with Filtron Filtron In Vitro EvaluationIn Vitro Evaluation In Vivo EvaluationIn Vivo Evaluation ConclusionsConclusions
Spam Filtering: Spam Filtering: past, present and futurepast, present and future
Past:Past: Black-lists and white-lists of e-mail addressesBlack-lists and white-lists of e-mail addresses Handcrafted rules looking for suspicious keywords Handcrafted rules looking for suspicious keywords
and patterns in headersand patterns in headers Present:Present:
Machine learning-based filtersMachine learning-based filters– Mostly using Naïve Bayes classifierMostly using Naïve Bayes classifier– Examples: Mozilla’s spam filter, POPFILE, K9Examples: Mozilla’s spam filter, POPFILE, K9
Signature based filtering (Vipul’s Razor)Signature based filtering (Vipul’s Razor) Future:Future:
Combination of several techniques (SpamAssassin)Combination of several techniques (SpamAssassin)
Filtron: An overviewFiltron: An overview
A multi-platform learning-based anti-spam filter.A multi-platform learning-based anti-spam filter. Features for simple the userFeatures for simple the user::
Personalized: based on her legitimate messagesPersonalized: based on her legitimate messages Automatically updating black/white listsAutomatically updating black/white lists Efficient: server-side filtering and interception rulesEfficient: server-side filtering and interception rules
Features for the advanced user and the researcherFeatures for the advanced user and the researcher:: Customizable learning componentCustomizable learning component
– Through WekaThrough Weka open source machine learning platformopen source machine learning platform Support for creating publicly available message collectionsSupport for creating publicly available message collections
– Privacy-preserving encoding of messages and user profilesPrivacy-preserving encoding of messages and user profiles
Portable: Implemented in Java and Tcl/TkPortable: Implemented in Java and Tcl/Tk Currently supported under POSIX-compatible mail Currently supported under POSIX-compatible mail
servers (MS Exchange Server port efforts under way)servers (MS Exchange Server port efforts under way)
LegitimateLegitimate
foldersfolders
SpamSpam
foldersfoldersPreprocessorPreprocessor
Vectorizer Vectorizer
LearnerLearner
Attribute Attribute
SelectorSelector
FiltronFiltron
FiltronFiltron’s Architecture’s Architecture
attribute set
training
vectors
User
modelinducedclassifier
black list,white list
PreprocessingPreprocessing
1. Break down mailbox(es) into distinct messages2. Remove from every message:
mail headers html tags attached files
3. Remove messages with no textual content4. Store 5 messages per sender
Avoids bias towards regular correspondents.
5. Remove duplicates6. Encode messages (optional)
Message ClassificationMessage Classification
Incoming e-mail
User’s Mailbox
Unix Mail Server
Procmail
Classifiede-mail
Classification
Filtron
Address Book
Black List
User’sProfile
Classifier
From: sender@provider
Dear Fred,Thanks for the immediatereply. I am glad to hear...
Attachments: 1. File.zip
In Vitro EvaluationIn Vitro Evaluation
We investigated the effect of:We investigated the effect of: Single-token versus multi-token attributes (n-grams Single-token versus multi-token attributes (n-grams
for n=1,2,3)for n=1,2,3) Number of attributes (40-3000)Number of attributes (40-3000) Learning algorithm (Naïve Bayes, Flexible Bayes, Learning algorithm (Naïve Bayes, Flexible Bayes,
SVMs, LogitBoost)SVMs, LogitBoost) Training corpus size (~ 10%-100% of full training Training corpus size (~ 10%-100% of full training
corpus)corpus) Cost-Sensitive Learning FormulationCost-Sensitive Learning Formulation
Misclassifying a legitimate message as spam (LMisclassifying a legitimate message as spam (LS) S) is is λλ times more serious times more serious an error than misclassifying a an error than misclassifying a spam to legitimate (Sspam to legitimate (SL)L)
Two usage scenarios (Two usage scenarios (λ = λ = 1, 9)1, 9)
In Vitro Evaluation (cont.)In Vitro Evaluation (cont.)
Evaluation:Evaluation: Four message collections (PU1, PU2, PU3, PUA)Four message collections (PU1, PU2, PU3, PUA) Stratified 10-fold cross validationStratified 10-fold cross validation
Results:Results: No clear winner among learning algorithms wrt accuracy No clear winner among learning algorithms wrt accuracy
Efficiency (or other criteria) more important for real usage.Efficiency (or other criteria) more important for real usage.
Nevertheless, SVMs consistently among two bestNevertheless, SVMs consistently among two best No substantial improvement with n-grams (for n>1)No substantial improvement with n-grams (for n>1)
Refer to the TR for more details: Refer to the TR for more details: Learning to filter unsolicited commercial e-mailLearning to filter unsolicited commercial e-mail, TRN 2004/2, , TRN 2004/2,
NCSR “Demokritos” (NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-confighttp://www.iit.demokritos.gr/skel/i-config//))
Summary of in Vitro EvaluationSummary of in Vitro Evaluation
λλ = 1 = 1 λλ = 9 = 9
PrPr ReRe WAccWAcc PrPr ReRe WAccWAcc
1-grams1-grams
Naive BayesNaive BayesFlexible BayesFlexible BayesLogitBoostLogitBoostSVMSVM
90.5695.5592.4394.95
94.7389.8990.0891.43
94.6595.1593.6495.42
91.5798.8897.7198.12
92.1774.6374.8978.33
94.8797.7697.2497.60
1/2/3-grams1/2/3-grams
Flexible BayesFlexible BayesSVMSVM
92.9894.73
91.8991.70
93.8995.05
97.4398.70
81.3676.40
96.9197.67
In Vivo EvaluationIn Vivo Evaluation
Seven month live-evaluation by the third authorSeven month live-evaluation by the third author Training collection: PU3 Training collection: PU3
2313 legitimate / 1826 spam2313 legitimate / 1826 spam Learning algorithm: SVMLearning algorithm: SVM Cost scenario: Cost scenario: λλ = 1 = 1 Retained attributes: 520 1-gramsRetained attributes: 520 1-grams
Numeric values (term frequency)Numeric values (term frequency) No black-list was usedNo black-list was used
Summary of in Vivo EvaluationSummary of in Vivo Evaluation
Days usedDays used
Messages receivedMessages received
Spam messages receivedSpam messages received
Legitimate messages receivedLegitimate messages received
Legitimate-to-Spam RatioLegitimate-to-Spam Ratio
212212
6732 (avg. 31.75 per day)6732 (avg. 31.75 per day)
1623 (avg. 7.66 per day)1623 (avg. 7.66 per day)
5109 (avg. 24.10 per day)5109 (avg. 24.10 per day)
3.153.15
Correctly classified legitimate messages (LCorrectly classified legitimate messages (LL)L)
Incorrectly classified legitimate messages (LIncorrectly classified legitimate messages (LS)S)
Correctly classified spam messages (SCorrectly classified spam messages (SS)S)
Incorrectly classified spam messages (SIncorrectly classified spam messages (SL)L)
50575057
52 (avg. 1.72 per week)52 (avg. 1.72 per week)
14501450
173 (avg. 5.71 per week)173 (avg. 5.71 per week)
PrecisionPrecision
RecallRecall
WAccWAcc
96.54% (PU3: 96.43%)96.54% (PU3: 96.43%)
89.34% (PU3: 95.05%)89.34% (PU3: 95.05%)
96.66% (PU3: 96.22%)96.66% (PU3: 96.22%)
Post-Mortem AnalysisPost-Mortem AnalysisFalse PositivesFalse Positives
52 false positives (out of 6732)52 false positives (out of 6732) 52%: Automatically generated messages 52%: Automatically generated messages
subscription verifications, virus warnings, etc.subscription verifications, virus warnings, etc. 22%: Very short messages22%: Very short messages
3-5 words in message body3-5 words in message body Along with attachments and hyperlinksAlong with attachments and hyperlinks
26%: Short messages26%: Short messages 1-2 lines1-2 lines Written in casual style, often exploited by spammersWritten in casual style, often exploited by spammers With no attachments or hyperlinksWith no attachments or hyperlinks
Post-Mortem AnalysisPost-Mortem AnalysisFalse NegativesFalse Negatives
173 false negatives (out of 6732)173 false negatives (out of 6732) 30%: “Hard Spam” 30%: “Hard Spam”
Little textual information, avoiding common suspicious word patternsLittle textual information, avoiding common suspicious word patterns Many images and hyperlinksMany images and hyperlinks Tricks to confuse tokenizersTricks to confuse tokenizers
8%: Advertisements of pornographic sites with very casual and well 8%: Advertisements of pornographic sites with very casual and well chosen vocabularychosen vocabulary
23%: Non-English messages 23%: Non-English messages Under-represented in the training corpusUnder-represented in the training corpus
30%: Encoded messages 30%: Encoded messages BASE64 format; Filtron could not process it at that timeBASE64 format; Filtron could not process it at that time
6%: Hoax letters 6%: Hoax letters Long formal letters (“tremendous business opportunity !”)Long formal letters (“tremendous business opportunity !”) Many occurrences of the receiver’s full nameMany occurrences of the receiver’s full name
3%: Short messages with unusual content3%: Short messages with unusual content
ConclusionsConclusions
Signs of arms race between spammers and content-based Signs of arms race between spammers and content-based filtersfilters
Filtron’sFiltron’s performance deemed satisfactory, though it can be performance deemed satisfactory, though it can be improved with:improved with: More elaborate preprocessing to tackle usual countermeasures of More elaborate preprocessing to tackle usual countermeasures of
spammers (misspellings, uncommon words, text on images)spammers (misspellings, uncommon words, text on images) Regular retraining Regular retraining
Currently most promising approach: combination of Currently most promising approach: combination of different filtering approaches along with Machine Learningdifferent filtering approaches along with Machine Learning Collaborative filteringCollaborative filtering Filtering in the transport layer levelFiltering in the transport layer level ……