1 Fighting Against Spam. 2 How might we analyze email? Identify different parts – Reply blocks,...

24
1 Fighting Against Spam

Transcript of 1 Fighting Against Spam. 2 How might we analyze email? Identify different parts – Reply blocks,...

1

Fighting Against Spam

2

How might we analyze email?

• Identify different parts– Reply blocks, signature blocks

• Integrate email with workflow tasks• Build a social network

– Who do you know, and what is their contact info?– Reputation analysis

• Useful for anti-spam too

3

• Email analysis• Spam filtering

4

Recognizing Email Structure

• Three tasks:– Does this message contain a signature block?– If so, which lines are in it?– Which lines are reply lines?– Three-way classification for each line

• Representation– A sequence of lines– Each line has features associated with it– Windows of lines important for line classification

5

6

Features used for line classification:

•Contain email patterns

•Contain url patterns

•Contain phone number

•Typical signature words: department, lab, university, college, etc

•Person’s name

•Quote symbols

•Large number of punctuation symbols

7

The Cost of Spam• Most of the cost of spam is paid for

by the recipients:

– Typical spam batch is 1,000,000 spams

– Spammer averages ~$250 commission per batch

– Cost to recipients to delete the load of spam @ 2 seconds/spam, $7.25/hour:

$4,028

8

The Cost of Spam

• Theft efficiency ratio of spammer:

profit to thief ------------------------ = ~6 %

cost to victims

• 10% theft efficiency ratio is typical in many other lines of criminal activity such as fencing stolen goods (jewellery, hubcaps, car stereos).

9

How to Recognize Spam?

10

Anti-spam Approaches

• Technology– White listing of Email addresses– Black Listing of Email addresses/domains– Challenge Response mechanisms– Content Filtering

• Learning Techniques• “Bayesian filtering” for spam has got a lot of press,

e.g.• The “Bayesian filtering” is actually Naïve Bayes Classification

11

Research in Spam Classification• Spam filtering is really a classification problem

– Each email needs to be classified as either spam or not spam (“ham”)• W. Cohen (1996):

– RIPPER, Rule Learning System– Rules in a human-comprehensible format

• Pantel & Lin (1998):– Naïve-Bayes with words as features

• Sahami, Dumais, Heckerman, Horvitz (1998): – Naïve-Bayes with a mutual information measure to select features with

strongest resolving power– Words and domain-specific attributes of spam used as features

12

Research in Spam Classification• Paul Graham (2002): A Plan for spam

– Very popular algorithm credited with starting the craze for Bayesian Filters

– Uses naïve bayes with words as features

• Bill Yerazunis (2002): CRM114 sparse binary polynomial hashing algorithm– Very accurate (over 99.7% accuracy)– Distinctive because of it’s powerful feature extraction technique– Uses Bayesian chain rule for combining weights– Available via sourceforge

• Others have used SVMs, etc.

13

Yerazunis’ CRM114 Algorithm

• Other naïve-bayes approaches focused on single-word features

• CRM114 creates a huge number of n-grams and represents them efficiently

• The goal is to create a large number of features, many of which will be invariant over a large body of spam (or nonspam).

14

CRM114

1. Slide a window of N words over the incoming text

2. For each window position, generate a set of order-preserving sub-phrases containing combinations of the windowed words

3. Calculate 32-bit hashes of these order-preserved sub-phrases (for efficiency reasons)

15

Step 1: slide a window N words long over the incoming text. ex:

– You can Click here to buy viagra online NOW!!!

Yields:– You can Click here to buy viagra online NOW!!!– You can Click here to buy viagra online NOW!!!– You can Click here to buy viagra online NOW!!!– You can Click here to buy viagra online NOW!!!

... and so on... (on to step 2)

CRM114 Feature Extraction Example

16

SBPH Example

Click Click hereClick toClick here toClick buyClick here buyClick to buyClick here to buy Click viagraClick here viagraClick to viagraClick here to viagra Click buy viagraClick here buy viagraClick to buy viagraClick here to buy viagra

...yields all these feature sub-phrases

Note the binary counting pattern; this is the ‘binary’ in ‘sparse binary polynomial hashing’

Sliding Window Text : ‘Click here to buy viagra’

Step 2: generate order-preserving sub-phrases from the words in each of the sliding windows

17

SBPH Example

Click Click hereClick toClick here toClick buyClick here buyClick to buy Click here to buyClick viagraClick here viagraClick to viagraClick here to viagra Click buy viagraClick here buy viagraClick to buy viagraClick here to buy viagra

Step 3: make 32-bit hash value “features” from the sub-phrases

32-bit hash

E06BF8AA12FAD10F7B37C4F9113936CF1821F0E846B99AADB7EE69BF19A78B4D56626838AE1B0B615710DE7333094DBB

..... and so on

18

How to use the terms

• For each phrase you can build– Keep track of how many times you see that

phrase in both the spam and nonspam categories.

• When you need to classify some text, – Build up the phrases– Count up how many times all of the phrases

appear in each of the two different categories. – The category with the most phrase matches

wins. • But really it uses the Bayesian chain rule

19

Learning and Classifying

• Learning: each feature is bucketed into one of two bucket files ( spam or nonspam)

• Classifying: the comparable bucket counts of the two files generate rough estimates of each feature's ‘spamminess’

20

Evaluation

The feature set created by the hash gives better performance than single-word Bayesian systems.

Phrases in colloquial English are much more standardized than words alone - this makes filter evasion much harder

A bigger corpus of example text is better

With 400Kbytes selected spams, 300Kbytes selected nonspams trained in, no blacklists, whitelists, or other shenanigans

21

Results>99.915 %

The actual performance of CRM114 Mailfilter from Nov 1 to Dec 1, 2002.

5849 messages, (1935 spam, 3914 nonspam)

4 false accepts, ZERO false rejects, (and 2 messages I couldn't make head nor tail of).

All messages were incoming mail 'fresh from the wild'. No canned spam.

For comparison, a human* is only about 99.84% accurate in classifying spam v. nonspam in a “rapid classification” environment.

22

Results Stats

Filtering speed: classification: about 20Kbytes per second, learning time: about 10Kbytes per second

Memory required: about 5 megabytes

404K spam features, 322K nonspam features

23

Downsides?

The bad news: SPAM MUTATES

Even a perfectly trained Bayesian filter will slowly deteriorate.

New spams appear, with new topics, as well as old topics with creative twists to evade antispam filters.

24

Revenge of the Spammers

• How do the spammers game these algorithms?– Break the tokenizer

• Split up words, use html tags, etc

– Throw in randomly ordered words• Throw off the n-gram based statistics

– Use few words• Harder for the classifier to work