Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

10
Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

Transcript of Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Page 1: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Adapting Statistical Email Filtering

David KohlbrennerIT.com

TJHSST

Page 2: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

What is a statistical email filter?

Filters email for spam.

Supervised learning

Not heuristics

Bayesian filtering and Bayes

Page 3: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Why is a statistical email filter better?

Not based on pre-set values by a human

Can use concepts not easily understood by

people Learns over time, and therefore adapts Real world tests put accuracy better than

99.9%

Page 4: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

How does a statistical email filter work?

Three parts

Tokenization / feature extraction

Training

Analysis

Also, it must store a persistent state

Page 5: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Tokenization / Feature extraction

Emails are made of words Tokens are words, phrases, HTML,

timestamps, senders, etc. The goal is to get as many 'features' of the

email as is possible, the good ones rise to the top

“the orange ball” Becomes: “the”, “orange”, “ball”, “the orange”, “orange ball”, “the orange ball”, “*Font: Albany”, etc.

Page 6: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Training

All filters begin blank

Trained with a corpus of spam / nonspam

Methods for training as email is seen

TEFT

TUNE

TOE

Page 7: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Analysis

Email's tokens are compared to training data

Some aggregated percentage is created for

email Categorized based on that. Bayesian filtering gets its name from Bayes

theorem here.

Page 8: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

So how does this one work?

Designed to be highly modular. Currently has modules for:

TEFT Chi squared Robinson's Graham's

Corpus is non changing, just classifications change.

Page 9: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Object Diagram

ExternalUser

AnalysisPackage

TrainingPackage

Message Database

Token Database

Marked Messages

Un-marked messages

Token Counts

Suggestions

SuggestionsUn-marked messages

ExternalDatabase(Optional)

All database information

EmailCorpus

Page 10: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Does this one work?

To an extent Test data had very limited feature set Test data was based on personal writing style Little time to test/tune

56%-57% accuracy at best Measured by interesting predicted/interesting

actual Also mistakes/interesting marked

More testing will be done Other projects are more critical