Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V....

10
Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine Learning Research 7 (2006) 2673-2698 Presenter: Kenneth Fung

description

Appreciative Comment The experimenters have use the area under Receiver Operating Characteristic (ROC) curve (AUC) instead of a value of false positive rate (FPR) or spam misclassification rate (SMR) to compare filter. I believe it is better than just showing just FPR in evaluation because FPR can easy to be improved by allow more message through the filter.  A extreme case will be allow all message go through it, the FPR will be very good. AUC is important because it can measure the unbalance cost of misclassification.

Transcript of Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V....

Page 1: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Spam Filtering Using Statistical Data Compression Models

Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine Learning Research 7 (2006) 2673-2698

Presenter: Kenneth Fung

Page 2: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Summary

Dynamic Markov compression (DMC) and prediction by partial matching (PPM) compression algorithms were used to build a spam filter.

Advantages: Fast to construct Updates incrementally Classifies spam in linear time Resistant to random distortions in spam.

Disadvantages Large memory requirements.

Page 3: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Appreciative Comment The experimenters have use the area under Receiver

Operating Characteristic (ROC) curve (AUC) instead of a value of false positive rate (FPR) or spam misclassification rate (SMR) to compare filter.

I believe it is better than just showing just FPR in evaluation because FPR can easy to be improved by allow more message through the filter. A extreme case will be allow all message go through it, the

FPR will be very good.

AUC is important because it can measure the unbalance cost of misclassification.

Page 4: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Critical Comment: Poor Explanations The experimenters have not explained clearly

how to classify spam.

They build two models using same compression algorithm, one train by all spam and other train by all legitimate email. They put the message into those model, the message is classify by change of length in the model.

Page 5: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Explanation of classificationmessage

Legitimate email model

spammodel

Model length increase by B

Model length increase by A

Score (A) – Score (B)positive

SPAM

negative

Legitimate email

Page 6: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Explanation of classification

In this classification method, Both model will compress the message in minimum length. In the design, the strength of the filter cannot be controlled.

The authors not clearly discuss how to control the trade off of FPR and SMR in the experiment.

A possible design is weighting the score. Score(A) - k * Score(B) for some value k.

Page 7: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Explanation of Training

By assumption, user must correct the misclassification of the filter for every message. The corresponding model will update by the correct classified message.

userunclassified message

Filtermessage classified by filter

correct classified message

Page 8: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Critical Comment: Overfitting? The authors did not discuss overfitting.

10-fold cross validation data have been used in some of the experiment, but no information about cross validation is shown.

The data shows that the model may have been overfitted.

Page 9: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Model is probably overfitted

Page 10: Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.

Question

Would user always correct classify the spam? What will happen if user provide incorrect spam

report? Will you always go junk box to pick up all

legitimate email in hundreds of spam?