A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering
description
Transcript of A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering
![Page 1: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/1.jpg)
A Comparison of Feature-Based and Feature-Free Case-Based Reasoning
for Spam Filtering
Derek BridgeUniversity College Cork
work done with
Sarah Jane DelanyDublin Institute of Technology
![Page 2: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/2.jpg)
Overview
• Introduction• Case-Based Spam Filtering
– Feature-Based– Feature-Free– Experiments I
• Case Base Maintenance– Competence-Based Editing– Experiments II
• Concept Drift– Incremental & periodic solutions– Experiments III
• Conclusions
![Page 3: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/3.jpg)
Introduction
• From the Spamhaus project (www.spamhaus.org)
– “An electronic message is ‘spam’ IF:1) the recipient's personal identity and context are
irrelevant because the message is equally applicable to many other potential recipients;
AND2) the recipient has not verifiably granted deliberate,
explicit, and still-revocable permission for it to be sent.”
• “[It’s] about consent, not content”
• We focus on email spam
![Page 4: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/4.jpg)
Spam Filtering
• Spam filtering is classification:– is an incoming email ham or spam?
• Spam filters– procedural
• whitelists, blacklists, challenge-response systems,…
– collaborative • sharing signatures
– content-based • rules, decision trees, probabilities, case bases,…
– hybrid.
![Page 5: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/5.jpg)
Challenges of Spam Filtering• Spam is subjective and personal;• It is heterogeneous;• There is a high costs to false
positives (where ham is classified as spam); and
• It is constantly changing (‘concept drift’).
![Page 6: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/6.jpg)
Overview
• IntroductionCase-Based Spam Filtering
– Feature-Based– Feature-Free– Experiments I
• Case Base Maintenance– Competence-Based Editing– Experiments II
• Concept Drift– Incremental & periodic solutions– Experiments III
• Conclusions
![Page 7: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/7.jpg)
Case-Based Reasoning
Generalknowledge
Tested/RepairedCase
AdaptedCase
LearnedCase
RetrievedCase
Newproblem
PreviousCase
RETRIEVE
REVISE
RETAIN REUSE
[Aamodt & Plaza 1994]
![Page 8: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/8.jpg)
Case-Based Reasoning
Generalknowledge
PreviousCase
MAINTAIN
![Page 9: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/9.jpg)
Is Case-Based Reasoning (CBR) the answer?• Spam is subjective and personal;• It is heterogeneous;• There is a high costs to false
positives (where ham is classified as spam); and
• It is constantly changing (‘concept drift’). Users can have
individual case bases created from their own
emails
It is known that CBR handles disjunctive
concepts well
We can bias CBR away from false positivesCase bases can be
updated incrementally
![Page 10: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/10.jpg)
Overview
• Introduction• Case-Based Spam Filtering
Feature-Based– Feature-Free– Experiments I
• Case Base Maintenance– Competence-Based Editing– Experiments II
• Concept Drift– Incremental & periodic solutions– Experiments III
• Conclusions
![Page 11: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/11.jpg)
Email Classification Using Examples (ECUE)• ECUE uses Case-Based Reasoning (CBR) to
classify emails • A case base contains a user’s email (both
ham and spam)• ECUE classifies an incoming email using the
k-nearest neighbour algorithm:– It retrieves from the case base the k nearest
neighbours (the k that are closest or most similar)
– The cases it retrieves then vote to decide the class of the new email
– To bias away from false positives, ECUE uses unanimous voting.
![Page 12: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/12.jpg)
Feature-Based ECUE
• Features extracted (fij )– words, characters, structural features
• Binary representation: fi1= 1 or fi1= 0
EmailEmailEmailEmail FeatureExtraction
Casebase
label class,,..., 21 iNiii fffe
![Page 13: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/13.jpg)
Feature-Based ECUE
• Information Gain used to select the 700 most predictive features
EmailEmailEmailEmail FeatureExtraction
Casebase
FeatureSelection
Casebase
![Page 14: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/14.jpg)
Feature-Based ECUE
EmailEmailEmailEmail FeatureExtraction
Casebase
FeatureSelection
Casebase
CaseSelection
Casebase
• Competence-Based Editing usedto edit case base
![Page 15: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/15.jpg)
Runtime System
Feature-Based ECUE
EmailEmailEmailEmail FeatureExtraction
Casebase
FeatureSelection
Casebase
CaseSelection
Casebase
Classification
spam!
NewCase
![Page 16: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/16.jpg)
Feature-Based ECUE
• The distance between cases is a count of the number of features that they do not share
• Naïve Bayes classifier thought to be among the best for spam filtering
• Feature-Based ECUE has comparable, and sometimes slightly better, accuracy than Naïve Bayes
![Page 17: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/17.jpg)
Overview
• Introduction• Case-Based Spam Filtering
– Feature-BasedFeature-Free– Experiments I
• Case Base Maintenance– Competence-Based Editing– Experiments II
• Concept Drift– Incremental & periodic solutions– Experiments III
• Conclusions
![Page 18: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/18.jpg)
Feature-Free ECUE
• Alternative to Feature-Based ECUE• Inspired by theory of Kolmogorov
Complexity– K(x) = size of smallest Turing machine
that can output x to its tape– K(x|y) = size of smallest Turing machine
that can output x when given y• Basis for distance measure
if K(x|y) < K(x|z) then y is more similar to x than z
[Li et al. 2003]
![Page 19: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/19.jpg)
Feature-Free ECUE
• Approximate K(x) by C(x)C(x) = size of x after compression
• Text compression exploits intra-document redundancy
Case based reasoningCase b•d reasoning
![Page 20: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/20.jpg)
Using Compression
• Consider length of two documents allowing for inter-document redundancy = len(gzip( + ))docX docY
= len(gzip( ))docX docY
= len( )docX docY
= C(xy)
![Page 21: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/21.jpg)
Using Compression
• Consider length of two documents not allowing for inter-document redundancy
= len(gzip( )) + len(gzip( ))docX docY
= len( ) + len( )docX docY
= C(x) + C(y)
![Page 22: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/22.jpg)
Compression-Based Dissimilarity (CDM)
• Max value ≤ 1 (furthest)Min value > 0.5 (nearest)
• HoweverCDM(x,x) ≠ 0; CDM(x,y) ≠ CDM(y,x); CDM(x,y) + CDM(y,z) ≥ CDM(x,z)
)()(
)(),(
yCxC
xyCyxCDM
[Keogh et al 2004]
![Page 23: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/23.jpg)
Runtime System
Feature-Based ECUE
EmailEmailEmailEmail FeatureExtraction
Casebase
FeatureSelection
Casebase
Case BaseEdit
Casebase
Classification
spam!
NewEmail
![Page 24: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/24.jpg)
Runtime System
Feature-Free ECUE
EmailEmailEmailEmailEmail
Casebase
Case BaseEdit
Casebase
Classification
spam!
NewEmail
EmailEmailEmail
![Page 25: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/25.jpg)
Experiments I• Created 4 datasets of 1000 emails from
two years of email from two people– each dataset has 500 consecutive ham, 500
consecutive spam• 10-fold cross-validation • Settings:
– k = 3– Feature-based: 700 features– Feature-free: GZip as text compressor
• Measures:– FPRate = #false positives/#ham– FNRate = #false negatives/#spam– Err = (FPRate + FNRate) / 2
![Page 26: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/26.jpg)
Results - % Error
5.7%
2.4%
4.0%
0.2%
9.8%
2.2%
13.2%
1.5%
Feature-Based Feature-Free (GZip)
Dataset 1
Dataset 2
Dataset 3
Dataset 4
![Page 27: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/27.jpg)
Results - % False Positives
9.2%
1.4%1.4%
0.0%
1.0% 0.8%0.6%1.2%
Feature-Based Feature-Free (GZip)
Dataset 1
Dataset 2Dataset 3
Dataset 4
![Page 28: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/28.jpg)
Overview
• Case-Based Spam Filtering– Feature-Based & Feature-Free– Experiments I
Case Base Maintenance– Competence-Based Editing– Experiments II
• Concept Drift– Incremental & periodic solutions– Experiments III
• Conclusions
![Page 29: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/29.jpg)
Case Base Maintenance
• Case base editing algorithms– remove redundant cases, and– remove noisy cases.
• Their goal is to– reduce retrieval time but– maintain or even improve accuracy.
![Page 30: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/30.jpg)
Competence Model
• For each case c, compute– coverage set of c
• cases that have c as one of their k-NN and which have same class as c
– liability set of c• cases that have c as one of their k-NN and
which have different class from c
xc x is in coverage set of c
y
y is in liability set of c
![Page 31: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/31.jpg)
Competence-Based Editing
• Blame-Based Noise Reduction– For each case c with non-empty liability set
(taken in descending order of size of liability set),• if the cases in c’s coverage set can still be correctly
classified without c, then c can be deleted.
– This emphasises removal of cases that cause misclassifications.
• Conservative Redundancy Reduction– For each remaining case c (taken in ascending
order of size of coverage set)• retain c but delete the cases in c’s coverage set
– This retains cases close to class boundaries
![Page 32: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/32.jpg)
Results - % Error
5.7%
3.8%2.4% 2.2%
9.8%
7.0%
2.2% 2.6%
Feature-Based (full)
Feature-Based
(edited)
Feature-Free(full)
Feature-Free(edited)
Dataset 1 Dataset 3
• Feature-based edited size = 75% and 65%• Feature-free edited size = 59% and 57%
![Page 33: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/33.jpg)
Results - % False Positives
9.2%
3.4%
1.4% 1.0%1.0%2.2%
0.8% 0.4%
Feature-Based (full)
Feature-Based
(edited)
Feature-Free(full)
Feature-Free(edited)
Dataset 1 Dataset 3
![Page 34: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/34.jpg)
Overview
• Case-Based Spam Filtering– Feature-Based & Feature-Free– Experiments I
• Case Base Maintenance– Competence-Based Editing– Experiments II
Concept Drift– Incremental & periodic solutions– Experiments III
• Conclusions
![Page 35: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/35.jpg)
Concept Drift• The target concept is not static
– it changes according to season– it changes according to world events– people’s interests and tolerances
change– there is an arm’s race:
• ever more devious spamouflage!
• We need to investigate behaviour over time
![Page 36: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/36.jpg)
Experiments III• Took ~10000 emails from two years of
email from two people in date-order• Created a case base for each person from
earliest 500 consecutive ham & earliest 500 consecutive spam
• Remaining ~9000 emails presented chronologically as test cases
• Same settings and measures as before– k = 3– Feature-based: 700 features– Feature-free: GZip as text compressor
![Page 37: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/37.jpg)
Retention policies• CBR (and other lazy learners) can
easily incorporate the most recent examples– retain-all: store all new emails in the
case base– retain-misclassifieds: store a new email
if our prediction is wrong
![Page 38: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/38.jpg)
Results - % Error
15.9%
2.3%
12.6%
3.2%
Feature-Free (GZip) Feature-Free (GZip):retain-misclassifieds
Dataset A Dataset B
• When we retain-misclassified cases, case bases increase in size by ~30%
![Page 39: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/39.jpg)
Results - % False Positives
0.7%
1.5%
4.0%3.5%
Feature-Free (GZip) Feature-Free (GZip):retain-misclassifieds
Dataset A Dataset B
![Page 40: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/40.jpg)
Retention• Bigger case base reduces efficiency• Obsolete cases may reduce accuracy• Obsolete features may reduce
accuracy
• Need a deletion policy
![Page 41: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/41.jpg)
Incremental Solutions
• Consider add-1-delete-1– Case base size remains constant– retention policy
• retain-all• retain-misclassified
– forgetting policy• forget-oldest• forget-least-accurate
instance selection
instance weighting
![Page 42: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/42.jpg)
Incremental Solutions
• Consider add-1-delete-1– Case base size remains constant– retention policy
• retain-all• retain-misclassified
– forgetting policy• forget-oldest• forget-least-accurate
Accuracy = #successes#retrievals
![Page 43: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/43.jpg)
Results - % Error
15.9%
2.3% 1.7% 1.8% 1.9%
12.6%
3.2% 2.8%4.0% 3.0%
Feature-Free Feature-Free:retain-
misclassifieds,forget-oldest
Feature-Free:retain-all, forget-
oldest
Feature-Free:retain-
misclassifieds,forget-least-
accurate
Feature-Free:retain-all, forget-least-accurate
Dataset A Dataset B
![Page 44: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/44.jpg)
Results - % False Positives
0.7%1.3% 1.7% 1.8%
2.4%
4.0%3.5%
4.2%
6.4%
5.0%
Feature-Free Feature-Free:retain-
misclassifieds,forget-oldest
Feature-Free:retain-all, forget-
oldest
Feature-Free:retain-
misclassifieds,forget-least-
accurate
Feature-Free:retain-all, forget-least-accurate
Dataset A Dataset B
Negative effect on FPs?
![Page 45: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/45.jpg)
Periodic Solutions
• Periodic– Feature-based:
• retain-misclassified;• monthly, feature re-extraction, feature re-
selection, case base rebuild and case base edit
– Feature-free• retain-misclassified; • monthly, case base edit
![Page 46: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/46.jpg)
Feature-Based ECUE
EmailEmailEmailEmail FeatureExtraction
Casebase
FeatureSelection
Casebase
Case BaseEdit
Casebase
![Page 47: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/47.jpg)
Results - % Error
15.4%
4.5%
15.9%
2.3%
19.2%
6.1%
12.6%
2.6%
Feature-Based Feature-Based: retain-misclassifieds,
monthly reselect &edit
Feature-Free Feature-Free: retain-misclassifieds,
monthly edit
Dataset A Dataset B
![Page 48: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/48.jpg)
Results - % False Positives
20.0%
2.0% 0.7% 0.9%
14.7%
2.4%4.0%
2.5%
Feature-Based Feature-Based: retain-misclassifieds,
monthly reselect &edit
Feature-Free Feature-Free: retain-misclassifieds,
monthly edit
Dataset A Dataset B
![Page 49: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/49.jpg)
Overview
• Case-Based Spam Filtering– Feature-Based & Feature-Free– Experiments I
• Case Base Maintenance– Competence-Based Editing– Experiments II
• Concept Drift– Incremental & periodic solutions– Experiments III
• Conclusions
![Page 50: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/50.jpg)
Feature-Free ECUE: Advantages• Accuracy
– lower error rate than traditional feature-based methods
– often lower false positive rate
• Costs– it uses the raw text– no need to extract, select or weight features– no need to update features as spam changes
• Concept drift– simple retention/forgetting policies can be
effective
![Page 51: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/51.jpg)
Feature-Free ECUE: Disadvantages
• No justification factors to explain results or drive adaptation
• Higher computation time– Time to classify email (with cb of 1000)
Feature-free = 2 secs Feature-based = .01 sec
• Not a metric
![Page 52: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/52.jpg)
Future Work
• Investigating algorithms to speed up retrieval time
• Application of measure to text other than emails
![Page 53: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/53.jpg)
Thank you for your attention!
![Page 54: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/54.jpg)
Spare slides
![Page 55: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/55.jpg)
Normalized Compression Distance (NCD)
• Max value = 1 + ε (furthest)Min value = 0 (nearest)
• HoweverNCD(x,x) ≠ 0; NCD(x,y) ≠ NCD(y,x); NCD(x,y) + NCD(y,z) ≥ NCD(x,z)
))(),(max())(),(min()(
),(yCxC
yCxCxyCyxNCD
[Li et al 2003]
![Page 56: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/56.jpg)
Comparing Compression Algorithms
• The better the compression the better the measure?
• Compared GZip with Prediction by Partial Matching (PPM)– GZip = Lempel-Ziv variant– PPM = adaptive statistical compressor
![Page 57: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/57.jpg)
Results - % Error
2.4% 2.3%2.1% 2.0%
0.1% 0.2% 0.2% 0.2%
2.4%
1.9%2.2%
2.5%
1.4%1.1%
1.6% 1.7%
GZip PPM(2) PPM(4) PPM(8)
Dataset 1
Dataset 2Dataset 3
Dataset 4
![Page 58: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/58.jpg)
Results
• Little difference in classification error– Compressor choice does not greatly
matter
• PPM is generally considered better at compression but on our datasets...– average of 59% compression for GZip– average 57% compression for PPM
• PPM computationally expensive– 180 times slower than GZip
![Page 59: A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering](https://reader035.fdocuments.in/reader035/viewer/2022062519/56815420550346895dc2206e/html5/thumbnails/59.jpg)
GZip Speed Up
• GZip uses a 32 KByte sliding window
• Truncate each email to 16KB • Achieves speed ups of between 9.5%
to 25%
docX docY
32KB