ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a...
ECOC for Text Classification
Hybrids of EM & Co-Training (with Kamal Nigam)
Learning to build a monolingual corpus from the web (with Rosie Jones)
Effect of Smoothing on Naive Bayes for text classification (with Tong Zhang)
Hypertext Categorization using link and extracted information (with Sean Slattery & Yiming Yang)
Some Recent work
Using Error-Correcting Codes For Text Classification
Rayid GhaniCenter for Automated Learning & DiscoveryCarnegie Mellon University
This presentation can be accessed at http://www.cs.cmu.edu/~rayid/talks/
Outline Introduction to ECOC Intuition & Motivation Some Questions? Experimental Results Semi-Theoretical Model Types of Codes Drawbacks Conclusions
Introduction Decompose a multiclass
classification problem into multiple binary problems One-Per-Class Approach (moderately
expensive) All-Pairs (very expensive) Distributed Output Code (efficient but
what about performance?) Error-Correcting Output Codes (?)
Is it a good idea? Larger margin for error since errors can
now be “corrected” One-per-class is a code with minimum
hamming distance (HD) = 2 Distributed codes have low HD
The individual binary problems can be harder than before
Useless unless number of classes > 5
Training ECOC Given m distinct classes
1. Create an m x n binary matrix M.
2. Each class is assigned ONE row of M.
3. Each column of the matrix divides the classes into TWO groups.
4. Train the Base classifiers to learn the n binary problems.
0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1
ABCD
f1 f2 f3 f4 f5
Training ECOC
Given m distinct classes
Create an m x n binary matrix M.
Each class is assigned ONE row of M.
Each column of the matrix divides the classes into TWO groups.
Train the Base classifiers to learn the n binary problems.
Testing ECOC To test a new instance
Apply each of the n classifiers to the new instance
Combine the predictions to obtain a binary string(codeword) for the new point
Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)
ECOC - Picture
0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1
ABCD
A
DC
B
f1 f2 f3 f4 f5
ECOC - Picture
0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1
ABCD
A
DC
B
f1 f2 f3 f4 f5
ECOC - Picture
0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1
ABCD
A
DC
B
f1 f2 f3 f4 f5
ECOC - Picture
0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1
ABCD
A
DC
B
f1 f2 f3 f4 f5
X 1 1 1 1 0
Single classifier – learns a complex boundary once
Ensemble – learns a complex boundary multiple times
ECOC – learns a “simple” boudary multiple times
Questions? How well does it work? How long should the code be? Do we need a lot of training data? What kind of codes can we use? Are there intelligent ways of creating
the code?
Previous Work Combine with Boosting –
ADABOOST.OC (Schapire, 1997), (Guruswami & Sahai, 1999)
Local Learners (Ricci & Aha, 1997) Text Classification (Berger, 1999)
Experimental Setup Generate the code
BCH Codes Choose a Base Learner
Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998)
Dataset Industry Sector Dataset
Consists of company web pages classified into 105 economic sectors
Standard stoplist No Stemming Skip all MIME headers and HTML tags Experimental approach similar to
McCallum et al. (1998) for comparison purposes.
Results
Industry Sector Data Set
Naïve Bayes
Shrinkage1 ME2 ME/ w Prior3
ECOC 63-bit
66.1% 76% 79% 81.1% 88.5%
ECOC reduces the error of the Naïve Bayes Classifier by 66%
1. (McCallum et al. 1998) 2,3. (Nigam et al. 1999)
The Longer the Better!Naive Bayes Classifier15-bit ECOC 31-bit ECOC 63-bit ECOC
Accuracy(%) 65.3 77.4 83.6 88.1
Table 2: Average Classification Accuracy on 5 random 50-50 train-test splits of the Industry Sector dataset with a vocabulary size of 10000 words selected using Information Gain.
Longer codes mean larger codeword separation
The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C
If minimum hamming distance is h, then the code can correct (h-1)/2 errors
Size Matters?
Variation of accuracy with code length and training size
40
50
60
70
80
90
100
0 20 40 60 80 100
Training size per class
Acc
ura
cy (
%) SBC
15bit
31bit
63bit
Size does NOT matter!
Percent Decrease in Error with Training size and length of code
30
35
40
45
50
55
60
65
70
0 20 40 60 80 100
Training Size
% D
ecre
ase
in E
rro
r
15bit
31bit
63bit
Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p)
n = length of the codep = probability of each bit being classified
incorrectly# of Bits Hmin Emax Pave Accuracy
15 5 2 .85 .59
15 5 2 .89 .80
15 5 2 .91 .84
31 11 5 .85 .67
31 11 5 .89 .91
31 11 5 .91 .94
63 31 15 .89 .99
inave
iave
E
i
ppi
nnp
)1()(
max
0
Theoretical Vs. Experimental AccuracyVocabsize=10000
0
20
40
60
80
100
15 15 15 31 31 31 63
Length of Code
Acc
ura
cy (
%)
Theoretical
Exprerimental
Talk.misc.religion
Comp.sys.ibm.hardware
Comp.os.windows
Alt.atheism
Comp.os.windows
Comp.sys.ibm.hardware
Talk.misc.religion
Alt.atheism
Comp.os.windows
Talk.misc.religion
Comp.sys.ibm.hardware
Alt.atheism
Alt.atheism
Talk.misc.religion
Comp.sys.ibm.hardware
Comp.os.windows
Talk.misc.religion
Alt.atheism
Comp.sys.ibm.hardware
Comp.os.windows
Comp.os.windows
Alt.atheism
Comp.sys.ibm.hardware
Talk.misc.religion
99% 73% 68%
81% 86% 87%
Types of CodesTypes of Codes Data-Independent Data-Dependent
Algebraic
Random
Hand-Constructed
Adaptive
What is a Good Code? Row Separation Column Separation (Independence
of errors for each binary classifier) Efficiency (for long codes)
Choosing Codes
Random Algebraic
Row Sep On AverageFor long codes
Guaranteed
Col Sep On AverageFor long codes
Can be Guaranteed
Efficiency No Yes
Experimental Results
Code Min Row HD
Max Row HD
Min Col HD
Max Col HD
Error Rate
15-Bit BCH
5 15 49 64 20.6%
19-Bit Hybrid
5 18 15 69 22.3%
15-bit Random
2 (1.5)
13 42 60 24.1%
Drawbacks Can be computationally expensive Random Codes throw away the real-
world nature of the data by picking random partitions to create artificial binary problems
Current Work Combine ECOC with Co-Training to
use unlabeled data Automatically construct optimal /
adaptive codes
Conclusion Performs well on text classification tasks Can be used when training data is sparse Algebraic codes perform better than
random codes for a given code length Hand-constructed codes may not be the
answer
Background Co-training seems to be the way to
go when there is (and maybe even when there isn’t) a feature split in the data
Reported results on co-training only deal with very small (toy) problems – mostly binary classification tasks (Blum & Mitchell 98, Nigam & Ghani 2000)
Co-Training Challenge Task: Apply cotraining to a 65 class
dataset containing 130,000 training examples
Result: Cotraining fails!
Solution? ECOC seems to work well when there
are a large number of classes ECOC decomposes a multiclass
problems into several binary problems Cotraining works well with binary
problems
Combine ECOC and Cotrain
Algorithm Learn each bit for ECOC using a
cotrained classifier
Dataset (Job Descriptions) 65 classes 32000 examples Two feature sets
Title Description
Class Distribution
0
2
4
6
8
10
12
Class
Perc
enta
ge
Results 10% Train, 50% unlabeled, 40% test
NB 40.3% ECOC 48.9% EM 30.83% CoTraining ECOC-EM ECOC-Cotrain ECOC-CoEM