Robust Feature Selection by Mutual Information Distributions
description
Transcript of Robust Feature Selection by Mutual Information Distributions
![Page 1: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/1.jpg)
Robust Feature Selection Robust Feature Selection by Mutual Information Distributionsby Mutual Information Distributions
Marco ZaffalonMarco Zaffalon & Marcus Hutter & Marcus Hutter
IDSIAIDSIAGalleria 2, 6928 Manno (Lugano), SwitzerlandGalleria 2, 6928 Manno (Lugano), Switzerland
www.idsia.ch/~{zaffalon,marcus}www.idsia.ch/~{zaffalon,marcus}{zaffalon,marcus}@idsia.ch{zaffalon,marcus}@idsia.ch
![Page 2: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/2.jpg)
Mutual Information (MI)Mutual Information (MI)
Consider two discrete random variables (Consider two discrete random variables (,,))
(In)Dependence often measured by MI(In)Dependence often measured by MI
– Also known as Also known as cross-entropycross-entropy or or information gaininformation gain– ExamplesExamples
Inference of Bayesian nets, classification treesInference of Bayesian nets, classification trees Selection of relevant variables for the task at handSelection of relevant variables for the task at hand
,, of chancejoint jiij sjri ,,1 and ,,1
iiji of chance marginal j jijj of chance marginal i
ij
ji
ijijI
log0 π
![Page 3: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/3.jpg)
MI-Based Feature-Selection Filter (F)MI-Based Feature-Selection Filter (F)Lewis, 1992Lewis, 1992
ClassificationClassification– Predicting the Predicting the classclass value given values of value given values of featuresfeatures– Features (or attributes) and class = random variablesFeatures (or attributes) and class = random variables– Learning the rule ‘features Learning the rule ‘features class’ from data class’ from data
Filters goal: removing irrelevant featuresFilters goal: removing irrelevant features– More accurate predictions, easier modelsMore accurate predictions, easier models
MI-based approachMI-based approach– Remove feature Remove feature if class if class does not depend on it: does not depend on it:– Or: remove Or: remove if if
is an arbitrary threshold of relevanceis an arbitrary threshold of relevance
0πI πI
![Page 4: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/4.jpg)
Empirical Mutual InformationEmpirical Mutual Informationa common way to use MI in practicea common way to use MI in practice
Data ( ) Data ( ) contingency table contingency table
– Empirical (sample) probability:Empirical (sample) probability:– Empirical mutual information: Empirical mutual information:
Problems of the empirical approachProblems of the empirical approach– due to random fluctuations? (finite sample)due to random fluctuations? (finite sample)– How to know if it is reliable, e.g. by How to know if it is reliable, e.g. by
jj\\ii 11 22 …… rr11 nn1111 nn1212 …… nn1r1r
22 nn2121 nn2222 …… nn2r2r
ss nns1s1 nns2s2 …… nnsrsr
occurred times of# i,jnij
occurred times of# i nnj iji
occurred times of# j nni ijj
sizedataset ij ijnn
nnijij ̂ π̂I
0ˆ πI ?nIP
n
![Page 5: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/5.jpg)
We Need the Distribution of MIWe Need the Distribution of MI
Bayesian approachBayesian approach– Prior distribution for the unknown chances Prior distribution for the unknown chances
(e.g., Dirichlet) (e.g., Dirichlet) – Posterior: Posterior:
Posterior probability density of MI:Posterior probability density of MI:
How to compute it?How to compute it?– Fitting a curve by the exact mean, approximate varianceFitting a curve by the exact mean, approximate variance
πp
ij
nij
ijpp ππn
πππ dpIIIp nn
![Page 6: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/6.jpg)
Mean and Variance of MIMean and Variance of MIHutter, 2001; Wolpert & Wolf, 1995Hutter, 2001; Wolpert & Wolf, 1995
Exact meanExact mean
Leading and next to leading order term (NLO) for Leading and next to leading order term (NLO) for the variancethe variance
Computational complexity Computational complexity OO((rsrs))– As fast as empirical MIAs fast as empirical MI
1
1
1 ,11111 n
kijjiijij k
nnnnnnn
IE
ij ij ji
ijij
ji
ijij nOnnnn
nn
nnnnn
nn
nIVAR 3
22
NLOlog1log1
![Page 7: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/7.jpg)
MI Density Example GraphsMI Density Example Graphs
Distribution of Mutual Information for Dirichlet Priors
0
1
2
3
4
5
6
7
8
9
10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9I
p(I|n)
Exact n=[(40,10),(20,80)]
Gauss n=[(40,10),(20,80)]
Gamma n=[(40,10),(20,80)]
Beta n=[(40,10),(20,80)]
Exact n=[(8,2),(4,16)]
Gauss n=[(8,2),(4,16)]
Gamma n=[(8,2),(4,16)]
Beta n=[(8,2),(4,16)]
Exact n=[(20,5),(10,40)]
Gauss n=[(20,5),(10,40)]
Gamma n=[(20,5),(10,40)]
Beta n=[(20,5),(10,40)]
![Page 8: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/8.jpg)
Robust Feature SelectionRobust Feature Selection
Filters: two new proposalsFilters: two new proposals– FF: include feature FF: include feature iff iff
(include iff “proven” relevant)(include iff “proven” relevant)– BF: exclude feature BF: exclude feature iff iff
(exclude iff “proven” irrelevant) (exclude iff “proven” irrelevant) ExamplesExamples
95.0 nIP
95.0 nIP
FF includesBF includes
FF excludesBF includes
FF excludesBF excludes
I
![Page 9: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/9.jpg)
Comparing the FiltersComparing the Filters
Experimental set-upExperimental set-up– Filter (F,FF,BF) + Naive Bayes classifierFilter (F,FF,BF) + Naive Bayes classifier– Sequential learning and testingSequential learning and testing
Collected measures for each filterCollected measures for each filter– Average # of correct predictions (prediction accuracy)Average # of correct predictions (prediction accuracy)– Average # of features usedAverage # of features used
Naive Bayes ClassificationTe
st
inst
ance
FilterIn
stan
ce k
Inst
ance
k+1
Inst
ance
N Learningdata
Store after
classification
![Page 10: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/10.jpg)
Results on 10 Complete DatasetsResults on 10 Complete Datasets
# of used features# of used features
Accuracies NOT significantly differentAccuracies NOT significantly different– Except Chess & Spam with FFExcept Chess & Spam with FF
# Instances # Features Dataset FF F BF690 36 Australian 32.6 34.3 35.9
3196 36 Chess 12.6 18.1 26.1653 15 Crx 11.9 13.2 15.0
1000 17 German-org 5.1 8.8 15.22238 23 Hypothyroid 4.8 8.4 17.13200 24 Led24 13.6 14.0 24.0148 18 Lymphography 18.0 18.0 18.0
5800 8 Shuttle-small 7.1 7.7 8.01101 21611 Spam 123.1 822.0 13127.4435 16 Vote 14.0 15.2 16.0
![Page 11: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/11.jpg)
Results on 10 Complete Datasets - Results on 10 Complete Datasets - ctdctd
0%20%40%60%80%100%
Austr
alian
Ches
s
Crx
Germ
an-o
rg
Hypo
thyro
id
Led2
4
Lymp
hogr
aphy
Shut
tle-sm
all
Spam Vote
FFF
BF
Percentages of used features
![Page 12: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/12.jpg)
FF: Significantly Better AccuraciesFF: Significantly Better Accuracies
ChessChess
SpamSpam
0.7
0.8
0.9
1
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
Instance number
Pred
ictio
n ac
cura
cy (C
hess
) FF
F
0.5
0.6
0.7
0.8
0.9
1
0
100
200
300
400
500
600
700
800
900
1000
1100
Instance number
Pred
ictio
n ac
cura
cy (S
pam
)
F
FF
BF
0
11000
22000
0
100
200
300
400
500
600
700
800
900
1000
1100
Instance number
Ave
r. nu
mbe
r of e
xclu
ded
feat
ures
(Spa
m)
F
FF
BF
![Page 13: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/13.jpg)
Extension to Incomplete SamplesExtension to Incomplete Samples
MAR assumptionMAR assumption– General case: missing features and classGeneral case: missing features and class
EM + closed-form expressionsEM + closed-form expressions– Missing features onlyMissing features only
Closed-form approximate expressions for Mean and VarianceClosed-form approximate expressions for Mean and Variance Complexity still Complexity still OO((rsrs))
New experimentsNew experiments– 5 data sets5 data sets– Similar behaviorSimilar behavior
0.9
0.92
0.94
0.96
0.98
1
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
Instance number
Pred
ictio
n ac
cura
cy (H
ypot
hyro
idlo
ss)
F
FF
![Page 14: Robust Feature Selection by Mutual Information Distributions](https://reader036.fdocuments.in/reader036/viewer/2022081604/568167c9550346895ddd1927/html5/thumbnails/14.jpg)
ConclusionsConclusions
Expressions for several moments of MI distribution are Expressions for several moments of MI distribution are availableavailable– The distribution can be approximated wellThe distribution can be approximated well– Safer inferences, same computational complexity of empirical Safer inferences, same computational complexity of empirical
MIMI– Why not to use it?Why not to use it?
Robust feature selection shows power of MI distributionRobust feature selection shows power of MI distribution– FF outperforms traditional filter FFF outperforms traditional filter F
Many useful applications possibleMany useful applications possible– Inference of Bayesian netsInference of Bayesian nets– Inference of classification treesInference of classification trees– ……