1 Feature selection with conditional mutual information maximin in text categorization CIKM2004.
-
Upload
jeffrey-mason -
Category
Documents
-
view
213 -
download
1
Transcript of 1 Feature selection with conditional mutual information maximin in text categorization CIKM2004.
1
Feature selection with conditional mutual information maximin in text categorization
CIKM2004
2
Abstract
• Feature selection– Advantage
• Increase a classifier’s computational speed• Reduce the overfitting problem
– Drawback• do not consider the mutual relationships among the features
– one feature’s predictive power is weakened by others– the selected features tend to bias towards major categories
– Contribution• CMIM (Conditional mutual information maxmin)
– select a set of individually discriminating and weakly dependent features
3
Information Theory Review
• Assumption– discrete random variables as X and Y– 1-of-n classification problem
•
•
x X
x,
Entropy , H(x) = - ( ) log ( )
quantifie how many bits are required to encode or describe
the random variable X on average.
( , )Mutual information (MI) , I(X;Y)= - ( , ) log
( ) ( )
quanti
y
p x p x
p x yp x y
p x p y
i
fie how many information is shared between X & Y.
widely used as a feature selection method , I(C ; F )
4
Information Theory Review
• Select a small number of features that can carry as much information as possible
– • H.Yang 1999 , directly estimating the joint prob.
suffers from the curse of high dimension.
– Assume that all random variables are discrete , and each of them may take one of M different values.
• It can be shown that
1Max the joint MI (JMI) , I(F ,..., ; ) , for k featureskF C
1 1 1 1 1
1 1
k
1 1
( ,..., ; ) ( ,..., ; ) ( ; | ,..., ) (1)
in which ( ; | ,..., ) is CMI
quantifies the shared information between F and C
, given features F ,..., .
k k k k
k k
k
I F F C I F F C I F C F F
I F C F F
F
5
Information Theory Review
• – which suggests adding a feature Fk will never
decrease the mutual information.•
– Approach• Current k-1 selected features max the JMI• Next The next feature, which max the CMI , should be
chosen into the feature set to ensure the max of the JMI of k features.
– Benefit• Features can selected one by one into the feature set
through an iterative process• In the beginning , a feature which max the MI is first selected
into the set.
1 1 1 1 1 ( ; | ,..., ) 0 ( ,..., ; ) ( ,..., ; )k k k kI F C F F I F F C I F F C
1 1 1 1 1( ,..., ; ) ( ,..., ; ) ( ; | ,..., )k k k kI F F C I F F C I F C F F
6
CMIM Algorithm
• CMIM Algorithm– Deal with the problem of computation when
the dimension is high.– Because more information will degrade
uncertainty, is certain to be smaller than any CMI with fewer dimensional forms
– Therefore, we estimate by their minimum value, i.e.,
*1 kI(F ;C | F ,..., F )
*i jI(F ;C | F ,..., F )
*1 kI(F ;C | F ,..., F )
* *1 k i jI(F ;C | F ,..., F ) min I(F ;C | F ,..., F ) (2)
7
CMIM Algorithm• Use the triplet form
–
• Select a feature F
*i
* *1 k i
i
I(F ;C | F )
I(F ;C | F ,..., F ) min I(F ;C | F ) (3)
*ii
max min I(F ;C | F )
8
Experiment
9
Experiment
10
Conclusion and Future Work
• Present a CMI method and uses a CMIM algorithm to select features– both individually discriminate as well as being
dependent on features already selected.
• The experiments show that both micro-averaged and macro-averaged classification perform better based on this feature selection method, especially when the feature size is small and the category number is large.
11
Conclusion and Future Work
• CMIM’s drawbacks.– cannot deal with integer-valued or continuous features. – ignores the dependencies among three or larger families of
features. – Although CMIM has greatly relieved the computation overhead,
the complexity O(NV 3) is still not very attractive.
• Future work – decrease the complexity of CMIM– consider parameter density models to deal with continuous
features, and investigate other conditional models to efficiently formulate features’ mutual relationship.