1 Feature selection with conditional mutual information maximin in text categorization CIKM2004.

1

Feature selection with conditional mutual information maximin in text categorization

CIKM2004

2

Abstract

• Feature selection– Advantage

• Increase a classifier’s computational speed• Reduce the overfitting problem

– Drawback• do not consider the mutual relationships among the features

– one feature’s predictive power is weakened by others– the selected features tend to bias towards major categories

– Contribution• CMIM (Conditional mutual information maxmin)

– select a set of individually discriminating and weakly dependent features

3

Information Theory Review

• Assumption– discrete random variables as X and Y– 1-of-n classification problem

•

•

x X

x,

Entropy , H(x) = - ( ) log ( )

quantifie how many bits are required to encode or describe

the random variable X on average.

( , )Mutual information (MI) , I(X;Y)= - ( , ) log

( ) ( )

quanti

y

p x p x

p x yp x y

p x p y

i

fie how many information is shared between X & Y.

widely used as a feature selection method , I(C ; F )

4


• Select a small number of features that can carry as much information as possible

– • H.Yang 1999 , directly estimating the joint prob.

suffers from the curse of high dimension.

– Assume that all random variables are discrete , and each of them may take one of M different values.

• It can be shown that

1Max the joint MI (JMI) , I(F ,..., ; ) , for k featureskF C

1 1 1 1 1

1 1

k

1 1

( ,..., ; ) ( ,..., ; ) ( ; | ,..., ) (1)

in which ( ; | ,..., ) is CMI

quantifies the shared information between F and C

, given features F ,..., .

k k k k

k k

k

I F F C I F F C I F C F F

I F C F F

F

5


• – which suggests adding a feature Fk will never

decrease the mutual information.•

– Approach• Current k-1 selected features max the JMI• Next The next feature, which max the CMI , should be

chosen into the feature set to ensure the max of the JMI of k features.

– Benefit• Features can selected one by one into the feature set

through an iterative process• In the beginning , a feature which max the MI is first selected

into the set.

1 1 1 1 1 ( ; | ,..., ) 0 ( ,..., ; ) ( ,..., ; )k k k kI F C F F I F F C I F F C

1 1 1 1 1( ,..., ; ) ( ,..., ; ) ( ; | ,..., )k k k kI F F C I F F C I F C F F

6

CMIM Algorithm

• CMIM Algorithm– Deal with the problem of computation when

the dimension is high.– Because more information will degrade

uncertainty, is certain to be smaller than any CMI with fewer dimensional forms

– Therefore, we estimate by their minimum value, i.e.,

*1 kI(F ;C | F ,..., F )

*i jI(F ;C | F ,..., F )

*1 kI(F ;C | F ,..., F )

* *1 k i jI(F ;C | F ,..., F ) min I(F ;C | F ,..., F ) (2)

7

CMIM Algorithm• Use the triplet form

–

• Select a feature F

*i

* *1 k i

i

I(F ;C | F )

I(F ;C | F ,..., F ) min I(F ;C | F ) (3)

*ii

max min I(F ;C | F )

8

Experiment

9

Experiment

10

Conclusion and Future Work

• Present a CMI method and uses a CMIM algorithm to select features– both individually discriminate as well as being

dependent on features already selected.

• The experiments show that both micro-averaged and macro-averaged classification perform better based on this feature selection method, especially when the feature size is small and the category number is large.

11

Conclusion and Future Work

• CMIM’s drawbacks.– cannot deal with integer-valued or continuous features. – ignores the dependencies among three or larger families of

features. – Although CMIM has greatly relieved the computation overhead,

the complexity O(NV 3) is still not very attractive.

• Future work – decrease the complexity of CMIM– consider parameter density models to deal with continuous

features, and investigate other conditional models to efficiently formulate features’ mutual relationship.

1 Feature selection with conditional mutual information maximin in text categorization CIKM2004.

Documents

Transcript of 1 Feature selection with conditional mutual information maximin in text categorization CIKM2004.