OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu...

30
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et al. Microsoft Research Asia Peking University Tsinghua University Chinese University of Hong Kong Virginia Polytechnic Institute and State Universi ty

Transcript of OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu...

OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization

Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et al.

Microsoft Research Asia Peking University Tsinghua University

Chinese University of Hong KongVirginia Polytechnic Institute and State University

Outline

Motivation

Problem formulation

Related works

The OCFS algorithm

Experiments

Conclusion and future works

Motivation

DR are highly desired for web scale text data DR can improve efficiency and effectivenessFeature selection (FS) is more applicable than feature extraction (FE)Most of FS algorithms are greedy.

simple, effective, efficient and optimal FS algorithm

Outline

Motivation

Problem formulation Related works

The OCFS algorithm

Experiments

Conclusion and future works

Problem Formulation

: d pf R R (p<<d) Dimension reduction:

( ) T pi i iy f x W x R

dix RConsider linear case: suppose

where

1,2, ,i n

{ , }d p d p TW H W R W W I

Problem Formulation

( ) Ti i iy f x W x

We denote the discrete solution space as: d pW H

given a set of labeled training documents X, learn a transformation matrix such that it is optimal according to some criterion in space .

d pW H

( )J W

d pH

The problem is:

FS:

Outline

Motivation

Problem formulation

Related works The OCFS algorithm

Experiments

Conclusion and future works

Related Works – IG

Information gain aims to select a group of optimal features: by:

and

global optimal is NP , greedy computing

~ ~

1 1 1( ) ( ) log ( ) ( ) ( | ) log ( | ) ( ) ( | ) log ( | )c c cj j jr j r j r r j r j r r j r jIG T P c P c P T P c T P c T P T P c T P c T

1 2( , , , )k k k p

T t t t

( ) ( )J W IG T

* arg max ( )W J W

Related Works – CHI

CHI aims to select a group of features by:

and

1 2( , , , )k k k p

T t t t

2( ) ( )J W T

* arg max ( )W J W

22 ( )( , )

( )( )( )( )j

n AD CBt c

A C B D A B C D

Outline

Motivation

Problem formulation

Related works

The OCFS algorithm Experiments

Conclusion and future works

Orthogonal Centroid Algorithm

Orthogonal centroid : FE algorithm. Effective for DR of text classification problems. Computation is based on QR matrix decomposition

Theorem: the solution of OC algorithm equals to the solution of the following optimization problem,

where

* arg max ( ) arg max ( )T

bW J W trace W S W

{ , }d p d p TW H W R W W I

Intuition of Our Work

1( )( )

c j Tb j j

j

nS m m m m

n

arg max ( ) arg max ( )TbJ W trace W S W

{ , }d p d p TW H W R W W I

OC from the FE perspective

where

Intuition of Our Work

OC from the FS perspective: how to optimize J in discrete space d pH

by FE

by FS

The OCFS Algorithm

FS problem: suppose we want to preserve the mt

h and nth feature of and discard the others.

1

0 1 0( )

0 1 0

i

im Ti i i ik

in

id

x

xy f x W x x

x

x

dix R

columnthn columnthm

The OCFS Algorithm

arg max ( ) arg max ( )T

bJ W trace W S WOptimization:{ , 1 , 1,2, , }i iK k k d i p

1

0ik

i

k kw

otherwise

211 1( ) ( )i i

j

T j k kT p p cji ib i b i

ntrace W S W w S w m m

n

The OCFS Algorithm

211arg max ( )i i

j

j k kp cji

K

nm m

n

Solution : p largest ones from

21 ( ) , 1, 2, ,

j

j k kcj

nm m k d

n

OCFS:

The OCFS Algorithm

Algorithm Analysis

The Number of selected features is

subject to

where the energy function 1

1

( )( )

( )

pjj

di

s kE p

s i

* arg min ( )p E p

( )E p T

Algorithm Analysis

Complexity: time complexity is O(cd)

OCFS only compute the simple square function instead of some functional computation such as logarithm of IG

Outline

Motivation

Problem formulation

Related works

The OCFS algorithm

Experiments Conclusion and future works

Experiments Setup

Datasets: 20 Newsgroups (5-class; 5,000-data; 131,072-d) Reuters Corpus Volume 1 (4-class; 800,000-data; 500,0

00-d) Open Directory Project (13-class)

Baseline: IG & CHI

Performance measurement : CPU runtime and

Classifier: SVM SMO

2Micro 1

P RF

P R

Experimental Results –20NG

20NG

F1 CPU runtime

Experimental Results –20NG

Experimental Results – RCV1

F1 CPU runtime

RCV1

Experimental Results – ODP

F1

ODP

Results Analysis

Better than IG and CHI

Only half of the time

Outperform performance when dimension small. dimension is small, optimal outperform greedy. increasing selected features, the saturation of

features makes additional features of less value.

Outline

Motivation

Problem formulation

Related works

The OCFS algorithm

Experiments

Conclusion and future works

Conclusion

We proposed a novel efficient and effective feature selection algorithm for text categorization.

Main advantages : optimal better performance more efficient

Future Works

Future works: unbalanced data

combine with other approaches.

E.g. OCFS + PCA

The End

Thanks!

Q & A