OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu...
-
Upload
erick-boyd -
Category
Documents
-
view
232 -
download
3
Transcript of OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu...
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization
Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et al.
Microsoft Research Asia Peking University Tsinghua University
Chinese University of Hong KongVirginia Polytechnic Institute and State University
Outline
Motivation
Problem formulation
Related works
The OCFS algorithm
Experiments
Conclusion and future works
Motivation
DR are highly desired for web scale text data DR can improve efficiency and effectivenessFeature selection (FS) is more applicable than feature extraction (FE)Most of FS algorithms are greedy.
simple, effective, efficient and optimal FS algorithm
Outline
Motivation
Problem formulation Related works
The OCFS algorithm
Experiments
Conclusion and future works
Problem Formulation
: d pf R R (p<<d) Dimension reduction:
( ) T pi i iy f x W x R
dix RConsider linear case: suppose
where
1,2, ,i n
{ , }d p d p TW H W R W W I
Problem Formulation
( ) Ti i iy f x W x
We denote the discrete solution space as: d pW H
given a set of labeled training documents X, learn a transformation matrix such that it is optimal according to some criterion in space .
d pW H
( )J W
d pH
The problem is:
FS:
Outline
Motivation
Problem formulation
Related works The OCFS algorithm
Experiments
Conclusion and future works
Related Works – IG
Information gain aims to select a group of optimal features: by:
and
global optimal is NP , greedy computing
~ ~
1 1 1( ) ( ) log ( ) ( ) ( | ) log ( | ) ( ) ( | ) log ( | )c c cj j jr j r j r r j r j r r j r jIG T P c P c P T P c T P c T P T P c T P c T
1 2( , , , )k k k p
T t t t
( ) ( )J W IG T
* arg max ( )W J W
Related Works – CHI
CHI aims to select a group of features by:
and
1 2( , , , )k k k p
T t t t
2( ) ( )J W T
* arg max ( )W J W
22 ( )( , )
( )( )( )( )j
n AD CBt c
A C B D A B C D
Outline
Motivation
Problem formulation
Related works
The OCFS algorithm Experiments
Conclusion and future works
Orthogonal Centroid Algorithm
Orthogonal centroid : FE algorithm. Effective for DR of text classification problems. Computation is based on QR matrix decomposition
Theorem: the solution of OC algorithm equals to the solution of the following optimization problem,
where
* arg max ( ) arg max ( )T
bW J W trace W S W
{ , }d p d p TW H W R W W I
Intuition of Our Work
1( )( )
c j Tb j j
j
nS m m m m
n
arg max ( ) arg max ( )TbJ W trace W S W
{ , }d p d p TW H W R W W I
OC from the FE perspective
where
Intuition of Our Work
OC from the FS perspective: how to optimize J in discrete space d pH
by FE
by FS
The OCFS Algorithm
FS problem: suppose we want to preserve the mt
h and nth feature of and discard the others.
1
0 1 0( )
0 1 0
i
im Ti i i ik
in
id
x
xy f x W x x
x
x
dix R
columnthn columnthm
The OCFS Algorithm
arg max ( ) arg max ( )T
bJ W trace W S WOptimization:{ , 1 , 1,2, , }i iK k k d i p
1
0ik
i
k kw
otherwise
211 1( ) ( )i i
j
T j k kT p p cji ib i b i
ntrace W S W w S w m m
n
The OCFS Algorithm
211arg max ( )i i
j
j k kp cji
K
nm m
n
Solution : p largest ones from
21 ( ) , 1, 2, ,
j
j k kcj
nm m k d
n
OCFS:
Algorithm Analysis
The Number of selected features is
subject to
where the energy function 1
1
( )( )
( )
pjj
di
s kE p
s i
* arg min ( )p E p
( )E p T
Algorithm Analysis
Complexity: time complexity is O(cd)
OCFS only compute the simple square function instead of some functional computation such as logarithm of IG
Outline
Motivation
Problem formulation
Related works
The OCFS algorithm
Experiments Conclusion and future works
Experiments Setup
Datasets: 20 Newsgroups (5-class; 5,000-data; 131,072-d) Reuters Corpus Volume 1 (4-class; 800,000-data; 500,0
00-d) Open Directory Project (13-class)
Baseline: IG & CHI
Performance measurement : CPU runtime and
Classifier: SVM SMO
2Micro 1
P RF
P R
Results Analysis
Better than IG and CHI
Only half of the time
Outperform performance when dimension small. dimension is small, optimal outperform greedy. increasing selected features, the saturation of
features makes additional features of less value.
Outline
Motivation
Problem formulation
Related works
The OCFS algorithm
Experiments
Conclusion and future works
Conclusion
We proposed a novel efficient and effective feature selection algorithm for text categorization.
Main advantages : optimal better performance more efficient