IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X ...
Transcript of IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X ...
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 1
Cost-Sensitive Feature Selectionby Optimizing F-measures
Meng Liu, Chang Xu, Yong Luo, Chao Xu, Member, IEEEYonggang Wen, Senior Member, IEEE and Dacheng Tao, Fellow, IEEE,
AbstractâFeature selection is beneficial for improving theperformance of general machine learning tasks by extractingan informative subset from the high-dimensional features. Con-ventional feature selection methods usually ignore the classimbalance problem, thus the selected features will be biasedtowards the majority class. Considering that F-measure is a morereasonable performance measure than accuracy for imbalanceddata, this paper presents an effective feature selection algorithmthat explores the class imbalance issue by optimizing F-measures.Since F-measure optimization can be decomposed into a seriesof cost-sensitive classification problems, we investigate the cost-sensitive feature selection (CSFS) by generating and assigningdifferent costs to each class with rigorous theory guidance.After solving a series of cost-sensitive feature selection problems,features corresponding to the best F-measure will be selected. Inthis way, the selected features will fully represent the propertiesof all classes. Experimental results on popular benchmarks andchallenging real-world datasets demonstrate the significance ofcost-sensitive feature selection for the imbalanced data settingand validate the effectiveness of the proposed method.
Index Termsâfeature selection, cost-sensitive, imbalanceddata, F-measure optimization
I. INTRODUCTION
FEATURE selection aims to select a small subset offeatures from the input high-dimensional features by
reducing noise and redundancy to maximize relevance to thetarget, such as class labels in image classification [1], [2], [3],[4], [5]. It can improve the generalization ability of learningmodels, reduce the computational complexity, and thus benefitthe subsequent learning tasks. Considering these advantages, alarge number of works have been done in this literature [6], [7],[8], [9], [10], [11], [12], [13], [14]. In general, these methodscan be classified into three categories: (1) Filter methods thatevaluate features by only examining the characteristics of data.ReliefF [15], mRMR [16], F-statistic [17] and Information
Meng Liu and Chao Xu are with the Key Laboratory of Machine Percep-tion (Ministry of Education), Cooperative Medianet Innovation Center, Schoolof Electronics Engineering and Computer Science, Peking University, Beijing,100871, China. (Email: [email protected], [email protected])
Chang Xu and Dacheng Tao are with the UBTECH Sydney ArtificialIntelligence Centre and the School of Information Technologies in the Facultyof Engineering and Information Technologies at University of Sydney, 6Cleveland St, Darlington, NSW 2008, Australia. (Email: [email protected],[email protected])
Yong Luo and Yonggang Wen are with the School of Computer Scienceand Engineering, Nanyang Technological University, 639798, Singapore.(Email: [email protected], [email protected])
c©20XX IEEE. Personal use of this material is permitted. Permissionfrom IEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works.
Gain [18] are among the most representative algorithms. (2)Wrapper methods, which use the prediction method as a blackbox to score the feature subsets, such as correlation-based fea-ture selection (CFS) [19] and support vector machine recursivefeature elimination (SVM-RFE) [20]. (3) Embedded methods,which directly incorporate the feature selection procedure intothe model training process, such as regularized regression-based feature selection methods [21], [22]. These methodsintroduce group sparse regularization into the optimizationof the regression model and often have low computationalcomplexity.
Although all these aforementioned feature selection al-gorithms are promising and effective, most of them havebeen developed by implicitly or explicitly assuming the idealdata sampling, where the numbers of samples for all classeshave no significant difference. However, in practice, the classimbalance issue, i.e., the minority class usually includes muchfewer examples than the majority class does, is common inmachine learning and pattern recognition. Specifically, theclass imbalance of real-world datasets consists of two aspects:statistical distribution and algorithmic application. In the testset of handwritten digit dataset USPS [23], the proportionof digit-0 is 18%, while digit-7 is 7%. Moreover, we oftentransform a multi-class/multi-label classification problem intomultiple binary classification problems one-vs-all strategy,where negative samples are much more than positive samplesin each binary classification problem [2], [24], [25], [20],[26]. This encourages conventional feature selection methodsto select the features describing the majority classes ratherthan those features representing the minority class. Given thesubsequent classification or clustering task, it is then difficultto derive an optimal solution based on the biased selectedfeatures.
Furthermore, most classifier-dependent feature selectionmethods [21], [22], [27], [28], [29] are facing the sameproblem. These methods treat the misclassification costs ofdifferent classes equally and choose the feature subset withthe highest classification accuracy, which is not a satisfyingperformance measure for the imbalanced training data. In thissituation, classifiers are usually overwhelmed by the majorityclass due to the neglect of the minority class. These methodscan thus be called as cost-blind feature selection methods [30].
Since F-measure favors high and balanced values of pre-cision and recall [31], it is a more reasonable performancemeasure than accuracy for the class imbalance situation [31],[32], [33]. F-measure has been widely used for evaluating theperformance of binary classification, and its variants for multi-
arX
iv:1
904.
0230
1v1
[cs
.CV
] 4
Apr
201
9
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 2
class and multi-label classification have been studied recently[34], [35], [32], [36], [33]. Various methods of optimizing F-measures have been proposed in the literature, and generallythey fall into two paradigms. The decision-theoretic approach(DTA) [37] first estimates a probability model, and thencomputes the optimal predictions according to the model.On the other hand, empirical utility maximization (EUM)approach [38], [39] follows a structured risk minimizationprinciple. Optimizing F-measure directly is often difficult sincethe F-measure is non-convex, thus approximation methodsare often used instead, such as the algorithms for maximiz-ing a convex lower bound of F-measure for support vectormachines [39], and maximizing the expected F-measure ofa probabilistic classifier using a logistic regression model[38]. A simpler method is to threshold the scores obtainedby classifiers to maximize the empirical F-measure. Thoughsimple, this method has been found effective and is commonlyapplied [31], [40]. Recent works [31], [36], [41], [42] studythe pseudo-linear property of F-measures where they arefunctions of per-class false negative/false positive rate. In thesescenarios, F-measure optimization can be reduced to a seriesof cost-sensitive classification problems, and an approximatelyoptimal classifier for the F-measure is guaranteed.
Inspired by the recent successes on F-measure optimization[31], [36], we propose a novel approach, called cost-sensitivefeature selection (CSFS), to optimize F-measures of the clas-sifiers used in embedded feature selection [21], [43], [44] toovercome the class imbalance problem as mentioned. Throughthe reduction of F-measure optimization to cost-sensitiveclassification, the classifiers of regularized regression-basedfeature selection methods are modified into cost-sensitive. Bysolving a series of cost-sensitive feature selection problems,features will be selected according to the optimal classifierwith the largest F-measure. Therefore, the class imbalance istaken into consideration, and the selected features will fullyrepresent both majority class and minority class. We apply ourmethod to both multi-class and multi-label tasks. Promisingresults on several benchmark datasets as well as some chal-lenging real-world datasets have validated the efficiency of theproposed method.
The main contributions of our work are as follows.
âą We propose that F-measure can be used to measurethe performance of feature selection methods since F-measure is more reasonable than accuracy for imbalanceddata classification. Although some previous works [24],[45] have been proposed to deal with the feature selectionand class imbalance problems, our method is one of thefirst works that handle the class-imbalance problem infeature selection by optimizing F-measure.
âą We modify the traditional feature selection methods intocost-sensitive by designing and assigning different coststo each class. Based on the general reduction frameworkof F-measure optimization, our method can obtain anoptimal solution.
The rest of this paper is organized as follows. SectionII introduces the notations and summarizes the preliminaryknowledge on the reduction of F-measure optimization to
cost-sensitive classification. The description and formulationof the algorithm are demonstrated in Section III. Extensiveexperimental results conducted on the synthetic data, binary-class, multi-class and multi-label datasets are reported inSection IV. We conclude our work in Section V.
II. NOTATIONS AND PRELIMINARY KNOWLEDGE
In this section, we introduce the notations used in this paper,and provide a brief introduction to F-measure optimizationreduction.
A. Notations
In this paper, we present matrices as bold uppercase lettersand vectors as bold lowercase letters. Given a matrix W =[wij ], we denote wi as its i-th row and wj as its j-th column.
For p > 0, the `p-norm of the vector b â Rn is definedas âbâp = (
âni=1 |bi|p)
1p . The `p,q-norm of the matrix
W â RnĂm is defined as âWâp,q = (âni=1 âwiâqp)
1q , where
p > 0 and q > 0. The symbol ïżœ denotes the element-wisemultiplication. The transpose, trace, inverse of a matrix Ware denoted as WT , tr(W) and Wâ1 respectively.
B. F-Measure Optimization Reduction
Before presenting the formulation of the proposed method,we first summarize some preliminary knowledge on the reduc-tion of F-measure optimization to cost-sensitive classification.There are four possible outcomes for a given binary classi-fier: true positives tp, false positives fp, false negatives fn,and true negatives tn. They are represented as a confusionmatrix in Figure 1(a). F-measure can be defined in terms ofthe marginal probabilities of classes and the per-class falsenegative/false positive probabilities. The marginal probabilityof label k is denoted by Pk, and the per-class false negativeprobability and false positive probability of a classifier h aredenoted by FNk(h) and FPk(h), respectively [31]. Theseprobabilities of a classifier h can be summarized by the errorprofile e(h):
e(h) = (FN1(h), FP1(h), . . . , FNm(h), FPm(h)), (1)
where m is the number of labels, e2kâ1 of e(h) â R2m is thefalse negative probability of class k and e2k is the false positiveprobability. In binary classification, we have FN2 = FP1.Thus, for any ÎČ > 0, F-measure can be written as a functionof error profile e:
FÎČ(e) =(1 + ÎČ2)(P1 â e1)
(1 + ÎČ2)P1 â e1 + e2. (2)
There are several definitions of F-measures in multi-classand multi-label classification. Specifically, we can transformthe multi-class or multi-label classification into multiple binaryclassification problems, and the average over the FÎČ-measuresof these binary problems is defined as the macro-F-measure.According to [31], we can also define the following micro-F-measure mlFÎČ for multi-label classification:
mlFÎČ(e) =(1 + ÎČ2)
âmk=1(Pk â e2kâ1)âm
k=1((1 + ÎČ2)Pk + e2k â e2kâ1). (3)
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 3
(a) Confusion matrix
ActualPositive
ActualNegative
PredictedPositive 0
PredictedNegative 1 + 2 â 0
(b) Cost matrix
ActualPositive
ActualNegative
PredictedPositive
PredictedNegative
Fig. 1. Confusion matrix and cost matrix in the context of binary classi-fication. FÎČ -measure can be represented as the sum of the elements of theHadamard product of confusion matrix and cost matrix [31].
Multi-class classification differs from multi-label classifica-tion in that a single class must be predicted for each example.According to [46], one definition of multi-class micro-F-measure, denoted as mcFÎČ can be written as:
mcFÎČ(e) =(1 + ÎČ2)(1â P1 â
âmk=2 e2kâ1)
(1 + ÎČ2)(1â P1)ââmk=2 e2kâ1 + e1
. (4)
We can notice that the fractional-linear F-measures pre-sented in Eqs. (2-4) are pseudo-linear functions with respectto e. The important property of pseudo-linear functions is thattheir level sets, as function of the false negative rate and thefalse positive rate, are linear. Based on this observation, arecent work [31] was proposed for F-measure maximizationby reducing it into cost-sensitive classification, and proved thatthe obtained optimal classifier for a cost-sensitive classificationproblem with label dependent costs is also an optimal classifierfor F-measure. This method can be separated into three steps.Firstly, the F-measure range [0, 1] is discretized into a set ofevenly spaced values {ri} without considering its variableproperty under label switching. Secondly, for each given F-measure value r, cost function a : R1
+ â R2m+ generates a
cost vector a(r) â R2m and assigns costs to the elements oferror profile e. We will use binary-class FÎČ(e) as an exampleto illustrate the procedure of generating costs.
As a pseudo-linear function, the level sets of FÎČ(e) arelinear [41], [42]. For a given discretized F-measure value r, bysetting denominator of Eq. (2) strictly positive and FÎČ(e) †r,we can get
(1 + ÎČ2 â r)e1 + re2 + (1 + ÎČ2)P1(r â 1) â„ 0. (5)
The non-negative coefficients 1 + ÎČ2 â r and r can beinterpreted as the costs of e1 and e2. Thus, the cost functionaFÎČ (r) for binary-class FÎČ can be represented as follows:
aFÎČi (r) =
1 + ÎČ2 â r if i = 1r if i = 20 otherwise.
(6)
These costs are shown as a cost matrix in Figure 1(b) [31].Therefore, the goal of optimization is changed to minimizethe total cost ăa(r), e(h)ă, which is the inner product of costvector and error profile [31].
Similarly, the cost function amlFÎČ (r) of multi-label micro-F-measure mlFÎČ can be represented as follows:
amlFÎČi (r) =
{1 + ÎČ2 â r if i is oddr if i is even (7)
F1-measure surface
F1-measure
fn (FN1)
r
e
1
total cost hyperplane
0.50
0.5
error profile space
fp (FP1)
F1
Fig. 2. Illustration of F1-measure surface with level sets and total costhyperplane for a given cost vector in the context of binary classification. F1-measure is a nonlinear function with respect to fn and fp. When cost vectora(r) is fixed, the total cost hyperplane is a linear function of fn and fp. Errorprofile space contains all possible values of e illustratively. Intuitively, wecan notice that higher values of F1-measure entail lower values of total costăa(r), eă.
and the cost function amcFÎČ (r) of multi-class micro-F-measure mcFÎČ is:
amcFÎČi (r) =
r if i = 11 + ÎČ2 â r if i is odd and i 6= 10 otherwise.
(8)
Lastly, cost-sensitive classifiers for each a(r) are learned tominimize the total cost ăa(r), e(h)ă, and the one with largestF-measure on the validation set is selected as the optimalclassifier. Figure 2 shows that the higher the F-measure value,the lower the total cost. This indicates that maximizing F-measure can be achieved by minimizing the correspondingtotal cost.
The determination of misclassification costs usually requiresexperts or prior knowledge. These costs reflect how severeone kind of mistake compared with other kinds of mistakes[30], which will result in numerous cost matrices based ondifferent cost ratios. By contrast, the costs generated fromthe F-measure optimization reduction algorithm have explicitvalue ranges once the discretized step is determined.
III. COST-SENSITIVE FEATURE SELECTION
Inspired from the reduction of F-measure optimization tocost-sensitive classification, we modify the classifiers usedin regularized regression-based feature selection methods intocost-sensitive by adding properly generated costs, and selectfeatures according to the classifier with the optimal F-measureperformance. This leads to a novel cost-sensitive feature selec-tion (CSFS) method by optimizing F-measures. A schematicillustration of the proposed method is shown in Figure 3.
A. Problem Formulation
Given training data, let X = [x1, . . . ,xn] â RdĂn denotefeature matrix with n samples and the feature dimension
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 4
F1-measure
r
fn
1
fp
1
F1
0.5
0
(2) Train cost-sensitive classifiers (3) Select the optimal classifier(1) Discretize the F-measure interval (4) Select features
F1-measure
fn
1
fp
1
0.5
0
0.5
total cost hyperplane
generatecosts
F1-measure surfaceselectedfeatures
large
small
optimal W
optimizetotal cost
r1
r2
r3
rn
. . . . . . error profile space
a(r)
{ }: majority class, cost = 1+ÎČ2-r{ }: minority class, cost = r
x1
x2
e e
Wri
Wrn
Wr1
Fig. 3. System diagram of the proposed cost-sensitive feature selection (CSFS) model in the case of binary classification. This model can be divided intofour stages. (1) Discretize the F-measure interval to obtain a set of evenly spaced values {r1, . . . , rn}. (2) For a given ri, cost function a(ri) generates costs1 + ÎČ2 â ri for the false negative and ri for the false positive, thus we can get a series of cost-sensitive classifiers. (3) Select the optimal classifier with thelargest F-measure value on the validation set. (4) Select the top-ranking features according to the projection matrix W of the optimal classifier by sortingâwiâ (1 †i †d) in descending order.
is d. The corresponding label matrix is given by Y =[y1; . . . ;yn] â {â1, 1}nĂm where yi is a row vector ofthe labels for the i-th example, and m is the number ofclass labels. The general formulation of regularized regression-based feature selection methods [21], [22], [47], which aim toobtain a projection matrix W â RdĂm, can be summarizedas follows:
minWL(XTW âY) + λR(W), (9)
where L(·) is the norm-based loss function of the predictionresidual, R(·) is the regularizer that introduces sparsity tomake W applicable for feature selection, and λ is a trade-off parameter. For simplicity, the bias has been absorbed intoW by adding a constant value 1 to the feature vector of eachexample. Such methods have been widely used in both multi-class and multi-label data situations [21], [44], [22], [43].However, they are designed to maximize the classificationaccuracy, which is unsuitable for highly imbalanced classessituations [31], since equal costs are assigned to differentclasses.
To deal with this problem, we now present a new featureselection method, which optimizes F-measure by modifyingthe classifiers of regularized regression-based feature selectioninto cost-sensitive. Without loss of generality, we start withthe illustration on the cost-sensitive feature selection under thebinary-class setting, where the label vector is [y1; y2; . . . ; yn] â{â1, 1}nĂ1. As mentioned previously, the misclassificationcost for positive class is 1 + ÎČ2 â r and the misclassificationcost for negative class is r. Thus for each class, we obtain acost vector c = [c1, . . . , cn]
T â Rn, where ci = 1+ ÎČ2 â r ifyi = 1, and ci = r if yi = â1. The formulation of total costfor all samples can be given as follows:
minw
nâi=1
L((xTi w â yi) · ci) + λR(w), (10)
where w â RdĂ1 is the projection vector. In multi-class andmulti-label scenarios, the cost vector ci â Rn for the i-th classcan be obtained according to their per-class false negative/falsepositive cost generated by corresponding cost function a(r).
Denoting the cost matrix as C = [c1, c2, . . . , cm] â RnĂm,we obtain the following formulation:
minW
nâi=1
L((xTi W â yi)ïżœ ci) + λR(W), (11)
where ci is the i-th row of C corresponding to the i-th ex-ample. Due to the rotational invariant property and robustnessto outliers [21], we adopt `2-norm based loss function as thespecific form of L(·) and the optimization problem becomes:
minW
nâi=1
â(xTi W â yi)ïżœ ciâ2 + λR(W). (12)
By further considering thatnâi=1
â(xTi W â yi)ïżœ ciâ2 = â(XTW âY)ïżœCâ2,1, (13)
and taking the commonly used `2,1-norm as regularization[48], [22], [21], [49], we obtain the following compact formof the cost-sensitive feature selection (CSFS) optimizationproblem:
minWâ(XTW âY)ïżœCâ2,1 + λâWâ2,1. (14)
As shown in Figure 3, we can get a series of cost-sensitive feature selection problems with different cost matrixC corresponding to each F-measure value r. After obtainingthe optimal W, features can be selected by sorting âwiâ(1 †i †d) in descending order. If âwiâ shrinks to zero,the i-th feature is less important and will not be selected.
It is worth mentioning that the loss and regularization ofEq. (14) are not necessarily `2,1-norm based terms. Otherregularizations that have the effects of feature selection, suchas ridge regularization and LASSO regularization [21], canalso be applied to obtain the concrete form of objectivefunction.
B. Optimization Method
The difficulty of optimization for Eq. (14) lies in the `2,1-norm imposed on both loss term and regularization term,which is difficult to get an explicit solution. For a given F-measure r, the corresponding cost matrix C is fixed and thus
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 5
W is the only variable in Eq. (14). Thus, we develop aniterative algorithm to solve this problem. Taking the derivativeof the objective function with respect to wk(1 †k †m) andsetting it to zero, we obtain:
XUkGUkXTwk âXUkGUkyk + λDwk = 0, (15)
where diagonal matrix Uk = diag(ck), D is a diagonal matrixwith the i-th diagonal element as1
dii =1
2âwiâ2, (16)
and G is a diagonal matrix with the i-th diagonal element as
gii =1
2â((XTW âY)ïżœC)iâ2. (17)
Each wk can thus be solved in the closed form:
wk = (λD+XUkGUkXT )â1(XUkGUk)yk. (18)
Since the solution of W is dependent on D and G, wedevelop an iterative algorithm to obtain the ideal D and G.The whole optimization procedure is described in Algorithm1. In each iteration, D and G are calculated with current W,and then each column vector wk of W is updated based onthe newly solved D and G. The iteration procedure is repeateduntil the convergence criterion is reached.
500 100 150 200
20
40
60
80
100
Iterations
Obj
ectiv
eva
lue
CSFS r=0.5CSFS r=0.7
Fig. 4. The convergence property of the proposed CSFS method. The twolines show how the objective values change along with the iterations giventhe discretized F-measure values r = 0.5 and r = 0.7, respectively.
C. Convergence Analysis
The convergence of Algorithm 1 is guaranteed by thefollowing theorem:
1When âwiâ2 = 0, Eq. (14) is not differnetiable. Following [21], thisproblem can be solved by introducing a small perturbation to regularize dii as
1
2ââwiâ22+ζ
. Similarly, the i-th diagonal element gii of G can be regularized
as 1
2ââ((XTWâY)ïżœC)iâ22+ζ
. It can be verified that the derived algorithm
minimizes the following problem:âni=1
ââ((XTW âY)ïżœC)iâ22 + ζ+
λâdi=1
ââwiâ22 + ζ, which is apparently reduced to Eq. (14) when ζ â 0.
Algorithm 1 An iterative algorithm to solve the optimizationproblem in Eq. (14).Input: feature matrix X â RdĂn, label matrix Y â RnĂm
and discretized F-measure value r.Output: projection matrix W â RdĂm.
1: Generate cost function a(r) for binary-class, multi-label,multi-class tasks according to Eqs. (6), (7) and (8), re-spectively.
2: Calculate cost vector ci â Rn(1 †i †m) for the i-thclass according to per-class FNi/FPi cost in a(r).
3: Obtain cost matrix C = [c1, c2, . . . , cm] â RnĂm.4: Initialize W0 as a random matrix, t = 0.5: while not converging do6: update diagonal matrix Dt+1 where the i-th diagonal
element is 12âwitâ2
.7: update diagonal matrix Gt+1 where the i-th diagonal
element is 12â((XTWtâY)ïżœC)iâ2 .
8: for k â 1 to m do9: Uk = diag(ck).
10: (wt+1)k = (λDt+1 +XUkGt+1UkXT )â1
11: ·(XUkGt+1Uk)yk.12: end for13: t = t+ 1.14: end while
Theorem 1. Algorithm 1 monotonically decreases the objec-tive value of Eq. (14) in each iteration, that is,
â(XTWt+1 âY)ïżœCâ2,1 + λâWt+1â2,1 â€â(XTWt âY)ïżœCâ2,1 + λâWtâ2,1.
(19)
We put the convergence analysis of algorithm 1 in theappendix. According to [21], the objective value of Eq. (14)monotonically decreases in iterations, which implies that Al-gorithm 1 can converge to a local minimum.
To demonstrate the convergence property of CSFS, weconduct an experiment on the USPS 1:10 subset (introducedin the following Section IV) to show in Figure 4 the evolutionof the objective value w.r.t. iterations. We can notice that theobjective values of CSFS rapidly reach their steady state whenthe iteration number is around 40.
D. Complexity Analysis
In Algorithm 1, steps 6 and 7 calculate the diagonal el-ements which are computationally trivial, so the complexitymainly depends on the matrix multiplication and inversion instep 10. By using sparse matrix multiplication and avoidingdense intermediate matrices, the complexity of updating each(wt+1)k is O(d2(n+d)). Thus the complexity of the proposedalgorithm is O(Ttmd2(n + d)), where t is the number ofiterations, and T is the number of discretized values ofF-measure. Empirical results show that the convergence ofAlgorithm 1 is rapid and t is usually less than 50. Besides,T is usually less than 20. Therefore, the proposed algorithmis quite efficient. Since the update of each (wt+1)k and theoptimization for different costs are independent, they can bothbe rapidly solved when sufficient computational resources are
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 6
TABLE ISUMMARIZATION OF THE MULTI-CLASS AND MULTI-LABEL DATASETS.
datasets classes samples features
multi-class
LUNG 5 203 3312USPS 10 9258 256
COIL20 20 1440 1024UMIST 20 575 10304YaleB 38 2414 1024Isolet1 26 1560 617
multi-classBarcelona 4 129 384
MSRC 23 591 384TRECVID 39 3721 512
available and parallel computing is implemented. Thereby thewhole cost-sensitive feature selection framework can be solvedefficiently.
IV. EXPERIMENTS
In this section, we evaluate the performance of the proposedmethod with other feature selection methods. Specifically,the experiments consist of four parts. Firstly, we constructa two-dimensional binary-class synthetic dataset to show theinfluence of costs during feature selection process. Secondly,multiple binary datasets with different sampling ratios areextracted to validate the effectiveness of our CSFS in com-parison to other multi-class feature selection methods. Lastly,extensive experiments are conducted on real-world multi-classand multi-label datasets. For multi-class classification, weuse six multi-class datasets: one cancer dataset LUNG2, oneobject image dataset COIL203, one spoken letter recognitiondataset Isolet13, one handwritten digit dataset USPS3, andtwo face image datasets YaleB3 and UMIST4. For multi-label classification, we use the Barcelona5, MSVCv26 andTRECVID20057 datasets. We extract 384-dimensional colormoment features on the Barcelona and MSRCv2 datasets,and 512-dimensional GIST features on TRECVID. For eachdataset, we randomly select 1/3 of the training samples forvalidation to tune the hyper-parameters. For datasets that donot have a separate test set, the data is first split to keep 1/4for testing. Information of multi-class and multi-label datasetsis summarized in Table I.
On each multi-class dataset, our method CSFS is comparedto the several popular multi-class feature selection methods,which can be summarized as following:âą ReliefF [15] selects features by finding the near-hit and
near-miss instances using the `1-norm.âą Information Gain (IG) [18] selects features by using
information gain as the split criterion in decision treeinduction.
âą Minimum Redundancy Maximum Relevance(mRMR) [16] selects features by calculating the
2http://www.ncbi.nlm.nih.gov/pubmed/117075673http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html4http://www.sheffield.ac.uk/eee/research/iel/research/face5http://mlg.ucd.ie/content/view/616http://research.microsoft.com/en-us/projects/objectclassrecognition/7http://www-nlpir.nist.gov/projects/tv2005/
feature relevance and redundancy according to theirmutual information.
âą F-statistic [17] selects features by computing the scoresfor each feature according to between-group and within-group variabilities.
âą Robust Feature Selection (RFS) [21] selects features byjoint `2,1-norm minimization on both loss function andregularization.
For multi-label classification, the proposed method CSFS iscompared with five representative multi-label feature selectionmethods on each multi-label dataset. These multi-label featureselection methods are summarized as following:âą Multi-Label ReliefF (MLReliefF) [50] improves the
classical ReliefF by overcoming the ambiguity problemrelating to multi-label classification.
âą Multi-Label F-statistic (MLF-statistic) [50] incorpo-rates the multi-label information into the conventional F-statistic algorithm.
âą Information-Theoretic Feature Ranking (ITFR) [51]selects features by avoiding and reusing entropy calcula-tions and identifying important label combinations basedon information theory.
âą Non-Convex Feature Learning (NCFS) [44] selectsfeatures by using the proximal gradient method to solvethe `p,â operator problem.
âą Robust Feature Selection (RFS) [21] can be naturallyextended for multi-label feature selection tasks [44].
In addition, we also compare CSFS with two binary-classfeature selection methods which are designed for imbalanceddata:âą FAST [24] takes each feature as the output of a classifier,
calculates AUC for each feature by sliding thresholds,and then selects features according to AUC results in adescending order.
âą FAIR [45] is a modification of FAST which chooses fea-tures according to their corresponding largest precision-recall curve values.
In the training process, the parameter λ in our methodis optimized in the range of {10â6, 10â5, . . . , 106}, and thenumbers of selected features are set as {20, 30, . . . , 120}.To fairly compare all different feature selection methods,classification experiments are conducted on all datasets using5-fold cross validation SVM with linear kernel and parameterC = 1. We repeat the experiments 10 times with random seedsfor generating the validation sets. Both mean and standarddeviation of the accuracy and F1-measures are reported.
A. Synthetic Data
To demonstrate the advantage of cost-sensitive feature se-lection over traditional cost-blind feature selection, a toy ex-periment is conducted to show the influence of the costs on theselected features. We construct a two-dimensional binary-classsynthetic dataset based on two different uniform distributions,as shown in Figure 5. The ratio of majority class to minorityclass is 3 : 1. In this experiment, majority class is treated asthe positive class, and minority class as the negative class.
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 7
(b) r = 1 (c) r = 1.2(a) r = 0.8
1+ÎČ2-r 1+ÎČ2-r 1+ÎČ2-r
r r r
Wr=1|w1| : |w2| = 1 : 1.12
Wr=1.2|w1| : |w2| = 3.93 : 1
Wr=0.8 |w1| : |w2| = 1 : 8.75
{ }: majority class, cost=1.2{ }: minority class, cost=0.8
{ }: majority class, cost=1.0{ }: minority class, cost=1.0
{ }: majority class, cost=0.8{ }: minority class, cost=1.2
x1 x1 x1
x2 x2 x2
Fig. 5. Illustration of how CSFS influences the features weights on a two-dimensional synthetic dataset. The feature weights, i.e., the absolute values of thecoefficients of the projection vector w change with the assigned cost to each class. When discreted F-measure value r = 1.2, the cost of minority class islarger than the cost of majority class, which makes the projection vector w bias towards feature x1. In this case, the weight of feature of x1 is larger thanthe weight of feature x2, which is different from the situations when r = 0.8 and r = 1.0.
ReliefFCSF-SVM IG mRMR F-statistic RFS CSFS
(a) Top 20 features (b) Top 70 features
(a) Top 20 features (b) Top 70 features (c) Top 120 features
(c) Top 120 features
1:1 1:2 1:3 1:4 1:5 1:6 1:7 1:8 1:9 1:100.94
0.95
0.96
0.97
0.98
0.99
1.00
proportion of two classes proportion of two classes proportion of two classes
accu
racy
1:1 1:2 1:3 1:4 1:5 1:6 1:7 1:8 1:9 1:10
0.95
0.96
0.97
0.98
0.99
1.00
accu
racy
1:1 1:2 1:3 1:4 1:5 1:6 1:7 1:8 1:9 1:100.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
accu
racy
1:1 1:2 1:3 1:4 1:5 1:6 1:7 1:8 1:9 1:10
0.88
0.90
0.92
0.94
0.96
0.98
proportion of two classes
F 1-m
easu
re
1:1 1:2 1:3 1:4 1:5 1:6 1:7 1:8 1:9 1:10
0.92
0.94
0.96
0.98
proportion of two classes
F 1-m
easu
re
1:1 1:2 1:3 1:4 1:5 1:6 1:7 1:8 1:9 1:10
0.92
0.94
0.96
0.98
proportion of two classes
F 1-m
easu
re
Fig. 6. Binary classification results using SVM in terms of F1-measure and accuracy on ten binary datasets with different levels of imbalanced classes. Thenumbers of selected features are set as 20, 70 and 120. The sampling ratio for two classes varies from 1 to 10. For each fixed number of selected features, thechart above shows the F1-measure and the chart below shows the accuracy of classification. CSF-SVM is a baseline method which first trains cost-sensitiveSVMs to optimize F-measure and then performs feature selection by thresholding the weights.
FAST FAIR CSFS
20 40 60 80 100 120
0.88
0.90
0.92
0.94
0.96
number of selected features
F 1-m
easu
re
20 40 60 80 100 120
0.90
0.92
0.94
0.96
0.98
number of selected features
accu
racy
20 40 60 80 100 1200.92
0.94
0.96
0.98
number of selected features
AU
C
Fig. 7. Binary classification results using SVM in terms of F1-measure, accuracy and AUC on the USPS 1:10 subset. FAST and FAIR are two binary-classfeature selection methods which take the imbalance issue into consideration by optimizing AUC via sliding the thresholds.
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 8
As mentioned above, regularized regression-based featureselection methods achieve the classifier learning and featureselection procedures simultaneously. For a given linear classi-fier, each coefficient of its projection vector w corresponds toone feature weight such as w1 for x1, then the features withlarger coefficients will be selected. The projection vector wvary with the costs assigned to both classes. In Figure 5(a),the cost of majority class is larger than the cost of minorityclass when r < 1. In Figure 5(b), the costs for both classes arethe same when r = 1. In this case, the cost-sensitive featureselection degenerates to the cost-blind feature selection. Whenr > 1, as shown in Figure 5(c), the cost of majority class issmaller than the cost of minority class. It is worth noting thatthe weight of feature x1 is larger than the weight of featurex2, which is different from the first two examples. Therefore,different features will be selected from different cost-sensitivefeature selection problems.
B. Binary-Class Datasets with Imbalanced Samplings
Re-sampling technique has been used to deal with theclass imbalance problem by over-sampling the minority classand under-sampling the majority class [52]. However, theover/under-sampling processes both change the statistical dis-tribution of original data. Besides, the sampling ratio is hardto determine for binary class datasets, and this problembecomes more difficult for real-world multi-class and multi-label datasets.
To evaluate the performance of our CSFS and other featureselection methods under varying conditions of class imbalance,we construct multiple binary-class datasets with differentsampling ratios. These datasets are extracted for a two-classsubset of the USPS dataset. The number of samples for oneclass is fixed at 150, while the number of samples for the otherclass varies in the set {150, 300, . . . , 1500}. In this way, weobtain 10 binary-class datasets with different levels of classimbalance.
Figure 6 shows the classification F1-measure and accuracycomparisons of top 20, 70 and 120 features selected byReliefF, IG, mRMR, F-statistic, RFS and the proposed CSFSon these ten datasets. Moreover, we also add a baseline namedCSF-SVM. It first trains cost-sensitive SVMs on the trainingset within the framework of [31], then slides the thresholdsfrom the minimum to the maximum prediction scores of thevalidation samples with the step of 0.01 to obtain the optimalF-measure classifier, and finally uses the optimal classifier toperform feature selection on test set after sorting the featureweights. This baseline method can be used to show how theproposed CSFS significantly improves the quality of selectedfeatures compared to the traditional cost-sensitive classifierswhich are used for classification [31]. These results show that:(1) as the proportion of the two classes becomes larger, allthe compared feature selection methods except the baselineachieve similar accuracy performance. This is because whenthe classes are imbalanced, accuracy is not an appropriateperformance measure; (2) the proposed CSFS outperforms theother methods significantly under the F1-measure criterion.The performances of the others decrease sharply by increasing
the proportion, while the curve of CSFS is steady; and (3)although the F1-measure of CSF-SVM is relatively lower,its F1-measure descends slowly as the ratio becomes larger.This indicates that the introduction of optimizing F-measurethrough cost-sensitive classification benefits feature selection.
To obtain a more comprehensive understanding of theperformance of our method, we compare CSFS with twocompetitive algorithms which take class imbalance into ac-count, FAST [24] and FAIR [45]. Since FAST and FAIR areonly designed for binary-class learning tasks, we conduct thecomparison experiment in terms of F1-measure, accuracy andAUC on the USPS 1:10 subset. The classification results areshown in Figure 7. It can be seen that: (1) CSFS achievessimilar AUC performances compared with FAST and FAIRwhen more than 80 features are selected; and (2) CSFSoutperforms FAST and FAIR significantly on F1-measure, anddemonstrates the consistent superiority on accuracy. Theseresults can be explained from two aspects. On one hand,both F-measure used on CSFS as well as AUC used in FASTand FAIR are appropriate measures for imbalanced datasets,and thus they display advantages over other methods thatneglect the imbalance issue. On the other hand, FAST andFAIR independently select each individual feature and AUCis not involved in the optimization procedure. As a result, theirsuperiority on AUC gradually drops as the number of selectedfeatures increases.
C. Comparisons with Multi-Class Feature Selection MethodsTo compare the proposed algorithm with other multi-class
feature selection methods, we conduct our experiments onmulti-class datasets LUNG, USPS, COIL20, UMIST, YaleBand Isolet1. On each dataset, our method CSFS is comparedto five multi-class feature selection methods, such as ReliefF[15], Information Gain (IG) [18], mRMR [16], F-statistic [17]and RFS [21].
We follow the experimental setups of the previous works[22], [21], and employ classification accuracy and multi-classmicro-F1-measure to evaluate the classification results of eachfeature set selected by various methods.
Figure 8 shows the multi-class classification results in termsof micro-F1-measure and accuracy. Table II shows the resultsof different feature selection methods on their best dimensions.We observe that: (1) the proposed CSFS is superior to othermulti-class feature selection methods consistently in termsof the micro-F1-measure on all six datasets; (2) in terms ofaccuracy, CSFS outperforms other methods on most of theselected features on all six datasets. Besides, we can alsonote that compared with CSFS, RFS is quite competitive onspoken letter recognition dataset Isolet1. This can be explainedfrom two aspects. On one hand, CSFS and RFS both use `2,1-norm on loss term and regularization term. When setting thediscretized F-measure value as 1, the misclassification costsare equal and the cost-sensitive CSFS degenerates to the cost-blind RFS. On the other hand, we only set the number ofdiscretized values of F-measure T as 20. The performanceof CSFS can be further improved by using smaller discretizedsteps. Therefore, the competitivity of RFS validates rather thandecreases the effectiveness of CSFS.
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 9
ReliefF IG mRMR F-statistic RFS CSFS
(a) LUNG (b) USPS (c) COIL20
20 40 60 80 100 120
0.4
0.5
0.6
0.7
0.8
number of selected features
mic
ro-F
1-m
easu
re
20 40 60 80 100 1200.5
0.6
0.7
0.8
0.9
number of selected features
mic
ro-F
1-m
easu
re
20 40 60 80 100 120
0.2
0.3
0.4
0.5
number of selected features
mic
ro-F
1-m
easu
re(d) UMIST (e) YaleB (f) Isolet1
20 40 60 80 100 1200.20.30.40.50.60.70.80.9
number of selected features
mic
ro-F
1-m
easu
re
20 40 60 80 100 120
0.0
0.2
0.4
0.6
number of selected features
mic
ro-F
1-m
easu
re
20 40 60 80 100 120
0.2
0.4
0.6
0.8
number of selected featuresm
icro
-F1-
mea
sure
(d) UMIST (e) YaleB (f) Isolet1
(a) LUNG (b) USPS (c) COIL20
20 40 60 80 100 1200.90
0.92
0.94
0.96
0.98
number of selected features
accu
racy
20 40 60 80 100 1200.90
0.92
0.94
0.96
0.98
number of selected features
accu
racy
20 40 60 80 100 120
0.85
0.90
0.95
number of selected features
accu
racy
20 40 60 80 100 120
0.94
0.95
0.96
0.97
0.98
0.99
number of selected features
accu
racy
20 40 60 80 100 1200.93
0.94
0.95
0.96
0.97
0.98
0.99
number of selected features
accu
racy
20 40 60 80 100 120
0.95
0.96
0.97
0.98
0.99
number of selected features
accu
racy
Fig. 8. Multi-class classification results using SVM in terms of multi-class micro-F1-measure and accuracy on LUNG, USPS, COIL20, UMIST, YaleB andIsolet1 datasets. The proposed CSFS is compared with five multi-class feature selection methods, such as ReliefF, IG, mRMR, F-statistic and RFS. For eachdataset, the chart above shows the multi-class micro-F1-measure and the chart below shows the accuracy of classification.
D. Comparisons with Multi-Label Feature Selection Methods
To compare our CSFS with other multi-label feature se-lection methods, we conduct the experiments on multi-labeldatasets Barcelona, MSRC and TREVCID by following thesetups of the previous works [21], [44], [50]. On each multi-label dataset, the proposed CSFS is compared with five com-petitive multi-label feature selection methods: MLReliefF [50],MLF-statistic [50], ITFR [51], NCFS [44] and RFS [21].
Figure 9 shows the multi-label classification results in termsof micro-F1-measure and accuracy on Barcelona, MSRC andTRECVID datasets. Table III shows the results of each feature
selection method on its best performing dimension. From theresults, we observe that: (1) the methods using joint sparseregularization, such as CSFS, NCFS and RFS, show betterperformances than other feature selection methods that onlyuse the statistical information of the original features. Thisis because the projection matrices of these methods are de-termined at the same time during the optimization procedure,corresponding features are selected to prevent high correlation[22]; (2) the feature selection methods based on accuracyoptimization have poor F-measure performance because theytend to select those features that are biased towards the ma-
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 10
TABLE IIMULTI-CLASS MICRO-F1-MEASURE (%± STD) AND ACCURACY (%± STD) OF VARIOUS MULTI-CLASS FEATURE SELECTION METHODS ON SIX
DATASETS. THE BEST RESULTS ARE IN BOLDFACE.
LUNG USPS COIL20 UMIST YaleB Isolet1
Micro-F1-measureReliefF 78.02±7.35 87.25±0.71 67.78±4.47 25.94±6.95 39.09±6.55 71.65±1.21
IG 74.99±5.87 83.51±1.06 79.13±2.86 18.78±8.90 39.13±1.47 75.89±0.90mRMR 79.46±6.03 88.30±0.87 81.83±1.15 56.38±1.60 49.80±9.44 77.96±1.19
F-statistic 78.18±7.07 88.36±0.84 72.00±3.31 24.53±2.41 42.64±1.13 74.45±2.40RFS 81.29±9.33 89.54±0.62 88.15±2.00 73.33±4.18 48.68±8.54 85.43±0.94
CSFS 85.88±4.70 91.56±0.56 91.69±1.19 77.77±1.50 53.83±1.63 87.49±0.81
AccuracyReliefF 97.72±1.02 98.13±1.39 98.23±1.30 95.59±0.41 94.22±0.34 97.97±0.12
IG 97.97±0.56 97.35±2.96 98.85±3.66 95.41±2.06 93.44±1.22 98.31±0.28mRMR 97.22±0.55 98.23±0.76 99.04±1.16 97.48±0.20 95.00±0.85 98.27±0.13
F-statistic 96.97±1.64 98.20±1.69 98.54±1.84 96.22±0.24 92.78±2.70 98.25±0.07RFS 97.48±1.45 98.44±0.95 99.46±1.43 98.62±0.15 95.56±1.51 99.04±0.05
CSFS 98.23±0.45 98.50±0.56 99.51±1.04 98.82±0.14 96.72±0.41 99.11±0.07
MLReliefF MLF-statistic ITFR NCFS RFS CSFS
(a) Barcelona (b) MSRC (c) TRECVID
(a) Barcelona (b) MSRC (c) TRECVID
20 40 60 80 100 120
0.70
0.75
0.80
0.85
number of selected features
accu
racy
20 40 60 80 100 1200.50
0.55
0.60
0.65
0.70
number of selected features
accu
racy
20 40 60 80 100 120
0.54
0.56
0.58
0.60
0.62
0.64
number of selected features
accu
racy
20 40 60 80 100 120
0.68
0.70
0.72
0.74
0.76
0.78
0.80
number of selected features
mic
ro-F
1-m
easu
re
20 40 60 80 100 1200.560.580.600.620.640.660.680.70
number of selected features
mic
ro-F
1-m
easu
re
20 40 60 80 100 120
0.35
0.40
0.45
0.50
number of selected featuresm
icro
-F1-
mea
sure
Fig. 9. Multi-label classification results using SVM in terms of multi-label micro-F1-measure and accuracy on Barcelona, MSRC and TRECVID datasets.The proposed CSFS is compared with five multi-label feature selection methods, such as MLReliefF, MLF-statistic, ITFR, NCFS and RFS. For each dataset,the chart above shows the multi-label micro-F1-measure and the chart below shows the accuracy of classification.
TABLE IIIMULTI-LABEL MICRO-F1-MEASURE (%± STD) AND ACCURACY (%± STD) OF VARIOUS MULTI-LABEL FEATURE SELECTION METHODS ON THREE
DATASETS. THE BEST RESULTS ARE IN BOLDFACE.
Barcelona MSRC TRECVID Barcelona MSRC TRECVID
Micro-F1-measure AccuracyMLReliefF 73.80±5.70 63.96±0.49 45.25±0.71 83.42±8.42 63.57±1.41 62.46±0.88
MLF-statistic 75.10±3.61 59.99±1.87 43.51±0.36 80.43±0.89 62.15±1.50 59.55±4.27ITFR 75.03±4.32 63.24±1.07 47.23±0.59 84.43±2.25 65.70±0.32 60.46±3.75NCFS 76.32±5.08 67.50±1.96 49.23±0.80 87.43±1.18 69.32±1.64 64.04±3.57RFS 77.65±2.70 68.29±0.93 49.54±0.62 86.64±3.05 70.75±1.02 63.53±0.77
CSFS 80.22±0.91 70.88±0.77 51.56±0.56 89.21±1.05 72.32±0.46 65.42±0.43
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 11
jority class. Our proposed method outperforms these methodssignificantly under the F-measure criterion, and does not leadto obvious decrement on accuracy. In particular, our methodoutperforms other methods by a relative improvement between2%-10% on all three datasets in terms of micro-F1-measure.
V. CONCLUSION
In this paper, we present a novel feature selection method byoptimizing F-measure instead of accuracy to tackle the classimbalance problem. Due to the neglect of class imbalanceissue, conventional feature selection methods usually selectthe features that are biased towards the majority class. Forinstance, embedded methods select the feature subset by max-imizing the classification accuracy, which is unsuitable giventhe imbalanced classes. In this situation, F-measure is a moreappropriate performance measure than accuracy. Based on thereduction of F-measure optimization to cost-sensitive classi-fication, we modify the classifiers of regularized regression-based feature selection into cost-sensitive by generating andassigning different costs to each class. After solving a seriesof cost-sensitive feature selection problems, features will beselected according to the classifier with optimal F-measure.Thus the selected features will be more representative for allclasses. Extensive experiments on both multi-class and multi-label datasets demonstrate the effectiveness of the proposedmethod.
APPENDIX APROOF OF THEOREM 1
Proof. In each iteration t, optimal Wt+1 is obtained by
Wt+1 =
minW
Tr(((XTW âY)ïżœC)TGt+1((XTW âY)ïżœC))
+ λTr(WTDt+1W). (20)
Fixing Dt+1 and Gt+1, we have:
Tr(((XTWt+1 âY)ïżœC)TGt+1((XTWt+1 âY)ïżœC))
+ λTr(WTt+1Dt+1Wt+1)
†Tr(((XTWt âY)ïżœC)TGt+1((XTWt âY)ïżœC))
+ λTr(WTt Dt+1Wt). (21)
Substituting diagonal matrices Dt+1 and Gt+1 with theircorresponding definitions, we obtain:
nâi=1
â((XTWt+1 âY)ïżœC)iâ222â((XTWt âY)ïżœC)iâ2
+ λ
dâi=1
âwit+1â22
2âwitâ2â€
nâi=1
â((XTWt âY)ïżœC)iâ222â((XTWt âY)ïżœC)iâ2
+ λ
dâi=1
âwitâ22
2âwitâ2
. (22)
Because it can be easily verified that function f(x) = xâ x2
2aachieves its maximum value when a = x, then we have
nâi=1
â((XTWt+1 âY)ïżœC)iâ2â
nâi=1
â((XTWt+1 âY)ïżœC)iâ222â((XTWt âY)ïżœC)iâ2
â€
nâi=1
â((XTWt âY)ïżœC)iâ2â
nâi=1
â((XTWt âY)ïżœC)iâ222â((XTWt âY)ïżœC)iâ2
,
(23)
anddâi=1
âwit+1â2 â
dâi=1
âwit+1â22
2âwitâ2â€
dâi=1
âwitâ2 â
dâi=1
âwitâ22
2âwitâ2
.
(24)
Adding Eqs. (22-24) on both sides, we getnâi=1
â((XTWt+1 âY)ïżœC)iâ2 + λ
dâi=1
âwit+1â22 â€
nâi=1
â((XTWt âY)ïżœC)iâ2 + λ
dâi=1
âwitâ22, (25)
that is,
â(XTWt+1 âY)ïżœCâ2,1 + λâWt+1â2,1†â(XTWt âY)ïżœCâ2,1 + λâWtâ2,1.
(26)
Therefore, Algorithm 1 decreases the objective value mono-tonically in each iteration.
ACKNOWLEDGEMENTS
The authors would like to thank the handling associate edi-tor Dr. Paul Rodriguez and both anonymous reviewers for theirconstructive comments. This research is partially supportedby NSFC under Grant 61375026 and Grant 2015BAF15B00,Australian Research Council Projects FL-170100117, DP-180103424, DP-140102164, LP-150100671, DE-180101438,and Singapore MOE Tier 1 projects (RG 17/14, RG 26/16) andNRF2015ENC-GDCR01001-003 (administrated via IMDA).
REFERENCES
[1] H. Lang and H. Ling, âCovert photo classification by fusing imagefeatures and visual attributes,â IEEE Transactions on Image Processing,vol. 24, no. 10, pp. 2996â3008, 2015.
[2] S. Bahrampour, N. M. Nasrabadi, A. Ray, and W. K. Jenkins, âMul-timodal task-driven dictionary learning for image classification,â IEEETransactions on Image Processing, vol. 25, no. 1, pp. 24â38, 2016.
[3] D. Wang, F. Nie, and H. Huang, âFeature selection via global redundancyminimization,â IEEE Transactions on Knowledge & Data Engineering,vol. PP, no. 99, pp. 2743â2755, 2015.
[4] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan, âTrace ratio criterionfor feature selection,â in National Conference on Artificial Intelligence,2008, pp. 671â676.
[5] Y. Luo, T. Liu, D. Tao, and C. Xu, âDecomposition-based transferdistance metric learning for image classification,â IEEE Transactionson Image Processing, vol. 23, no. 9, pp. 3789â3801, 2014.
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 12
[6] M. J. Saberian and N. Vasconcelos, âBoosting algorithms for simulta-neous feature extraction and selection,â in CVPR, 2012, pp. 2448â2455.
[7] M. D. Gupta and J. Xiao, âNon-negative matrix factorization as a featureselection tool for maximum margin classifiers,â in CVPR, 2011, pp.2841â2848.
[8] A. Wang, J. Lu, J. Cai, G. Wang, and T.-J. Cham, âUnsupervisedjoint feature learning and encoding for rgb-d scene labeling,â IEEETransactions on Image Processing, vol. 24, no. 11, pp. 4459â4473, 2015.
[9] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou, â`2,1-normregularized discriminative feature selection for unsupervised learning,âin IJCAI, 2011, pp. 1589â1594.
[10] K. Shin and A. P. Angulo, âA geometric theory of feature selection anddistance-based measures,â in IJCAI, 2015, pp. 3812â3819.
[11] S. M. Villela, S. de Castro Leite, and R. F. Neto, âFeature selectionfrom microarray data via an ordered search with projected margin,â inIJCAI, 2015, pp. 3874â3881.
[12] W. Jiang, G. Er, Q. Dai, and J. Gu, âSimilarity-based online featureselection in content-based image retrieval,â IEEE Transactions on ImageProcessing, vol. 15, no. 3, pp. 702â712, 2006.
[13] Y. Luo, Y. Wen, D. Tao, J. Gui, and C. Xu, âLarge margin multi-modalmulti-task feature extraction for image classification,â IEEE Transactionson Image Processing, vol. 25, no. 1, pp. 414â427, 2016.
[14] H. Tao, C. Hou, F. Nie, Y. Jiao, and D. Yi, âEffective discriminativefeature selection with nontrivial solution.â IEEE Transactions on NeuralNetworks & Learning Systems, vol. 27, no. 4, pp. 3013â3017, 2016.
[15] I. Kononenko, âEstimating attributes: analysis and extensions of RE-LIEF,â in ECML, 1994, pp. 171â182.
[16] H. Peng, F. Long, and C. Ding, âFeature selection based on mu-tual information criteria of max-dependency, max-relevance, and min-redundancy,â IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 27, no. 8, pp. 1226â1238, 2005.
[17] H. Liu and H. Motoda, Feature selection for knowledge discovery anddata mining. Springer Science & Business Media, 2012, vol. 454.
[18] L. E. Raileanu and K. Stoffel, âTheoretical comparison between thegini index and information gain criteria,â Annals of Mathematics andArtificial Intelligence, vol. 41, no. 1, pp. 77â93, 2004.
[19] M. A. Hall and L. A. Smith, âFeature selection for machine learning:Comparing a correlation-based filter approach to the wrapper.â inFLAIRS, 1999, pp. 235â239.
[20] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, âGene selection forcancer classification using support vector machines,â Machine Learning,vol. 46, no. 1-3, pp. 389â422, 2002.
[21] F. Nie, H. Huang, X. Cai, and C. H. Ding, âEfficient and robust featureselection via joint `2,1-norms minimization,â in NIPS, 2010, pp. 1813â1821.
[22] D. Han and J. Kim, âUnsupervised simultaneous orthogonal basisclustering feature selection,â in CVPR, 2015, pp. 5016â5023.
[23] J. J. Hull, âA database for handwritten text recognition research,â IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 16,no. 5, pp. 550â554, 1994.
[24] X. Chen and M. Wasikowski, âFast: a roc-based feature selection metricfor small samples and imbalanced data classification problems,â inProceedings of the 14th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, 2008, pp. 124â132.
[25] Q. Shi, B. Du, and L. Zhang, âSpatial coherence-based batch-mode ac-tive learning for remote sensing image classification,â IEEE Transactionson Image Processing, vol. 24, no. 7, pp. 2037â2050, 2015.
[26] J.-M. Guo and H. Prasetyo, âContent-based image retrieval using fea-tures extracted from halftoning-based block truncation coding,â IEEETransactions on Image Processing, vol. 24, no. 3, pp. 1010â1024, 2015.
[27] Y. Chen and C. Lin, âCombining svms with various feature selectionstrategies,â in Feature Extraction. Springer, 2006, pp. 315â324.
[28] M. Qian and C. Zhai, âRobust unsupervised feature selection,â in IJCAI,2013, pp. 1621â1627.
[29] Q. Gu, Z. Li, and J. Han, âJoint feature selection and subspace learning,âin IJCAI, 2011, pp. 1294â1299.
[30] Y. Zhang and Z.-H. Zhou, âCost-sensitive face recognition,â IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 32,no. 10, pp. 1758â1769, 2010.
[31] S. P. Parambath, N. Usunier, and Y. Grandvalet, âOptimizing F-measuresby cost-sensitive classification,â in NIPS, 2014, pp. 2123â2131.
[32] K. J. Dembczynski, W. Waegeman, W. Cheng, and E. Hullermeier, âAnexact algorithm for F-measure maximization,â in NIPS, 2011, pp. 1404â1412.
[33] I. Pillai, G. Fumera, and F. Roli, âF-measure optimisation in multi-labelclassifiers,â in ICPR, 2012, pp. 2424â2427.
[34] W. Cheng, K. Dembczynski, E. Hullermeier, A. Jaroszewicz, andW. Waegeman, âF-measure maximization in topical classification,â inRough Sets and Current Trends in Computing, 2012, pp. 439â446.
[35] K. J. Dembczynski, A. Jachnik, W. Kotlowski, W. Waegeman, andE. Hullermeier, âOptimizing the F-measure in multi-label classification:Plug-in rule approach versus structured loss minimization,â in ICML,2013, pp. 1130â1138.
[36] N. Ye, K. M. A. Chai, W. S. Lee, and H. L. Chieu, âOptimizing F-measures: a tale of two approaches,â in ICML, 2012.
[37] D. D. Lewis, âEvaluating and optimizing autonomous text classificationsystems,â in SIGIR, 1995, pp. 246â254.
[38] M. Jansche, âMaximum expected F-measure training of logistic regres-sion models,â in Proceedings of the conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing,2005, pp. 692â699.
[39] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, âLarge marginmethods for structured and interdependent output variables,â JMLR,vol. 6, pp. 1453â1484, 2005.
[40] Y. Yang, âA study of thresholding strategies for text categorization,â inSIGIR, 2001, pp. 137â145.
[41] O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon,âConsistent binary classification with generalized performance metrics,âin NIPS, 2014, pp. 2744â2752.
[42] H. Narasimhan, R. Vaish, and S. Agarwal, âOn the statistical consistencyof plug-in classifiers for non-decomposable performance measures,â inNIPS, 2014, pp. 1493â1501.
[43] X. Zhu, H. I. Suk, and D. Shen, âMatrix-similarity based loss functionand feature selection for Alzheimerâs disease diagnosis,â in CVPR, 2014,pp. 3089â3096.
[44] D. Kong and C. Ding, âNon-convex feature learning via `p,â operator,âin AAAI, 2014.
[45] M. Wasikowski and X. Chen, âCombating the small sample class imbal-ance problem using feature selection,â IEEE Transactions on Knowledgeand Data Engineering, vol. 22, no. 10, pp. 1388â1400, 2010.
[46] J. Kim, Y. Wang, and Y. Yasunori, âThe genia event extraction sharedtask, 2013 edition-overview,â in Proceedings of the BioNLP Shared Task2013 Workshop, 2013, pp. 8â15.
[47] S. Xiang, F. Nie, G. Meng, C. Pan, and C. Zhang, âDiscriminative leastsquares regression for multiclass classification and feature selection,âIEEE Transactions on Neural Networks & Learning Systems, vol. 23,no. 11, pp. 1738â1754, 2012.
[48] R. He, T. Tan, L. Wang, and W.-S. Zheng, â`2,1 regularized correntropyfor robust feature selection,â in CVPR, 2012, pp. 2504â2511.
[49] X. Cai, F. Nie, and H. Huang, âExact top-k feature selection via l2,0-norm constraint.â in IJCAI, 2013, pp. 1240â1246.
[50] D. Kong, C. Ding, H. Huang, and H. Zhao, âMulti-label ReliefF andF-statistic feature selections for image annotation,â in CVPR, 2012, pp.2352â2359.
[51] J. Lee and D. Kim, âFast multi-label feature selection based oninformation-theoretic feature ranking,â Pattern Recognition, vol. 48,no. 9, pp. 2761â2771, 2015.
[52] C. Elkan, âThe foundations of cost-sensitive learning,â in Internationaljoint conference on artificial intelligence, vol. 17, no. 1, 2001, pp. 973â978.
Meng Liu received the B.S. degree in 2012 and theD.Sc degree in 2017 from School of Electronics En-gineering and Computer Science, Peking University,Beijing, China. He is currently working as a researchengineer in Laboratory 2012 of Huawei Corporation,Beijing, China. He was a Visiting Scholar with theFaculty of Engineering and Information Technology,University of Technology, Sydney, Australia. Hiscurrent research interests include machine learningand its applications on classification and imbalanceissues.
IEEE TIP-15403-2016 FINAL VERSION, VOL. XX, NO. X, DECEMBER 2017 13
Yong Luo received the B.E. degree in ComputerScience from the Northwestern Polytechnical Uni-versity, Xiâan, China, in 2009, and the D.Sc. degreein the School of Electronics Engineering and Com-puter Science, Peking University, Beijing, China,in 2014. He is currently a Research Fellow withthe School of Computer Science and Engineering,Nanyang Technological University. He was a visitingstudent in the School of Computer Engineering,Nanyang Technological University, and the Facultyof Engineering and Information Technology, Univer-
sity of Technology Sydney. His research interests are primarily on machinelearning and data mining with applications to visual information understandingand analysis. He has authored several scientific articles at top venues includingIEEE T-NNLS, IEEE T-IP, IEEE T-KDE, IJCAI and AAAI. He received theIEEE Globecom 2016 Best Paper Award, and was nominated as the IJCAI2017 Distinguished Best Paper Award.
Chao Xu received the B.E. degree from TsinghuaUniversity, Beijing, China, the M.S. degree fromthe University of Science and Technology of China,Hefei, China, and the Ph.D. degree from the Instituteof Electronics, Chinese Academy of Sciences, Bei-jing, in 1988, 1991, and 1997, respectively. He wasan Assistant Researcher with the University of Sci-ence and Technology of China from 1991 and 1994.Since 1997, he has been with the Key Laboratory ofMachine Perception (Ministry of Education), PekingUniversity, Beijing, where he has been a Professor
since 2005. He has authored or co-authored more than 100 papers in journalsand conferences, and holds six patents. His current research interests includeimage and video processing, multimedia technology.
Yonggang Wen (Sâ99-Mâ08-SMâ14) received thePh.D. degree in electrical engineering and computerscience (with a minor in western literature) from theMassachusetts Institute of Technology, Cambridge,MA, USA, in 2008. He is currently an AssociateProfessor with the School of Computer Scienceand Engineering, Nanyang Technological University,Singapore. He has been with Cisco, San Jose, CA,USA, where he led product development in contentdelivery network, which had a revenue impact of$3 billion globally. His work in multiscreen cloud
social TV has been featured by global media (over 1600 news articles fromover 29 countries). He has authored or co-authored over 140 papers in topjournals and prestigious conferences. His current research interests includecloud computing, green data centers, big data analytics, multimedia networks,and mobile computing. Dr. Wen was a recipient of the ASEAN ICT Award2013 (Gold Medal) and the Data Center Dynamics Awards 2015-APAC forhis work on cloud 3-D view, as the only academia entry, and a co-recipient ofthe 2015 IEEE Multimedia Best Paper Award, and the Best Paper Awardat EAI/ICST Chinacom 2015, IEEE WCSP 2014, IEEE Globecom 2013,and IEEE EUC 2012. He was the Chair for the IEEE ComSoc MultimediaCommunication Technical Committee (2014-2016).
Chang Xu is a Lecturer in Machine Learningand Computer Vision at the School of InformationTechnologies, University of Sydney. He obtainedthe B.E. degree from Tianjin University, China, andthe Ph.D. degree from Peking University, China.While pursing the PhD degree, he received thePhD fellowships from IBM and Baidu. His researchinterests lie in Machine Learning algorithms andrelated applications in Computer Vision, includingmulti-view learning, multi-label learning, and visualsearch.Dacheng Tao (Fâ15) is Professor of ComputerScience and ARC Laureate Fellow in the Schoolof Information Technologies and the Faculty ofEngineering and Information Technologies, and theInaugural Director of the UBTECH Sydney ArtificialIntelligence Centre, at the University of Sydney.He mainly applies statistics and mathematics toArtificial Intelligence and Data Science. His researchinterests spread across computer vision, data sci-ence, image processing, machine learning, and videosurveillance. His research results have expounded in
one monograph and 500+ publications at prestigious journals and prominentconferences, such as IEEE T-PAMI, T-NNLS, T-IP, JMLR, IJCV, NIPS, ICML,CVPR, ICCV, ECCV, ICDM; and ACM SIGKDD, with several best paperawards, such as the best theory/algorithm paper runner up award in IEEEICDMâ07, the best student paper award in IEEE ICDMâ13, the distinguishedstudent paper award in the 2017 IJCAI, the 2014 ICDM 10-year highest-impact paper award, and the 2017 IEEE Signal Processing Society BestPaper Award. He received the 2015 Australian Scopus-Eureka Prize, the 2015ACS Gold Disruptor Award and the 2015 UTS Vice-Chancellorâs Medal forExceptional Research. He is a Fellow of the IEEE, AAAS, OSA, IAPR andSPIE.