Semi-supervised Feature Selection via Rescaled Linear Regression · 2017-08-06 ·...

7
Semi-supervised Feature Selection via Rescaled Linear Regression Xiaojun Chen 1 , Feiping Nie 2 *, Guowen Yuan 1 , Joshua Zhexue Huang 1 1 College of Computer Science and Software, Shenzhen University, Shenzhen 518060, P.R. China 2 School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi an 710072, P. R. China [email protected], [email protected], [email protected], [email protected] Abstract With the rapid increase of complex and high- dimensional sparse data, demands for new meth- ods to select features by exploiting both labeled and unlabeled data have increased. Least regres- sion based feature selection methods usually learn a projection matrix and evaluate the importances of features using the projection matrix, which is lack of theoretical explanation. Moreover, these meth- ods cannot find both global and sparse solution of the projection matrix. In this paper, we propose a novel semi-supervised feature selection method which can learn both global and sparse solution of the projection matrix. The new method extends the least square regression model by rescaling the re- gression coefficients in the least square regression with a set of scale factors, which are used for rank- ing the features. It has shown that the new mod- el can learn global and sparse solution. Moreover, the introduction of scale factors provides a theoret- ical explanation for why we can use the projection matrix to rank the features. A simple yet effective algorithm with proved convergence is proposed to optimize the new model. Experimental results on eight real-life data sets show the superiority of the method. 1 Introduction Feature selection is an effective mean to identify relevant fea- tures from high-dimensional data [Liu and Yu, 2005]. Dur- ing the past ten years, many feature selection methods have been proposed and various studies show that feature selection can help to remove irrelevant features without performance deterioration [Huang, 2015]. Feature selection can be con- ducted in a supervised or an unsupervised manner, depending on whether the label information is available. In supervised feature selection, feature relevance can be evaluated accord- ing to the correlations of the features with the class labels, e.g., Fisher score [Richard et al., 2010], Relief-F [Kira and Rendell, 1992; Kononenko, 1994], RFS [Nie et al., 2010], *Corresponding Author. CSFS [Chang et al., 2014] and GRM [Wang et al., 2015]. In unsupervised feature selection, without label information, feature relevance can be evaluated by feature dependency or similarity, e.g., Laplacian Score [He et al., 2005] and RSF- S [Shi et al., 2014]. With the rapid increase of data size, it is often costly to obtain labeled data [Luo et al., 2013]. Therefore, we often have a small set of labeled data together with a large col- lection of unlabeled data in most machine learning and data mining applications, such as image annotations and catego- rizations. Under such circumstances, it is desirable to devel- op feature selection methods that are capable of exploiting both labeled and unlabeled data. The task of conducting fea- ture selection from mixed labeled and unlabeled data is called “semi-supervised feature selection". Various semi-supervised feature selection methods have been proposed recently. Most semi-supervised feature s- election methods are filter-based by ranking the features wherein the highly ranked features are selected and applied to a predictor [Zhao and Liu, 2007; Zhao et al., 2008; Doquire and Verleysen, 2013; Xu et al., 2016]. However, as argued in [Guyon and Elisseef, 2003], the filter-based fea- ture selection could discard important features that are less informative by themselves but are informative when com- bined with other features. Ren et al. proposed a wrapper-type forward semi-supervised feature selection framework [Ren et al., 2008], which performs supervised sequential forward fea- ture selection on both labeled and unlabeled data. However, this method is time consuming for high-dimensional data be- cause it involves iterative feature subset searching. Embed- ded semi-supervised methods take feature selection as part of the training process, therefore, are superior to others in many respects. Kong and Yu et al. proposed a semi-supervised fea- ture selection algorithm for graph data [Kong and Yu, 2010]. Xu et al. proposed a discriminative semi-supervised feature selection method based on the idea of manifold regulariza- tion, but their method has high computational complexity of O(n 2.5 ) where n is the number of objects [Xu et al., 2010]. Least square regression is a widely-used statistical anal- ysis technique. It has been used for many real-world ap- plications due to its effectiveness for data analysis as well as its completeness in statistics theory. Many variants have been developed, including weighted LSR [Strutz, 2010], par- tial LSR [Wold et al., 1984], ridge regression [Cristianini Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 1525

Transcript of Semi-supervised Feature Selection via Rescaled Linear Regression · 2017-08-06 ·...

Page 1: Semi-supervised Feature Selection via Rescaled Linear Regression · 2017-08-06 · Semi-supervisedFeature Selection via Rescaled Linear Regression Xiaojun Chen1, Feiping Nie2*, Guowen

Semi-supervised Feature Selection via Rescaled Linear Regression

Xiaojun Chen1, Feiping Nie2*, Guowen Yuan1, Joshua Zhexue Huang11College of Computer Science and Software, Shenzhen University, Shenzhen 518060, P.R. China

2School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL),Northwestern Polytechnical University, Xi′an 710072, P. R. China

[email protected], [email protected], [email protected], [email protected]

AbstractWith the rapid increase of complex and high-dimensional sparse data, demands for new meth-ods to select features by exploiting both labeledand unlabeled data have increased. Least regres-sion based feature selection methods usually learna projection matrix and evaluate the importances offeatures using the projection matrix, which is lackof theoretical explanation. Moreover, these meth-ods cannot find both global and sparse solution ofthe projection matrix. In this paper, we proposea novel semi-supervised feature selection methodwhich can learn both global and sparse solution ofthe projection matrix. The new method extends theleast square regression model by rescaling the re-gression coefficients in the least square regressionwith a set of scale factors, which are used for rank-ing the features. It has shown that the new mod-el can learn global and sparse solution. Moreover,the introduction of scale factors provides a theoret-ical explanation for why we can use the projectionmatrix to rank the features. A simple yet effectivealgorithm with proved convergence is proposed tooptimize the new model. Experimental results oneight real-life data sets show the superiority of themethod.

1 IntroductionFeature selection is an effective mean to identify relevant fea-tures from high-dimensional data [Liu and Yu, 2005]. Dur-ing the past ten years, many feature selection methods havebeen proposed and various studies show that feature selectioncan help to remove irrelevant features without performancedeterioration [Huang, 2015]. Feature selection can be con-ducted in a supervised or an unsupervised manner, dependingon whether the label information is available. In supervisedfeature selection, feature relevance can be evaluated accord-ing to the correlations of the features with the class labels,e.g., Fisher score [Richard et al., 2010], Relief-F [Kira andRendell, 1992; Kononenko, 1994], RFS [Nie et al., 2010],

*Corresponding Author.

CSFS [Chang et al., 2014] and GRM [Wang et al., 2015].In unsupervised feature selection, without label information,feature relevance can be evaluated by feature dependency orsimilarity, e.g., Laplacian Score [He et al., 2005] and RSF-S [Shi et al., 2014].With the rapid increase of data size, it is often costly to

obtain labeled data [Luo et al., 2013]. Therefore, we oftenhave a small set of labeled data together with a large col-lection of unlabeled data in most machine learning and datamining applications, such as image annotations and catego-rizations. Under such circumstances, it is desirable to devel-op feature selection methods that are capable of exploitingboth labeled and unlabeled data. The task of conducting fea-ture selection from mixed labeled and unlabeled data is called“semi-supervised feature selection".Various semi-supervised feature selection methods have

been proposed recently. Most semi-supervised feature s-election methods are filter-based by ranking the featureswherein the highly ranked features are selected and appliedto a predictor [Zhao and Liu, 2007; Zhao et al., 2008;Doquire and Verleysen, 2013; Xu et al., 2016]. However,as argued in [Guyon and Elisseef, 2003], the filter-based fea-ture selection could discard important features that are lessinformative by themselves but are informative when com-bined with other features. Ren et al. proposed a wrapper-typeforward semi-supervised feature selection framework [Ren etal., 2008], which performs supervised sequential forward fea-ture selection on both labeled and unlabeled data. However,this method is time consuming for high-dimensional data be-cause it involves iterative feature subset searching. Embed-ded semi-supervised methods take feature selection as part ofthe training process, therefore, are superior to others in manyrespects. Kong and Yu et al. proposed a semi-supervised fea-ture selection algorithm for graph data [Kong and Yu, 2010].Xu et al. proposed a discriminative semi-supervised featureselection method based on the idea of manifold regulariza-tion, but their method has high computational complexity ofO(n2.5) where n is the number of objects [Xu et al., 2010].Least square regression is a widely-used statistical anal-

ysis technique. It has been used for many real-world ap-plications due to its effectiveness for data analysis as wellas its completeness in statistics theory. Many variants havebeen developed, including weighted LSR [Strutz, 2010], par-tial LSR [Wold et al., 1984], ridge regression [Cristianini

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

1525

Page 2: Semi-supervised Feature Selection via Rescaled Linear Regression · 2017-08-06 · Semi-supervisedFeature Selection via Rescaled Linear Regression Xiaojun Chen1, Feiping Nie2*, Guowen

and Shawe-Taylor, 2000], discriminative LSR [Xiang et al.,2012]. Least regression based feature selection methods usu-ally learn a projection matrixw and evaluate the importancesof features according to {

∥∥w1∥∥2, ...,

∥∥wd∥∥2}. Nie et al. pro-

posed a sparse Least Squares Regression model for super-vised feature selection [Nie et al., 2010], which introducesℓ2,1 norm to enforce W sparse in rows, thus is particularlysuitable for feature selection. However, it lacks of theoreticalexplanation for why we can use {

∥∥w1∥∥2, ...,

∥∥wd∥∥2} to rank

the features.In this paper, we propose a novel semi-supervised feature

selection method, named Rescaled Linear Square Regression(RLSR). We first propose a new convex model by extendingthe least square regression by rescaling the regression coeffi-cients with a set of scale factors, which are used to evaluatethe importances of features. The new model is proved to beequivalent to a sparse model in which the ℓ2,1 norm regular-ization term is used. Therefore, the new model can learn bothglobal and sparse solution. Moreover, the optimal solutionof scale factors provides a theoretical explanation for whywe can use {

∥∥w1∥∥2, ...,

∥∥wd∥∥2} to rank the features. We

propose a simple yet effective algorithm with proven conver-gence to solve the new feature selection objective function.A series of experiments have been performed on real-life da-ta sets. The experimental results have shown that the newmethod outperformed seven commonly used feature selectionmethods, including semi-supervised, supervised and unsuper-vised feature selection methods.The rest of this paper is organized as follows. Section 2

presents the notations and definitions used in this paper. InSection 3, the new method RLSR is proposed. The experi-mental results are presented in Section 4. Conclusions andfuture work are given in Section 5.

2 Notations and DefinitionsWe summarize the notations and the definition of norms usedin this paper. Matrices are written as boldface uppercase let-ters. Vectors are written as boldface lowercase letters. Formatrix M = (mij), its i-th row is denoted as mi, and its j-thcolumn is denoted by mj . The Frobenius norm of the matrix

M ∈ Rn×m is defined as ∥M∥F =√∑n

i=1

∑mj=1 m

2ij . The

ℓ2,1-norm of matrix M ∈ Rn×m is defined as ∥M∥2,1 =∑ni=1

√∑mj=1 m

2ij .

3 The Proposed MethodIn semi-supervised learning, a data set X ∈ Rd×n with cclasses consists of two subsets: a set of l labeled object-s XL = (x1, ...,xl) which are associated with class label-s YL = {y1, ...,yl}T ∈ Rl×c, and a set of u = n − lunlabeled objects XU = (xl+1, ...,xl+u)

T whose label-s YU = {yl+1, ...,yl+u}T ∈ Ru×c are unknown. Here,yi ∈ Rc(1 ≤ i ≤ l) is a binary vector in which yj

i = 1 if xi

belongs to the j-th class.Let F1,...,Fd denote the d features of X , the semi-

supervised feature selection is to use bothXL andXU to rankF . To measure the importances of d features, we introduce d

scale factors θ in which θj > 0(1 ≤ j ≤ d) measures the im-portances of the j-th feature. We use θ to evaluate the impor-tances of the d features and the k most important features canbe selected according the biggest k values in θ. To learn ΘandYU simultaneously, we form the following convex prob-lem

min(∥∥XTΘW + 1bT −Y

∥∥2F+ γ ∥W∥2F

)st. W,b, θ > 0,1T θ = 1,YU ≥ 0,YU1 = 1

(1)

where YU are relaxed as values in [0, 1], and Θ ∈ Rd×d is arescale matrix which is a diagonal matrix and Θjj =

√θj .

Then we rewrite problem (1) as a sparse problem accordingto the following theoremTheorem 1. Problem (1) is equivalent to the following prob-lem

minW,b,YU≥0,YU1=1

(∥∥XTW + 1bT −Y∥∥2F+ γ ∥W∥22,1

)(2)

where θj can be computed as

θj =

∥∥wj∥∥2∑d

j=1 ∥wj∥2(3)

Proof. Let W̃ = ΘW, then W = Θ−1W̃. Problem (1) canbe rewritten as

min

∥∥∥XTW̃ + 1bT −Y∥∥∥2F+ γ

d∑j=1

∥∥w̃j∥∥22

θj

s.t. W̃,b, θ > 0,1T θ = 1,YU ≥ 0,YU1 = 1

(4)

If W̃ andY are fixed, we can get the optimal solution of θby solving the following problem

minθ>0,θT 1=1

d∑j=1

∥∥w̃j∥∥22

θj(5)

It can be verified that the optimal solution of θ is

θj =

∥∥w̃j∥∥2∑d

j′=1 ∥w̃j′∥2(6)

With the above optimal solution of θ, problem (5) is equiv-alent to

minθ>0,θT 1=1

∥∥∥W̃∥∥∥22,1

(7)

So, problem (4) can be rewritten as

minW̃,b,YU≥0,YU1=1

(∥∥∥XTW̃ + 1bT −Y∥∥∥2F+ γ

∥∥∥W̃∥∥∥22,1

)(8)

which is equivalent to problem (2).

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

1526

Page 3: Semi-supervised Feature Selection via Rescaled Linear Regression · 2017-08-06 · Semi-supervisedFeature Selection via Rescaled Linear Regression Xiaojun Chen1, Feiping Nie2*, Guowen

Since the new problem in Eq. (2) uses ℓ2,1 norm regular-ization, the learnt W are sparse in rows. The scale factor θj

is explicitly computed as∥wj∥

2∑dj′=1∥wj′∥

2

which provides per-

fect theoretical explanation for why we can rank the j-th fea-ture as

∥∥wj∥∥2. Note that problem (1) is convex, therefore,

problem (2) is also convex. In the following, we propose aneffective algorithm to solve problem (2).

3.1 Update b withY andW FixedWhenY and W are fixed, problem (2) becomes

minb

∥∥XTW + 1bT −Y∥∥2F (9)

By setting the partial derivative of the above function withrespect to b as 0, we get the optimal solution of b as

b =1

n(YT1−WTX1) (10)

3.2 Update W with b and YU FixedWhen b andYU are fixed, problem (2) becomes

minW

(∥∥XTW + 1bT −Y∥∥2F+ γ ∥W∥22,1

)(11)

Obviously, ∥W∥2,1 can be zero in theory, however, it willmake Eq. (11) non-differentiable. To avoid this condition,

we regularize ∥W∥22,1 as(∑d

j=1

√∥wj∥22 + ϵ

)2

where ϵ is

a small enough constant. Therefore, we have

minW

∥∥XTW + 1bT −Y∥∥2F+ γ

d∑j=1

√∥wj∥22 + ϵ

2

(12)which is equal to problem (11) when ϵ is infinitely close tozero.The Lagrangian function of problem (12) is

L(W) =∥∥XTW + 1bT −Y

∥∥2F+ γ

d∑j=1

√∥wj∥22 + ϵ

2

(13)

Taking the derivative ofL(W)with respect toW, and settingthe derivative to zero, we have

∂L(W)

∂W= 2X(XTW + 1bT −Y) + 2γQW = 0 (14)

where Q ∈ Rd×d is a diagonal matrix with the j-th diagonalelement as

qjj =

∑dv=1

√∥wv∥22 + ϵ√

∥wj∥22 + ϵ(15)

Note that Q is unknown and depends on W, we can iter-atively solve Q and W. With W fixed, Q can be obtainedby Eq. (15). And with Q fixed, we turn to solve the follow-ing problem which will be proved to be equivalent to problem(12) latter

minW

[∥∥XTW + 1bT −Y∥∥2F+ γTr(WTQW)

](16)

With the fixedY andQ, substituting b in Eq. (10) into Eq.(16), we get

minW

[∥∥HXTW −HY∥∥2F+ γTr(WTQW)

](17)

whereH = I− 1n11

T .The partial derivative of the above problem with respect to

W is

2XHT (HXTW −HY) + 2γQW = 0 (18)

Then we get the optimal solution ofW as

W = (XHTHXT + γQ)−1XHTHY (19)

Since H is an idempotent matrix, the optimal solution ofW can be rewritten as

W = (XHXT + γQ)−1XHY (20)

We propose an iterative algorithm in this paper to obtainthe optimal solution ofW such that Eq. (20) is satisfied. Thealgorithm is described in Algorithm 1. In each iteration,W iscalculated with currentQ, and thenQ is updated based on thecurrently calculated W. The iteration procedure is repeateduntil the algorithm converges.

Algorithm 1 Algorithm to solve problem (12)

1: Input: Data matrix X ∈ Rd×n, labels Y ∈ Rn×c.2: Output: W ∈ Rd×c and Q ∈ Rd×d.3: Set t = 0.4: InitializeQ as an identity matrix.5: repeat6: UpdateWt+1 = (XHXT + γQt)

−1XHY.7: Update the diagonal matrix Qt+1, where the j-th di-

agonal element is∑d

j=1

√∥wj

t+1∥2

2+ϵ√

∥wjt+1∥2

2+ϵ

.

8: t := t+ 19: until Converges

3.3 Update YU with b andW FixedNote that the above problem is independent between differentl + 1 ≤ i ≤ l + u, so we can solve the following problemindividually for each yi ∈ YU with fixedW and b

minyi≥0,yT

i 1=1

∥∥WTxi + b− yi

∥∥22 (21)

The Lagrangian function of the above problem is

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

1527

Page 4: Semi-supervised Feature Selection via Rescaled Linear Regression · 2017-08-06 · Semi-supervisedFeature Selection via Rescaled Linear Regression Xiaojun Chen1, Feiping Nie2*, Guowen

L(YU) =[∥∥WTxi + b− yi

∥∥22+ η(yT

i 1− 1)− yTi βi

](22)

where η and βi ≥ 0 are the Lagrangian multipliers.It can be verified that the optimal solution of yi is

yi = (WTxi + b+ η)+ (23)

where η can be obtained by solving yTi 1 = 1.

The detailed algorithm to solve problem (1), namedRescaled Linear Square Regression (RLSR), is summarizedin Algorithm 2. In this algorithm, W, b and YU are al-ternately updated until convergence. Finally, θ is computedfrom the learned W and the k most important features areselected according to θ.

Algorithm 2 Algorithm to solve problem (1): RLSR

1: Input: Data matrixX ∈ Rd×n, labelsYL ∈ Rl×c, num-ber of selected features k, regularization parameter γ.

2: Output: k selected features.3: t := 0.4: repeat5: UpdateWt+1 via Algorithm 1.6: Update b = 1

n (YT1−WTX1).

7: UpdateYU , in which each yi ∈ YU is calculate fromEq. (23) individually.

8: t := t+ 1.9: until Converges

10: Compute θ ∈ Rd, where θj =∥w̃j∥

2∑dj=1∥w̃j∥2

.

11: Sort θ in descending order, and select top k ranked fea-tures as ultimate result.

3.4 Convergence Analysis of Algorithm 2To prove the convergence of Algorithm 2, we first prove thefollowing lemmaLemma 1. The following inequality holds for any positivevector u ∈ Rd and v ∈ Rd. d∑

j=1

uj

2

−d∑

j=1

vj

d∑j=1

u2j

vj≤

d∑j=1

vj

2

−d∑

j=1

vj

d∑j=1

v2jvj

(24)

Proof. According to the Cauchy-Schwarz inequality, we have d∑j=1

uj

2

=

d∑j=1

uj√vj

√vj

2

≤d∑

j=1

u2j

vj

d∑j=1

vj (25)

Then we have(d∑

j=1

uj

)2

−d∑

j=1

u2j

vj

d∑j=1

vj ≤ 0 =

(d∑

j=1

vj

)2

−d∑

j=1

vj

d∑j=1

v2jvj

(26)

which completes the proof.

The convergence of Algorithm 1 can be proven by the fol-lowing theorem.Theorem 2. In Algorithm 1, updating W will decrease theobjective function of problem (2) until the algorithm con-verges.

Proof. In the t-th iteration, suppose we have obtained the op-timal solutionWt+1 by solving problem (16)

Wt+1 = argW min

[∥∥∥XTW + 1bTt −Yt

∥∥∥2F+ γTr(WTQtW)

](27)

which indicates that∥∥∥XT

Wt+1 + 1bTt − Yt

∥∥∥2

F+ γ

d∑j=1

√∥∥∥wjt

∥∥∥2

2+ ϵ

d∑j=1

∥∥∥wjt+1

∥∥∥2

2+ ϵ√∥∥∥wj

t

∥∥∥2

2+ ϵ

≤∥∥∥XT

Wt + 1bTt − Yt

∥∥∥2

F+ γ

d∑j=1

√∥∥∥wjt

∥∥∥2

2+ ϵ

d∑j=1

∥∥∥wjt

∥∥∥2

2+ ϵ√∥∥∥wj

t

∥∥∥2

2+ ϵ

(28)

Based on Lemma 1, we know

γ

d∑

j=1

√∥∥∥wjt+1

∥∥∥2

2+ ϵ

2

−d∑

j=1

√∥∥∥wjt

∥∥∥2

2+ ϵ

d∑j=1

∥∥∥wjt+1

∥∥∥2

2+ ϵ√∥∥∥wj

t

∥∥∥2

2+ ϵ

≤γ

j

√∥∥∥wjt

∥∥∥2

2+ ϵ

2

−d∑

j=1

√∥∥∥wjt

∥∥∥2

2+ ϵ

d∑j=1

∥∥∥wjt

∥∥∥2

2+ ϵ√∥∥∥wj

t

∥∥∥2

2+ ϵ

(29)

From the above two inequalities, we get

∥∥XTWt+1 + 1bTt −Yt

∥∥2F+ γ

d∑j=1

√∥∥∥wjt+1

∥∥∥22+ ϵ

2

≤∥∥XTWt + 1bT

t −Yt

∥∥2F+ γ

d∑j=1

√∥∥∥wjt

∥∥∥22+ ϵ

2

(30)

which completes the proof.

The convergence of Algorithm 2 can be proven by follow-ing theorem.Theorem 3. Algorithm 2 decreases the objective function ofproblem (1) at each iteration and the solution converges to itsglobal optimum.

Proof. In the t-th iteration, suppose we have obtained the so-lutionWt+1 by solving problem (16), and then we fixWt+1

and update bt+1 and Yt+1 separately. According to Theo-rem 2, we have∥∥∥XTWt+1 + 1bT

t+1 −Yt+1

∥∥∥2F+ γ

(d∑

j=1

√∥∥wjt+1

∥∥22+ ϵ

)2

≤∥∥∥XTWt + 1bT

t −Yt

∥∥∥2F+ γ

(d∑

j=1

√∥∥wjt

∥∥22+ ϵ

)2

(31)

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

1528

Page 5: Semi-supervised Feature Selection via Rescaled Linear Regression · 2017-08-06 · Semi-supervisedFeature Selection via Rescaled Linear Regression Xiaojun Chen1, Feiping Nie2*, Guowen

Table 1: Characteristics of 8 benchmark data sets.

Name #Samples #Features #ClassesGlass 214 9 6

Segment 2310 19 7LM 360 90 15USPS 2007 256 10

Binalpha 1404 320 36Ecoli 336 343 8

CNAE-9 1080 856 9Colon 62 2000 2

Therefore, Algorithm 2 decreases the objective function ofEq. (2) at each iteration. According to Theorem 1 that prob-lem (2) is equivalent to problem (1), we know that Algorith-m 2 also decreases the objective function of Eq. (1) at eachiteration. Note that problem (1) is convex, therefore, the so-lution obtained by Algorithm 2 will converge to its globaloptimum.

According to Theorem 3, our model can learn a global op-timal solution ofW which is also sparse solution. Moreover,it can use unlabeled data to improve performance.

4 Experimental ResultsIn order to validate the performance of our proposed method,we have conducted a series of experiments on 8 benchmarkdata sets. The experimental results are reported in this sec-tion.

4.1 Benchmark Data Sets and ComparisonScheme

The 8 benchmark data sets were selected from Feiping Nie’spage1. The characteristics of these 8 data sets are summarizedin Table 1.To validate the effectiveness of RLSR, we compared it

with seven state-of-the-art feature selection approaches, in-cluding three semi-supervised feature selection methods sS-elect [Zhao and Liu, 2007], LSDF [Zhao et al., 2008] andRRPC [Xu et al., 2016], three unsupervised feature selectionmethod Laplacian Score (LS) [He et al., 2005], UDFS [Yanget al., 2011] and MCFS [Cai et al., 2010], and a supervisedfeature selection method RFS [Nie et al., 2010]. We alsoused all features to perform SVM as baseline. We set theregularization parameter γ of LS, LSDF, RFS, UDFS, sSelec-t and RLSR as {10−3, 10−2, 10−1, 1, 102, 103}, λ of sSelectas {0.1, 0.2, 0.3, 0.4, 0.5, 0.6}. The projection dimensions forLS, LSDF, sSelect and UDFS were set empirically around d

3

to 2d3 in our experiments, where d is the number of features in

the data. For the selected features, we first performed 10-foldcross-validation to select the best SVMmodel, then we testedthe selected SVM model on the test part.For each of the 8 data sets, the training exam-

ples were randomly selected with the given ratio{10%, 20%, 30%, 40%, 50%}. The remaining exampleswere then used as the test data. The test data were also

1http://www.escience.cn/system/file?fileId=82035

used as the unlabeled data for the semi-supervised featureselection methods.

0.1 0.2 0.3 0.4 0.50.450

0.475

0.500

0.525

0.550

0.575

0.600

0.625

0.650

0.675

0.700

Accura

cy

Percentage of labeled data

LS

LSDF

sSelect

PRPC

MCFS

UDFS

RFS

RLSR

SVM

(a) Results on the Glass dataset.

0.1 0.2 0.3 0.4 0.50.87

0.88

0.89

0.90

0.91

0.92

0.93

0.94

0.95

0.96

Accura

cy

Percentage of labeled data

LS LSDF

sSelect PRPC

MCFS UDFS

RFS RLSR

SVM

(b) Results on the Segment dataset.

0.1 0.2 0.3 0.4 0.50.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

Accura

cy

Percentage of labeled data

LS LSDF

sSelect PRPC

MCFS UDFS

RFS RLSR

SVM

(c) Results on the LM data set.

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

0.62

0.64

0.66

0.68

0.70

0.72

0.74

Accura

cy

Percentage of labeled data

LS

LSDF

sSelect

PRPC

MCFS

UDFS

RFS

RLSR

SVM

(d) Results on the USPS dataset.

0.1 0.2 0.3 0.4 0.50.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

Accura

cy

Percentage of labeled data

LS

LSDF

sSelect

PRPC

MCFS

UDFS

RFS

RLSR

SVM

(e) Results of the Binalpha dataset.

0.1 0.2 0.3 0.4 0.50.400

0.425

0.450

0.475

0.500

0.525

0.550

0.575

0.600

0.625

0.650

Accura

cy

Percentage of labeled data

LS

LSDF

sSelect

PRPC

MCFS

UDFS

RFS

RLSR

SVM

(f) Results on the Ecoli data set.

LS LSDF sSelect PRPC

MCFS UDFS RFS RLSR

SVM

0.1 0.2 0.3 0.4 0.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Accura

cy

Percentage of labeled data

(g) Results on the CNAE-9 dataset.

0.1 0.2 0.3 0.4 0.5

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Accura

cy

Percentage of labeled data

LS LSDF

aSelect PRPC

MCFS UDFS

RFS RLSR

SVM

(h) Results on the Colon dataset.

Figure 1: The maximum accuracies versus the percentage of labeledrecords by 8 feature selection methods on 8 data sets.

4.2 Results and AnalysisThe maximum accuracies of eight feature selection methodsversus the percentage of labeled records are shown in Fig-ure 1. In general, the more labeled data we have, the bet-ter accuracy we can achieve. This indicates that we are ableto select features with higher quality if more labeled data isavailable. Overall, the proposed method RLSR outperformedother methods on most cases. For example, on the CANE-9 data set, RLSR has 4% average improvement compared tothe second best approach RFS. 3% average improvement was

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

1529

Page 6: Semi-supervised Feature Selection via Rescaled Linear Regression · 2017-08-06 · Semi-supervisedFeature Selection via Rescaled Linear Regression Xiaojun Chen1, Feiping Nie2*, Guowen

(a) Results of the Binalpha dataset.

(b) Results on the CNAE-9 dataset.

Figure 2: θ versus different γ.

achieved by the proposed method RLSR on the LM data set,compared to the second best approach LSDF. We also noticethat RLSR outperformed RFS on almost all cases, which in-dicates that RLSR can improve RFS with the unlabeled data.This verifies the effectiveness of the semi-supervised featureselection method.

4.3 Parameter Sensitivity StudyWe show the effect of γ on the performance of RLSR in thissection. The relationship between γ and θ is shown in Fig-ure 2. For brevity, we only show the results on two data sets(i.e., the Binalpha and CNAE-9 data sets). As γ increasedfrom 0.001 to 1000, the high weights in θ occurred on fewerfeatures. With the increase of γ, θ will contain more valueswhich are close to zero. In real applications, we hope that θonly contains a few features with high weights. Therefore,we can use γ to control the distribution of θ.In RLSR, γ is used to control the row sparsity of W, and

its value seriously influences the final performance. Varyingthe value of γ, the average clustering accuracies on two datasets are shown in Figure 3. This indicates that RLSR did notchange much with the change of γ. In real life applications,we can perform hierarchy grid search to get better result.

4.4 Convergence StudyWe have proposed Algorithm 1 to iteratively solve problem(12), and proved its convergence in the previous section. Nowwe experimentally show the speed of its convergence. Theconvergence curves of the objective value on two data sets areshown in Figure 4. We can see that, Algorithm 1 convergedvery fast, which ensures the speed of the whole proposed ap-proach.

5 ConclusionsIn this paper, we have proposed a novel semi-supervised fea-ture selection approach named RLSR. The new method ex-tends the least square regression by rescaling the regressioncoefficients in the least square regression with a set of scalefactors, which are used to rank the features. We have provedthat the new model can learn both global and sparse solution.Moreover, the optimal solution of scale factor provides a the-oretical explanation for why we can use {

∥∥w1∥∥2, ...,

∥∥wd∥∥2}

to rank the features. We have proposed a simple yet effectivealgorithm with proved convergence to solve the new mod-el. Empirical studies have been performed on eight data sets,

(a) Results of the Binalpha data set.

(b) Results on the CNAE-9 data set.

Figure 3: Average accuracies versus γ and the number of selectedfeatures.

2 4 6 8 10

239.55

239.58

239.61

239.64

239.67

239.70

239.73

239.76

Obje

ctive function v

alu

e

Iteration index

(a) Results of the Binalpha dataset.

2 4 6 8 1089.65

89.70

89.75

89.80

89.85

89.90

89.95

90.00

Obje

ctive function v

alu

e

Iteration index

(b) Results on the CNAE-9 dataset.

Figure 4: The objective value versus the number of iterations.

to demonstrate the superior performance of our method overseven commonly-used feature selection methods. In the fu-ture work, we will improve this method to handle large scaledata.

AcknowledgementsThis research was supported by NSFC under Grantno.61305059, 61473194 and U1636202.

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

1530

Page 7: Semi-supervised Feature Selection via Rescaled Linear Regression · 2017-08-06 · Semi-supervisedFeature Selection via Rescaled Linear Regression Xiaojun Chen1, Feiping Nie2*, Guowen

References[Cai et al., 2010] Deng Cai, Chiyuan Zhang, and Xiaofei He.

Unsupervised feature selection for multi-cluster data. InACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 333–342, 2010.

[Chang et al., 2014] Xiaojun Chang, Feiping Nie, Yi Yang,and Heng Huang. A convex formulation for semi-supervised multi-label feature selection. In Twenty-EighthAAAI Conference on Artificial Intelligence, pages 1171–1177, 2014.

[Cristianini and Shawe-Taylor, 2000] Nello Cristianini andJohn Shawe-Taylor. An Introduction to Support VectorMachines: And Other Kernel-based Learning Methods.Cambridge University Press, New York, NY, USA, 2000.

[Doquire and Verleysen, 2013] Gauthier Doquire andMichel Verleysen. A graph laplacian based approach tosemi-supervised feature selection for regression problems.Neurocomputing, 121:5–13, 2013.

[Guyon and Elisseef, 2003] Isabelle Guyon and André Elis-seef. An introduction to variable and feature selection.Journal of Machine Learning Research, 3:1157–1182,2003.

[He et al., 2005] Xiaofei He, Deng Cai, and Partha Niyogi.Laplacian score for feature selection. In Advances in neu-ral information processing systems, pages 507–514, 2005.

[Huang, 2015] Samuel H Huang. Supervised feature selec-tion: A tutorial. Artificial Intelligence Research, 4(2):22,2015.

[Kira and Rendell, 1992] Kenji Kira and Larry A. Rendell. Apractical approach to feature selection. In The ninth inter-national workshop on Machine learning, pages 249–256,1992.

[Kong and Yu, 2010] Xiangnan Kong and Philip S Yu. Semi-supervised feature selection for graph classification. InProceedings of the 16th ACM SIGKDD international con-ference on Knowledge discovery and data mining, pages793–802. ACM, 2010.

[Kononenko, 1994] Igor Kononenko. Estimating attributes:Analysis and extensions of relief. In Proceedings ofECML-94, pages 171–182, Italy, 1994. Springer BerlinHeidelberg.

[Liu and Yu, 2005] Huan Liu and Lei Yu. Toward integratingfeature selection algorithms for classification and cluster-ing. IEEE Transactions on Knowledge and Data Engineer-ing, 17(4):491–502, 2005.

[Luo et al., 2013] Yong Luo, Dacheng Tao, Chang Xu,Dongchen Li, and Chao Xu. Vector-valued multi-viewsemi-supervised learning for multi-label image classifica-tion. In Proceedings of the Twenty-Seventh AAAI Confer-ence on Artificial Intelligence, AAAI’13, pages 647–653.AAAI Press, 2013.

[Nie et al., 2010] Feiping Nie, Heng Huang, Xiao Cai, andChris H Ding. Efficient and robust feature selection viajoint ℓ2,1-norms minimization. In Advances in neural in-formation processing systems, pages 1813–1821, 2010.

[Ren et al., 2008] Jiangtao Ren, Zhengyuan Qiu, Wei Fan,Hong Cheng, and Philip S. Yu. Forward semi-supervisedfeature selection. In Proceedings of the 12th Pacific-AsiaConference in Knowledge Discovery and Data Mining,pages 970–976, Berlin, Heidelberg, 2008. Springer BerlinHeidelberg.

[Richard et al., 2010] D. Richard, P. E. Hart, and D. G Stork.Pattern classification. Wiley-Interscience, 2010.

[Shi et al., 2014] Lei Shi, Liang Du, and Yi-Dong Shen. Ro-bust spectral learning for unsupervised feature selection.In IEEE International Conference on Data Mining, pages977–982, Dec 2014.

[Strutz, 2010] Tilo Strutz. Data fitting and uncertainty: Apractical introduction to weighted least squares and be-yond. Vieweg and Teubner, Germany, 2010.

[Wang et al., 2015] D. Wang, F. Nie, and H. Huang.Feature selection via global redundancy minimization.IEEE Transactions on Knowledge and Data Engineering,27(10):2743–2755, Oct 2015.

[Wold et al., 1984] Svante Wold, Arnold Ruhe, HermanWold, and WJ Dunn, III. The collinearity problem in lin-ear regression. the partial least squares (pls) approach togeneralized inverses. SIAM Journal on Scientific and Sta-tistical Computing, 5(3):735–743, 1984.

[Xiang et al., 2012] Shiming Xiang, Feiping Nie, GaofengMeng, Chunhong Pan, and Changshui Zhang. Discrimi-native least squares regression for multiclass classificationand feature selection. IEEE transactions on neural net-works and learning systems, 23(11):1738–1754, 2012.

[Xu et al., 2010] Zenglin Xu, Irwin King, Michael Rung-Tsong Lyu, and Rong Jin. Discriminative semi-supervisedfeature selection via manifold regularization. IEEE Trans-actions on Neural Networks, 21(7):1033–1047, July 2010.

[Xu et al., 2016] J. Xu, B. Tang, H. He, and H.Man. Semisu-pervised feature selection based on relevance and redun-dancy criteria. IEEE Transactions on Neural Networks andLearning Systems, PP(99):1–11, 2016.

[Yang et al., 2011] Yi Yang, Heng Tao Shen, Zhigang Ma,Zi Huang, and Xiaofang Zhou. ℓ2,1-norm regularized dis-criminative feature selection for unsupervised learning. InInternational Joint Conference on Artificial Intelligence,pages 1589–1594, 2011.

[Zhao and Liu, 2007] Zheng Zhao and Huan Liu. Semi-supervised feature selection via spectral analysis. In Pro-ceedings of the 2007 SIAM International Conference onData Mining, pages 641–646, 2007.

[Zhao et al., 2008] Jidong Zhao, Ke Lu, and Xiaofei He. Lo-cality sensitive semi-supervised feature selection. Neuro-computing, 71:1842 – 1849, 2008.

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

1531