Weighted principal component extraction with genetic algorithms

14
Applied Soft Computing 12 (2012) 961–974 Contents lists available at SciVerse ScienceDirect Applied Soft Computing j ourna l ho mepage: www.elsevier.com/locate/asoc Weighted principal component extraction with genetic algorithms Nan Liu , Han Wang School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore a r t i c l e i n f o Article history: Received 1 December 2009 Received in revised form 10 April 2011 Accepted 14 August 2011 Available online 6 October 2011 Keywords: Feature extraction Principal component analysis Linear discriminant analysis Kernel principal component analysis Genetic algorithms a b s t r a c t Pattern recognition techniques have been widely used in a variety of scientific disciplines including computer vision, artificial intelligence, biology, and so forth. Although many methods present satisfactory performances, they still have several weak points, thus leaving a lot of space for further improvements. In this paper, we propose two performance-driven subspace learning methods by extending the principal component analysis (PCA) and the kernel PCA (KPCA). Both methods adopt a common structure where genetic algorithms are employed to pursue optimal subspaces. Because the proposed feature extractors aim at achieving high classification accuracy, enhanced generalization ability can be expected. Extensive experiments are designed to evaluate the effectiveness of the proposed algorithms in real-world problems including object recognition and a number of machine learning tasks. Comparative studies with other state-of-the-art techniques show that the methods in this paper are capable of enhancing generalization ability for pattern recognition systems. © 2011 Elsevier B.V. All rights reserved. 1. Introduction In many applications of pattern recognition and information retrieval, the system performances depend heavily on particular choice of the features. The process of feature extraction should involve the derivation of the salient features by reducing redundant information of the data and providing enhanced discriminatory power [1]. Jain et al. [2] pointed out that the probability of misclassi- fication does not increase with the increment of feature dimension as long as the class-conditional densities are completely known. If the feature dimension is much larger than the number of training samples, the system performance may be poor due to the redun- dancy within feature vectors. Therefore, increasing the ratio of the number of training samples to the feature dimension is suggested such that the degradation on classification accuracy can be avoided [3]. Moreover, limited yet salient features can alleviate the burden of classifier design. As a consequence, many algorithms [4,5] were proposed in the past few years for feature extraction, by which compact representations can be obtained to reflect the intrinsic characteristics of the original data. The feature extraction algorithms are broadly categorized into supervised and unsupervised methods depending on whether or not the class information is deployed. Among unsupervised approaches, principal component analysis (PCA) [6,7] is one repre- sentative technique and has been widely used. Due to its simplicity in both theory and implementation, PCA is commonly used for face Corresponding author. E-mail addresses: [email protected] (N. Liu), [email protected] (H. Wang). recognition [8], ECG signal processing [9] and many other applica- tions. In image analysis, undesirable artifacts, noise, occlusions may exist, and one solution is to compensate for pixels with weights. As a result, weighted and robust learning [10] and robust PCA [11] were proposed, where the weights are estimated through opti- mization algorithms. To fit the data well, several extensions of conventional PCA such as two-dimensional PCA [12] and proba- bilistic PCA [13] have been developed. Furthermore, Kumar et al. [14] proposed a technique to alleviate the computational complex- ity for PCA. The aforementioned approaches promise to have good results only if the samples are linearly separable. However, real- world data like face images are nonlinearly distributed. Because linear feature extractors are not always suitable for expressing data, nonlinear techniques are required to handle the nonlinear- ity of the data, among which kernel PCA (KPCA) [15] is the most prominent one. By using the kernel trick, KPCA maps data into a high dimensional feature space in which the original samples are possible to be linearly separable. Instead of utilizing the kernel trick, a few PCA variations were also developed to tackle nonlinear systems [16,17]. Although nonlinear PCA algorithms have pre- sented impressive results on various tests, the limitations arising from algorithmic complexities, make their success heavily rely- ing on prior knowledge for determining the parameters. Recent advancements on nonlinear dimensionality reduction are mani- fold learning methods, such as locally linear embedding [18,19] and isometric mapping [20]. These techniques discover the inher- ent nonlinear structure from the observed space directly and have shown their superiorities on both artificial and real-world data. However, their unsupervised nature and intrinsic limitations (e.g., the out-of-sample problem [21]) prohibit effective implementation 1568-4946/$ see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2011.08.030

Transcript of Weighted principal component extraction with genetic algorithms

Page 1: Weighted principal component extraction with genetic algorithms

W

NS

a

ARRAA

KFPLKG

1

rciipfiatsdns[opcc

soasi

1d

Applied Soft Computing 12 (2012) 961–974

Contents lists available at SciVerse ScienceDirect

Applied Soft Computing

j ourna l ho mepage: www.elsev ier .com/ locate /asoc

eighted principal component extraction with genetic algorithms

an Liu ∗, Han Wangchool of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore

r t i c l e i n f o

rticle history:eceived 1 December 2009eceived in revised form 10 April 2011ccepted 14 August 2011vailable online 6 October 2011

a b s t r a c t

Pattern recognition techniques have been widely used in a variety of scientific disciplines includingcomputer vision, artificial intelligence, biology, and so forth. Although many methods present satisfactoryperformances, they still have several weak points, thus leaving a lot of space for further improvements. Inthis paper, we propose two performance-driven subspace learning methods by extending the principalcomponent analysis (PCA) and the kernel PCA (KPCA). Both methods adopt a common structure where

eywords:eature extractionrincipal component analysisinear discriminant analysisernel principal component analysisenetic algorithms

genetic algorithms are employed to pursue optimal subspaces. Because the proposed feature extractorsaim at achieving high classification accuracy, enhanced generalization ability can be expected. Extensiveexperiments are designed to evaluate the effectiveness of the proposed algorithms in real-world problemsincluding object recognition and a number of machine learning tasks. Comparative studies with otherstate-of-the-art techniques show that the methods in this paper are capable of enhancing generalizationability for pattern recognition systems.

. Introduction

In many applications of pattern recognition and informationetrieval, the system performances depend heavily on particularhoice of the features. The process of feature extraction shouldnvolve the derivation of the salient features by reducing redundantnformation of the data and providing enhanced discriminatoryower [1]. Jain et al. [2] pointed out that the probability of misclassi-cation does not increase with the increment of feature dimensions long as the class-conditional densities are completely known. Ifhe feature dimension is much larger than the number of trainingamples, the system performance may be poor due to the redun-ancy within feature vectors. Therefore, increasing the ratio of theumber of training samples to the feature dimension is suggesteduch that the degradation on classification accuracy can be avoided3]. Moreover, limited yet salient features can alleviate the burdenf classifier design. As a consequence, many algorithms [4,5] wereroposed in the past few years for feature extraction, by whichompact representations can be obtained to reflect the intrinsicharacteristics of the original data.

The feature extraction algorithms are broadly categorized intoupervised and unsupervised methods depending on whetherr not the class information is deployed. Among unsupervised

pproaches, principal component analysis (PCA) [6,7] is one repre-entative technique and has been widely used. Due to its simplicityn both theory and implementation, PCA is commonly used for face

∗ Corresponding author.E-mail addresses: [email protected] (N. Liu), [email protected] (H. Wang).

568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved.oi:10.1016/j.asoc.2011.08.030

© 2011 Elsevier B.V. All rights reserved.

recognition [8], ECG signal processing [9] and many other applica-tions. In image analysis, undesirable artifacts, noise, occlusions mayexist, and one solution is to compensate for pixels with weights. Asa result, weighted and robust learning [10] and robust PCA [11]were proposed, where the weights are estimated through opti-mization algorithms. To fit the data well, several extensions ofconventional PCA such as two-dimensional PCA [12] and proba-bilistic PCA [13] have been developed. Furthermore, Kumar et al.[14] proposed a technique to alleviate the computational complex-ity for PCA. The aforementioned approaches promise to have goodresults only if the samples are linearly separable. However, real-world data like face images are nonlinearly distributed. Becauselinear feature extractors are not always suitable for expressingdata, nonlinear techniques are required to handle the nonlinear-ity of the data, among which kernel PCA (KPCA) [15] is the mostprominent one. By using the kernel trick, KPCA maps data into ahigh dimensional feature space in which the original samples arepossible to be linearly separable. Instead of utilizing the kerneltrick, a few PCA variations were also developed to tackle nonlinearsystems [16,17]. Although nonlinear PCA algorithms have pre-sented impressive results on various tests, the limitations arisingfrom algorithmic complexities, make their success heavily rely-ing on prior knowledge for determining the parameters. Recentadvancements on nonlinear dimensionality reduction are mani-fold learning methods, such as locally linear embedding [18,19]and isometric mapping [20]. These techniques discover the inher-

ent nonlinear structure from the observed space directly and haveshown their superiorities on both artificial and real-world data.However, their unsupervised nature and intrinsic limitations (e.g.,the out-of-sample problem [21]) prohibit effective implementation
Page 2: Weighted principal component extraction with genetic algorithms

9 ft Com

f[t

idtbOtt[vba(NcflnspdafmdTldseRaGfttftm

bwsiifAltptifiUlnteirear

62 N. Liu, H. Wang / Applied So

or pattern recognition. Therefore, Chang and Yeung [22] and Li et al.23] have attempted to improve the standard manifold learningechniques from different aspects.

Unsupervised methods ignore the labels of input data and max-mize the total scatter across all variables by projecting the originalata to a low-dimensional feature space. Theoretically, the projec-ions are optimal for reconstruction from a low-dimensional basis,ut they may not be optimal from a discrimination standpoint.n the contrary, supervised dimensionality reduction strategies

ake into account class information for extracting discrimina-ive features, thus promising to enhance the classification ability23]. Linear discriminant analysis (LDA) [6,24] is a typical super-ised algorithm, which extracts features by maximizing theetween-class scatter and minimizing the within-class scatter. As

kernel-based extension, the kernel Fisher discriminant analysisKFDA) [25] was proposed to handle nonlinearly distributed data.evertheless, although LDA-based methods are able to enhancelass separability to obtain discriminative features, they still sufferrom various difficulties like the small sample size (SSS) prob-em [26]. For example, LDA technique is not applicable when theumber of training samples is smaller than the feature dimen-ion because the within-class scatter matrix is singular. Wang [27]ointed out that the features leading to large class separabilityo not have to result in a high predictive accuracy. Moreover, thessumption for LDA to find the optimal subspace is that the samplesrom different classes should distribute as homoscedastic Gaussian

odel [28], which cannot be satisfied in most real-world problemsue to the lack of samples and/or outliers [29]. As a consequence,ang et al. [29] introduced the relevance weights to tackle out-iers and improve the overall performance where the weights areetermined either by numerical calculation or through evolutiontrategy. In the past few years, evolution strategies were discussedxtensively for feature selection [30,31], and classification [32].epresentative works include evolutionary pursuit [1], evolution-ry discriminant analysis [33], relevance weighted LDA [29], andA-Fisher [34]. In addition to the genetic algorithms (GAs) based

rameworks, particle swarm optimization (PSO) was also proposedo be coupled with random subspace method for feature extrac-ion [35]. In general, researchers attempt to extract discriminativeeatures to represent the original patterns, and meanwhile intendo look for a trade-off between minimizing the training error and

aximizing the testing accuracy.In this paper, a performance-driven framework is proposed for

uilding two novel feature extractors based on PCA and KPCA, inhich a modified Fisher criterion is employed to express class

eparability. Within this strategy, class-dependent weights arentroduced to represent differences among the samples from var-ous categories, and genetic algorithms are implemented to seekor suitable weights in terms of optimizing the fitness function.s a result, the evolutionary weighted PCA (EWPCA) and the evo-

utionary weighted KPCA (EWKPCA) are proposed as supervisedechniques for dimensionality reduction. Generally speaking, ourroposals are motivated by the facts that PCA lacks class informa-ion in extracting features, and LDA suffers from the SSS problemn many real-world applications. In the proposed methods, thetness function acts dominantly in determining the “optimum”.nlike most evolution-based pattern recognition systems where

earning accuracy is considered as the sole component of the fit-ess, we create the modified Fisher criterion as an additional termo express class separability. By utilizing this new fitness, thextracted features could present enhanced discriminatory powern terms of achieving small training error and large class sepa-

ation. To demonstrate the effectiveness of the proposed featurextractors for classification, EWPCA and EWKPCA are implementednd evaluated in a diverse spectrum of applications such as objectecognition, face authentication, and UCI benchmark data sets.

puting 12 (2012) 961–974

2. Proposed EWPCA algorithm

As mentioned by Jain et al. [2], in the case that number of trainingsamples is small relative to the dimensionality of feature vector, theadded features will bring the danger to degrade the generalizationperformance of the classifier. Apparently, dimensionality reductionon features is essential to recognition.

2.1. Weighted principal component analysis

To obtain the best possible performance, prior knowledge abouta problem ought to be incorporated into the training phase [36].If there is no additional knowledge at hand, we should rely on theinformation embedded in the training samples [11]. As an unsuper-vised dimensionality reduction algorithm, conventional PCA doesnot take account of characteristics of training samples. Therefore,PCA treats all the samples equally regardless of their diverse con-tributions to generalization.

PCA is a linear feature extractor whilst it is suspected that mostdata sets are nonlinear. Apparently, features extracted by PCA maynot retain the information of the original samples well. Hiden et al.[17] claimed that a common procedure to overcome above problemis to “linearize” the data with suitable transformations prior to anal-ysis. In our proposal, weights are employed for compensating theoriginal data so as to facilitate improving classification accuracy. Byintroducing randomly generated weights that later are adaptivelytuned, the proposed weighted PCA (WPCA) is able to incorporatecategory properties into the subspace from which discriminativefeatures can be extracted. One weighted version of PCA [11] addsweights to each pixel and individual image such that the learnedmodels can fit training data well. However, in the context of pat-tern recognition, generalization performance may be poor as subtlechanges on weights will impact the outputs dramatically. More-over, optimizing such huge number of weights is also a challenge.Details of the proposed WPCA algorithm are as follows.

Assume that there are N patterns and each has M observationsin the feature set, we have the data set X = [x1, x2, . . ., xN] as shownin Fig. 1. Note that each pattern xk in X belongs to one of c classes{l1, l2, . . ., lc}. We define the weight vector as [�1, �2, . . ., �N], andassign each variable to one of c values {�1, �2, . . ., �c} accordingto their class properties.

�i = �j if xi ∈ jth class (1)

where i = 1, . . ., N and j = 1, . . ., c. The weights are observed tobe class-dependent so that they can stand for the dissimilaritiesamong different classes.

From the original data set, substituting the grand mean X =(1/N)

∑Ni=1xi and multiplying � i creates a weighted data set XW.

XW = [�1 · x1 − �1 · X, . . . , �N · xN − �N · X] (2)

Subsequently, the covariance matrix of the weighted data set iscalculated.

�XW= 1

NXW XT

W

= 1N

N∑(�i · xi − �i · X)(�i · xi − �i · X)T

i=1

= 1N

N∑i=1

�2i (xi − X)(xi − X)T

(3)

Page 3: Weighted principal component extraction with genetic algorithms

N. Liu, H. Wang / Applied Soft Computing 12 (2012) 961–974 963

), . . .,

Dt

wtrtfsceta

w�at[wc

y

Ast

ftpsprtfmm

(chromosome) in the population.In order to initialize the population we need: (a) the number of

bits in a solution candidate, which is depending on the required pre-cision; (b) total number of solution candidates (population size). In

Fig. 1. The visual structure of the feature set X, where xk = [xk(1), xk(2

efine weight ıi = �2i

, the covariance matrix �XWcan be alterna-

ively represented as

�XW= 1

N

N∑i=1

ıi(xi − X)(xi − X)T

= 1N

N∑i=1

f (ıi, �xi)

(4)

here ıi indicates how important the corresponding pattern con-ributes to the covariance matrix. Function f(a, b) describes theelationship between weight ıi and the covariance matrix of pat-ern xi. In this study, f(a, b) is an inner product operator, that is(a, b) =〈 a, b 〉. Therefore, the formulation of the covariance matrixhares similar format with the relevance weighted within-classovariance matrix in reference [29], and the weights are usuallystimated based on relationships among classes. Then decomposi-ion is applied to the covariance matrix to calculate its eigenvaluesnd corresponding eigenvectors as

UW = �XWUW (5)

here UW = [u1, u2, . . ., uN] are the orthonormal eigenvectors and is the diagonal matrix containing the eigenvalues of the covari-

nce matrix �XW. The vectors u1, u2, . . ., um corresponding to

he mth largest eigenvalues, form a weighted feature space UW =u1, u2, . . . , um], where m ≤ N. By projecting pattern xk onto theeighted feature space UW , an m-dimensional feature vector is

omputed as

k = UTW (xk − X) (6)

feature vector calculated using (6) is used to represent the originalample where yk is able to capture the most dominant features fromhe original data xk.

The reconstruction of the variable in data set X is given straight-orwardly as xk = UW yk + X. Then the reconstruction error ε on theraining data is calculated as

∑Nk=1‖xk − xk‖2. From the statistical

oint of view, small errors are expected. Nevertheless, small recon-truction error has no direct relevance with the generalizationerformance in pattern analysis, and thus cannot guarantee highecognition accuracy. As a consequence, minimizing reconstruc-

ion error will not contribute to WPCA in extracting expressiveeatures. Determination of weights can be considered as an opti-

ization problem where recognition rate is anticipated to beaximized. Moreover, the modified Fisher criterion is embedded

xk(M)]T. In this example, each face image belongs to one of c persons.

into the process of feature extraction to make WPCA supervised sothat high predictive accuracy can be obtained. It is known from (1)that the number of weights is determined by c which is unlikely tobe a small number in many real-world applications. Consequently,traditional numerical optimization methods like gradient descentstrategy may not approximate the target function well as there aretoo many variables to be optimized.

2.2. Optimal weights selection with genetic algorithms

In evolutionary computation community, genetic algorithms[37] are a class of optimization procedures inspired by the bio-logical mechanisms of reproduction. In this study, GAs are usedas the search algorithm to discover the most fitting weights forWPCA where the optimality is defined with respect to classifica-tion accuracy and class separability. Given appropriate weights, itis possible to not only enhance training performance but also over-come the drawback of conventional PCA that the between-class andwithin-class variations cannot be distinguished. In the rest of thissection, we describe the representation of chromosome, and sub-sequently introduce the customized fitness function. Finally, theentire evolving process is illustrated with real-world data wherethree basic genetic operators (selection, crossover, and mutation)guide the optimization.

2.2.1. Chromosome encoding and population initializationWe employ a simple encoding scheme where the chromosome

is defined as a set of weights �1, �2, . . ., �c. In the chromosome,10 bits (resolution) are used to represent each variable. Each chro-mosome represents a combination of the weights, from which yk in(6) are calculated. Fig. 2 depicts the components of each individual

Fig. 2. The mechanism of chromosome coding in genetic algorithms.

Page 4: Weighted principal component extraction with genetic algorithms

9 ft Com

gpib

2

oiLt

wctRsbctlcdgtcecd

nsteatbtivtvtiArvidTCias((tvttafas

64 N. Liu, H. Wang / Applied So

eneral, the initial population is randomly generated. Because theopulation size influences the computation cost and the diversity of

ndividuals, it is usually heuristically determined to give a trade-offetween efficiency and variety.

.2.2. The fitness functionThe chromosome selection for the next generation is done based

n the fitness. In most evolution-based methods, training accuracys employed as the fitness measure [1,29,31]. Taking advantage ofiu and Wechsler’s work [1], the fitness function �(F) consisting ofwo terms is defined as

�(F) = �ca(F) + ��d(F)= CA + �RBW

(7)

here �ca(F) is the performance accuracy term, and �d(F) is the dis-rimination term. In the fitness function, CA represents how goodhe extracted features are in terms of classification accuracy, andBW is the modified Fisher criterion used for indicating the classeparability. Since better classification requires large class separa-ility but the later does not necessarily lead to the first, a positiveonstant � is introduced to determine the importance of the seconderm relative to the first one. Maximizing the fitness function willead to high classification accuracy and large class separability. Byombining those two terms with a proper �, GAs can evolve moreiscriminative principal components than standard PCA. Therefore,ood generalization performance is possible to be obtained on bothraining and testing data sets. In pattern recognition, because thelassification accuracy is in a dominant position for evaluating theffectiveness of the feature extraction algorithm [1], � is empiri-ally chosen such that CA can contribute more to the fitness thaniscriminant term does.

To determine the performance accuracy term �ca(F) of the fit-ess, the overall accuracy in training phase could be used. In earlytudies of evolutionary dimensionality reduction, the accuracy onesting set was used as the fitness function [38]. However, asmphasized by Sun et al. [30], the enrollment of the test set is notppropriate in determining fitness because bias can be introducedo the classification phase. In many recently developed evolution-ased pattern recognition systems, a separate validation set is usedo overcome the overtraining problem [31,30,39]. Generally speak-ng, the data set is partitioned into three disjoint sets: training,alidation (tuning), and testing. During the learning procedure, theraining and validation sets are used to train classifiers and pro-ide feedbacks to the evolutionary algorithms, respectively. Thenhe testing set is employed to evaluate the effectiveness of learn-ng with best tuned parameters determined through validation.lthough an extra validation set can prohibit overfitting, in manyeal-world applications, training samples are not enough to form aalidation set. In such scenario, it is not a wise choice for sacrific-ng training data to create validation set as the decrease of learningata may substantially degrade the performance in generalization.herefore, we implement the cross-validation scheme to estimateA for the fitness. By doing so, the accuracy and stability of test-

ng are likely to be improved, and the overfitting problem may bevoided due to the reduced variance. In details, let L be the traininget, in calculating CA using K-fold cross-validation, approximatelyK − 1)N/K instances composing Ltrn = {(yn, ln)}, n = 1, 2, . . .,K − 1)N/K with labels ln ∈ {l1, l2, . . ., lc}, are used for training andhe remaining N/K samples constitute the corresponding Lval foralidation. h(y, Ltrn) is defined as the predictor (classifier) acrossraining, validation, and testing. Since there are K sets of Ltrn, theraining procedures are repeated K times, and the mean value of∑

ll validation accuracies calculated as CA = (1/K) K

k=1CAk is usedor fitness. Because there are no overlaps between each pair of Ltrn

nd Lval, high classification accuracy can be achieved in the evolvingtage and good performance is probably observed in testing.

puting 12 (2012) 961–974

The second component of the fitness function (7) is the discrimi-nation term �d(F). It is commonly agreed that the features obtainedby the conventional PCA do not distinguish the different roles of thebetween-class and within-class variations because PCA treats themequally. Therefore, the Fisher criterion [24] may be a choice as theclass separability in the fitness function. Nevertheless, SW could besingular because of the small sample size (SSS) problem [40]. Thusa distance-based modified Fisher criterion RBW is proposed to indi-cate the class separability. We define the between-class differenceas DB and within-class difference as DW. As described in the defini-tion of the fitness function, in order to improve the generalizationability, RBW should be maximized during the evolving process. Inthis new criterion, rather than using scatter matrices like SB andSW in standard LDA, DB and DW are calculated to present the classseparation such that decomposition of matrix is not necessary andthus the singularity of SW will be no longer an issue. To provide aclear description in calculating RBW, Z = [z1, z2, . . ., zN] is employedto represent the zero-mean data set with zk = �k · xk − �k · X wherezk belongs to one of c classes {l1, l2, . . ., lc}. Then the differences aredescribed as the cost of matching and given by the �2 test statistic[6], where M is the dimension of xk,

�2(zi, zj) = 12

M∑t=1

[zi(t) − zj(t)]2

zi(t) + zj(t)(8)

Then the between-class difference can be calculated as

DB =c∑

i=1

Ni|�2(Zi, Z)| (9)

and the within-class variation is

DW =c∑

i=1

∑zk∈li

|�2(zk, Zi)| (10)

where Zi

is the mean of weighted samples of class ˚i, and Z is themean of the entire data set. Ni is defined as the number of class li.In order to increase the discrimination of the weighted data set, wewant to maximize RBW which is defined as:

RBW = DB

DW=

∑ci=1Ni|�2(Z

i, Z)|

∑ci=1

∑zk∈li

|�2(zk, Zi)|

(11)

From (6), we notice that the dimensionality of the original data isreduced by discarding those principal components which do notcontribute significantly to overall variation. However, the informa-tion used for distinguishing among different patterns will be lost.In the next section, we will show how to extract statistically dom-inant and discriminative features through GAs-based evolutionaryprocess.

2.2.3. Evolutionary procedureThree genetic operators, selection, crossover, and mutation, are

used to guide the search to pick out the most fitting individual.After initializing the population, these genetic operators are appliedrepeatedly, from which the elite chromosome is obtained once thetermination criterion is met. In this work, the criterion to stopthe evolution is that a pre-defined number of generations NG hasbeen completed. The individual having the maximum fitness in thepopulation of the last generation provides the optimal weights forWPCA.

According to the results of fitness evaluation, the reproduction

process is implemented if the termination criterion is not achieved.The selection operator chooses the chromosomes from the currentgeneration into a mating pool by taking into consideration the fit-ness values of the individuals. The chromosomes with higher fitness
Page 5: Weighted principal component extraction with genetic algorithms

ft Com

vpHtm

tgacsrpos

wspri

�fte

outEosestaEortit7rawa

N. Liu, H. Wang / Applied So

alues will likely appear more times in the mating pool. As one ofopular selection techniques, the roulette wheel is widely used.owever, it may not yield the expected performance if the popula-

ion size is small. Therefore, the stochastic universal sampling [37]ethod is adopted.Based on the probabilistic selection of individuals, the crossover

hat exchanges information between two parent chromosomes forenerating two offsprings is carried out on the mating pool. Therere three basic types of crossovers: one point crossover, two pointrossover, and uniform crossover. Different from the other twochemes, in uniform crossover, each gene of the offspring is selectedandomly from the corresponding genes of the parents. In the pro-osed WPCA algorithm, since how the weights correlate to eachther is unknown, we choose uniform crossover to produce off-prings for the next generation.

Subsequently, the mutation operator is applied to the offspringsith the mutation probability pm. The mutation on the chromo-

omes will introduce a degree of diversity to the population torevent a premature convergence and help to sample unexploredegions of the search space. As mutation usually plays a minor rolen the evolutionary procedure, pm is chosen as a small value.

After pre-defined generations, a set of weightsopt1 , �opt

2 , . . . , �optc are selected, from which the weighted

eature space is constructed. Then we project the samples in bothraining and testing sets onto the GAs-optimized subspace toxtract features.

To show the effectiveness of the EWPCA algorithm, a subsetf UCI optical recognition of handwritten digits set (Optdigits) issed for illustration. The data set contains randomly selected 500raining samples and 896 testing patterns of digit 1, 3, 5, 7, 8.WPCA starts its evolution process with an initialized populationf 50 individuals. Guided by the fitness function, GAs finish theirearches for optimal weights after 50 generations. Then the weightsncoded in the best chromosome are used to construct a new sub-pace. It is presented in the results that the testing accuracies andheir corresponding RBW with features obtained by PCA and EWPCAre 0.8214, 1.0549, and 0.8717, 1.7946, respectively. Apparently,WPCA provides more discriminative features than PCA in termsf achieving higher predictive accuracy. Moreover, the class sepa-ability is maximized through the evolving process. Fig. 3 depictshe two-dimensional mappings of 500 training samples obtained bymplementing PCA and EWPCA. It can be observed in Fig. 3(a) thathe projection by PCA cannot clearly distinguish digit 1 and digit, whereas the clusters of the patterns in Fig. 3(b) are well sepa-

ated with the help of EWPCA. The results and figures reveal thatlthough the introduced weights are not able to create a compactithin-class scatter, they no doubt enhance the class separability

nd present good performance in both training and testing.

Fig. 3. Two-dimensional mappings of a subset of U

puting 12 (2012) 961–974 965

3. Proposed EWKPCA algorithm

EWPCA takes advantage of linear feature extraction algorithmsPCA and LDA whilst most data sets to be tackled in pattern recog-nition are nonlinear. Therefore, we expect to extend EWPCA fromlinear to nonlinear by means of kernel trick [41]. In this section, theweighted kernel PCA (WKPCA) is proposed. The key issue shouldbe addressed in WKPCA is where and how the weights are incorpo-rated. Next, we describe the basic idea of nonlinear mapping, andsubsequently elaborate the proposed WKPCA algorithm. GAs areagain employed as the optimization tool for weights selection.

3.1. Data manipulation in high dimensional space

First of all, we investigate the procedure of KPCA for nonlin-ear mapping. Let us consider a feature space F related to the inputdomain by a map

˚ : X → F, x → ˚(x) (12)

In this paper, RBF (Gaussian) kernel function is employed for com-putation, and it is formulated as k(xi, xj) = exp (− ‖ xi − xj ‖ 2/2�2).

Assuming that the input data are centered, i.e.,∑N

i=1˚(xi) = 0, thecovariance matrix has the same formulation as PCA

�˚ = 1N

N∑j=1

˚(xj)˚(xj)T (13)

The next action is to find eigenvalues � ≥ 0 and nonzero eigenvec-tors U so that �U = �˚U is satisfied. Because all solutions U with /= 0 lie in the span of ˚(x1), . . ., ˚(xN), we can instead considerthat

�〈˚(xn), U〉 = 〈˚(xn), �˚U〉 (14)

where n = 1, . . ., N and there exist coefficients ˛i such that U =∑Ni=1˛i˚(xi). Consequently, we have

N∑i=1

˛i〈˚(xn), ˚(xi)〉

= 1N

N∑i=1

˛i〈˚(xn),N∑

j=1

˚(xj)〈˚(xj), ˚(xi)〉〉(15)

By defining an N × N Gram matrix as

Kij = 〈˚(xi), ˚(xj)〉 (16)

we obtain

N�K� = K2� (17)

CI Optdigits data set: (a) PCA and (b) EWPCA.

Page 6: Weighted principal component extraction with genetic algorithms

9 ft Com

wg

3

bKtciiowiswci

3

itT

TK

Pc

T

), ˚(

, ˚(x

, ˚(x

), ˚(

66 N. Liu, H. Wang / Applied So

here is a column vector with entries ˛1, . . ., ˛N. As K is nonsin-ular, (17) can be simply represented as N�� = K�.

.2. Weighted kernel construction

So far, we have obtained the kernel matrix. Since the distri-utions of all the sample classes are assumed to be the same,

deals with each pattern equally. However, the patterns fromhe same class are quite different from the patterns in otherategories. Therefore it becomes promising to enhance the discrim-natory power for the extracted features if class information can bencorporated into the covariance matrix or kernel matrix. Recallur motivation, although LDA introduces the between-class andithin-class scatters to show variations among different classes,

t may still encounter difficulties such as linearity and the smallample size problem. Previously, the EWPCA method is introduced,hich has the capability of utilizing proper weights to integrate the

lass information into PCA with an aim on yielding good general-zation performance.

.2.1. Can the same trick as WPCA be used?The rationale behind WPCA is to compensate each pattern by

ncorporating the weights into the covariance matrix. Is it possibleo use similar method to extend KPCA to a weighted formulation?o answer this question, we have following theorem.

heorem 1. If function f(a, b) in (4) is used to formulate the weightedPCA, the resultant Gram matrix is not a Hermitian matrix.

roof. By incorporating the same weights used by WPCA to theovariance matrix of KPCA, we have

˚W = 1

N

N∑j=1

�j˚(xj)˚(xj)T (18)

hen (15) has following formulation

N�

N∑˛i〈˚(xn), ˚(xi)〉

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

...

〈˚(xn),N∑

j=1

�j˚(xj)〈˚(xj

...

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

...

{�1〈˚(xn), ˚(x1)〉〈˚(x1)

�2〈˚(xn), ˚(x2)〉〈˚(x2)

· · ·+�N〈˚(xn), ˚(xN)〉〈˚(xN

...

i=1

=N∑

i=1

˛i〈˚(xn), �j˚(xj)〈˚(xj), ˚(xi)〉〉

(19)

puting 12 (2012) 961–974

The right side can be read as (20). After expansion, (20) becomes(21). According to the definition of kernel matrix, (19) is simplifiedto KK ′

where K′ shown as (22) is a variation of conventional Grammatrix.

. . ....

x1)〉〉 . . . 〈˚(xn),N∑

j=1

�j˚(xj)〈˚(xj), ˚(xN)〉〉

. . ....

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

˛1

...

˛n

...

˛N

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

(20)

. . ....

1)〉+1)〉+

x1)〉}

. . .

{�1〈˚(xn), ˚(x1)〉〈˚(x1), ˚(xN)〉+�2〈˚(xn), ˚(x2)〉〈˚(x2), ˚(xN)〉+

· · ·+�N〈˚(xn), ˚(xN)〉〈˚(xN), ˚(xN)〉}

. . ....

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

˛1

...

˛n

...

˛N

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

(21)

K′ =

⎛⎜⎜⎜⎜⎝

�1〈˚(x1), ˚(x1)〉 �1〈˚(x1), ˚(x2)〉 . . . �1〈˚(x1), ˚(xN )〉�2〈˚(x2), ˚(x1)〉 �2〈˚(x2), ˚(x2)〉 . . . �2〈˚(x2), ˚(xN )〉

.

.

....

. . ....

�N 〈˚(xN ), ˚(x1)〉 �N 〈˚(xN ), ˚(x2)〉 . . . �N 〈˚(xN ), ˚(xN )〉

⎞⎟⎟⎟⎟⎠

(22)

Then we can obtain (23).

N�� = K′� (23)

In evaluating the kernel matrix, each entry K′ij in the Gram matrix

reflects the relationship between pattern xi and xj, and the weightof K′

ij in K′ is � i. However, the weights are expected to closely cor-relate with Kij rather � i. In addition, because K′ is not symmetricand thus cannot be considered as a Hermitian matrix, the decom-position of K′ will not give the exact solutions to (23). Therefore, itis important to find an approach to incorporate the weights to keepthe kernel matrix positive semidefinite. �

3.2.2. Formulation of weighted kernel matrixTo unify the formulations of WPCA and WKPCA, we intend to

present the weighted Gram matrix as K′ij = f˚(ˇij, Kij) using a cus-

tomized function f˚ where ˇij is one of i × j weights. In addition,like ıi in (4), ˇij should be derived exclusively from the weights � iand � j.

A trial is using the compensated data XW to infer the weightedkernel matrix which will appear as

K ′ij = 〈˚(�i · xi), ˚(�j · xj)〉 (24)

Since Gaussian kernel is used, K′ij becomes

K ′ij = exp ( − ‖�i · xi − �j · xj‖2

2�2) (25)

Obviously, it is not possible to find an expression of ˇij with � i and

� j merely.

Therefore, we have to construct a weight matrix and implementit to build the weighted kernel matrix K′ using a simple functionf˚(a, b) that associates ˇij with � i and � j. As a result, the weight

Page 7: Weighted principal component extraction with genetic algorithms

ft Com

mm

W

T

f

K

K

wpmeib

is.rtdie

Ipo

HFp

K

wt

3

fwtvc

ws

N. Liu, H. Wang / Applied So

atrix W is designed to have the same dimension as the kernelatrix,

=

⎛⎜⎜⎜⎜⎜⎜⎜⎝

ˇ11 ˇ12 . . . ˇ1N

ˇ21 ˇ22 . . . ˇ2N

......

. . ....

ˇN1 ˇN2 . . . ˇNN

⎞⎟⎟⎟⎟⎟⎟⎟⎠

(26)

hen K ′ij

= Kˇijij

is obtained by easily implementing a function˚(aij, bij) = baij

ij, and subsequently a new weighted kernel matrix

′ is derived from the original K,

′ =

⎛⎜⎜⎜⎜⎜⎜⎜⎝

Kˇ1111 Kˇ12

12 . . . Kˇ1N1N

Kˇ2121 Kˇ22

22 . . . Kˇ2N2N

......

. . ....

KˇN1N1 KˇN2

N2 . . . KˇNNNN

⎞⎟⎟⎟⎟⎟⎟⎟⎠

(27)

here K′ij = 〈 ˚(xi), ˚(xj) 〉 ˇij . In the task of feature extraction for

attern recognition, the major objective is to construct an opti-al subspace for extracting discriminative features from which

nhanced generalization performance can be yielded. By select-ng appropriate weights ˇij, discriminative features are likely toe extracted to attain high classification accuracy.

Having K′, eigenvalues 1, . . ., N of the weighted kernel matrixn decreasing order are calculated and ˛1, . . ., ˛N are the corre-ponding eigenvectors. The subspace is defined as Un, where n = 1,

. ., m, and m is the last nonzero eigenvalue. We normalize the cor-esponding eigenvectors and have n 〈 ˛n, ˛n 〉 = 1. By constructinghe nonlinear subspace, we can project both training and testingata onto it to extract the features. Given a pattern xk, we calculate

ts feature vector of reduced dimension by projecting xk onto theigenvectors Un:

yk(n) = 〈Un, ˚(xk)〉

=N∑

i=1

˛ni 〈˚(xi), ˚(xk)〉 (28)

n other words, we can extract the first m (1 ≤ m ≤ N) nonlinearrincipal components which carry more variance than any other mrthogonal directions by using the kernel function.

We have so far made the assumption that the data are centered.owever, it is difficult to center the input data in the feature space. Schölkopf and Smola [36] proposed an approach to solve thisroblem by calculating K from K′.

˜ ij = (K′ − 1NK′ − K′1N + 1NK′1N)ij (29)

here (1N)ij : = 1/N for all i, j. Then K is used instead of K′ for buildinghe weighted nonlinear subspace.

.3. Estimation of weights

Our idea on calculating ˇij is introduced as follows. If xi and xj arerom the same class, they should be kept as close as possible. Thuse define the weight ˇij as 1 such that the relationship between

he two patterns are retained as in KPCA. Otherwise, ˇij should beariable so as to present the dissimilarity between two different

lasses.

Then, the calculation of W is discussed. As described in [42], theeights should represent some information relative to their corre-

ponding patterns. For instance, in WPCA, the weight ıi shows the

puting 12 (2012) 961–974 967

dominance of pattern xi among the entire training set; in WKPCA,the weight ˇij should reflect the relationship between xi and xj.Suppose that there are several face images of c classes, it is easy forhuman to recognize different persons, but is hard for computer todistinguish. Different from the supervised methods like LDA, bothPCA and KPCA are not able to incorporate the class information intothe subspaces. Therefore, the weights consisting of the class infor-mation are proposed to be embedded into KPCA. Suppose that thereare N weights �1, . . ., �N corresponding to N patterns and each ofthem has been assigned to one of c values according to its classproperty, we can construct W as another N × N matrix

W =

⎛⎜⎜⎝

〈˚(�1), ˚(�1)〉 . . . 〈˚(�1), ˚(�N)〉...

. . ....

〈˚(�N), ˚(�1)〉 . . . 〈˚(�N), ˚(�N)〉

⎞⎟⎟⎠ (30)

so that ˇij =〈 ˚(� i), ˚(� j) 〉. Because Gaussian kernel is used, theweighted kernel matrix K′ has the following formulation

K ′ij = exp ( − ‖xi − xj‖2

2�2)ˇij

= exp ( − ‖xi − xj‖2 exp(−(‖�i − �j‖2)/2�2)

2�2)

(31)

From (31), we can analyze the effects of the weights under twoconditions:

• If � i = � j, the patterns i and j belong to the same class so that ˇijwill be equal to 1. It is observed in (31) that the weighted kernelmatrix will remain the same as conventional kernel matrix, i.e.,K′

ij = Kij.• If � i /= � j, ˇij will be a positive value, in which the correlation

between xi and xj from different categories is embedded.

The relevance among different patterns will result in diversifiedgeneralization performances. Therefore, the estimation of weights� is crucial to the recognition system. GAs are again employed toselect the weights iteratively under the driven of the same fitnessfunction (7) used by EWPCA.

4. Experiments

In this section, we validate EWPCA and EWKPCA on variousapplications, and compare them with PCA, KPCA, and LDA to showtheir effectiveness in extracting discriminative features. We firstlyshow the results on face recognition [43], a challenging problemin which feature dimensions are extremely high. Moreover, threeUCI data sets [44] are used to exploit the appropriateness of theproposed weighted feature extractors on a broad spectrum of prob-lems. Finally, our methods are extensively tested in accomplishingthe task of object categorization on Columbia Object Image Library(COIL-20) [45].

Within the problems related to pattern recognition, in conjunc-tion with feature extraction methods that find low-dimensionalrepresentation for data, classifiers should be implemented forcategorizing patterns. Numerous classifiers, such as Bayesian clas-sifier [6], k-nearest neighbor algorithm (k-NN) [46], support vectormachines (SVMs) [36], neural networks (NN), and recently intro-duced extreme learning machine (ELM) [47], have successfullydemonstrated their merits in predicting properties of unseen data.Putting the classification task in the framework of artificial evo-

lution, k-NN is well investigated by Raymer et al. [31] and hasshown its appropriateness to be used in evolution-based patternrecognition system. Because of k-NN’s easy implementation andnonparametric nature (assumptions on the distributions of features
Page 8: Weighted principal component extraction with genetic algorithms

9 ft Com

acr

igatcmdGrcgPdsi

4

atitlsEaai(pnttit

ycdaihoftleirf

4

iaefa4pr

database is seven which is larger than those of ORL and the comboset. With more training images per class, EWPCA and EWKPCA arepossibly able to learn sufficient details of the intrinsic structuresof data, and consequently provide more accurate hypotheses on

Table 1Recognition rate on face databases.

Database m PCA KPCA LDA EWPCA EWKPCA

Combo 70 91.48 91.65 90.78 92.17 92.7065 91.30 91.65 90.78 91.83 92.0060 91.13 91.30 90.78 92.52 91.6555 90.09 90.09 90.96 91.13 92.1850 90.78 90.61 90.26 91.83 91.1345 89.74 89.91 91.13 90.96 91.1340 89.74 90.09 90.78 91.30 90.9635 89.74 89.91 91.48 92.18 90.7830 90.43 90.09 91.83 92.00 91.8325 90.09 90.26 91.30 90.96 90.7820 87.48 87.48 89.22 90.78 90.0915 86.43 86.43 89.57 87.83 89.2110 81.57 81.57 81.22 85.91 85.91

5 62.96 63.13 66.96 72.70 73.22ORL 35 88.50 90.00 88.00 92.00 93.00

30 87.00 88.00 87.00 92.00 91.5025 88.00 88.00 87.50 91.00 91.0020 86.50 87.00 86.50 88.00 89.0015 84.00 83.50 84.50 88.00 86.5010 82.50 82.50 84.00 85.00 85.00

5 70.50 70.50 75.00 71.50 72.00GTFD 45 60.29 64.29 61.43 62.86 70.29

40 61.71 65.71 60.86 63.43 70.8635 63.43 67.43 61.43 66.57 70.2930 61.71 64.57 62.86 64.29 68.0025 65.14 64.29 65.71 65.71 68.29

68 N. Liu, H. Wang / Applied So

re not required [31]), we combine the proposed weighted prin-ipal components extractors with k-NN to constitute a completeecognition system.

It is noted that some parameters should be specified for exper-mental setup. In the experiments, both population size Np andeneration number NG are set as 50 for GAs. It seems that Np and NGre probably insufficiently large for discovering the optimal solu-ions from traditional viewpoint in the evolutionary computationommunity. However, in pattern recognition, mature convergenceay result in overfitting for training instances and subsequently

egrade the generalization ability in testing. Other parameters inAs such as pc and pm are heuristically set to 0.65 and 0.004,

espectively. Furthermore, in determining the fitness values forhromosomes, five-fold cross-validation scheme is applied so as toive a trade-off between computational time and learning effects.ractically, different combinations of parameters may produceiversified outputs. We keep aforementioned variables with con-tant values for all testing cases so as to provide fair comparisonsn algorithm evaluation.

.1. Face recognition

Machine recognition of faces is emerging as an active researchrea spanning several disciplines such as image processing, pat-ern recognition, and computer vision. Face recognition from stillmages of a scene is a problem that, identify one or more persons inhe scene using a stored database of faces. The solution to the prob-em involves segmentation of faces (face detection) from clutteredcenes, feature extraction from the face region, and recognition.WPCA and EWKPCA are implemented to extract facial featuresnd the recognition results are compared with those of PCA, LDA,nd KPCA. In face recognition, the number of training samples Ns always much smaller than the number of pixels in each imageimage dimension M). Hence the within-class scatter matrix SW isossible to be singular when LDA is applied. In such a case, we areot able to calculate the optimal subspace directly. As an alterna-ion, the fisherface [24] can be used, in which PCA is firstly deployedo reduce the dimension of the feature space to N − c, then LDAs applied to further reduce the feature dimension to c − 1. Thus,he feature vector yk for any query variable xk can be calculated as

k = �ToptU

Txk. Another problem should be addressed is the high

omputing complexity in implementing the proposed methods asecomposition of the covariance matrix involves in the evolution-ry process. In face recognition, the dimensionality of feature vectors usually very high. For example, an image of 112 pixels by 92 pixelsas 10304 (112 × 92) attributes. Therefore, when PCA-based meth-ds are applied, a 10304 × 10304 covariance matrix is generatedor finding optimal basis representations. Our solution is to applyhe discrete cosine transform (DCT) to convert 2D face image to aow-dimensional vector of DCT coefficients, to which the featurextractors can be implemented. This approach is practical as DCTs a data independent technique, and has been widely used in faceecognition [48]. In addition, the framework of combining DCT witheature extractors is commonly accepted [49,50].

.1.1. Data setsTo assess the performances of our proposals, three sets of face

mages are used. They are ORL database [51], a combo face data set,nd Georgia Tech face database (GTFD) [52]. The combo data setncompasses ORL, UMIST [53], and Yale database [24]. Therefore,our stand-alone face image sets are used, and their descriptions

re presented as follows. The ORL database contains 400 images of0 individuals under a high degree of variability in expression andose. In the experiments, we select 40 persons with each 5 samplesandomly as the training set, and the remaining 200 samples are

puting 12 (2012) 961–974

used as test set. The UMIST Face Database consists of 564 imagesof 20 people. Each covering a range of poses from profile to frontalviews. In the experiments, 280 images are used for training, and theremaining images are used for testing. The Yale database contains165 frontal face images covering 15 individuals taken under 11 dif-ferent conditions. The training set for the experiments consists 75randomly selected images (15 persons with each 5 images), and theremaining 90 images are included into the testing set. The GeorgiaTech face database consists of 50 subjects with 15 images each. Allthe color images with cluttered background are taken at resolution640 × 480 pixels. In the experiments, eight samples are randomlyselected for training and the rest of seven images are used for vali-dation. In summary, the combo set consists of 555 training samplesand 575 testing images in total, and all the images belong to 75 dif-ferent classes with large variations of illumination, poses, and facialexpressions. In the experiments, images in Yale and GTFD databasesare manually cropped and resized to 112 × 92 to make their dimen-sions identical to those of samples in ORL and UMIST. Additionalpre-processing on GTFD includes converting the cropped imagesto grey scale level.

4.1.2. ResultsThe testing results are summarized in Table 1 in which pre-

dictive accuracies on different feature dimensions are given. Itcan be observed that both EWPCA and EWKPCA perform gener-ally better than their corresponding origins in terms of extractingdiscriminative facial features which are capable of leading to highclassification rates on testing data. The improvements are par-ticularly significant when low-dimensional feature vectors areextracted for recognition. In analyzing the results of different datasets, we can see that the increasing on CA of GTFD is slightly obvi-ous. It is reasonable as the number of training samples in GTFD

20 62.86 64.57 66.29 67.14 64.8615 61.43 62.29 64.00 65.43 63.7110 54.57 55.71 65.14 58.00 62.86

5 48.00 44.29 53.43 55.43 52.86

Page 9: Weighted principal component extraction with genetic algorithms

N. Liu, H. Wang / Applied Soft Computing 12 (2012) 961–974 969

PCA, a

ua

mcopayatbdawptdr

todr

Fs

Fig. 4. Results on ORL database: (a) LDA,

nseen data. However, maturely trained evolutionary systems on large number of examples may also suffer from overfitting [32].

From the results of ORL database in Table 1, we notice that theaximum dimension of feature vector is constrained to 35, as LDA

an extract at most c − 1 attributes, where c is the total numberf classes (c is 40 for ORL). Nevertheless, our proposed weightedrincipal components extractors do not have such constraints. Ide-lly, it is possible to achieve higher CA when high-dimensionalet discriminative features are used. Therefore, we apply EWPCAnd EWKPCA to extract more features for classification to illustratehat they can present better generalization performances than LDAut are not limited by the constraints on feature dimension. Fig. 4epicts the comparison results on ORL database, in which LDA is notpplicable after feature dimension is larger than 35 (m increasesith an interval of 5), and meanwhile both EWPCA and EWKPCAresent high testing accuracies when m > 35. It is also observedhat CA will not always be improved with the increase of featureimension, because high-dimensional feature vector may possessedundant information.

Since the population size Np is a key factor in controllinghe evolutionary process, it is worth investigating the influence

f variations of Np on the classification results. To this end, weemonstrate with EPWCA on a wide range of population sizes. Theesults are visualized in Fig. 5. Standing the viewpoint of evolution

ig. 5. Recognition rate on GTFD data set using EWPCA with different populationizes where feature dimension is five.

nd EWPCA; (b) LDA, KPCA, and EWKPCA.

strategies, large Np can bring great varieties to candidate solutionsto avoid getting stuck into local optimum. Nevertheless, as claimedin [32], overtraining may occur with big population while evolvinga pattern recognition system. Consequently, parameters acquiredfrom an maturely evolved phase will put the testing on unseendata into an unstable situation that generalization performancecould be very poor. Referring to the results in Fig. 5, it is observedthat large Np neither degrades predictive accuracy nor improvethe performance. It is also noted that small changes on the testingoutcomes are presented when either 20 individuals or 300 chro-mosomes comprise the population. Moreover, both populationsizes can help to improve the validation performance significantly.Above observations support the rationale to set Np as 50 to obtaina trade-off between the system performance and the consumptionof computational resources.

The small sample size (SSS) problem [26,54] is faced by manyapplications such as face recognition and microarray data analysiswhere feature vectors are in high dimension but limited samplesare available for training. Generally speaking, LDA particularly suf-fers from the SSS problem so that it is not guaranteed to outperformPCA [55]. In consequence, several methods have been developed toaddress the SSS problem, such as the regularized LDA (RLDA) [56],the direct LDA (DLDA) [57], and the Nullspace LDA (NLDA) [58]. Tocompare these state-of-the-art algorithms with ours, we conductthe experiments on both ORL and GTFD face databases. The exper-imental results are presented in Table 2. It is observed that DLDA,NLDA, and EWKPCA achieve comparable classification accuracies

in higher dimensionality, while DLDA performs poorly when fea-ture dimension is low. Furthermore, although EWPCA is not ableto improve the classification performance as much as DLDA, NLDA,and EWKPCA, it still outperforms LDA significantly.

Table 2Comparisons with the state-of-the-art methods in handling the SSS problem on ORLand GTFD face databases.

Database m LDA DLDA RLDA NLDA EWPCA EWKPCA

ORL 35 88.00 93.00 91.00 93.00 92.00 93.0030 87.00 92.00 90.00 91.00 92.00 91.5025 87.50 92.00 91.00 90.50 91.00 91.0020 86.50 87.50 90.00 90.50 88.00 89.0015 84.50 85.50 90.00 86.00 88.00 86.5010 84.00 75.00 84.50 86.00 85.00 85.00

GTFD 45 61.43 68.57 63.14 70.50 62.86 70.2940 60.86 67.71 63.71 70.00 63.43 70.8635 61.43 63.43 62.57 70.29 66.57 70.2930 62.86 60.29 63.43 70.86 64.29 68.0025 65.71 56.00 65.43 70.00 65.71 68.2920 66.29 47.71 63.43 69.71 67.14 64.86

Page 10: Weighted principal component extraction with genetic algorithms

970 N. Liu, H. Wang / Applied Soft Computing 12 (2012) 961–974

F ith EWm n GTF

itewdiacafE

4

4

u

TS

ig. 6. Experimental results by combining the state-of-the-art LDA-based methods wethods on ORL, (c) DLDA-based methods on GTFD, and (d) RLDA-based methods o

Since the proposed methods are derived from PCA and KPCA, its feasible to combine the PCA-based methods with the LDA-basedechniques, as the PCA+LDA method [24] is an effective strategy inxtracting discriminative features for classification. In this study,e incorporate PCA and EWPCA with DLDA and RLDA, and vali-ate the generated algorithms on ORL and GTFD databases. As seen

n Fig. 6, EWPCA+DLDA and EWPCA+RLDA outperform PCA+DLDAnd PCA+RLDA in general. Moreover, the combined methods areapable of achieving higher classification accuracies than DLDAnd RLDA. These observations reveal that better prediction per-ormances are possible to be obtained by integrating EWPCA andWKPCA with other feature extractors in a straightforward manner.

.2. UCI data sets

.2.1. Data descriptionThree data sets from UCI machine learning repository [44] are

sed for the evaluation. Their specifications are listed in Table 3.

able 3ummary of UCI data sets. All data sets have been partitioned into two parts for training

Data set Feature dimension Class number

Landsat 36 6

Optdigits 64 10

Segmentation 19 7

PCA on ORL and GTFD databases: (a) DLDA-based methods on ORL, (b) RLDA-basedD.

Landset satellite data set (Landsat) consists of the multi-spectralvalues of pixels in 3 × 3 neighborhoods in a satellite image. Theinstances in Landset are originally separated into training and test-ing sets with 4435 and 2000 samples, respectively. To alleviate thecomputational burden in the learning phase, 150 instances per classare randomly selected for evaluating our proposals. The seconddatabase is optical recognition of handwritten digits set (Optdig-its) where training and testing sets are partitioned by default. Thethird test is accomplished on the image segmentation (Segmenta-tion) database in which the training set is small compared with thetesting set.

4.2.2. ResultsTable 4 presents the recognition rates of five feature extraction

algorithms. Similar to the results on the face databases, EWPCA andEWKPCA generally outperform PCA, KPCA, and LDA. Since the classnumber is small for each UCI data set, more examples per class thanface images can be used for training. This observation demonstrates

and testing.

Sample size Training size Testing size

2900 900 20005620 3823 17972310 210 2100

Page 11: Weighted principal component extraction with genetic algorithms

N. Liu, H. Wang / Applied Soft Com

Table 4Recognition rate on UCI databases.

Database m PCA KPCA LDA EWPCA EWKPCA

Landsat 5 76.85 79.50 78.10 79.65 80.554 74.75 78.50 76.60 78.15 79.503 73.85 76.90 76.95 77.40 81.902 67.55 74.20 67.70 76.00 74.351 57.35 58.05 56.10 60.60 59.25

Optdigits 9 95.44 95.44 95.44 96.11 96.278 94.16 94.49 94.88 94.71 94.887 93.38 93.54 93.60 94.10 94.606 92.54 92.43 92.26 92.82 92.435 90.26 90.09 89.71 90.76 91.214 81.75 81.97 82.25 84.75 85.143 74.51 75.68 71.40 75.24 75.402 56.04 57.48 53.53 61.99 64.001 31.50 30.94 28.44 35.45 35.56

Segment 6 81.43 80.52 80.86 83.38 83.905 81.00 80.57 80.43 81.29 83.814 80.71 80.52 73.43 81.52 82.193 60.71 77.38 72.29 78.67 78.91

ttiLrsfsTsplt

sTndidtdnriMwFo

2 61.81 57.38 57.90 63.05 75.191 39.38 50.00 55.90 55.76 59.48

hat our methods can extract discriminative features for classifica-ion in a wide spectrum of applications. It is also noted that themprovements using EWPCA and EWKPCA on Segmentation andandsat data sets are significant, especially when low-dimensionalepresentations are used. Given the fact that there are more trainingamples in Optdigits set, it is concluded that our proposed weightedeature extractors are able to select discriminative features for clas-ification, particularly when few examples are used for learning.his discovery provides the evidence that EWPCA and EWKPCA areuitable for building pattern recognition systems as training sam-les are very limited in many real-world applications. In addition,

ow-dimensional representations of the original data can reducehe computing requirements significantly.

To obtain discriminative representations through evolution, auitable fitness function is needed to guide the whole procedure.herefore, it is important to know the effects of parameters in fit-ess function on the final results. To this end, the subset of Optdigitsescribed in Section 2.2 is chosen for illustration. In detail, EWPCA

s applied using the same parameters as those used in previousemonstration. First of all, we would like to investigate the rela-ionship between the performance accuracy term �ca(F) and theiscrimination term �d(F). Fig. 7(a)–(c) depict the mean value of fit-ess, the class separability criterion RBW, and the testing accuracy,espectively, on different number of generations. In this example,t is clear that the evolving process converges after 25 iterations.

oreover, RBW and the testing accuracy are 1.0549 and 0.8214hen PCA is utilized where m = 2. In comparison, it is seen from

ig. 7 that EWPCA achieves better generalization performance andbtains larger class separability than PCA by means of the optimal

Fig. 7. (a–c) Recorded variations on different parameters of EWPCA during evolutio

puting 12 (2012) 961–974 971

weights. Bigger RBW can be obtained by adjusting � in the fitness,but it should be emphasized that our goal is not to find the “best”low-dimensional representation for patterns such that the trainingerror is minimized and the underlying distribution of each class canbe approximated. Rather, the features are expected to achieve smalltraining error and perform well in generalization. As a consequence,� is determined empirically so that the predictive accuracy term inthe fitness can contribute more than the class separability criteriondoes. Table 5 presents the results of evolutionary process with dif-ferent values of �, where CAtest indicates the testing accuracy. It isobvious that RBW becomes larger with the increase of � while test-ing accuracy decreases accordingly. This observation reflects that� can control the contribution of class separability criterion to thefitness function. An appropriately selected � is capable of enlargingRBW as well as achieving good generalization performance. We alsonotice that CAtest usually reaches to its peak value within severalgenerations, and decreases until becoming stable after 50 itera-tions of genetic operations. Refer to previous discussion, matureconvergence can have small training error but weaken the abilityof generalization on unseen data. On the contrary, un-optimizedweights may perform poorly in learning whereas achieve high clas-sification accuracy in testing. That is the reason that the values ofCAtest at the beginning of the evolution process are normally largerthan the final testing results.

4.3. Object categorization

4.3.1. DatabaseColumbia Object Image Library [45] is adopted as the object

data set for algorithm evaluation. COIL-20 contains 1440 gray scaleimages of 20 objects, in which wide variety of geometric andreflectance characteristics can be observed from different imagesof the same subject. After operating normalization and histogramon each picture, 1440 gray scale images (128 × 128) are obtained,from which 36 images of each subject are selected to construct thetraining set, and the remaining 720 samples are used for testing. As72 images per object are taken from in a sequence at every 5 degreesof rotation, there are great variations in poses among images of sin-gle object. Hence, random selection of training and testing samplesmay introduce bias to data distribution. In practice, if the imageindex is even number, its corresponding image will be chosen intothe training set so that images in testing set have similar pose ori-entation as the samples in training set. Furthermore, DCT is alsoapplied to reduce dimensionality prior to implementing featureextraction algorithms.

4.3.2. ResultsIn this section, we are interested in investigating the abilities of

EWPCA and EWKPCA in handling the SSS problem with differentnumber of training samples. Toward this end, six subsets are

n where � is 0.02. The results are based on a subset of UCI Optdigits data set.

Page 12: Weighted principal component extraction with genetic algorithms

972 N. Liu, H. Wang / Applied Soft Computing 12 (2012) 961–974

Table 5Recognition rate on a subset of UCI optdigits database where the feature dimension is two. The class separability RBW of the original data set is 1.0549, and the testing accuracyusing features extracted by PCA is 82.14%.

NG � = 0.002 � = 0.02 � = 0.2 � = 2

�(F) RBW CAtest �(F) RBW CAtest �(F) RBW CAtest �(F) RBW CAtest

5 0.9359 1.3697 0.8984 0.9521 1.6199 0.8862 1.2797 2.5986 0.8147 4.7241 2.7013 0.791310 0.9546 1.2708 0.9051 0.9808 1.7497 0.8795 1.3958 2.9748 0.8315 6.0721 3.3663 0.633915 0.9638 1.2879 0.9085 0.9862 1.8632 0.8795 1.4716 3.4149 0.7623 7.4458 3.9954 0.678720 0.9653 1.3035 0.9051 0.9910 1.7747 0.8728 1.4886 3.5337 0.7567 8.4518 4.0363 0.684225 0.9668 1.3003 0.9063 0.9904 1.7754 0.8728 1.5248 3.5443 0.7545 8.5733 4.0557 0.683030 0.9662 1.3019 0.9040 0.9899 1.7764 0.8727 1.5394 3.5990 0.7478 8.6777 4.0729 0.677535 0.9680 1.3019 0.9040 0.9914 1.7792 0.8717 1.5423 3.5988 0.7500 8.7514 4.0793 0.677540 0.9676 1.3019 0.9040 0.9944 1.7861 0.8717

45 0.9691 1.3019 0.9040 0.9950 1.7872 0.8717

50 0.9686 1.3038 0.9051 0.9951 1.7946 0.8717

Fv

coasaipMtoodro

TRt

ig. 8. Recognition rate on COIL-20 database where dimensionality of the featureector is three.

onstructed using the original training data with various numberf samples from each object. As there are 36 images per class,ccording to their indexes, the first six images forms the firstubset. Then, the second subset is comprised of the first 12 images,nd so forth. The results in Fig. 8 reveal that LDA performs poorlyn achieving high classification accuracy, and the effects of the SSSroblem is obvious when only six examples are used for training.eanwhile, PCA and KPCA, as well as our methods, are less sensi-

ive to the number of training samples. These observations supportur ideas that EWPCA and EWKPCA can inherit the advantagesf PCA, KPCA and LDA, and simultaneously avoid involving their

rawbacks. Furthermore, Table 6 summarizes the classificationesults using various feature extractors when the training is carriedut with six samples per object. It can also be observed that the

able 6ecognition rate on COIL-20 database where six samples per class are used forraining.

m PCA KPCA LDA EWPCA EWKPCA

19 90.69 90.56 72.08 91.53 91.8118 91.39 90.28 74.86 92.08 91.8117 91.25 89.86 73.19 91.53 90.8316 90.28 89.58 73.89 91.39 90.8315 90.42 89.58 73.89 91.11 90.6914 89.17 89.03 75.28 90.83 90.5613 89.86 88.19 76.11 90.28 90.1412 89.44 88.06 79.17 90.56 90.2811 89.31 88.33 78.19 90.00 89.8610 88.89 87.64 78.75 90.28 89.31

1.5492 3.6088 0.7489 8.7546 4.0798 0.67631.5519 3.6088 0.7489 8.8773 4.1006 0.67301.5478 3.6542 0.7411 8.8531 4.1006 0.6730

discriminatory power of features extracted by LDA is degradedsignificantly due to limited training samples.

5. Discussion

The proposed weighted principal components extractionmethods have proven effective on various benchmark data sets.Apparently, EWPCA and EWKPCA outperform PCA, KPCA, andLDA in terms of achieving higher testing accuracies. The results inTable 1 reveal that our methods can tackle high-dimensional dataeffectively because face recognition is a challenging task whereredundancy exists in a huge number of attributes. The evaluationson UCI data sets extend EWPCA and EWKPCA to be suitable for abroad spectrum of machine learning applications. Furthermore,both face databases and COIL-20 image library are employed todemonstrate the capability of the proposed algorithms in solvingthe SSS problem. Other than the generalization performance, theissue of algorithmic complexity should be addressed as well. Ourmethods have a limitation to be extended for online learning, whichis also faced by most evolutionary pattern recognition systems.In stead of calculating the complexities quantitatively, we wouldrather analytically describe them. The major obstacle in accelerat-ing the extraction of evolved features is brought by GA because thefeature extraction and classification procedures must be repeatedat least Np · NG times in the training phase. Although several trials,such as decreasing values for Np and NG, choosing simple classi-fiers, can alleviate the computational burden, the time-consumingprocedure originated from the nature of population-based randomselection, is still a weak point in evolutionary pattern recogni-tion system. In the remaining of this section, some interestingobservations obtained from the results are summarized as follows.

• The combination of the merits of PCA, KPCA, and LDA, is consid-ered as a key advantage of the weighted principal componentsextractors. Moreover, the proposals also benefit from the abilityof GAs to seeking for global optimum. In the training phase, theevolution strategy integrates feature extraction and classificationtogether so as to determine the optimal weights with respectto providing high training rate and simultaneously increasingthe class separability. Evidences that support the superioritiesof EWPCA and EWKPCA over other feature extraction methodshave been shown in the experiments.

• It is worth mentioning that the modified Fisher criterion playsa crucial role, because it not only utilizes class informationto enhance EWPCA and EWKPCA to become supervised, butalso helps to avoid the SSS problem usually faced by several

LDA-based techniques. The experimental results in Section 4.1.2show that EWPCA and EWKPCA achieve comparable general-ization performances with RLDA, DLDA, and NLDA. In addition,the classification results are significantly improved when the
Page 13: Weighted principal component extraction with genetic algorithms

ft Com

amcwparset

6

EcTtfatapmtfir

aticuoco

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

N. Liu, H. Wang / Applied So

proposed EWPCA algorithm is incorporated with LDA-basedtechniques.We also notice that the features extracted by EWPCA andEWKPCA are compact and discriminative. This observation isstrongly supported by the results in Tables 1 and 6. From theclassification results on ORL in Table 1, it is noted that only 15principal components extracted by EWPCA can achieve an accu-racy of 88%, whereas when PCA, KPCA, and LDA are applied, 25,25, and 35 dimensional features are required to obtain the sametesting performance. As far as our knowledge goes, compact rep-resentations of the original data will not merely speed up thelearning and testing processes, but also relieve the burden ofstorage.Since the feature extraction scheme is enhanced by GAs, severalparameters, such as Np, pc, and pm, are introduced and should bedecided beforehand. Therefore, it is useful to evaluate our meth-ods under different settings of parameters. In the experiments,the population size Np and � in the fitness function, are partic-ularly discussed to show their effects on the final results. Fig. 5depicts the testing accuracies over a large variations of popula-tion size. The results convince us that even moderate value of Np

can lead to good generalization performance. This finding is fairlyimportant as a small amount of individuals in population can savethe learning time substantially. Moreover, the results in Table 5reveal that � is of primary importance in controlling the trade-off between accuracy and class separability in order to generalizewell on unknown data.

In general, both EWPCA and EWKPCA can be employed in thepplications where improving the classification accuracy is theajor task while the time consumption is not the main con-

ern. When conduct the experiments to evaluate our methods,e attempt to present the advantages of the weighted princi-al components extractors for pattern analysis, but not to targett achieving the highest classification accuracy. Hence, betteresults can be expected by selecting proper parameters. In practicalettings, it is suggested to choose different combinations of param-ters that are specifically designed for the evolutionary systems sohat satisfactory outcomes can be obtained.

. Conclusion

This paper introduces two novel feature extraction algorithms,WPCA and EWKPCA, and shows their superiorities in extractingompact yet discriminative representations from the original data.he fundamental principles behind the proposals are the introduc-ion of weights to PCA and KPCA, and the use of evolution strategyor selecting optimal parameters. The weighted feature extractornd the classifier are integrated by GAs to build a framework ofraining, from which optimal weights are obtained in terms ofchieving high training accuracy. Because PCA and KPCA are unsu-ervised methods, we take advantage of the weights in creating aodified Fisher criterion RBW and incorporate it to the fitness func-

ion as the class separation term. As a consequence, we employ thetness to evolve a balanced performance in achieving good learningesults as well as enhancing class separability.

One challenge should be addressed is that minimizing the over-ll learning error may make the GAs-based prediction model fitraining data very well, and consequently degrades the general-zation ability. In dealing with the above mentioned difficulty, wehoose using five-fold cross-validation scheme rather than making

se of the entire training set to estimate the learning error. More-ver, we adopt a moderate population size to avoid pre-matureonvergence. Another difficult problem is the great consumptionf time for training, which is an inevitable issue faced by almost

[

[

puting 12 (2012) 961–974 973

all other evolutionary systems for pattern analysis. Therefore, fastevolution strategies are worthwhile for further investigation toimprove systematic efficiency. Overall, EWPCA and EWKPCA havebeen demonstrated with a variety of applications to be able toextract discriminative features. The impacts of parameters in ourmethods are also well studied and illustrated in the experiments.Moreover, better testing performances can be obtained than thereported results. Toward that end one is suggested to select theparameters specifically designed for applications.

References

[1] C. Liu, H. Wechsler, Evolutionary pursuit and its application to face recogni-tion, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000)570–582.

[2] A.K. Jain, R.P.W. Duin, J.C. Mao, Statistical pattern recognition: a review, IEEETransactions on Pattern Analysis and Machine Intelligence 22 (2000) 4–37.

[3] A.K. Jain, B. Chandrasekaran, Dimensionality and sample size considerations inpattern recognition practice, in: P.R. Krishnaiah, L.N. Kanal (Eds.), Handbook ofStatistics, vol. 2, North-Holland, Amsterdam, 1982, pp. 835–855.

[4] H. Liu, L. Yu, Toward integrating feature selection algorithms for classifica-tion and clustering, IEEE Transactions on Knowledge and Data Engineering 17(2005) 491–502.

[5] A.R. Webb, Statistical Pattern Recognition, 2nd ed., John Wiley & Sons, NewYork, 2002.

[6] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, NewYork, 2001.

[7] K.I. Diamantaras, S.Y. Kung, Principal Component Neural Networks Theory andApplications, Wiley, New York, 1996.

[8] M. Turk, A. Pentland, Eigenfaces for recognition, Journal of Cognitive Neuro-science 3 (1991) 71–86.

[9] F. Castells, P. Laguna, L. Sörnmo, A. Bollmann, J.M. Roig, Principal componentanalysis in ECG signal processing, EURASIP Journal on Advances in Signal Pro-cessing 2007 (2007).

10] F.D.L. Torre, M.J. Black, A framework for robust subspace learning, InternationalJournal of Computer Vision 54 (2003) 117–142.

11] D. Skocaj, A. Leonardis, H. Bischof, Weighted and robust learning of subspacerepresentations, Pattern Recognition 40 (2007) 1556–1569.

12] J. Yang, D. Zhang, A.F. Frangi, J.Y. Yang, Two-dimensional PCA: a new approachto appearance-based face representation and recognition, IEEE Transactions onPattern Analysis and Machine Intelligence 26 (2004) 131–137.

13] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York,2006.

14] K.V. Kumar, A. Negi, Subxpca and a generalized feature partitioning approachto principal component analysis, Pattern Recognition 41 (2008) 1398–1409.

15] B. Schölkopf, A. Smola, K.R. Müller, Nonlinear component analysis as a kerneleigenvalue problem, Neural Computation 10 (1998) 1299–1319.

16] R. Saegusa, H. Sakano, S. Hashimoto, Nonlinear principal component analysis topreserve the order of principal components, Neurocomputing 61 (2004) 57–70.

17] H.G. Hiden, M.J. Willis, M.T. Tham, G.A. Montague, Non-linear principal compo-nents analysis using genetic programming, Computers & Chemical Engineering23 (1999) 413–425.

18] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linearembedding, Science 290 (2000) 2323–2326.

19] L.K. Saul, S.T. Roweis, Think globally, fit locally: unsupervised learning oflow dimensional manifolds, Journal of Machine Learning Research 4 (2003)119–155.

20] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework fornonlinear dimensionality reduction, Science 290 (2000) 2319–2323.

21] B. Li, D.S. Huang, C. Wang, K.H. Liu, Feature extraction using constrained max-imum variance mapping, Pattern Recognition 41 (2008) 3287–3294.

22] H. Chang, D.Y. Yeung, Robust locally linear embedding, Pattern Recognition 39(2006) 1053–1065.

23] B. Li, C.H. Zheng, D.S. Huang, Locally linear discriminant embedding: an efficientmethod for face recognition, Pattern Recognition 41 (2008) 3813–3821.

24] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recog-nition using class specific linear projection, IEEE Transactions on PatternAnalysis and Machine Intelligence 19 (1997) 711–720.

25] S. Mika, G. Ratsch, J. Weston, B. Schölkopf, A. Smola, K.R. Müller, Construct-ing descriptive and discriminative nonlinear features: Rayleigh coefficientsin kernel feature spaces, IEEE Transactions on Pattern Analysis and MachineIntelligence 25 (2003) 623–628.

26] P. Howlanda, J. Wangb, H. Parkc, Solving the small sample size problem in facerecognition using generalized discriminant analysis, Pattern Recognition 39(2006) 277–287.

27] L. Wang, Feature selection with kernel class separability, IEEE Transactions onPattern Analysis and Machine Intelligence 30 (2008) 1534–1546.

28] S. Petridis, S.J. Perantonis, On the relation between discriminant analysis andmutual information for supervised linear feature extraction, Pattern Recogni-tion 37 (2004) 857–874.

29] E.K. Tang, P.N. Suganthan, X. Yao, A.K. Qin, Linear dimensionality reductionusing relevance weighted LDA, Pattern Recognition 38 (2005) 485–493.

Page 14: Weighted principal component extraction with genetic algorithms

9 ft Com

[

[

[

[

[

[

[

[

[

[

[

[

[

[[

[

[

[

[

[

[

[

[

[

[

[

[

74 N. Liu, H. Wang / Applied So

30] Z.H. Sun, G. Bebis, R. Miller, Object detection using feature subset selection,Pattern Recognition 37 (2004) 2165–2176.

31] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, A.K. Jain, Dimensionalityreduction using genetic algorithms, IEEE Transactions on Evolutionary Com-putation 4 (2000) 164–171.

32] X. Wang, H. Wang, Classification by evolutionary ensembles, Pattern Recogni-tion 39 (2006) 595–607.

33] A.A. Sierra, A. Echeverría, Evolutionary discriminant analysis, IEEE Transactionson Evolutionary Computation 10 (2006) 81–92.

34] W.S. Zheng, J.H. Lai, P.C. Yuen, GA-fisher: a new LDA-based face recognitionalgorithm with selection of principal components, IEEE Transactions on Sys-tems, Man, and Cybernetics, Part B: Cybernetics 35 (2005) 1065–1078.

35] L. Nanni, A. Lumini, Evolved feature weighting for random subspace classifer,IEEE Transactions on Neural Networks 19 (2008) 363–366.

36] B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Reg-ularization, Optimization, and Beyond, The MIT Press, Cambridge, MA, 2002.

37] D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learn-ing, Addison-Wesley, 1989.

38] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, IEEEIntelligent Systems 13 (1998) 44–49.

39] Q.Y. Zhu, A.K. Qin, P.N. Suganthan, G.B. Huang, Evolutionary extreme learningmachine, Pattern Recognition 38 (2005) 1759–1763.

40] D. Masip, J. Vitrià, Shared feature extraction for nearest neighbor face recogni-tion, IEEE Transactions on Neural Networks 19 (2008) 586–595.

41] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, CambridgeUniversity Press, Cambridge, UK, 2004.

42] M. Loog, R.P.W. Duin, R. Haeb-Umbach, Multiclass linear dimension reductionby weighted pairwise fisher criteria, IEEE Transactions on Pattern Analysis andMachine Intelligence 23 (2001) 762–766.

43] S.Z. Li, A.K. Jain (Eds.), Handbook of Face Recognition, Springer, New York, 2005.44] A. Frank, A. Asuncion, UCI Machine Learning Repository, 2010.

http://archive.ics.uci.edu/ml.45] S.A. Nene, S.K. Nayar, H. Murase, Columbia Object Image Library (COIL-20),

Technical Report CUCS-005-96, Columbia University, New York, NY, 1996.

[

[

puting 12 (2012) 961–974

46] G. Shakhnarovich, T. Darrell, P. Indyk (Eds.), Nearest-Neighbor Methods inLearning and Vision: Theory and Practice, The MIT Press, Cambridge, MA, 2006.

47] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and appli-cations, Neurocomputing 70 (2006) 489–501.

48] Z.M. Hafed, M.D. Levine, Face recognition using the discrete cosine transform,International Journal of Computer Vision 43 (2001) 167–188.

49] M.J. Er, W.L. Chen, S.Q. Wu, High-speed face recognition based on discretecosine transform and RBF neural networks, IEEE Transactions on Neural Net-works 16 (2005) 679–691.

50] D. Ramasubramanian, Y.V. Venkatesh, Encoding and recognition of facesbased on the human visual model and DCT, Pattern Recognition 34 (2001)2447–2458.

51] F.S. Samaria, A.C. Harter, Parameterization of a stochastic model for human faceidentification, in: Proceedings of IEEE Workshop on Applications of ComputerVision, Florida, 1994, pp. 138–142.

52] L. Chen, H. Man, A.V. Nefian, Face recognition based on multi-class mapping offisher scores, Pattern Recognition 38 (2005) 799–811.

53] D.B. Graham, N.M. Allinson, Characterizing virtual eigensignatures for gen-eral purpose face recognition, in: H. Wechsler, P.J. Phillips, V. Bruce, F.Fogelman-Soulie, T.S. Huang (Eds.), Face Recognition: From Theory to Appli-cations, NATO ASI Series F, Computer and Systems Sciences, vol. 163, 1998,pp. 446–456.

54] J. Liu, S.C. Chen, X.Y. Tan, A study on three linear discriminant analysis basedmethods in small sample size problem, Pattern Recognition 41 (2008) 102–116.

55] A.M. Martínez, A.C. Kak, PCA versus LDA, IEEE Transactions on Pattern Analysisand Machine Intelligence 23 (2001) 228–233.

56] D.H. Lin, X.O. Tang, Recognize high resolution faces: from macrocosm to micro-cosm, in: Proceedings of IEEE Conference on Computer Vision and PatternRecognition, 2006.

57] H. Yu, J. Yang, A direct LDA algorithm for high-dimensional data – with appli-cation to face recognition, Pattern Recognition 34 (2001) 2067–2070.

58] W. Liu, Y.H. Wang, S.Z. Li, T.N. Tan, Null space approach of fisher discrimi-nant analysis for face recognition, in: Proceedings of Biometric AuthenticationWorkshop, 2004, pp. 32–44.