[IEEE 2012 IEEE Fifth International Conference On Biometrics: Theory, Applications And Systems...

6
Supervised Sparse Representation with Coefficients’ Group Constraint Xin Guo, Zhicheng Zhao, Anni Cai Beijing University of Posts and Telecommunications Beijing, China, 100876 {guoxin, zhaozc, annicai}@bupt.edu.cn Abstract Sparse representation has gained much attention of many researchers recently due to the powerful ability of rep- resenting and compressing the original sample. The sparse based classification (SRC) method has been proposed for face recognition and applied to many other fields, which method aims to sparse represent test sample on training set and minimize the reconstruction error. In order for bet- ter representation, it is expected that the original sample is represented by samples in same class as much as possible. Base on this assumption, in this paper, a group constrain- t for coefficients is introduced into the object function of sparse representation to penalize the non-zero coefficients with different classes from the original samples class. The function is solved efficiently by the conventional subgradi- ent method. Experiments on several databases from three fields, such as face recognition, digit recognition and nat- ural image classification, demonstrated the effectiveness of the proposed algorithm. 1. Introduction Recently, a great deal of interest, enthusiasm and progress are concentrated on the sparse representation (or coding) due to its powerful ability of representing and com- pressing the original signal. Based on this technique, many applications have been involved, such as face recognition [14], digit and texture classification [4, 17], image denois- ing [1], image super-resolution [16] and image classifica- tion [11, 15] et al.. The success of sparse representation is mainly from the fact that signals such as audio or image could be naturally sparse represented w.r.t fixed basis which is over-complete [14]. Furthermore, efficient and effective convex optimization algorithms made the problem easily to be solved. B. Olshausen et al. [9] produced a set of spatially local- ized, oriented and bandpass filters with coefficients sparse constraint on natural images. The filters are very similar to receptive fields of simple cells in mammalian primary visu- al cortex. To apply sparse representation to classification tasks, J. Wright et al. proposed Sparse Representation Classification (SRC) method for robust face recognition in [14]. Based on the dictionary formed by observed samples, each new sam- ple is sparse represented as a linear combination of a few atoms from the dictionary in the learning stage and residual errors between the new (test) sample and the approximates reconstructed from atoms associated with different classes are respectively generated. Then the test sample is classified as the class with minimum residual error. Following the basic idea, there are some other applica- tions for sparse representation. J. Gemmeke et al. [4] ap- plied it for automatic speech recognition (ASR) to reduce the influence of noise and gained better performance than HMM based speech decoder. A. Yang et al. also used it for sensor networks and human activity classification in [15]. J. Yang et al. spare represented features extracted from images on a well learned dictionary and formed a fixed dimension vector with ”max pooling” strategy. This method is simi- lar to bag-of-word (BoW) which is conventional method in image classification task. A number of extensions are also made based on the 1 - Norm constraint, known as regularization in some litera- tures. One of the extensions is robust sparse coding (RSC) proposed by M. Yang et al. in [18]. Considering the dis- tribution of residual error which is not Laplacian or Gaus- sian (for 2 -Norm minimization) anymore, they introduced a weight term for the loss function to modify residual er- ror. The other extension is related to graph-based method [10, 2]. The key idea is to accomplish this kind of exten- sion by interpreting the coefficients as weights in a directed graph. This method can be used to handle noise, outliers as well as missing data. There are also some other extensions with different applications, such as objects recognition us- ing attributes [3]. Here, the sparse coefficients computed by 1 -Norm minimization are used to characterize relation- ships between object category and its attributes. In this paper, we consider that a sample should be repre- sented by atoms from few or even one class (the same class 978-1-4673-1228-8/12/$31.00 ©2012 IEEE 183

Transcript of [IEEE 2012 IEEE Fifth International Conference On Biometrics: Theory, Applications And Systems...

Supervised Sparse Representation with Coefficients’ Group Constraint

Xin Guo, Zhicheng Zhao, Anni CaiBeijing University of Posts and Telecommunications

Beijing, China, 100876{guoxin, zhaozc, annicai}@bupt.edu.cn

Abstract

Sparse representation has gained much attention ofmany researchers recently due to the powerful ability of rep-resenting and compressing the original sample. The sparsebased classification (SRC) method has been proposed forface recognition and applied to many other fields, whichmethod aims to sparse represent test sample on trainingset and minimize the reconstruction error. In order for bet-ter representation, it is expected that the original sample isrepresented by samples in same class as much as possible.Base on this assumption, in this paper, a group constrain-t for coefficients is introduced into the object function ofsparse representation to penalize the non-zero coefficientswith different classes from the original samples class. Thefunction is solved efficiently by the conventional subgradi-ent method. Experiments on several databases from threefields, such as face recognition, digit recognition and nat-ural image classification, demonstrated the effectiveness ofthe proposed algorithm.

1. IntroductionRecently, a great deal of interest, enthusiasm and

progress are concentrated on the sparse representation (orcoding) due to its powerful ability of representing and com-pressing the original signal. Based on this technique, manyapplications have been involved, such as face recognition[14], digit and texture classification [4, 17], image denois-ing [1], image super-resolution [16] and image classifica-tion [11, 15] et al.. The success of sparse representationis mainly from the fact that signals such as audio or imagecould be naturally sparse represented w.r.t fixed basis whichis over-complete [14]. Furthermore, efficient and effectiveconvex optimization algorithms made the problem easily tobe solved.

B. Olshausen et al. [9] produced a set of spatially local-ized, oriented and bandpass filters with coefficients sparseconstraint on natural images. The filters are very similar toreceptive fields of simple cells in mammalian primary visu-

al cortex.To apply sparse representation to classification tasks, J.

Wright et al. proposed Sparse Representation Classification(SRC) method for robust face recognition in [14]. Based onthe dictionary formed by observed samples, each new sam-ple is sparse represented as a linear combination of a fewatoms from the dictionary in the learning stage and residualerrors between the new (test) sample and the approximatesreconstructed from atoms associated with different classesare respectively generated. Then the test sample is classifiedas the class with minimum residual error.

Following the basic idea, there are some other applica-tions for sparse representation. J. Gemmeke et al. [4] ap-plied it for automatic speech recognition (ASR) to reducethe influence of noise and gained better performance thanHMM based speech decoder. A. Yang et al. also used it forsensor networks and human activity classification in [15]. J.Yang et al. spare represented features extracted from imageson a well learned dictionary and formed a fixed dimensionvector with ”max pooling” strategy. This method is simi-lar to bag-of-word (BoW) which is conventional method inimage classification task.

A number of extensions are also made based on the `1-Norm constraint, known as regularization in some litera-tures. One of the extensions is robust sparse coding (RSC)proposed by M. Yang et al. in [18]. Considering the dis-tribution of residual error which is not Laplacian or Gaus-sian (for `2-Norm minimization) anymore, they introduceda weight term for the loss function to modify residual er-ror. The other extension is related to graph-based method[10, 2]. The key idea is to accomplish this kind of exten-sion by interpreting the coefficients as weights in a directedgraph. This method can be used to handle noise, outliers aswell as missing data. There are also some other extensionswith different applications, such as objects recognition us-ing attributes [3]. Here, the sparse coefficients computedby `1-Norm minimization are used to characterize relation-ships between object category and its attributes.

In this paper, we consider that a sample should be repre-sented by atoms from few or even one class (the same class

978-1-4673-1228-8/12/$31.00 ©2012 IEEE

183

as the sample is) and the corresponding coefficients of theseatoms should concentrate the most value. More details arereferred to Section 3.

Motivated by this idea, we propose a sparse representa-tion algorithm with coefficients group constraint to penalizethose coefficients associated with different classes from thesamples class.

Reminder of this paper is organized as follows. In thenext section, we review the sparse representation classifica-tion algorithm briefly. In section 3, we present the proposedsparse representation algorithm with coefficients group con-straint. In section 4 and 5, we discuss the experimental re-sults and give the conclusions respectively.

2. Sparse representation classificationSparse representation is to represent a sample x ∈ Rm

as a linear combination of a few atoms of dictionary, whichis formulated as:

x = Dα (1)

where D in Rm×k is the dictionary with each columnrepresenting a basis vector (atom), and α ∈ Rk is the s-parse coefficient vector. Taking the reconstruction error in-to account, sparse representation can be seen as seeking theoptimal solution over a cost function:

argminα

‖α‖0

s.t. ‖x−Dα‖22 < ε(2)

or

argminα

‖x−Dα‖22 + λ‖α‖0 (3)

where λ is a constant that makes a trade-off between re-construction error and coefficients sparsity. However, theproblem with `0-Norm constraint is non-convex and NP-hard to solve, so that a relaxation is employed from the `0-Norm constraint to `1-Norm for sparse coefficient α, whichis shown in Eq.(4). Then the convex problem with `1-Normconstraint can be solved efficiently using the Lasso algorith-m [9].

argminα

‖x−Dα‖22 + λ‖α‖1 (4)

Suppose that we have training samples{x1,1, . . . , x1,N1 , . . . , xi,j , . . . , xC,1, . . . , xC,NC

} ∈Rm×k (i ∈ [1, . . . , C], j ∈ [1, . . . , Ni], k =

∑iNi). In

sparse representation classification approach proposed byWight et al. [1], dictionary D = [D1, . . . , DC ] is initializedby selecting samples {xi,1, . . . , xi,N1

} from the i-th class ascolumn Di, and the sparse coefficient α = [αT1 , . . . , α

TC ]T

Figure 1. To increase the power of representation, we expect non-zero coefficients are grouped to one class. Suppose sample x be-longs to class 1, we expect that the non-zero coefficients (coloredblue in top figure) move to the positions of coefficients α1 (coloredblue in bottom figure) related with dictionary atoms D1, whichatoms also come from the class 1.

is solved from Eq.(4). Then the test sample x is classifiedas the one with minimum reconstruction error:

label(x) = argmini‖x−Diαi‖22 (5)

The reconstruction error measures that how like the testsample and the training ones are, and lead to the result ofclassifying the test sample to class that owns similar trainingsamples.

3. Supervised Sparse Representation with Co-efficients Group Constraint

In sparse representation, a sample is usually represent-ed by a few atoms of dictionary due to sparse constrainton coefficient . However, wed like to further push the coef-ficients representing the sample into the same class as thesample is. In other words, it is better to represent the origi-nal sample only by atoms coming from the original samplesclass. The coefficients change we expect is demonstratedin Figure 1. With this motivation, we penalize those coef-ficients belonged to different classes from the original sam-ples class and expect to concentrate most of the non-zerocoefficients on the original samples class. The sparse repre-sentation problem with coefficients group constraint is thenformulated as:

184

argminα

‖x−Dα‖22 + λ1‖α‖1 + λ2

C∑i=1

ωi‖αi‖2 (6)

ωi =

{0, i = c1, others

(7)

where λ1 and λ2 are scalar constants that balance thereconstruction error and coefficient penalty, and both areempirically set to 0.01. c is predefined class label. Finally,the problem of Eq. (6) falls into the convex optimizationproblem, and the conventional coordinate descent methodcan be used to solve the problem.

For class c, we denote the dictionary by Dc =Z = (Z1, Z2, . . . , ZNc), the coefficient by αc = θ =(θ1, θ2, . . . , θNc) and the residual over dictionary from oth-er classes by r = x−

∑k 6=cDkαk. The subgradient equa-

tion of Eq. (6) over {θj}(j ∈ (1, 2, . . . , Nc)) is

− 2ZTj (r −∑j

Zjθj) + λ1αj + λ2ωcbj = 0

for j = 1, 2, . . . , Nc

(8)

where

αj =

{sign(θj), θj 6= 0[1,−1], θj = 0

(9)

bj =

{θj‖θ‖ , θ 6= 0

bj with ‖b‖2 ≤ 1, θ = 0(10)

Hence, we have θ = 0 if only if Eq. (8) has a solutionwith αj ∈ [−1, 1] and ‖b‖2 ≤ 1. So we can check Eq. (11)firstly, and obtain θ = 0 if J(a) ≤ 1.

J(a) =∑j

1

(λ2ωc)2(−ZTj (r −

∑j

Zjθj) + λ1αj)2

=∑j

b2j

(11)

If J(a) > 1, we have to optimize the original function:

‖r −∑j

Zjθj‖22 + λ1∑j

|θj |+ λ2ωc‖‖2(12) (12)

This is a convex problem and we can obtain the globalminimum through coordinate descent method. For each j,we can get θj = 0 if |2ZTj (r −

∑i6=j Ziθi)| < λ1 from the

subgradient over θj of Eq. (12), which is shown in Eq. (13)and (14).

− 2ZTj (r −∑i6=j

Ziθi − Zjθj) + λ1cj + λ2ωcθj‖θ‖

= 0

(13)

cj =

{sign(θj), θj 6= 0[−1, 1], θj = 0

(14)

If |2ZTj (r −∑i6=j Ziθi)| ≥ λ1, we minimize Eq. (12)

over θj by a one-dimensional optimization. For other class-es, we can repeat steps mentioned above until convergence.At the classification stage, the criterion of minimum recon-struction error is still adopted to classify the test sample toclass with minimum error.

4. ExperimentsWe verify the performance of the proposed algorith-

m on five databases, namely Extended Yale B, MNIST,Caltech-101, Caltech-256 and PASCAL 2007 databases,which databases cover the applications of face recognition,digit recognition and natural images classification. For eachdatabase, we compare our approach with the Sparse Rep-resentation Classification (SRC) method and other conven-tional holistic methods such as Eigenfaces and Laplacian-faces in face recognition and 1-Nearest Neighbors (1-NN)in image classification.

4.1. Face Recognition

For face recognition, we use the Extended Yale FaceDatabase B and AR Database for evaluation.

4.1.1 The Extended Yale B database

This database consists of 2,414 frontal-face images of 38individuals [5]. The cropped and normalized 192 x 168 faceimages were captured under various laboratory-controlledlighting conditions.

For each subject, we randomly select half part of the im-ages for training and the other part for testing. Images arefirst downsampled with the ratios of 1/32, 1/24, 1/16 and1/8. Hence, a set of feature spaces are formed with 30, 56,120 and 504 dimensions. Eigenfaces [12] and Laplacian-faces [6] are extracted as the dictionary. Two classificationmethods, namely 1-Nearest Neighbor (1-NN) and SRC arecompared with our approaches. Classification accuracy isadopted to evaluate the performances of different methods.

The classification results are shown in Figure 2. Fromthe figure, we can see that our approach outperforms both 1-NN and SRC methods on two dictionaries formed by Eigen-faces and Laplacianfaces. SRC method achieves great im-provements than 1-NN, about 9.8% improvement in Eigen-faces and 7.795% improvement in Laplacianfaces. How-ever, comparing with SRC method, our approach achieves

185

size 30% 50% 80% 100%1-NN 85.78 90.14 91.32 93.5SRC 81.26 86.64 92.13 93.19ours 84.58 91.13 93.64 94.3

Table 1. Classification accuracy (%) on MNIST database.

higher recognition ratio, i.e., 1.69% for 30D, 1.4% for 56D,1.41% for 120D, 0.33% for 504D based on Eigenfaces and1.74% for 30D, 1.99% for 56D, 2.18% for 120D, 1.29% for504D based on Laplacianfaces.

4.1.2 The AR database

This database [8] consists of over 4,000 frontal images for126 individuals (50 males and 50 females). For each indi-vidual, 26 pictures are taken in two separate sessions. Theseimages include more facial variations, including illumina-tion change, expressions, and facial disguises comparing tothe Extended Yale B database. As in [14], a subset of thedataset is chose in our experiments, in which images are on-ly with the illumination and expressions changes. For eachsubject, the seven images from Session 1for training, andthe other seven from Session 2 for testing. The images arecropped with dimension 165 x 120 and converted to grayscale. Four feature space dimensions are adopted: 30, 54,130 and 540, which correspond to the downsample ratios1/24, 1/18, 1/12, and 1/6, respectively.

Results of classification accuracy on AR database areshown in Figure 3. 1-NN method achieves highest accuracy89.7% with 540D features. Then SRC method outperform-s 1-NN for each kind of feature dimension and dictionar-ies, with the improvement ranging from 0.61% to 11.69%.Our method makes a slight improvement than SRC, about2.197% on average for Eigenfaces and 1.82% for Laplacian-faces.

4.2. Digit Recognition

For Handwritten digit recognition, we apply our ap-proach to the large scale MNIST database [7]. The databaseconsists of 70,000 handwritten digits, of which 60,000 dig-its are modeled as training and 10,000 as testing. The digitshave been size-normalized and centered in a fixed-size im-age.

Due to the limitation of computer memory, a subset of1,000 training images and 150 test images are selected ran-domly. In our experiments, different sizes of training set,namely 30%, 50%, 80% and 100%, are used to comparethe 1-NN, SRC and our approach. The results are listed inTable 1. In all cases, our approach outperforms SRC withimprovement ranging from 1.11% to 3.32 %, and slightlyoutperforms 1-NN method in some cases.

Databases Results

Caltech-101size 30% 50% 80%SRC 50.83 53.42 58.91ours 53.44 54.95 60.05

Caltech-256size 30% 50% 80%SRC 53.59 57.28 63.84ours 56.35 59.15 64.10

Table 2. Classification accuracy (%) on Caltech-101 and Caltech-256 databases.

4.3. Natural image classification

4.3.1 Caltech-101 and Caltech-256 databases

Images from Caltech-101 dataset belong to 101 categoriesincluding vehicle, animal and person et al.. Each categoryhas more than 40 images and most of them have 50 images.The size of each image is roughly 300 x 200 pixels.

Caltech-256 dataset is same to Caltech-101 but more im-ages and categories are included. The Caltech-256 datasetholds 30,607 images in 256 categories. It presents muchhigher variability in object size, location, pose, etc than inCaltech-101. Each class contains at least 80 images.

Data for training and testing are dense SIFT feature vec-tors, which are extracted from each image and quantified to300-dimention vectors with the bag-of-words method. Theimplementations of feature extraction, clustering and quan-tification are downloaded from [13].

First of all, the sparse coefficients obtained with andwithout the coefficients group constraint are shown in Fig-ure 4. It can be seen that the non-zero coefficients movetogether and become more concentrated after applying thegroup constraint.

Due to the limitation on computer memory, part of eachthe database is used for training and testing rather than thewhole database. For Caltech-101 and Caltech-256 database,samples belonged to 50 classes are randomly selected and30%, 50% and 80% samples in each class are respectivelyused for training while 10% samples are used for testing.

Accuracy over all classes is adopted to compare the per-formance of proposed algorithm with the sparse represen-tation algorithm (SRC) algorithm. The results are listedin Table 2. From the table, we can see that our algorith-m outperforms SRC with an improvement of about 1.63%in Caltech-101 database, 1.96% in Caltech-256 database onaverage.

4.3.2 PASCAL 2007 database

The PASCAL 2007 dataset [2] consists of 9,963 imagesfrom 20 classes. These images range between indoor andoutdoor scenes, close-ups and landscapes, and strange view-points. The dataset is an extremely challenging one because

186

(a) (b)

Figure 2. Face classification accuracy (%) on Extended Yale B database. The left figure is based on Eigenfaces and right one is based onLaplacianfaces.

(a) (b)

Figure 3. Face classification accuracy (%) on AR database. The left figure is based on Eigenfaces and right one is based on Laplacianfaces.

all the images are daily photos obtained from Flicker wherethe size, viewing angle, illumination, etc. appearances ofobjects and their poses vary significantly, with frequent oc-clusions.

Again, not all samples are involved in our experiments,since there are a huge number of images in the database. Af-ter down sampling, the numbers of samples for training are50, 100 and 300 in each one of all 20 classes; meanwhile,100 samples in each class are selected for testing. SRC andour approach are compared in this experiment and classifi-cation accuracy is employed to evaluate the algorithms per-formance. The results are listed in Table 3. From this table,we can see that our approach outperforms SRC with averageimprovements of 1.78%.

size 50 100 300SRC 60.14 62.07 70.51ours 61.78 63.21 73.06

Table 3. Classification accuracy (%) on PASCAL 2007 database.

5. Conclusion

In this paper, we proposed a sparse representation algo-rithm with coefficients group constraint. With the motiva-tion that sample should be represented by the samples ofthe same label, we constrain the non-zero coefficients into agroup associated to the same label with the original sample,and penalize those coefficients from other classes. Finally,the formalization of the sparse representation with coeffi-cients group constraint comes out to be a form of convex

187

Figure 4. The coefficients become more concentrated on fewer classes with the constraint than before. It is just a toy experiment whichshows how the algorithm with coefficient constraints group works. 20 classes with 10 samples in each one are used for training.

optimization problem, which can be solved efficiently. Ex-periments on several databases show the effectiveness of theproposed algorithm.

AcknowledgementThis work was supported by National Natural Sci-

ence Foundation of China under Projects 61101212 and90920001, and by Fundamental Research Funds for theCentral Universities, and Network System and NetworkCulture Foundation of Beijing.

References[1] M. Elad and M. Aharon. Image denoising via sparse and

redundant representations over learned dictionaries. Im-age Processing, IEEE Transactions on, 15(12):3736–3745,2006.

[2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman. The PASCAL Visual Object ClassesChallenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

[3] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describingobjects by their attributes. In Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conference on, pages1778–1785. IEEE, 2009.

[4] J. Gemmeke and B. Cranen. Noise robust digit recognitionusing sparse representations. Proceedings of ISCA 2008 ITR-W Speech Analysis and Processing for knowledge discovery,2008.

[5] A. Georghiades, P. Belhumeur, and D. Kriegman. From fewto many: Illumination cone models for face recognition un-der variable lighting and pose. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 23(6):643–660, 2001.

[6] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang. Face recog-nition using laplacianfaces. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 27(3):328–340, 2005.

[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

[8] A. M. Martinez and R. Benavente. The AR face database.Technical report, CVC, June 1998.

[9] B. Olshausen et al. Emergence of simple-cell receptive fieldproperties by learning a sparse code for natural images. Na-ture, 381(6583):607–609, 1996.

[10] S. Rao, R. Tron, R. Vidal, and Y. Ma. Motion segmentationvia robust subspace separation in the presence of outlying,incomplete, or corrupted trajectories. In Computer Visionand Pattern Recognition, 2008. CVPR 2008. IEEE Confer-ence on, pages 1–8. IEEE, 2008.

[11] R. Rigamonti, M. Brown, and V. Lepetit. Are sparse repre-sentations really relevant for image classification? In Com-puter Vision and Pattern Recognition (CVPR), 2011 IEEEConference on, pages 1545–1552. IEEE, 2011.

[12] M. Turk and A. Pentland. Eigenfaces for recognition. Jour-nal of cognitive neuroscience, 3(1):71–86, 1991.

[13] A. Vedaldi and B. Fulkerson. VLFeat: An open and portablelibrary of computer vision algorithms. http://www.vlfeat.org/, 2008.

[14] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robustface recognition via sparse representation. Pattern Analysisand Machine Intelligence, IEEE Transactions on, 31(2):210–227, 2009.

[15] A. Yang, R. Jafari, S. Sastry, and R. Bajcsy. Distributedrecognition of human actions using wearable motion sensornetworks. Journal of Ambient Intelligence and Smart Envi-ronments, 1(2):103–115, 2009.

[16] J. Yang, J. Wright, T. Huang, and Y. Ma. Image super-resolution as sparse representation of raw image patches.In Computer Vision and Pattern Recognition, 2008. CVPR2008. IEEE Conference on, pages 1–8. Ieee, 2008.

[17] J. Yang, K. Yu, and T. Huang. Supervised translation-invariant sparse coding. In Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on, pages3517–3524. IEEE, 2010.

[18] M. Yang, L. Zhang, J. Yang, and D. Zhang. Robust sparsecoding for face recognition. In Computer Vision and PatternRecognition (CVPR), 2011 IEEE Conference on, pages 625–632. IEEE, 2011.

188