Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University...

18
Open Research Online The Open University’s repository of research publications and other research outputs A modified principal component technique based on the LASSO Journal Item How to cite: Jolliffe, I.T.; Trendafilov, N.T. and Uddin, M. (2003). A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics, 12(3) pp. 531–547. For guidance on citations see FAQs . c [not recorded] Version: [not recorded] Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.1198/1061860032148 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk

Transcript of Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University...

Page 1: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

Open Research OnlineThe Open Universityrsquos repository of research publicationsand other research outputs

A modified principal component technique based onthe LASSOJournal Item

How to cite

Jolliffe IT Trendafilov NT and Uddin M (2003) A modified principal component technique based onthe LASSO Journal of Computational and Graphical Statistics 12(3) pp 531ndash547

For guidance on citations see FAQs

ccopy [not recorded]

Version [not recorded]

Link(s) to article on publisherrsquos websitehttpdxdoiorgdoi1011981061860032148

Copyright and Moral Rights for the articles on this site are retained by the individual authors andor other copyrightowners For more information on Open Research Onlinersquos data policy on reuse of materials please consult the policiespage

oroopenacuk

A Modi ed Principal Component TechniqueBased on the LASSO

Ian T JOLLIFFE Nickolay T TRENDAFILOV and Mudassir UDDIN

In many multivariate statistical techniques a set of linear functions of the original p

variables is produced One of the more dif cult aspects of these techniques is the inter-pretation of the linear functions as these functions usually have nonzero coef cients onall p variables A common approach is to effectively ignore (treat as zero) any coef cientsless than some threshold value so that the function becomes simple and the interpretationbecomes easier for the users Such a procedure can be misleading There are alternatives toprincipal componentanalysiswhich restrict the coef cients to a smaller number of possiblevalues in the derivationof the linear functionsor replace the principalcomponentsby ldquoprin-cipal variablesrdquo This article introduces a new technique borrowing an idea proposed byTibshirani in the context of multiple regressionwhere similar problems arise in interpretingregression equations This approach is the so-called LASSO the ldquoleast absolute shrinkageand selection operatorrdquo in which a bound is introduced on the sum of the absolute valuesof the coef cients and in which some coef cients consequentlybecome zero We exploresome of the propertiesof the new techniqueboth theoreticallyand using simulationstudiesand apply it to an example

Key Words InterpretationPrincipal component analysis Simpli cation

1 INTRODUCTION

Principal component analysis (PCA) like several other multivariate statistical tech-niques replaces a set of p measured variables by a small set of derived variables Thederived variables the principal components are linear combinationsof the p variables Thedimension reduction achieved by PCA is especially useful if the components can be readilyinterpreted and this is sometimes the case see for example Jolliffe (2002 chap 4) Inother examples particularly where a component has nontrivial loadings on a substantial

Ian T Jolliffe is Professor Department of Mathematical Sciences University of Aberdeen Meston BuildingKingrsquos College Aberdeen AB24 3UE Scotland UK (E-mail itjmathsabdnacuk) Nickolay T Trenda lov isSenior Lecturer Faculty of Computing Engineering and Mathematical Sciences University of the West of Eng-land Bristol BS16 1QY UK (E-mail NickolayTrenda lovuweacuk)Mudassir Uddin is Associate ProfessorDepartment of Statistics University of Karachi Karachi-75270 Pakistan (E-mail mudassir2000hotmailcom)

creg 2003 American Statistical Association Institute of Mathematical Statisticsand Interface Foundation of North America

Journal of Computational and Graphical Statistics Volume 12 Number 3 Pages 531ndash547DOI 1011981061860032148

531

532 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

proportion of the p variables interpretationcan be dif cult detracting from the value of theanalysis

A number of methods are available to aid interpretation Rotation which is common-place in factor analysis can be applied to PCA but has its drawbacks (Jolliffe 1989 1995)A frequently used informal approach is to ignore all loadings smaller than some thresh-old absolute value effectively treating them as zero This can be misleading (Cadima andJolliffe 1995) A more formal way of making some of the loadings zero is to restrict theallowable loadings to a small set of values for example iexcl 1 0 1 (Hausman 1982) Vines(2000) described a variation on this theme One further strategy is to select a subset of thevariables themselveswhich satisfy similar optimalitycriterion to the principalcomponentsas in McCabersquos (1984) ldquoprincipal variablesrdquo

This article introducesa new techniquewhich shares an idea central to both Hausmanrsquos(1982) and Vinesrsquos (2000) work This idea is that we choose linear combinations of themeasured variables which successively maximizes variance as in PCA but we imposeextra constraints which sacri ces some variance in order to improve interpretability Inour technique the extra constraint is in the form of a bound on the sum of the absolutevalues of the loadings in that component This type of bound has been used in regression(Tibshirani 1996) where similar problems of interpretationoccur and is known there as theLASSO (least absolute shrinkage and selection operator) As with the methods of Hausman(1982) and Vines (2000) and unlike rotation our technique usually produces some exactlyzero loadings in the components In contrast to Hausman (1982) and Vines (2000) it doesnot restrict the nonzero loadings to a discrete set of values This article shows throughsimulationsand an example that the new techniqueis a valuableadditionaltool for exploringthe structure of multivariate data

Section 2 establishes the notation and terminology of PCA and introduces an exam-ple in which interpretation of principal components is not straightforward The most usualapproach to simplifying interpretation the rotation of PCs is shown to have drawbacksSection 3 introduces the new technique and describes some of its properties Section 4 re-visits the example of Section 2 and demonstrates the practical usefulness of the techniqueA simulation study which investigates the ability of the technique to recover known un-derlying structures in a dataset is summarized in Section 5 The article ends with furtherdiscussion in Section 6 including some modi cations complications and open questions

2 A MOTIVATING EXAMPLE

Consider the classic example rst introduced by Jeffers (1967) in which a PCA wasdone on the correlation matrix of 13 physical measurements listed in Table 1 made on asample of 180 pitprops cut from Corsican pine timber

Let xi be the vector of 13 variables for the ith pitprop where each variable has beenstandardized to have unit variance What PCA does when based on the correlation matrixis to nd linear functions a

0

1x a0

2x a0

px which successively have maximum samplevariance subject to a

0

hak = 0 for k para 2 and h lt k In addition a normalization constraint

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 533

Table 1 Denitions of Variables in Jeffersrsquo Pitprop Data

Variable De nition

x1 Top diameter in inchesx2 Length in inchesx3 Moisture content of dry weightx4 Specic gravity at time of testx5 Oven-dry specic gravityx6 Number of annual rings at topx7 Number of annual rings at bottomx8 Maximum bow in inchesx9 Distance of point of maximum bow from top in inchesx10 Number of knot whorlsx11 Length of clear prop from top in inchesx12 Average number of knots per whorlx13 Average diameter of the knots in inches

a0

kak = 1 is necessary to get a bounded solution The derived variable a0

kx is the kthprincipal component (PC) It turns out that ak the vector of coef cients or loadings forthe kth PC is the eigenvector of the sample correlation matrix R corresponding to the kthlargest eigenvalue lk In addition the sample variance of a

0

kx is equal to lk Because ofthe successive maximization property the rst few PCs will often account for most of thesample variation in all the standardized measured variables In the pitprop example Jeffers(1967) was interested in the rst six PCs which together account for 87 of the totalvariance The loadings in each of these six components are given in Table 2 together withthe individual and cumulative percentage of variance in all 13 variables accounted for by1 2 6 PCs

PCs are easiest to interpret if the pattern of loadings is clear-cut with a few large

Table 2 Loadings for Correlation PCA for Jeffersrsquo Pitprop Data

Component

Variable (1) (2) (3) (4) (5) (6)

x1 0404 0212 iexcl0219 iexcl0027 iexcl0141 iexcl0086x2 0406 0180 iexcl0245 iexcl0025 iexcl0188 iexcl0111x3 0125 0546 0114 0015 0433 0120x4 0173 0468 0328 0010 0361 iexcl0090x5 0057 iexcl0138 0493 0254 iexcl0122 iexcl0560x6 0284 iexcl0002 0476 iexcl0153 iexcl0269 0032x7 0400 iexcl0185 0261 iexcl0125 iexcl0176 0030x8 0294 iexcl0198 iexcl0222 0294 0203 0103x9 0357 0010 iexcl0202 0132 iexcl0117 0103x10 0379 iexcl0252 iexcl0120 iexcl0201 0173 iexcl0019x11 iexcl0008 0187 0021 0805 iexcl0302 0178x12 iexcl0115 0348 0066 iexcl0303 iexcl0537 0371x13 iexcl0112 0304 iexcl0352 iexcl0098 iexcl0209 iexcl0671

Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266Variance () 324 182 144 89 70 63

Cumulative Variance () 324 507 650 740 809 872

534 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 3 Loadings for Rotated Correlation PCA Using the Varimax Criterion for Jeffersrsquo Pitprop Data

Component

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0019 0074 0043 iexcl0027 iexcl0519 iexcl0077x2 iexcl0018 0015 0048 iexcl0024 iexcl0540 iexcl0102x3 iexcl0024 0705 iexcl0128 0003 iexcl0059 0107x4 0029 0689 0112 0001 0014 iexcl0087x5 0258 0009 0477 0218 0205 iexcl0524x6 iexcl0185 0061 0604 iexcl0005 iexcl0032 0012x7 0031 iexcl0069 0512 iexcl0102 iexcl0151 0092x8 0440 iexcl0042 iexcl0072 0083 iexcl0221 0239x9 0097 iexcl0058 0045 0094 iexcl0408 0141x10 0271 iexcl0054 0129 iexcl0367 iexcl0216 0135x11 0057 iexcl0022 iexcl0029 0882 iexcl0137 0075x12 iexcl0776 iexcl0056 0091 0079 iexcl0123 0145x13 iexcl0120 iexcl0049 iexcl0280 iexcl0077 iexcl0269 iexcl0748

Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343variance ( ) 130 146 184 97 239 76

cumulative variance ( ) 130 276 460 557 796 872

(absolute) values and many small loadings in each PC Although Jeffers (1967) makes anattempt to interpret all six components some are to say the least messy and he ignoressome intermediate loadingsFor example PC2 has the largest loadingson x3 x4 with smallloadings on x6 x9 but a whole range of intermediate values on other variables

A traditionalway to simplify loadings is by rotation If A is the (13 pound 6) matrix whosekth column is ak then A is post-multipliedby a matrix T to give rotated loadingsB = ATIf bk is the kth column of B then b

0

kx is the kth rotated componentThe matrix T is chosenso as to optimize some simplicitycriterionVarious criteria have been proposedall of whichattempt to create vectors of loadings whose elements are close to zero or far from zero withfew intermediate values The idea is that each variable should be either clearly importantor clearly unimportant in a rotated component with as few cases as possible of borderlineimportance Varimax is the most widely used rotation criterion and like most other suchcriteria it tends to drive at least some of the loadings in each component towards zero Thisis not the only possible type of simplicity A component whose loadings are all roughlyequal is easy to interpret but will be avoided by most standard rotation criteria It is dif cultto envisage any criterion which could encompass all possible types of simplicity and weconcentrate here on simplicity as de ned by varimax

Table 3 gives the rotated loadings for six components in the correlation PCA of thepitprop data togetherwith the percentageof total variance accounted for by each rotated PC(RPC) The rotation criterion used in Table 3 is varimax (Krzanowski and Marriott 1995 p138) which is the most frequent choice (often the default in software) but other criteria givesimilar results Varimax rotation aims to maximize the sum over rotated components of acriterion which takes values between zero and one A value of zero occurs when all loadingsin the component are equal whereas a component with only one nonzero loading producesa value of unity This criterion or ldquosimplicity factorrdquo is given for each component in Tables2 and 3 and it can be seen that its values are larger for most of the rotated components than

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 535

they are for the unrotated componentsThere are however a number of disadvantages associated with rotation In the context

of interpreting the results in this example we note that we have lost the ldquosuccessive max-imization of variancerdquo property of the unrotated components so what we are interpretingafter rotation are not the ldquomost important sources of variationrdquo in the data The RPC withthe highest variance appears arbitrarily as the fth and this accounts for 24 of the totalvariation compared to 32 in the rst unrotated PC In addition a glance at the loadingsand simplicity factors for the RPCs shows that more generally those components whichare easiest to interpret among the six in Table 3 are those which have the smallest varianceRPC5 is still rather complicated Other problems associated with rotation were discussedby Jolliffe (1989 1995) A simpli ed component technique (SCoT) in which the two stepsof RPCA (PCA followed by rotation) are combined into one was discussed by Jolliffe andUddin (2000) The technique is based on a similar idea proposed by Morton (1989) in thecontext of projection pursuit It maximizes variance but adds a penalty function which isa multiple of one of the simplicity criteria such as varimax SCoT has some advantagescompared to standard rotation but shares a number of its disadvantages

The next section introduces an alternative to rotation which has some clear advantagesover rotated PCA and SCoT A detailed comparison of the new technique SCoT rotatedPCA and Vinesrsquo (2000) simple components is given for an example involving sea surfacetemperatures in Jolliffe Uddin and Vines (2002)

3 MODIFIED PCA BASED ON THE LASSO

Tibshirani (1996) studied the dif culties involved in the interpretation of multiple re-gression equationsThese problems may occur due to the instability of the regression coef- cients in the presence of collinearity or simply because of the large number of variablesincluded in the regression equation Some current alternatives to least squares regressionsuch as shrinkage estimators ridge regression principal component regression or partialleast squares handle the instabilityproblemby keepingall variablesin the equationwhereasvariable selection procedures nd a subset of variables and keep only the selected variablesin the equation Tibshirani (1996) proposed a new method the ldquoleast absolute shrinkageand selection operatorrdquo LASSO which is a compromise between variable selection andshrinkage estimators The procedure shrinks the coef cients of some of the variables notsimply towards zero but exactly to zero giving an implicit form of variable selectionLeBlanc and Tibshirani (1998) extended the idea to regression trees Here we adapt theLASSO idea to PCA

31 THE LASSO APPROACH IN REGRESSION

In standard multiple regression we have the equation

yi = not +

p

j = 1

shy jxij + ei i = 1 2 n

536 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

where y1 y2 yn are measurements on a response variable y xij i = 1 2 n j =

1 2 p are corresponding values of p predictor variables e1 e2 en are error termsand not shy 1 shy 2 shy p are parameters in the regression equation In least squares regressionthese parameters are estimated by minimizing the residual (or error) sum of squares

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

The LASSO imposes an additional restriction on the coef cients namely

p

j = 1

jshy j j micro t

for some ldquotuning parameterrdquo t For suitable choices of t this constraint has the interestingproperty that it forces some of the coef cients in the regression equation to zero An equiv-alent way of deriving LASSO estimates is to minimize the residual sum of squares with theaddition of a penalty function based on p

j = 1 jshy jj Thus we minimize

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

+ para

p

j = 1

jshy jj

for some multiplier para For any given value of t in the rst LASSO formulation there is avalue of para in the second formulation that gives equivalent results

32 THE LASSO APPROACH IN PCA (SCOTLASS)

PCA on a correlation matrix nds linear combinations a0

kx(k = 1 2 p) of thep measured variables x each standardized to have unit variance which successively havemaximum variance

a0

kR ak (31)

subject to

a0

kak = 1 and (for k para 2) a0

hak = 0 h lt k (32)

The proposed method of LASSO-based PCA performs the maximization under the extraconstraints

p

j = 1

jakjj micro t (33)

for some tuning parameter t where akj is the jth element of the kth vector ak (k =

1 2 p) We call the new technique SCoTLASS (Simpli ed Component Technique-LASSO)

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 537

= 1+

2

1 2aa

a

22

t = 1

t = sqrt (p)

Figure 1 The Two-Dimensional SCoTLASS

33 SOME PROPERTIES

SCoTLASS differs from PCA in the inclusion of the constraints de ned in (33) so adecision must be made on the value of the tuning parameter t It is easy to see that

(a) for t para pp we get PCA

(b) for t lt 1 there is no solution and(c) for t = 1 we must have exactly one nonzero akj for each k

As t decreases fromp

p we move progressivelyaway from PCA and eventuallyto a solutionwhere only one variable has a nonzero loading on each component All other variables willshrink (not necessary monotonically) with t and ultimately reach zero Examples of thisbehavior are given in the next section

The geometry of SCoTLASS in the case when p = 2 is shown in Figure 1 where weplot the elements a1 a2 of the vectora For PCA in Figure 1 the rst component a

0

1x wherea

0

1 = (a11 a12) corresponds to the point on circumference of the shaded circle (a0

1a1 = 1)

which touches the ldquolargestrdquo possible ellipse a0

1Ra1 = constantFor SCoTLASS with 1 lt t lt

pp we are restricted to the part of the circle a

0

1a1 = 1inside the dotted square 2

j = 1 ja1jj micro tFor t = 1 corresponding to the inner shaded square the optimal (only) solutions are

on the axesFigure 1 shows a special case in which the axes of the ellipses are at 45macr to the a1 a2

axes corresponding to equal variances for x1 x2 (or a correlation matrix) This gives two

538 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

optimal solutions for SCoTLASS in the rst quadrant symmetric about the 45macr line IfPCA or SCoTLASS is done on a covariance rather than correlation matrix (see Remark 1in Section 6) with unequal variances there will be a unique solution in the rst quadrant

34 IMPLEMENTATION

PCA reduces to an easily implemented eigenvalue problem but the extra constraintin SCoTLASS means that it needs numerical optimization to estimate parameters and suf-fers from the problem of many local optima A number of algorithms have been tried toimplement the technique including simulated annealing (Goffe Ferrier and Rogers 1994)but the results reported in the following example were derived using the projected gradientapproach (Chu and Trenda lov 2001 Helmke and Moore 1994) The LASSO inequalityconstraint (33) in the SCoTLASS problem (31)ndash(33) is eliminated by making use of anexterior penalty function Thus the SCoTLASS problem (31)ndash(33) is transformed intoa new maximization problem subject to the equality constraint (32) The solution of thismodi ed maximization problem is then found as an ascent gradient vector ow onto thep-dimensional unit sphere following the standard projected gradient formalism (Chu andTrenda lov 2001 Helmke and Moore 1994) Detailed considerationof this solution will bereported separately

We have implemented SCoTLASS using MATLAB The code requires as input thecorrelation matrix R the value of the tuning parameter t and the number of componentsto be retained (m) The MATLAB code returns a loading matrix and calculates a numberof relevant statistics To achieve an appropriate solution a number of parameters of theprojected gradient method (eg starting points absolute and relative tolerances) also needto be de ned

4 EXAMPLE PITPROPS DATA

Here we revisit the pitprop data from section 2 We have studied many other examplesnot discussed here and in general the results are qualitativelysimilar Table 4 gives loadingsfor SCoTLASS with t = 225 200 175 and 150 Table 5 gives variances cumulativevariances ldquosimplicity factorsrdquo and number of zero loadings for the same values of t aswell as corresponding information for PCA (t =

p13) and RPCA The simplicity factors

are values of the varimax criterion for each componentIt can be seen that as the value of t is decreased the simplicity of the components

increases as measured by the number of zero loadings and by the varimax simplicity factoralthough the increase in the latter is not uniform The increase in simplicity is paid for by aloss of variance retained By t = 225 175 the percentage of variance accounted for by the rst component in SCoTLASS is reduced from 324 for PCA to 267 196 respectivelyAt t = 225 this is still larger than the largest contribution (239) achieved by a singlecomponent in RPCA The comparison with RPCA is less favorable when the variationaccounted for by all six retained components is examined RPCA necessarily retains the

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 2: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

A Modi ed Principal Component TechniqueBased on the LASSO

Ian T JOLLIFFE Nickolay T TRENDAFILOV and Mudassir UDDIN

In many multivariate statistical techniques a set of linear functions of the original p

variables is produced One of the more dif cult aspects of these techniques is the inter-pretation of the linear functions as these functions usually have nonzero coef cients onall p variables A common approach is to effectively ignore (treat as zero) any coef cientsless than some threshold value so that the function becomes simple and the interpretationbecomes easier for the users Such a procedure can be misleading There are alternatives toprincipal componentanalysiswhich restrict the coef cients to a smaller number of possiblevalues in the derivationof the linear functionsor replace the principalcomponentsby ldquoprin-cipal variablesrdquo This article introduces a new technique borrowing an idea proposed byTibshirani in the context of multiple regressionwhere similar problems arise in interpretingregression equations This approach is the so-called LASSO the ldquoleast absolute shrinkageand selection operatorrdquo in which a bound is introduced on the sum of the absolute valuesof the coef cients and in which some coef cients consequentlybecome zero We exploresome of the propertiesof the new techniqueboth theoreticallyand using simulationstudiesand apply it to an example

Key Words InterpretationPrincipal component analysis Simpli cation

1 INTRODUCTION

Principal component analysis (PCA) like several other multivariate statistical tech-niques replaces a set of p measured variables by a small set of derived variables Thederived variables the principal components are linear combinationsof the p variables Thedimension reduction achieved by PCA is especially useful if the components can be readilyinterpreted and this is sometimes the case see for example Jolliffe (2002 chap 4) Inother examples particularly where a component has nontrivial loadings on a substantial

Ian T Jolliffe is Professor Department of Mathematical Sciences University of Aberdeen Meston BuildingKingrsquos College Aberdeen AB24 3UE Scotland UK (E-mail itjmathsabdnacuk) Nickolay T Trenda lov isSenior Lecturer Faculty of Computing Engineering and Mathematical Sciences University of the West of Eng-land Bristol BS16 1QY UK (E-mail NickolayTrenda lovuweacuk)Mudassir Uddin is Associate ProfessorDepartment of Statistics University of Karachi Karachi-75270 Pakistan (E-mail mudassir2000hotmailcom)

creg 2003 American Statistical Association Institute of Mathematical Statisticsand Interface Foundation of North America

Journal of Computational and Graphical Statistics Volume 12 Number 3 Pages 531ndash547DOI 1011981061860032148

531

532 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

proportion of the p variables interpretationcan be dif cult detracting from the value of theanalysis

A number of methods are available to aid interpretation Rotation which is common-place in factor analysis can be applied to PCA but has its drawbacks (Jolliffe 1989 1995)A frequently used informal approach is to ignore all loadings smaller than some thresh-old absolute value effectively treating them as zero This can be misleading (Cadima andJolliffe 1995) A more formal way of making some of the loadings zero is to restrict theallowable loadings to a small set of values for example iexcl 1 0 1 (Hausman 1982) Vines(2000) described a variation on this theme One further strategy is to select a subset of thevariables themselveswhich satisfy similar optimalitycriterion to the principalcomponentsas in McCabersquos (1984) ldquoprincipal variablesrdquo

This article introducesa new techniquewhich shares an idea central to both Hausmanrsquos(1982) and Vinesrsquos (2000) work This idea is that we choose linear combinations of themeasured variables which successively maximizes variance as in PCA but we imposeextra constraints which sacri ces some variance in order to improve interpretability Inour technique the extra constraint is in the form of a bound on the sum of the absolutevalues of the loadings in that component This type of bound has been used in regression(Tibshirani 1996) where similar problems of interpretationoccur and is known there as theLASSO (least absolute shrinkage and selection operator) As with the methods of Hausman(1982) and Vines (2000) and unlike rotation our technique usually produces some exactlyzero loadings in the components In contrast to Hausman (1982) and Vines (2000) it doesnot restrict the nonzero loadings to a discrete set of values This article shows throughsimulationsand an example that the new techniqueis a valuableadditionaltool for exploringthe structure of multivariate data

Section 2 establishes the notation and terminology of PCA and introduces an exam-ple in which interpretation of principal components is not straightforward The most usualapproach to simplifying interpretation the rotation of PCs is shown to have drawbacksSection 3 introduces the new technique and describes some of its properties Section 4 re-visits the example of Section 2 and demonstrates the practical usefulness of the techniqueA simulation study which investigates the ability of the technique to recover known un-derlying structures in a dataset is summarized in Section 5 The article ends with furtherdiscussion in Section 6 including some modi cations complications and open questions

2 A MOTIVATING EXAMPLE

Consider the classic example rst introduced by Jeffers (1967) in which a PCA wasdone on the correlation matrix of 13 physical measurements listed in Table 1 made on asample of 180 pitprops cut from Corsican pine timber

Let xi be the vector of 13 variables for the ith pitprop where each variable has beenstandardized to have unit variance What PCA does when based on the correlation matrixis to nd linear functions a

0

1x a0

2x a0

px which successively have maximum samplevariance subject to a

0

hak = 0 for k para 2 and h lt k In addition a normalization constraint

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 533

Table 1 Denitions of Variables in Jeffersrsquo Pitprop Data

Variable De nition

x1 Top diameter in inchesx2 Length in inchesx3 Moisture content of dry weightx4 Specic gravity at time of testx5 Oven-dry specic gravityx6 Number of annual rings at topx7 Number of annual rings at bottomx8 Maximum bow in inchesx9 Distance of point of maximum bow from top in inchesx10 Number of knot whorlsx11 Length of clear prop from top in inchesx12 Average number of knots per whorlx13 Average diameter of the knots in inches

a0

kak = 1 is necessary to get a bounded solution The derived variable a0

kx is the kthprincipal component (PC) It turns out that ak the vector of coef cients or loadings forthe kth PC is the eigenvector of the sample correlation matrix R corresponding to the kthlargest eigenvalue lk In addition the sample variance of a

0

kx is equal to lk Because ofthe successive maximization property the rst few PCs will often account for most of thesample variation in all the standardized measured variables In the pitprop example Jeffers(1967) was interested in the rst six PCs which together account for 87 of the totalvariance The loadings in each of these six components are given in Table 2 together withthe individual and cumulative percentage of variance in all 13 variables accounted for by1 2 6 PCs

PCs are easiest to interpret if the pattern of loadings is clear-cut with a few large

Table 2 Loadings for Correlation PCA for Jeffersrsquo Pitprop Data

Component

Variable (1) (2) (3) (4) (5) (6)

x1 0404 0212 iexcl0219 iexcl0027 iexcl0141 iexcl0086x2 0406 0180 iexcl0245 iexcl0025 iexcl0188 iexcl0111x3 0125 0546 0114 0015 0433 0120x4 0173 0468 0328 0010 0361 iexcl0090x5 0057 iexcl0138 0493 0254 iexcl0122 iexcl0560x6 0284 iexcl0002 0476 iexcl0153 iexcl0269 0032x7 0400 iexcl0185 0261 iexcl0125 iexcl0176 0030x8 0294 iexcl0198 iexcl0222 0294 0203 0103x9 0357 0010 iexcl0202 0132 iexcl0117 0103x10 0379 iexcl0252 iexcl0120 iexcl0201 0173 iexcl0019x11 iexcl0008 0187 0021 0805 iexcl0302 0178x12 iexcl0115 0348 0066 iexcl0303 iexcl0537 0371x13 iexcl0112 0304 iexcl0352 iexcl0098 iexcl0209 iexcl0671

Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266Variance () 324 182 144 89 70 63

Cumulative Variance () 324 507 650 740 809 872

534 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 3 Loadings for Rotated Correlation PCA Using the Varimax Criterion for Jeffersrsquo Pitprop Data

Component

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0019 0074 0043 iexcl0027 iexcl0519 iexcl0077x2 iexcl0018 0015 0048 iexcl0024 iexcl0540 iexcl0102x3 iexcl0024 0705 iexcl0128 0003 iexcl0059 0107x4 0029 0689 0112 0001 0014 iexcl0087x5 0258 0009 0477 0218 0205 iexcl0524x6 iexcl0185 0061 0604 iexcl0005 iexcl0032 0012x7 0031 iexcl0069 0512 iexcl0102 iexcl0151 0092x8 0440 iexcl0042 iexcl0072 0083 iexcl0221 0239x9 0097 iexcl0058 0045 0094 iexcl0408 0141x10 0271 iexcl0054 0129 iexcl0367 iexcl0216 0135x11 0057 iexcl0022 iexcl0029 0882 iexcl0137 0075x12 iexcl0776 iexcl0056 0091 0079 iexcl0123 0145x13 iexcl0120 iexcl0049 iexcl0280 iexcl0077 iexcl0269 iexcl0748

Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343variance ( ) 130 146 184 97 239 76

cumulative variance ( ) 130 276 460 557 796 872

(absolute) values and many small loadings in each PC Although Jeffers (1967) makes anattempt to interpret all six components some are to say the least messy and he ignoressome intermediate loadingsFor example PC2 has the largest loadingson x3 x4 with smallloadings on x6 x9 but a whole range of intermediate values on other variables

A traditionalway to simplify loadings is by rotation If A is the (13 pound 6) matrix whosekth column is ak then A is post-multipliedby a matrix T to give rotated loadingsB = ATIf bk is the kth column of B then b

0

kx is the kth rotated componentThe matrix T is chosenso as to optimize some simplicitycriterionVarious criteria have been proposedall of whichattempt to create vectors of loadings whose elements are close to zero or far from zero withfew intermediate values The idea is that each variable should be either clearly importantor clearly unimportant in a rotated component with as few cases as possible of borderlineimportance Varimax is the most widely used rotation criterion and like most other suchcriteria it tends to drive at least some of the loadings in each component towards zero Thisis not the only possible type of simplicity A component whose loadings are all roughlyequal is easy to interpret but will be avoided by most standard rotation criteria It is dif cultto envisage any criterion which could encompass all possible types of simplicity and weconcentrate here on simplicity as de ned by varimax

Table 3 gives the rotated loadings for six components in the correlation PCA of thepitprop data togetherwith the percentageof total variance accounted for by each rotated PC(RPC) The rotation criterion used in Table 3 is varimax (Krzanowski and Marriott 1995 p138) which is the most frequent choice (often the default in software) but other criteria givesimilar results Varimax rotation aims to maximize the sum over rotated components of acriterion which takes values between zero and one A value of zero occurs when all loadingsin the component are equal whereas a component with only one nonzero loading producesa value of unity This criterion or ldquosimplicity factorrdquo is given for each component in Tables2 and 3 and it can be seen that its values are larger for most of the rotated components than

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 535

they are for the unrotated componentsThere are however a number of disadvantages associated with rotation In the context

of interpreting the results in this example we note that we have lost the ldquosuccessive max-imization of variancerdquo property of the unrotated components so what we are interpretingafter rotation are not the ldquomost important sources of variationrdquo in the data The RPC withthe highest variance appears arbitrarily as the fth and this accounts for 24 of the totalvariation compared to 32 in the rst unrotated PC In addition a glance at the loadingsand simplicity factors for the RPCs shows that more generally those components whichare easiest to interpret among the six in Table 3 are those which have the smallest varianceRPC5 is still rather complicated Other problems associated with rotation were discussedby Jolliffe (1989 1995) A simpli ed component technique (SCoT) in which the two stepsof RPCA (PCA followed by rotation) are combined into one was discussed by Jolliffe andUddin (2000) The technique is based on a similar idea proposed by Morton (1989) in thecontext of projection pursuit It maximizes variance but adds a penalty function which isa multiple of one of the simplicity criteria such as varimax SCoT has some advantagescompared to standard rotation but shares a number of its disadvantages

The next section introduces an alternative to rotation which has some clear advantagesover rotated PCA and SCoT A detailed comparison of the new technique SCoT rotatedPCA and Vinesrsquo (2000) simple components is given for an example involving sea surfacetemperatures in Jolliffe Uddin and Vines (2002)

3 MODIFIED PCA BASED ON THE LASSO

Tibshirani (1996) studied the dif culties involved in the interpretation of multiple re-gression equationsThese problems may occur due to the instability of the regression coef- cients in the presence of collinearity or simply because of the large number of variablesincluded in the regression equation Some current alternatives to least squares regressionsuch as shrinkage estimators ridge regression principal component regression or partialleast squares handle the instabilityproblemby keepingall variablesin the equationwhereasvariable selection procedures nd a subset of variables and keep only the selected variablesin the equation Tibshirani (1996) proposed a new method the ldquoleast absolute shrinkageand selection operatorrdquo LASSO which is a compromise between variable selection andshrinkage estimators The procedure shrinks the coef cients of some of the variables notsimply towards zero but exactly to zero giving an implicit form of variable selectionLeBlanc and Tibshirani (1998) extended the idea to regression trees Here we adapt theLASSO idea to PCA

31 THE LASSO APPROACH IN REGRESSION

In standard multiple regression we have the equation

yi = not +

p

j = 1

shy jxij + ei i = 1 2 n

536 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

where y1 y2 yn are measurements on a response variable y xij i = 1 2 n j =

1 2 p are corresponding values of p predictor variables e1 e2 en are error termsand not shy 1 shy 2 shy p are parameters in the regression equation In least squares regressionthese parameters are estimated by minimizing the residual (or error) sum of squares

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

The LASSO imposes an additional restriction on the coef cients namely

p

j = 1

jshy j j micro t

for some ldquotuning parameterrdquo t For suitable choices of t this constraint has the interestingproperty that it forces some of the coef cients in the regression equation to zero An equiv-alent way of deriving LASSO estimates is to minimize the residual sum of squares with theaddition of a penalty function based on p

j = 1 jshy jj Thus we minimize

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

+ para

p

j = 1

jshy jj

for some multiplier para For any given value of t in the rst LASSO formulation there is avalue of para in the second formulation that gives equivalent results

32 THE LASSO APPROACH IN PCA (SCOTLASS)

PCA on a correlation matrix nds linear combinations a0

kx(k = 1 2 p) of thep measured variables x each standardized to have unit variance which successively havemaximum variance

a0

kR ak (31)

subject to

a0

kak = 1 and (for k para 2) a0

hak = 0 h lt k (32)

The proposed method of LASSO-based PCA performs the maximization under the extraconstraints

p

j = 1

jakjj micro t (33)

for some tuning parameter t where akj is the jth element of the kth vector ak (k =

1 2 p) We call the new technique SCoTLASS (Simpli ed Component Technique-LASSO)

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 537

= 1+

2

1 2aa

a

22

t = 1

t = sqrt (p)

Figure 1 The Two-Dimensional SCoTLASS

33 SOME PROPERTIES

SCoTLASS differs from PCA in the inclusion of the constraints de ned in (33) so adecision must be made on the value of the tuning parameter t It is easy to see that

(a) for t para pp we get PCA

(b) for t lt 1 there is no solution and(c) for t = 1 we must have exactly one nonzero akj for each k

As t decreases fromp

p we move progressivelyaway from PCA and eventuallyto a solutionwhere only one variable has a nonzero loading on each component All other variables willshrink (not necessary monotonically) with t and ultimately reach zero Examples of thisbehavior are given in the next section

The geometry of SCoTLASS in the case when p = 2 is shown in Figure 1 where weplot the elements a1 a2 of the vectora For PCA in Figure 1 the rst component a

0

1x wherea

0

1 = (a11 a12) corresponds to the point on circumference of the shaded circle (a0

1a1 = 1)

which touches the ldquolargestrdquo possible ellipse a0

1Ra1 = constantFor SCoTLASS with 1 lt t lt

pp we are restricted to the part of the circle a

0

1a1 = 1inside the dotted square 2

j = 1 ja1jj micro tFor t = 1 corresponding to the inner shaded square the optimal (only) solutions are

on the axesFigure 1 shows a special case in which the axes of the ellipses are at 45macr to the a1 a2

axes corresponding to equal variances for x1 x2 (or a correlation matrix) This gives two

538 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

optimal solutions for SCoTLASS in the rst quadrant symmetric about the 45macr line IfPCA or SCoTLASS is done on a covariance rather than correlation matrix (see Remark 1in Section 6) with unequal variances there will be a unique solution in the rst quadrant

34 IMPLEMENTATION

PCA reduces to an easily implemented eigenvalue problem but the extra constraintin SCoTLASS means that it needs numerical optimization to estimate parameters and suf-fers from the problem of many local optima A number of algorithms have been tried toimplement the technique including simulated annealing (Goffe Ferrier and Rogers 1994)but the results reported in the following example were derived using the projected gradientapproach (Chu and Trenda lov 2001 Helmke and Moore 1994) The LASSO inequalityconstraint (33) in the SCoTLASS problem (31)ndash(33) is eliminated by making use of anexterior penalty function Thus the SCoTLASS problem (31)ndash(33) is transformed intoa new maximization problem subject to the equality constraint (32) The solution of thismodi ed maximization problem is then found as an ascent gradient vector ow onto thep-dimensional unit sphere following the standard projected gradient formalism (Chu andTrenda lov 2001 Helmke and Moore 1994) Detailed considerationof this solution will bereported separately

We have implemented SCoTLASS using MATLAB The code requires as input thecorrelation matrix R the value of the tuning parameter t and the number of componentsto be retained (m) The MATLAB code returns a loading matrix and calculates a numberof relevant statistics To achieve an appropriate solution a number of parameters of theprojected gradient method (eg starting points absolute and relative tolerances) also needto be de ned

4 EXAMPLE PITPROPS DATA

Here we revisit the pitprop data from section 2 We have studied many other examplesnot discussed here and in general the results are qualitativelysimilar Table 4 gives loadingsfor SCoTLASS with t = 225 200 175 and 150 Table 5 gives variances cumulativevariances ldquosimplicity factorsrdquo and number of zero loadings for the same values of t aswell as corresponding information for PCA (t =

p13) and RPCA The simplicity factors

are values of the varimax criterion for each componentIt can be seen that as the value of t is decreased the simplicity of the components

increases as measured by the number of zero loadings and by the varimax simplicity factoralthough the increase in the latter is not uniform The increase in simplicity is paid for by aloss of variance retained By t = 225 175 the percentage of variance accounted for by the rst component in SCoTLASS is reduced from 324 for PCA to 267 196 respectivelyAt t = 225 this is still larger than the largest contribution (239) achieved by a singlecomponent in RPCA The comparison with RPCA is less favorable when the variationaccounted for by all six retained components is examined RPCA necessarily retains the

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 3: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

532 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

proportion of the p variables interpretationcan be dif cult detracting from the value of theanalysis

A number of methods are available to aid interpretation Rotation which is common-place in factor analysis can be applied to PCA but has its drawbacks (Jolliffe 1989 1995)A frequently used informal approach is to ignore all loadings smaller than some thresh-old absolute value effectively treating them as zero This can be misleading (Cadima andJolliffe 1995) A more formal way of making some of the loadings zero is to restrict theallowable loadings to a small set of values for example iexcl 1 0 1 (Hausman 1982) Vines(2000) described a variation on this theme One further strategy is to select a subset of thevariables themselveswhich satisfy similar optimalitycriterion to the principalcomponentsas in McCabersquos (1984) ldquoprincipal variablesrdquo

This article introducesa new techniquewhich shares an idea central to both Hausmanrsquos(1982) and Vinesrsquos (2000) work This idea is that we choose linear combinations of themeasured variables which successively maximizes variance as in PCA but we imposeextra constraints which sacri ces some variance in order to improve interpretability Inour technique the extra constraint is in the form of a bound on the sum of the absolutevalues of the loadings in that component This type of bound has been used in regression(Tibshirani 1996) where similar problems of interpretationoccur and is known there as theLASSO (least absolute shrinkage and selection operator) As with the methods of Hausman(1982) and Vines (2000) and unlike rotation our technique usually produces some exactlyzero loadings in the components In contrast to Hausman (1982) and Vines (2000) it doesnot restrict the nonzero loadings to a discrete set of values This article shows throughsimulationsand an example that the new techniqueis a valuableadditionaltool for exploringthe structure of multivariate data

Section 2 establishes the notation and terminology of PCA and introduces an exam-ple in which interpretation of principal components is not straightforward The most usualapproach to simplifying interpretation the rotation of PCs is shown to have drawbacksSection 3 introduces the new technique and describes some of its properties Section 4 re-visits the example of Section 2 and demonstrates the practical usefulness of the techniqueA simulation study which investigates the ability of the technique to recover known un-derlying structures in a dataset is summarized in Section 5 The article ends with furtherdiscussion in Section 6 including some modi cations complications and open questions

2 A MOTIVATING EXAMPLE

Consider the classic example rst introduced by Jeffers (1967) in which a PCA wasdone on the correlation matrix of 13 physical measurements listed in Table 1 made on asample of 180 pitprops cut from Corsican pine timber

Let xi be the vector of 13 variables for the ith pitprop where each variable has beenstandardized to have unit variance What PCA does when based on the correlation matrixis to nd linear functions a

0

1x a0

2x a0

px which successively have maximum samplevariance subject to a

0

hak = 0 for k para 2 and h lt k In addition a normalization constraint

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 533

Table 1 Denitions of Variables in Jeffersrsquo Pitprop Data

Variable De nition

x1 Top diameter in inchesx2 Length in inchesx3 Moisture content of dry weightx4 Specic gravity at time of testx5 Oven-dry specic gravityx6 Number of annual rings at topx7 Number of annual rings at bottomx8 Maximum bow in inchesx9 Distance of point of maximum bow from top in inchesx10 Number of knot whorlsx11 Length of clear prop from top in inchesx12 Average number of knots per whorlx13 Average diameter of the knots in inches

a0

kak = 1 is necessary to get a bounded solution The derived variable a0

kx is the kthprincipal component (PC) It turns out that ak the vector of coef cients or loadings forthe kth PC is the eigenvector of the sample correlation matrix R corresponding to the kthlargest eigenvalue lk In addition the sample variance of a

0

kx is equal to lk Because ofthe successive maximization property the rst few PCs will often account for most of thesample variation in all the standardized measured variables In the pitprop example Jeffers(1967) was interested in the rst six PCs which together account for 87 of the totalvariance The loadings in each of these six components are given in Table 2 together withthe individual and cumulative percentage of variance in all 13 variables accounted for by1 2 6 PCs

PCs are easiest to interpret if the pattern of loadings is clear-cut with a few large

Table 2 Loadings for Correlation PCA for Jeffersrsquo Pitprop Data

Component

Variable (1) (2) (3) (4) (5) (6)

x1 0404 0212 iexcl0219 iexcl0027 iexcl0141 iexcl0086x2 0406 0180 iexcl0245 iexcl0025 iexcl0188 iexcl0111x3 0125 0546 0114 0015 0433 0120x4 0173 0468 0328 0010 0361 iexcl0090x5 0057 iexcl0138 0493 0254 iexcl0122 iexcl0560x6 0284 iexcl0002 0476 iexcl0153 iexcl0269 0032x7 0400 iexcl0185 0261 iexcl0125 iexcl0176 0030x8 0294 iexcl0198 iexcl0222 0294 0203 0103x9 0357 0010 iexcl0202 0132 iexcl0117 0103x10 0379 iexcl0252 iexcl0120 iexcl0201 0173 iexcl0019x11 iexcl0008 0187 0021 0805 iexcl0302 0178x12 iexcl0115 0348 0066 iexcl0303 iexcl0537 0371x13 iexcl0112 0304 iexcl0352 iexcl0098 iexcl0209 iexcl0671

Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266Variance () 324 182 144 89 70 63

Cumulative Variance () 324 507 650 740 809 872

534 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 3 Loadings for Rotated Correlation PCA Using the Varimax Criterion for Jeffersrsquo Pitprop Data

Component

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0019 0074 0043 iexcl0027 iexcl0519 iexcl0077x2 iexcl0018 0015 0048 iexcl0024 iexcl0540 iexcl0102x3 iexcl0024 0705 iexcl0128 0003 iexcl0059 0107x4 0029 0689 0112 0001 0014 iexcl0087x5 0258 0009 0477 0218 0205 iexcl0524x6 iexcl0185 0061 0604 iexcl0005 iexcl0032 0012x7 0031 iexcl0069 0512 iexcl0102 iexcl0151 0092x8 0440 iexcl0042 iexcl0072 0083 iexcl0221 0239x9 0097 iexcl0058 0045 0094 iexcl0408 0141x10 0271 iexcl0054 0129 iexcl0367 iexcl0216 0135x11 0057 iexcl0022 iexcl0029 0882 iexcl0137 0075x12 iexcl0776 iexcl0056 0091 0079 iexcl0123 0145x13 iexcl0120 iexcl0049 iexcl0280 iexcl0077 iexcl0269 iexcl0748

Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343variance ( ) 130 146 184 97 239 76

cumulative variance ( ) 130 276 460 557 796 872

(absolute) values and many small loadings in each PC Although Jeffers (1967) makes anattempt to interpret all six components some are to say the least messy and he ignoressome intermediate loadingsFor example PC2 has the largest loadingson x3 x4 with smallloadings on x6 x9 but a whole range of intermediate values on other variables

A traditionalway to simplify loadings is by rotation If A is the (13 pound 6) matrix whosekth column is ak then A is post-multipliedby a matrix T to give rotated loadingsB = ATIf bk is the kth column of B then b

0

kx is the kth rotated componentThe matrix T is chosenso as to optimize some simplicitycriterionVarious criteria have been proposedall of whichattempt to create vectors of loadings whose elements are close to zero or far from zero withfew intermediate values The idea is that each variable should be either clearly importantor clearly unimportant in a rotated component with as few cases as possible of borderlineimportance Varimax is the most widely used rotation criterion and like most other suchcriteria it tends to drive at least some of the loadings in each component towards zero Thisis not the only possible type of simplicity A component whose loadings are all roughlyequal is easy to interpret but will be avoided by most standard rotation criteria It is dif cultto envisage any criterion which could encompass all possible types of simplicity and weconcentrate here on simplicity as de ned by varimax

Table 3 gives the rotated loadings for six components in the correlation PCA of thepitprop data togetherwith the percentageof total variance accounted for by each rotated PC(RPC) The rotation criterion used in Table 3 is varimax (Krzanowski and Marriott 1995 p138) which is the most frequent choice (often the default in software) but other criteria givesimilar results Varimax rotation aims to maximize the sum over rotated components of acriterion which takes values between zero and one A value of zero occurs when all loadingsin the component are equal whereas a component with only one nonzero loading producesa value of unity This criterion or ldquosimplicity factorrdquo is given for each component in Tables2 and 3 and it can be seen that its values are larger for most of the rotated components than

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 535

they are for the unrotated componentsThere are however a number of disadvantages associated with rotation In the context

of interpreting the results in this example we note that we have lost the ldquosuccessive max-imization of variancerdquo property of the unrotated components so what we are interpretingafter rotation are not the ldquomost important sources of variationrdquo in the data The RPC withthe highest variance appears arbitrarily as the fth and this accounts for 24 of the totalvariation compared to 32 in the rst unrotated PC In addition a glance at the loadingsand simplicity factors for the RPCs shows that more generally those components whichare easiest to interpret among the six in Table 3 are those which have the smallest varianceRPC5 is still rather complicated Other problems associated with rotation were discussedby Jolliffe (1989 1995) A simpli ed component technique (SCoT) in which the two stepsof RPCA (PCA followed by rotation) are combined into one was discussed by Jolliffe andUddin (2000) The technique is based on a similar idea proposed by Morton (1989) in thecontext of projection pursuit It maximizes variance but adds a penalty function which isa multiple of one of the simplicity criteria such as varimax SCoT has some advantagescompared to standard rotation but shares a number of its disadvantages

The next section introduces an alternative to rotation which has some clear advantagesover rotated PCA and SCoT A detailed comparison of the new technique SCoT rotatedPCA and Vinesrsquo (2000) simple components is given for an example involving sea surfacetemperatures in Jolliffe Uddin and Vines (2002)

3 MODIFIED PCA BASED ON THE LASSO

Tibshirani (1996) studied the dif culties involved in the interpretation of multiple re-gression equationsThese problems may occur due to the instability of the regression coef- cients in the presence of collinearity or simply because of the large number of variablesincluded in the regression equation Some current alternatives to least squares regressionsuch as shrinkage estimators ridge regression principal component regression or partialleast squares handle the instabilityproblemby keepingall variablesin the equationwhereasvariable selection procedures nd a subset of variables and keep only the selected variablesin the equation Tibshirani (1996) proposed a new method the ldquoleast absolute shrinkageand selection operatorrdquo LASSO which is a compromise between variable selection andshrinkage estimators The procedure shrinks the coef cients of some of the variables notsimply towards zero but exactly to zero giving an implicit form of variable selectionLeBlanc and Tibshirani (1998) extended the idea to regression trees Here we adapt theLASSO idea to PCA

31 THE LASSO APPROACH IN REGRESSION

In standard multiple regression we have the equation

yi = not +

p

j = 1

shy jxij + ei i = 1 2 n

536 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

where y1 y2 yn are measurements on a response variable y xij i = 1 2 n j =

1 2 p are corresponding values of p predictor variables e1 e2 en are error termsand not shy 1 shy 2 shy p are parameters in the regression equation In least squares regressionthese parameters are estimated by minimizing the residual (or error) sum of squares

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

The LASSO imposes an additional restriction on the coef cients namely

p

j = 1

jshy j j micro t

for some ldquotuning parameterrdquo t For suitable choices of t this constraint has the interestingproperty that it forces some of the coef cients in the regression equation to zero An equiv-alent way of deriving LASSO estimates is to minimize the residual sum of squares with theaddition of a penalty function based on p

j = 1 jshy jj Thus we minimize

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

+ para

p

j = 1

jshy jj

for some multiplier para For any given value of t in the rst LASSO formulation there is avalue of para in the second formulation that gives equivalent results

32 THE LASSO APPROACH IN PCA (SCOTLASS)

PCA on a correlation matrix nds linear combinations a0

kx(k = 1 2 p) of thep measured variables x each standardized to have unit variance which successively havemaximum variance

a0

kR ak (31)

subject to

a0

kak = 1 and (for k para 2) a0

hak = 0 h lt k (32)

The proposed method of LASSO-based PCA performs the maximization under the extraconstraints

p

j = 1

jakjj micro t (33)

for some tuning parameter t where akj is the jth element of the kth vector ak (k =

1 2 p) We call the new technique SCoTLASS (Simpli ed Component Technique-LASSO)

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 537

= 1+

2

1 2aa

a

22

t = 1

t = sqrt (p)

Figure 1 The Two-Dimensional SCoTLASS

33 SOME PROPERTIES

SCoTLASS differs from PCA in the inclusion of the constraints de ned in (33) so adecision must be made on the value of the tuning parameter t It is easy to see that

(a) for t para pp we get PCA

(b) for t lt 1 there is no solution and(c) for t = 1 we must have exactly one nonzero akj for each k

As t decreases fromp

p we move progressivelyaway from PCA and eventuallyto a solutionwhere only one variable has a nonzero loading on each component All other variables willshrink (not necessary monotonically) with t and ultimately reach zero Examples of thisbehavior are given in the next section

The geometry of SCoTLASS in the case when p = 2 is shown in Figure 1 where weplot the elements a1 a2 of the vectora For PCA in Figure 1 the rst component a

0

1x wherea

0

1 = (a11 a12) corresponds to the point on circumference of the shaded circle (a0

1a1 = 1)

which touches the ldquolargestrdquo possible ellipse a0

1Ra1 = constantFor SCoTLASS with 1 lt t lt

pp we are restricted to the part of the circle a

0

1a1 = 1inside the dotted square 2

j = 1 ja1jj micro tFor t = 1 corresponding to the inner shaded square the optimal (only) solutions are

on the axesFigure 1 shows a special case in which the axes of the ellipses are at 45macr to the a1 a2

axes corresponding to equal variances for x1 x2 (or a correlation matrix) This gives two

538 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

optimal solutions for SCoTLASS in the rst quadrant symmetric about the 45macr line IfPCA or SCoTLASS is done on a covariance rather than correlation matrix (see Remark 1in Section 6) with unequal variances there will be a unique solution in the rst quadrant

34 IMPLEMENTATION

PCA reduces to an easily implemented eigenvalue problem but the extra constraintin SCoTLASS means that it needs numerical optimization to estimate parameters and suf-fers from the problem of many local optima A number of algorithms have been tried toimplement the technique including simulated annealing (Goffe Ferrier and Rogers 1994)but the results reported in the following example were derived using the projected gradientapproach (Chu and Trenda lov 2001 Helmke and Moore 1994) The LASSO inequalityconstraint (33) in the SCoTLASS problem (31)ndash(33) is eliminated by making use of anexterior penalty function Thus the SCoTLASS problem (31)ndash(33) is transformed intoa new maximization problem subject to the equality constraint (32) The solution of thismodi ed maximization problem is then found as an ascent gradient vector ow onto thep-dimensional unit sphere following the standard projected gradient formalism (Chu andTrenda lov 2001 Helmke and Moore 1994) Detailed considerationof this solution will bereported separately

We have implemented SCoTLASS using MATLAB The code requires as input thecorrelation matrix R the value of the tuning parameter t and the number of componentsto be retained (m) The MATLAB code returns a loading matrix and calculates a numberof relevant statistics To achieve an appropriate solution a number of parameters of theprojected gradient method (eg starting points absolute and relative tolerances) also needto be de ned

4 EXAMPLE PITPROPS DATA

Here we revisit the pitprop data from section 2 We have studied many other examplesnot discussed here and in general the results are qualitativelysimilar Table 4 gives loadingsfor SCoTLASS with t = 225 200 175 and 150 Table 5 gives variances cumulativevariances ldquosimplicity factorsrdquo and number of zero loadings for the same values of t aswell as corresponding information for PCA (t =

p13) and RPCA The simplicity factors

are values of the varimax criterion for each componentIt can be seen that as the value of t is decreased the simplicity of the components

increases as measured by the number of zero loadings and by the varimax simplicity factoralthough the increase in the latter is not uniform The increase in simplicity is paid for by aloss of variance retained By t = 225 175 the percentage of variance accounted for by the rst component in SCoTLASS is reduced from 324 for PCA to 267 196 respectivelyAt t = 225 this is still larger than the largest contribution (239) achieved by a singlecomponent in RPCA The comparison with RPCA is less favorable when the variationaccounted for by all six retained components is examined RPCA necessarily retains the

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 4: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 533

Table 1 Denitions of Variables in Jeffersrsquo Pitprop Data

Variable De nition

x1 Top diameter in inchesx2 Length in inchesx3 Moisture content of dry weightx4 Specic gravity at time of testx5 Oven-dry specic gravityx6 Number of annual rings at topx7 Number of annual rings at bottomx8 Maximum bow in inchesx9 Distance of point of maximum bow from top in inchesx10 Number of knot whorlsx11 Length of clear prop from top in inchesx12 Average number of knots per whorlx13 Average diameter of the knots in inches

a0

kak = 1 is necessary to get a bounded solution The derived variable a0

kx is the kthprincipal component (PC) It turns out that ak the vector of coef cients or loadings forthe kth PC is the eigenvector of the sample correlation matrix R corresponding to the kthlargest eigenvalue lk In addition the sample variance of a

0

kx is equal to lk Because ofthe successive maximization property the rst few PCs will often account for most of thesample variation in all the standardized measured variables In the pitprop example Jeffers(1967) was interested in the rst six PCs which together account for 87 of the totalvariance The loadings in each of these six components are given in Table 2 together withthe individual and cumulative percentage of variance in all 13 variables accounted for by1 2 6 PCs

PCs are easiest to interpret if the pattern of loadings is clear-cut with a few large

Table 2 Loadings for Correlation PCA for Jeffersrsquo Pitprop Data

Component

Variable (1) (2) (3) (4) (5) (6)

x1 0404 0212 iexcl0219 iexcl0027 iexcl0141 iexcl0086x2 0406 0180 iexcl0245 iexcl0025 iexcl0188 iexcl0111x3 0125 0546 0114 0015 0433 0120x4 0173 0468 0328 0010 0361 iexcl0090x5 0057 iexcl0138 0493 0254 iexcl0122 iexcl0560x6 0284 iexcl0002 0476 iexcl0153 iexcl0269 0032x7 0400 iexcl0185 0261 iexcl0125 iexcl0176 0030x8 0294 iexcl0198 iexcl0222 0294 0203 0103x9 0357 0010 iexcl0202 0132 iexcl0117 0103x10 0379 iexcl0252 iexcl0120 iexcl0201 0173 iexcl0019x11 iexcl0008 0187 0021 0805 iexcl0302 0178x12 iexcl0115 0348 0066 iexcl0303 iexcl0537 0371x13 iexcl0112 0304 iexcl0352 iexcl0098 iexcl0209 iexcl0671

Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266Variance () 324 182 144 89 70 63

Cumulative Variance () 324 507 650 740 809 872

534 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 3 Loadings for Rotated Correlation PCA Using the Varimax Criterion for Jeffersrsquo Pitprop Data

Component

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0019 0074 0043 iexcl0027 iexcl0519 iexcl0077x2 iexcl0018 0015 0048 iexcl0024 iexcl0540 iexcl0102x3 iexcl0024 0705 iexcl0128 0003 iexcl0059 0107x4 0029 0689 0112 0001 0014 iexcl0087x5 0258 0009 0477 0218 0205 iexcl0524x6 iexcl0185 0061 0604 iexcl0005 iexcl0032 0012x7 0031 iexcl0069 0512 iexcl0102 iexcl0151 0092x8 0440 iexcl0042 iexcl0072 0083 iexcl0221 0239x9 0097 iexcl0058 0045 0094 iexcl0408 0141x10 0271 iexcl0054 0129 iexcl0367 iexcl0216 0135x11 0057 iexcl0022 iexcl0029 0882 iexcl0137 0075x12 iexcl0776 iexcl0056 0091 0079 iexcl0123 0145x13 iexcl0120 iexcl0049 iexcl0280 iexcl0077 iexcl0269 iexcl0748

Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343variance ( ) 130 146 184 97 239 76

cumulative variance ( ) 130 276 460 557 796 872

(absolute) values and many small loadings in each PC Although Jeffers (1967) makes anattempt to interpret all six components some are to say the least messy and he ignoressome intermediate loadingsFor example PC2 has the largest loadingson x3 x4 with smallloadings on x6 x9 but a whole range of intermediate values on other variables

A traditionalway to simplify loadings is by rotation If A is the (13 pound 6) matrix whosekth column is ak then A is post-multipliedby a matrix T to give rotated loadingsB = ATIf bk is the kth column of B then b

0

kx is the kth rotated componentThe matrix T is chosenso as to optimize some simplicitycriterionVarious criteria have been proposedall of whichattempt to create vectors of loadings whose elements are close to zero or far from zero withfew intermediate values The idea is that each variable should be either clearly importantor clearly unimportant in a rotated component with as few cases as possible of borderlineimportance Varimax is the most widely used rotation criterion and like most other suchcriteria it tends to drive at least some of the loadings in each component towards zero Thisis not the only possible type of simplicity A component whose loadings are all roughlyequal is easy to interpret but will be avoided by most standard rotation criteria It is dif cultto envisage any criterion which could encompass all possible types of simplicity and weconcentrate here on simplicity as de ned by varimax

Table 3 gives the rotated loadings for six components in the correlation PCA of thepitprop data togetherwith the percentageof total variance accounted for by each rotated PC(RPC) The rotation criterion used in Table 3 is varimax (Krzanowski and Marriott 1995 p138) which is the most frequent choice (often the default in software) but other criteria givesimilar results Varimax rotation aims to maximize the sum over rotated components of acriterion which takes values between zero and one A value of zero occurs when all loadingsin the component are equal whereas a component with only one nonzero loading producesa value of unity This criterion or ldquosimplicity factorrdquo is given for each component in Tables2 and 3 and it can be seen that its values are larger for most of the rotated components than

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 535

they are for the unrotated componentsThere are however a number of disadvantages associated with rotation In the context

of interpreting the results in this example we note that we have lost the ldquosuccessive max-imization of variancerdquo property of the unrotated components so what we are interpretingafter rotation are not the ldquomost important sources of variationrdquo in the data The RPC withthe highest variance appears arbitrarily as the fth and this accounts for 24 of the totalvariation compared to 32 in the rst unrotated PC In addition a glance at the loadingsand simplicity factors for the RPCs shows that more generally those components whichare easiest to interpret among the six in Table 3 are those which have the smallest varianceRPC5 is still rather complicated Other problems associated with rotation were discussedby Jolliffe (1989 1995) A simpli ed component technique (SCoT) in which the two stepsof RPCA (PCA followed by rotation) are combined into one was discussed by Jolliffe andUddin (2000) The technique is based on a similar idea proposed by Morton (1989) in thecontext of projection pursuit It maximizes variance but adds a penalty function which isa multiple of one of the simplicity criteria such as varimax SCoT has some advantagescompared to standard rotation but shares a number of its disadvantages

The next section introduces an alternative to rotation which has some clear advantagesover rotated PCA and SCoT A detailed comparison of the new technique SCoT rotatedPCA and Vinesrsquo (2000) simple components is given for an example involving sea surfacetemperatures in Jolliffe Uddin and Vines (2002)

3 MODIFIED PCA BASED ON THE LASSO

Tibshirani (1996) studied the dif culties involved in the interpretation of multiple re-gression equationsThese problems may occur due to the instability of the regression coef- cients in the presence of collinearity or simply because of the large number of variablesincluded in the regression equation Some current alternatives to least squares regressionsuch as shrinkage estimators ridge regression principal component regression or partialleast squares handle the instabilityproblemby keepingall variablesin the equationwhereasvariable selection procedures nd a subset of variables and keep only the selected variablesin the equation Tibshirani (1996) proposed a new method the ldquoleast absolute shrinkageand selection operatorrdquo LASSO which is a compromise between variable selection andshrinkage estimators The procedure shrinks the coef cients of some of the variables notsimply towards zero but exactly to zero giving an implicit form of variable selectionLeBlanc and Tibshirani (1998) extended the idea to regression trees Here we adapt theLASSO idea to PCA

31 THE LASSO APPROACH IN REGRESSION

In standard multiple regression we have the equation

yi = not +

p

j = 1

shy jxij + ei i = 1 2 n

536 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

where y1 y2 yn are measurements on a response variable y xij i = 1 2 n j =

1 2 p are corresponding values of p predictor variables e1 e2 en are error termsand not shy 1 shy 2 shy p are parameters in the regression equation In least squares regressionthese parameters are estimated by minimizing the residual (or error) sum of squares

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

The LASSO imposes an additional restriction on the coef cients namely

p

j = 1

jshy j j micro t

for some ldquotuning parameterrdquo t For suitable choices of t this constraint has the interestingproperty that it forces some of the coef cients in the regression equation to zero An equiv-alent way of deriving LASSO estimates is to minimize the residual sum of squares with theaddition of a penalty function based on p

j = 1 jshy jj Thus we minimize

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

+ para

p

j = 1

jshy jj

for some multiplier para For any given value of t in the rst LASSO formulation there is avalue of para in the second formulation that gives equivalent results

32 THE LASSO APPROACH IN PCA (SCOTLASS)

PCA on a correlation matrix nds linear combinations a0

kx(k = 1 2 p) of thep measured variables x each standardized to have unit variance which successively havemaximum variance

a0

kR ak (31)

subject to

a0

kak = 1 and (for k para 2) a0

hak = 0 h lt k (32)

The proposed method of LASSO-based PCA performs the maximization under the extraconstraints

p

j = 1

jakjj micro t (33)

for some tuning parameter t where akj is the jth element of the kth vector ak (k =

1 2 p) We call the new technique SCoTLASS (Simpli ed Component Technique-LASSO)

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 537

= 1+

2

1 2aa

a

22

t = 1

t = sqrt (p)

Figure 1 The Two-Dimensional SCoTLASS

33 SOME PROPERTIES

SCoTLASS differs from PCA in the inclusion of the constraints de ned in (33) so adecision must be made on the value of the tuning parameter t It is easy to see that

(a) for t para pp we get PCA

(b) for t lt 1 there is no solution and(c) for t = 1 we must have exactly one nonzero akj for each k

As t decreases fromp

p we move progressivelyaway from PCA and eventuallyto a solutionwhere only one variable has a nonzero loading on each component All other variables willshrink (not necessary monotonically) with t and ultimately reach zero Examples of thisbehavior are given in the next section

The geometry of SCoTLASS in the case when p = 2 is shown in Figure 1 where weplot the elements a1 a2 of the vectora For PCA in Figure 1 the rst component a

0

1x wherea

0

1 = (a11 a12) corresponds to the point on circumference of the shaded circle (a0

1a1 = 1)

which touches the ldquolargestrdquo possible ellipse a0

1Ra1 = constantFor SCoTLASS with 1 lt t lt

pp we are restricted to the part of the circle a

0

1a1 = 1inside the dotted square 2

j = 1 ja1jj micro tFor t = 1 corresponding to the inner shaded square the optimal (only) solutions are

on the axesFigure 1 shows a special case in which the axes of the ellipses are at 45macr to the a1 a2

axes corresponding to equal variances for x1 x2 (or a correlation matrix) This gives two

538 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

optimal solutions for SCoTLASS in the rst quadrant symmetric about the 45macr line IfPCA or SCoTLASS is done on a covariance rather than correlation matrix (see Remark 1in Section 6) with unequal variances there will be a unique solution in the rst quadrant

34 IMPLEMENTATION

PCA reduces to an easily implemented eigenvalue problem but the extra constraintin SCoTLASS means that it needs numerical optimization to estimate parameters and suf-fers from the problem of many local optima A number of algorithms have been tried toimplement the technique including simulated annealing (Goffe Ferrier and Rogers 1994)but the results reported in the following example were derived using the projected gradientapproach (Chu and Trenda lov 2001 Helmke and Moore 1994) The LASSO inequalityconstraint (33) in the SCoTLASS problem (31)ndash(33) is eliminated by making use of anexterior penalty function Thus the SCoTLASS problem (31)ndash(33) is transformed intoa new maximization problem subject to the equality constraint (32) The solution of thismodi ed maximization problem is then found as an ascent gradient vector ow onto thep-dimensional unit sphere following the standard projected gradient formalism (Chu andTrenda lov 2001 Helmke and Moore 1994) Detailed considerationof this solution will bereported separately

We have implemented SCoTLASS using MATLAB The code requires as input thecorrelation matrix R the value of the tuning parameter t and the number of componentsto be retained (m) The MATLAB code returns a loading matrix and calculates a numberof relevant statistics To achieve an appropriate solution a number of parameters of theprojected gradient method (eg starting points absolute and relative tolerances) also needto be de ned

4 EXAMPLE PITPROPS DATA

Here we revisit the pitprop data from section 2 We have studied many other examplesnot discussed here and in general the results are qualitativelysimilar Table 4 gives loadingsfor SCoTLASS with t = 225 200 175 and 150 Table 5 gives variances cumulativevariances ldquosimplicity factorsrdquo and number of zero loadings for the same values of t aswell as corresponding information for PCA (t =

p13) and RPCA The simplicity factors

are values of the varimax criterion for each componentIt can be seen that as the value of t is decreased the simplicity of the components

increases as measured by the number of zero loadings and by the varimax simplicity factoralthough the increase in the latter is not uniform The increase in simplicity is paid for by aloss of variance retained By t = 225 175 the percentage of variance accounted for by the rst component in SCoTLASS is reduced from 324 for PCA to 267 196 respectivelyAt t = 225 this is still larger than the largest contribution (239) achieved by a singlecomponent in RPCA The comparison with RPCA is less favorable when the variationaccounted for by all six retained components is examined RPCA necessarily retains the

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 5: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

534 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 3 Loadings for Rotated Correlation PCA Using the Varimax Criterion for Jeffersrsquo Pitprop Data

Component

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0019 0074 0043 iexcl0027 iexcl0519 iexcl0077x2 iexcl0018 0015 0048 iexcl0024 iexcl0540 iexcl0102x3 iexcl0024 0705 iexcl0128 0003 iexcl0059 0107x4 0029 0689 0112 0001 0014 iexcl0087x5 0258 0009 0477 0218 0205 iexcl0524x6 iexcl0185 0061 0604 iexcl0005 iexcl0032 0012x7 0031 iexcl0069 0512 iexcl0102 iexcl0151 0092x8 0440 iexcl0042 iexcl0072 0083 iexcl0221 0239x9 0097 iexcl0058 0045 0094 iexcl0408 0141x10 0271 iexcl0054 0129 iexcl0367 iexcl0216 0135x11 0057 iexcl0022 iexcl0029 0882 iexcl0137 0075x12 iexcl0776 iexcl0056 0091 0079 iexcl0123 0145x13 iexcl0120 iexcl0049 iexcl0280 iexcl0077 iexcl0269 iexcl0748

Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343variance ( ) 130 146 184 97 239 76

cumulative variance ( ) 130 276 460 557 796 872

(absolute) values and many small loadings in each PC Although Jeffers (1967) makes anattempt to interpret all six components some are to say the least messy and he ignoressome intermediate loadingsFor example PC2 has the largest loadingson x3 x4 with smallloadings on x6 x9 but a whole range of intermediate values on other variables

A traditionalway to simplify loadings is by rotation If A is the (13 pound 6) matrix whosekth column is ak then A is post-multipliedby a matrix T to give rotated loadingsB = ATIf bk is the kth column of B then b

0

kx is the kth rotated componentThe matrix T is chosenso as to optimize some simplicitycriterionVarious criteria have been proposedall of whichattempt to create vectors of loadings whose elements are close to zero or far from zero withfew intermediate values The idea is that each variable should be either clearly importantor clearly unimportant in a rotated component with as few cases as possible of borderlineimportance Varimax is the most widely used rotation criterion and like most other suchcriteria it tends to drive at least some of the loadings in each component towards zero Thisis not the only possible type of simplicity A component whose loadings are all roughlyequal is easy to interpret but will be avoided by most standard rotation criteria It is dif cultto envisage any criterion which could encompass all possible types of simplicity and weconcentrate here on simplicity as de ned by varimax

Table 3 gives the rotated loadings for six components in the correlation PCA of thepitprop data togetherwith the percentageof total variance accounted for by each rotated PC(RPC) The rotation criterion used in Table 3 is varimax (Krzanowski and Marriott 1995 p138) which is the most frequent choice (often the default in software) but other criteria givesimilar results Varimax rotation aims to maximize the sum over rotated components of acriterion which takes values between zero and one A value of zero occurs when all loadingsin the component are equal whereas a component with only one nonzero loading producesa value of unity This criterion or ldquosimplicity factorrdquo is given for each component in Tables2 and 3 and it can be seen that its values are larger for most of the rotated components than

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 535

they are for the unrotated componentsThere are however a number of disadvantages associated with rotation In the context

of interpreting the results in this example we note that we have lost the ldquosuccessive max-imization of variancerdquo property of the unrotated components so what we are interpretingafter rotation are not the ldquomost important sources of variationrdquo in the data The RPC withthe highest variance appears arbitrarily as the fth and this accounts for 24 of the totalvariation compared to 32 in the rst unrotated PC In addition a glance at the loadingsand simplicity factors for the RPCs shows that more generally those components whichare easiest to interpret among the six in Table 3 are those which have the smallest varianceRPC5 is still rather complicated Other problems associated with rotation were discussedby Jolliffe (1989 1995) A simpli ed component technique (SCoT) in which the two stepsof RPCA (PCA followed by rotation) are combined into one was discussed by Jolliffe andUddin (2000) The technique is based on a similar idea proposed by Morton (1989) in thecontext of projection pursuit It maximizes variance but adds a penalty function which isa multiple of one of the simplicity criteria such as varimax SCoT has some advantagescompared to standard rotation but shares a number of its disadvantages

The next section introduces an alternative to rotation which has some clear advantagesover rotated PCA and SCoT A detailed comparison of the new technique SCoT rotatedPCA and Vinesrsquo (2000) simple components is given for an example involving sea surfacetemperatures in Jolliffe Uddin and Vines (2002)

3 MODIFIED PCA BASED ON THE LASSO

Tibshirani (1996) studied the dif culties involved in the interpretation of multiple re-gression equationsThese problems may occur due to the instability of the regression coef- cients in the presence of collinearity or simply because of the large number of variablesincluded in the regression equation Some current alternatives to least squares regressionsuch as shrinkage estimators ridge regression principal component regression or partialleast squares handle the instabilityproblemby keepingall variablesin the equationwhereasvariable selection procedures nd a subset of variables and keep only the selected variablesin the equation Tibshirani (1996) proposed a new method the ldquoleast absolute shrinkageand selection operatorrdquo LASSO which is a compromise between variable selection andshrinkage estimators The procedure shrinks the coef cients of some of the variables notsimply towards zero but exactly to zero giving an implicit form of variable selectionLeBlanc and Tibshirani (1998) extended the idea to regression trees Here we adapt theLASSO idea to PCA

31 THE LASSO APPROACH IN REGRESSION

In standard multiple regression we have the equation

yi = not +

p

j = 1

shy jxij + ei i = 1 2 n

536 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

where y1 y2 yn are measurements on a response variable y xij i = 1 2 n j =

1 2 p are corresponding values of p predictor variables e1 e2 en are error termsand not shy 1 shy 2 shy p are parameters in the regression equation In least squares regressionthese parameters are estimated by minimizing the residual (or error) sum of squares

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

The LASSO imposes an additional restriction on the coef cients namely

p

j = 1

jshy j j micro t

for some ldquotuning parameterrdquo t For suitable choices of t this constraint has the interestingproperty that it forces some of the coef cients in the regression equation to zero An equiv-alent way of deriving LASSO estimates is to minimize the residual sum of squares with theaddition of a penalty function based on p

j = 1 jshy jj Thus we minimize

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

+ para

p

j = 1

jshy jj

for some multiplier para For any given value of t in the rst LASSO formulation there is avalue of para in the second formulation that gives equivalent results

32 THE LASSO APPROACH IN PCA (SCOTLASS)

PCA on a correlation matrix nds linear combinations a0

kx(k = 1 2 p) of thep measured variables x each standardized to have unit variance which successively havemaximum variance

a0

kR ak (31)

subject to

a0

kak = 1 and (for k para 2) a0

hak = 0 h lt k (32)

The proposed method of LASSO-based PCA performs the maximization under the extraconstraints

p

j = 1

jakjj micro t (33)

for some tuning parameter t where akj is the jth element of the kth vector ak (k =

1 2 p) We call the new technique SCoTLASS (Simpli ed Component Technique-LASSO)

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 537

= 1+

2

1 2aa

a

22

t = 1

t = sqrt (p)

Figure 1 The Two-Dimensional SCoTLASS

33 SOME PROPERTIES

SCoTLASS differs from PCA in the inclusion of the constraints de ned in (33) so adecision must be made on the value of the tuning parameter t It is easy to see that

(a) for t para pp we get PCA

(b) for t lt 1 there is no solution and(c) for t = 1 we must have exactly one nonzero akj for each k

As t decreases fromp

p we move progressivelyaway from PCA and eventuallyto a solutionwhere only one variable has a nonzero loading on each component All other variables willshrink (not necessary monotonically) with t and ultimately reach zero Examples of thisbehavior are given in the next section

The geometry of SCoTLASS in the case when p = 2 is shown in Figure 1 where weplot the elements a1 a2 of the vectora For PCA in Figure 1 the rst component a

0

1x wherea

0

1 = (a11 a12) corresponds to the point on circumference of the shaded circle (a0

1a1 = 1)

which touches the ldquolargestrdquo possible ellipse a0

1Ra1 = constantFor SCoTLASS with 1 lt t lt

pp we are restricted to the part of the circle a

0

1a1 = 1inside the dotted square 2

j = 1 ja1jj micro tFor t = 1 corresponding to the inner shaded square the optimal (only) solutions are

on the axesFigure 1 shows a special case in which the axes of the ellipses are at 45macr to the a1 a2

axes corresponding to equal variances for x1 x2 (or a correlation matrix) This gives two

538 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

optimal solutions for SCoTLASS in the rst quadrant symmetric about the 45macr line IfPCA or SCoTLASS is done on a covariance rather than correlation matrix (see Remark 1in Section 6) with unequal variances there will be a unique solution in the rst quadrant

34 IMPLEMENTATION

PCA reduces to an easily implemented eigenvalue problem but the extra constraintin SCoTLASS means that it needs numerical optimization to estimate parameters and suf-fers from the problem of many local optima A number of algorithms have been tried toimplement the technique including simulated annealing (Goffe Ferrier and Rogers 1994)but the results reported in the following example were derived using the projected gradientapproach (Chu and Trenda lov 2001 Helmke and Moore 1994) The LASSO inequalityconstraint (33) in the SCoTLASS problem (31)ndash(33) is eliminated by making use of anexterior penalty function Thus the SCoTLASS problem (31)ndash(33) is transformed intoa new maximization problem subject to the equality constraint (32) The solution of thismodi ed maximization problem is then found as an ascent gradient vector ow onto thep-dimensional unit sphere following the standard projected gradient formalism (Chu andTrenda lov 2001 Helmke and Moore 1994) Detailed considerationof this solution will bereported separately

We have implemented SCoTLASS using MATLAB The code requires as input thecorrelation matrix R the value of the tuning parameter t and the number of componentsto be retained (m) The MATLAB code returns a loading matrix and calculates a numberof relevant statistics To achieve an appropriate solution a number of parameters of theprojected gradient method (eg starting points absolute and relative tolerances) also needto be de ned

4 EXAMPLE PITPROPS DATA

Here we revisit the pitprop data from section 2 We have studied many other examplesnot discussed here and in general the results are qualitativelysimilar Table 4 gives loadingsfor SCoTLASS with t = 225 200 175 and 150 Table 5 gives variances cumulativevariances ldquosimplicity factorsrdquo and number of zero loadings for the same values of t aswell as corresponding information for PCA (t =

p13) and RPCA The simplicity factors

are values of the varimax criterion for each componentIt can be seen that as the value of t is decreased the simplicity of the components

increases as measured by the number of zero loadings and by the varimax simplicity factoralthough the increase in the latter is not uniform The increase in simplicity is paid for by aloss of variance retained By t = 225 175 the percentage of variance accounted for by the rst component in SCoTLASS is reduced from 324 for PCA to 267 196 respectivelyAt t = 225 this is still larger than the largest contribution (239) achieved by a singlecomponent in RPCA The comparison with RPCA is less favorable when the variationaccounted for by all six retained components is examined RPCA necessarily retains the

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 6: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 535

they are for the unrotated componentsThere are however a number of disadvantages associated with rotation In the context

of interpreting the results in this example we note that we have lost the ldquosuccessive max-imization of variancerdquo property of the unrotated components so what we are interpretingafter rotation are not the ldquomost important sources of variationrdquo in the data The RPC withthe highest variance appears arbitrarily as the fth and this accounts for 24 of the totalvariation compared to 32 in the rst unrotated PC In addition a glance at the loadingsand simplicity factors for the RPCs shows that more generally those components whichare easiest to interpret among the six in Table 3 are those which have the smallest varianceRPC5 is still rather complicated Other problems associated with rotation were discussedby Jolliffe (1989 1995) A simpli ed component technique (SCoT) in which the two stepsof RPCA (PCA followed by rotation) are combined into one was discussed by Jolliffe andUddin (2000) The technique is based on a similar idea proposed by Morton (1989) in thecontext of projection pursuit It maximizes variance but adds a penalty function which isa multiple of one of the simplicity criteria such as varimax SCoT has some advantagescompared to standard rotation but shares a number of its disadvantages

The next section introduces an alternative to rotation which has some clear advantagesover rotated PCA and SCoT A detailed comparison of the new technique SCoT rotatedPCA and Vinesrsquo (2000) simple components is given for an example involving sea surfacetemperatures in Jolliffe Uddin and Vines (2002)

3 MODIFIED PCA BASED ON THE LASSO

Tibshirani (1996) studied the dif culties involved in the interpretation of multiple re-gression equationsThese problems may occur due to the instability of the regression coef- cients in the presence of collinearity or simply because of the large number of variablesincluded in the regression equation Some current alternatives to least squares regressionsuch as shrinkage estimators ridge regression principal component regression or partialleast squares handle the instabilityproblemby keepingall variablesin the equationwhereasvariable selection procedures nd a subset of variables and keep only the selected variablesin the equation Tibshirani (1996) proposed a new method the ldquoleast absolute shrinkageand selection operatorrdquo LASSO which is a compromise between variable selection andshrinkage estimators The procedure shrinks the coef cients of some of the variables notsimply towards zero but exactly to zero giving an implicit form of variable selectionLeBlanc and Tibshirani (1998) extended the idea to regression trees Here we adapt theLASSO idea to PCA

31 THE LASSO APPROACH IN REGRESSION

In standard multiple regression we have the equation

yi = not +

p

j = 1

shy jxij + ei i = 1 2 n

536 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

where y1 y2 yn are measurements on a response variable y xij i = 1 2 n j =

1 2 p are corresponding values of p predictor variables e1 e2 en are error termsand not shy 1 shy 2 shy p are parameters in the regression equation In least squares regressionthese parameters are estimated by minimizing the residual (or error) sum of squares

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

The LASSO imposes an additional restriction on the coef cients namely

p

j = 1

jshy j j micro t

for some ldquotuning parameterrdquo t For suitable choices of t this constraint has the interestingproperty that it forces some of the coef cients in the regression equation to zero An equiv-alent way of deriving LASSO estimates is to minimize the residual sum of squares with theaddition of a penalty function based on p

j = 1 jshy jj Thus we minimize

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

+ para

p

j = 1

jshy jj

for some multiplier para For any given value of t in the rst LASSO formulation there is avalue of para in the second formulation that gives equivalent results

32 THE LASSO APPROACH IN PCA (SCOTLASS)

PCA on a correlation matrix nds linear combinations a0

kx(k = 1 2 p) of thep measured variables x each standardized to have unit variance which successively havemaximum variance

a0

kR ak (31)

subject to

a0

kak = 1 and (for k para 2) a0

hak = 0 h lt k (32)

The proposed method of LASSO-based PCA performs the maximization under the extraconstraints

p

j = 1

jakjj micro t (33)

for some tuning parameter t where akj is the jth element of the kth vector ak (k =

1 2 p) We call the new technique SCoTLASS (Simpli ed Component Technique-LASSO)

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 537

= 1+

2

1 2aa

a

22

t = 1

t = sqrt (p)

Figure 1 The Two-Dimensional SCoTLASS

33 SOME PROPERTIES

SCoTLASS differs from PCA in the inclusion of the constraints de ned in (33) so adecision must be made on the value of the tuning parameter t It is easy to see that

(a) for t para pp we get PCA

(b) for t lt 1 there is no solution and(c) for t = 1 we must have exactly one nonzero akj for each k

As t decreases fromp

p we move progressivelyaway from PCA and eventuallyto a solutionwhere only one variable has a nonzero loading on each component All other variables willshrink (not necessary monotonically) with t and ultimately reach zero Examples of thisbehavior are given in the next section

The geometry of SCoTLASS in the case when p = 2 is shown in Figure 1 where weplot the elements a1 a2 of the vectora For PCA in Figure 1 the rst component a

0

1x wherea

0

1 = (a11 a12) corresponds to the point on circumference of the shaded circle (a0

1a1 = 1)

which touches the ldquolargestrdquo possible ellipse a0

1Ra1 = constantFor SCoTLASS with 1 lt t lt

pp we are restricted to the part of the circle a

0

1a1 = 1inside the dotted square 2

j = 1 ja1jj micro tFor t = 1 corresponding to the inner shaded square the optimal (only) solutions are

on the axesFigure 1 shows a special case in which the axes of the ellipses are at 45macr to the a1 a2

axes corresponding to equal variances for x1 x2 (or a correlation matrix) This gives two

538 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

optimal solutions for SCoTLASS in the rst quadrant symmetric about the 45macr line IfPCA or SCoTLASS is done on a covariance rather than correlation matrix (see Remark 1in Section 6) with unequal variances there will be a unique solution in the rst quadrant

34 IMPLEMENTATION

PCA reduces to an easily implemented eigenvalue problem but the extra constraintin SCoTLASS means that it needs numerical optimization to estimate parameters and suf-fers from the problem of many local optima A number of algorithms have been tried toimplement the technique including simulated annealing (Goffe Ferrier and Rogers 1994)but the results reported in the following example were derived using the projected gradientapproach (Chu and Trenda lov 2001 Helmke and Moore 1994) The LASSO inequalityconstraint (33) in the SCoTLASS problem (31)ndash(33) is eliminated by making use of anexterior penalty function Thus the SCoTLASS problem (31)ndash(33) is transformed intoa new maximization problem subject to the equality constraint (32) The solution of thismodi ed maximization problem is then found as an ascent gradient vector ow onto thep-dimensional unit sphere following the standard projected gradient formalism (Chu andTrenda lov 2001 Helmke and Moore 1994) Detailed considerationof this solution will bereported separately

We have implemented SCoTLASS using MATLAB The code requires as input thecorrelation matrix R the value of the tuning parameter t and the number of componentsto be retained (m) The MATLAB code returns a loading matrix and calculates a numberof relevant statistics To achieve an appropriate solution a number of parameters of theprojected gradient method (eg starting points absolute and relative tolerances) also needto be de ned

4 EXAMPLE PITPROPS DATA

Here we revisit the pitprop data from section 2 We have studied many other examplesnot discussed here and in general the results are qualitativelysimilar Table 4 gives loadingsfor SCoTLASS with t = 225 200 175 and 150 Table 5 gives variances cumulativevariances ldquosimplicity factorsrdquo and number of zero loadings for the same values of t aswell as corresponding information for PCA (t =

p13) and RPCA The simplicity factors

are values of the varimax criterion for each componentIt can be seen that as the value of t is decreased the simplicity of the components

increases as measured by the number of zero loadings and by the varimax simplicity factoralthough the increase in the latter is not uniform The increase in simplicity is paid for by aloss of variance retained By t = 225 175 the percentage of variance accounted for by the rst component in SCoTLASS is reduced from 324 for PCA to 267 196 respectivelyAt t = 225 this is still larger than the largest contribution (239) achieved by a singlecomponent in RPCA The comparison with RPCA is less favorable when the variationaccounted for by all six retained components is examined RPCA necessarily retains the

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 7: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

536 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

where y1 y2 yn are measurements on a response variable y xij i = 1 2 n j =

1 2 p are corresponding values of p predictor variables e1 e2 en are error termsand not shy 1 shy 2 shy p are parameters in the regression equation In least squares regressionthese parameters are estimated by minimizing the residual (or error) sum of squares

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

The LASSO imposes an additional restriction on the coef cients namely

p

j = 1

jshy j j micro t

for some ldquotuning parameterrdquo t For suitable choices of t this constraint has the interestingproperty that it forces some of the coef cients in the regression equation to zero An equiv-alent way of deriving LASSO estimates is to minimize the residual sum of squares with theaddition of a penalty function based on p

j = 1 jshy jj Thus we minimize

n

i = 1

yi iexcl not iexclp

j = 1

shy jxij

2

+ para

p

j = 1

jshy jj

for some multiplier para For any given value of t in the rst LASSO formulation there is avalue of para in the second formulation that gives equivalent results

32 THE LASSO APPROACH IN PCA (SCOTLASS)

PCA on a correlation matrix nds linear combinations a0

kx(k = 1 2 p) of thep measured variables x each standardized to have unit variance which successively havemaximum variance

a0

kR ak (31)

subject to

a0

kak = 1 and (for k para 2) a0

hak = 0 h lt k (32)

The proposed method of LASSO-based PCA performs the maximization under the extraconstraints

p

j = 1

jakjj micro t (33)

for some tuning parameter t where akj is the jth element of the kth vector ak (k =

1 2 p) We call the new technique SCoTLASS (Simpli ed Component Technique-LASSO)

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 537

= 1+

2

1 2aa

a

22

t = 1

t = sqrt (p)

Figure 1 The Two-Dimensional SCoTLASS

33 SOME PROPERTIES

SCoTLASS differs from PCA in the inclusion of the constraints de ned in (33) so adecision must be made on the value of the tuning parameter t It is easy to see that

(a) for t para pp we get PCA

(b) for t lt 1 there is no solution and(c) for t = 1 we must have exactly one nonzero akj for each k

As t decreases fromp

p we move progressivelyaway from PCA and eventuallyto a solutionwhere only one variable has a nonzero loading on each component All other variables willshrink (not necessary monotonically) with t and ultimately reach zero Examples of thisbehavior are given in the next section

The geometry of SCoTLASS in the case when p = 2 is shown in Figure 1 where weplot the elements a1 a2 of the vectora For PCA in Figure 1 the rst component a

0

1x wherea

0

1 = (a11 a12) corresponds to the point on circumference of the shaded circle (a0

1a1 = 1)

which touches the ldquolargestrdquo possible ellipse a0

1Ra1 = constantFor SCoTLASS with 1 lt t lt

pp we are restricted to the part of the circle a

0

1a1 = 1inside the dotted square 2

j = 1 ja1jj micro tFor t = 1 corresponding to the inner shaded square the optimal (only) solutions are

on the axesFigure 1 shows a special case in which the axes of the ellipses are at 45macr to the a1 a2

axes corresponding to equal variances for x1 x2 (or a correlation matrix) This gives two

538 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

optimal solutions for SCoTLASS in the rst quadrant symmetric about the 45macr line IfPCA or SCoTLASS is done on a covariance rather than correlation matrix (see Remark 1in Section 6) with unequal variances there will be a unique solution in the rst quadrant

34 IMPLEMENTATION

PCA reduces to an easily implemented eigenvalue problem but the extra constraintin SCoTLASS means that it needs numerical optimization to estimate parameters and suf-fers from the problem of many local optima A number of algorithms have been tried toimplement the technique including simulated annealing (Goffe Ferrier and Rogers 1994)but the results reported in the following example were derived using the projected gradientapproach (Chu and Trenda lov 2001 Helmke and Moore 1994) The LASSO inequalityconstraint (33) in the SCoTLASS problem (31)ndash(33) is eliminated by making use of anexterior penalty function Thus the SCoTLASS problem (31)ndash(33) is transformed intoa new maximization problem subject to the equality constraint (32) The solution of thismodi ed maximization problem is then found as an ascent gradient vector ow onto thep-dimensional unit sphere following the standard projected gradient formalism (Chu andTrenda lov 2001 Helmke and Moore 1994) Detailed considerationof this solution will bereported separately

We have implemented SCoTLASS using MATLAB The code requires as input thecorrelation matrix R the value of the tuning parameter t and the number of componentsto be retained (m) The MATLAB code returns a loading matrix and calculates a numberof relevant statistics To achieve an appropriate solution a number of parameters of theprojected gradient method (eg starting points absolute and relative tolerances) also needto be de ned

4 EXAMPLE PITPROPS DATA

Here we revisit the pitprop data from section 2 We have studied many other examplesnot discussed here and in general the results are qualitativelysimilar Table 4 gives loadingsfor SCoTLASS with t = 225 200 175 and 150 Table 5 gives variances cumulativevariances ldquosimplicity factorsrdquo and number of zero loadings for the same values of t aswell as corresponding information for PCA (t =

p13) and RPCA The simplicity factors

are values of the varimax criterion for each componentIt can be seen that as the value of t is decreased the simplicity of the components

increases as measured by the number of zero loadings and by the varimax simplicity factoralthough the increase in the latter is not uniform The increase in simplicity is paid for by aloss of variance retained By t = 225 175 the percentage of variance accounted for by the rst component in SCoTLASS is reduced from 324 for PCA to 267 196 respectivelyAt t = 225 this is still larger than the largest contribution (239) achieved by a singlecomponent in RPCA The comparison with RPCA is less favorable when the variationaccounted for by all six retained components is examined RPCA necessarily retains the

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 8: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 537

= 1+

2

1 2aa

a

22

t = 1

t = sqrt (p)

Figure 1 The Two-Dimensional SCoTLASS

33 SOME PROPERTIES

SCoTLASS differs from PCA in the inclusion of the constraints de ned in (33) so adecision must be made on the value of the tuning parameter t It is easy to see that

(a) for t para pp we get PCA

(b) for t lt 1 there is no solution and(c) for t = 1 we must have exactly one nonzero akj for each k

As t decreases fromp

p we move progressivelyaway from PCA and eventuallyto a solutionwhere only one variable has a nonzero loading on each component All other variables willshrink (not necessary monotonically) with t and ultimately reach zero Examples of thisbehavior are given in the next section

The geometry of SCoTLASS in the case when p = 2 is shown in Figure 1 where weplot the elements a1 a2 of the vectora For PCA in Figure 1 the rst component a

0

1x wherea

0

1 = (a11 a12) corresponds to the point on circumference of the shaded circle (a0

1a1 = 1)

which touches the ldquolargestrdquo possible ellipse a0

1Ra1 = constantFor SCoTLASS with 1 lt t lt

pp we are restricted to the part of the circle a

0

1a1 = 1inside the dotted square 2

j = 1 ja1jj micro tFor t = 1 corresponding to the inner shaded square the optimal (only) solutions are

on the axesFigure 1 shows a special case in which the axes of the ellipses are at 45macr to the a1 a2

axes corresponding to equal variances for x1 x2 (or a correlation matrix) This gives two

538 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

optimal solutions for SCoTLASS in the rst quadrant symmetric about the 45macr line IfPCA or SCoTLASS is done on a covariance rather than correlation matrix (see Remark 1in Section 6) with unequal variances there will be a unique solution in the rst quadrant

34 IMPLEMENTATION

PCA reduces to an easily implemented eigenvalue problem but the extra constraintin SCoTLASS means that it needs numerical optimization to estimate parameters and suf-fers from the problem of many local optima A number of algorithms have been tried toimplement the technique including simulated annealing (Goffe Ferrier and Rogers 1994)but the results reported in the following example were derived using the projected gradientapproach (Chu and Trenda lov 2001 Helmke and Moore 1994) The LASSO inequalityconstraint (33) in the SCoTLASS problem (31)ndash(33) is eliminated by making use of anexterior penalty function Thus the SCoTLASS problem (31)ndash(33) is transformed intoa new maximization problem subject to the equality constraint (32) The solution of thismodi ed maximization problem is then found as an ascent gradient vector ow onto thep-dimensional unit sphere following the standard projected gradient formalism (Chu andTrenda lov 2001 Helmke and Moore 1994) Detailed considerationof this solution will bereported separately

We have implemented SCoTLASS using MATLAB The code requires as input thecorrelation matrix R the value of the tuning parameter t and the number of componentsto be retained (m) The MATLAB code returns a loading matrix and calculates a numberof relevant statistics To achieve an appropriate solution a number of parameters of theprojected gradient method (eg starting points absolute and relative tolerances) also needto be de ned

4 EXAMPLE PITPROPS DATA

Here we revisit the pitprop data from section 2 We have studied many other examplesnot discussed here and in general the results are qualitativelysimilar Table 4 gives loadingsfor SCoTLASS with t = 225 200 175 and 150 Table 5 gives variances cumulativevariances ldquosimplicity factorsrdquo and number of zero loadings for the same values of t aswell as corresponding information for PCA (t =

p13) and RPCA The simplicity factors

are values of the varimax criterion for each componentIt can be seen that as the value of t is decreased the simplicity of the components

increases as measured by the number of zero loadings and by the varimax simplicity factoralthough the increase in the latter is not uniform The increase in simplicity is paid for by aloss of variance retained By t = 225 175 the percentage of variance accounted for by the rst component in SCoTLASS is reduced from 324 for PCA to 267 196 respectivelyAt t = 225 this is still larger than the largest contribution (239) achieved by a singlecomponent in RPCA The comparison with RPCA is less favorable when the variationaccounted for by all six retained components is examined RPCA necessarily retains the

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 9: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

538 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

optimal solutions for SCoTLASS in the rst quadrant symmetric about the 45macr line IfPCA or SCoTLASS is done on a covariance rather than correlation matrix (see Remark 1in Section 6) with unequal variances there will be a unique solution in the rst quadrant

34 IMPLEMENTATION

PCA reduces to an easily implemented eigenvalue problem but the extra constraintin SCoTLASS means that it needs numerical optimization to estimate parameters and suf-fers from the problem of many local optima A number of algorithms have been tried toimplement the technique including simulated annealing (Goffe Ferrier and Rogers 1994)but the results reported in the following example were derived using the projected gradientapproach (Chu and Trenda lov 2001 Helmke and Moore 1994) The LASSO inequalityconstraint (33) in the SCoTLASS problem (31)ndash(33) is eliminated by making use of anexterior penalty function Thus the SCoTLASS problem (31)ndash(33) is transformed intoa new maximization problem subject to the equality constraint (32) The solution of thismodi ed maximization problem is then found as an ascent gradient vector ow onto thep-dimensional unit sphere following the standard projected gradient formalism (Chu andTrenda lov 2001 Helmke and Moore 1994) Detailed considerationof this solution will bereported separately

We have implemented SCoTLASS using MATLAB The code requires as input thecorrelation matrix R the value of the tuning parameter t and the number of componentsto be retained (m) The MATLAB code returns a loading matrix and calculates a numberof relevant statistics To achieve an appropriate solution a number of parameters of theprojected gradient method (eg starting points absolute and relative tolerances) also needto be de ned

4 EXAMPLE PITPROPS DATA

Here we revisit the pitprop data from section 2 We have studied many other examplesnot discussed here and in general the results are qualitativelysimilar Table 4 gives loadingsfor SCoTLASS with t = 225 200 175 and 150 Table 5 gives variances cumulativevariances ldquosimplicity factorsrdquo and number of zero loadings for the same values of t aswell as corresponding information for PCA (t =

p13) and RPCA The simplicity factors

are values of the varimax criterion for each componentIt can be seen that as the value of t is decreased the simplicity of the components

increases as measured by the number of zero loadings and by the varimax simplicity factoralthough the increase in the latter is not uniform The increase in simplicity is paid for by aloss of variance retained By t = 225 175 the percentage of variance accounted for by the rst component in SCoTLASS is reduced from 324 for PCA to 267 196 respectivelyAt t = 225 this is still larger than the largest contribution (239) achieved by a singlecomponent in RPCA The comparison with RPCA is less favorable when the variationaccounted for by all six retained components is examined RPCA necessarily retains the

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 10: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 539

Table 4 Loadings for SCoTLASS for Four Values of t Based on the Correlation Matrix for JeffersrsquoPitprop Data

Component

Technique Variable (1) (2) (3) (4) (5) (6)

SCoTLASS x1 0558 0085 iexcl0093 iexcl0107 0056 0017(t = 225) x2 0580 0031 iexcl0087 iexcl0147 0073 0047

x3 0000 0647 iexcl0129 0215 iexcl0064 iexcl0101x4 0000 0654 iexcl0000 0211 iexcl0080 0127x5 iexcl0000 0000 0413 iexcl0000 0236 0747x6 0001 0208 0529 iexcl0022 iexcl0108 0033x7 0266 iexcl0000 0385 0000 iexcl0121 0020x8 0104 iexcl0098 0000 0584 0127 iexcl0188x9 0372 iexcl0000 iexcl0000 0019 0142 iexcl0060x10 0364 iexcl0154 0000 0212 iexcl0296 0000x11 iexcl0000 0099 iexcl0000 0000 0879 iexcl0156x12 iexcl0000 0241 iexcl0001 iexcl0699 iexcl0044 iexcl0186x13 iexcl0000 0026 iexcl0608 iexcl0026 iexcl0016 0561

SCoTLASS x1 0623 0041 iexcl0049 0040 0051 iexcl0000(t = 200) x2 0647 0076 iexcl0001 0072 0059 iexcl0007

x3 0000 0000 iexcl0684 iexcl0128 iexcl0054 0106x4 0000 iexcl0001 iexcl0670 iexcl0163 iexcl0063 iexcl0100x5 iexcl0000 iexcl0267 0000 0000 0228 iexcl0772x6 0000 iexcl0706 iexcl0044 0011 iexcl0003 0001x7 0137 iexcl0539 0001 iexcl0000 iexcl0096 0001x8 0001 iexcl0000 0053 iexcl0767 0095 0098x9 0332 0000 0000 iexcl0001 0065 0014x10 0254 iexcl0000 0124 iexcl0277 iexcl0309 iexcl0000x11 0000 0000 iexcl0065 0000 0902 0194x12 iexcl0000 0000 iexcl0224 0533 iexcl0069 0137x13 0000 0364 iexcl0079 0000 0000 iexcl0562

SCoTLASS x1 0664 iexcl0000 0000 iexcl0025 0002 iexcl0035(t = 175) x2 0683 iexcl0001 0000 iexcl0040 0001 iexcl0018

x3 0000 0641 0195 0000 0180 iexcl0030x4 0000 0701 0001 0000 iexcl0000 iexcl0001x5 iexcl0000 0000 iexcl0000 0000 iexcl0887 iexcl0056x6 0000 0293 iexcl0186 0000 iexcl0373 0044x7 0001 0107 iexcl0658 iexcl0000 iexcl0051 0064x8 0001 iexcl0000 iexcl0000 0735 0021 iexcl0168x9 0283 iexcl0000 iexcl0000 0000 iexcl0000 iexcl0001x10 0113 iexcl0000 iexcl0001 0388 iexcl0017 0320x11 0000 0000 0000 iexcl0000 iexcl0000 iexcl0923x12 iexcl0000 0001 0000 iexcl0554 0016 0004x13 0000 iexcl0000 0703 0001 iexcl0197 0080

SCoTLASS x1 0701 iexcl0000 iexcl0001 iexcl0001 0001 iexcl0000(t = 150) x2 0709 iexcl0001 iexcl0001 iexcl0001 0001 iexcl0000

x3 0001 0698 iexcl0068 0002 iexcl0001 0001x4 0001 0712 iexcl0001 0002 iexcl0001 iexcl0001x5 iexcl0000 0000 0001 iexcl0001 0093 iexcl0757x6 0000 0081 0586 iexcl0031 0000 0000x7 0001 0000 0807 iexcl0001 0000 0002x8 0001 iexcl0000 0001 0044 iexcl0513 iexcl0001x9 0079 0000 0001 0001 iexcl0001 iexcl0000x10 0002 iexcl0000 0027 0660 iexcl0000 iexcl0002x11 0000 0001 iexcl0000 iexcl0749 iexcl0032 iexcl0000x12 iexcl0000 0001 iexcl0000 iexcl0001 0853 0083x13 0000 0000 iexcl0001 0000 iexcl0001 0648

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 11: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

540 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 5 Simplicity Factor Variance Cumulative Variance and Number of Zero Loadings for IndividualComponents in PCA RPCA and SCoTLASS for Four Values of t Based on the CorrelationMatrix for Jeffersrsquo Pitprop Data

Component

Technique Measure (1) (2) (3) (4) (5) (6)

PCA Simplicity factor (varimax) 0059 0103 0082 0397 0086 0266( = SCoTLASS with Variance () 324 182 144 89 70 63

t =p

13) Cumulative variance () 324 507 651 740 809 872

RPCA Simplicity factor (varimax) 0362 0428 0199 0595 0131 0343Variance () 130 146 184 97 239 76

Cumulative variance () 130 276 460 557 796 872

SCoTLASS Simplicity factor (varimax) 0190 0312 0205 0308 0577 0364(t = 225) Variance () 267 172 159 97 89 67

Cumulative variance () 267 439 598 694 784 850Number of zero loadings 6 3 5 3 0 1

SCoTLASS Simplicity factor (varimax) 0288 0301 0375 0387 0646 0412(t = 200) Variance () 231 164 162 112 89 65

Cumulative variance () 231 395 558 670 759 823Number of zero loadings 7 6 2 4 1 2

SCoTLASS Simplicity factor (varimax) 0370 0370 0388 0360 0610 0714(t = 175) Variance () 196 160 132 130 92 91

Cumulative variance () 196 356 487 618 710 801Number of zero loadings 7 7 7 7 3 0

SCoTLASS Simplicity factor (varimax) 0452 0452 0504 0464 0565 0464(t = 150) Variance () 161 149 138 102 99 96

Cumulative variance () 161 310 449 551 650 745Number of zero loadings 5 7 2 1 3 5

same total percentage variation (872) as PCA but SCoTLASS drops to 850 and 801 fort = 225 175 respectively Against this loss SCoTLASS has the considerable advantageof retaining the successive maximisation property At t = 225 apart from switching ofcomponents 4 and 5 the SCoTLASS componentsare nicely simpli ed versions of the PCsrather than being something different as in RPCA A linked advantage is that if we decideto look only at ve components then the SCoTLASS components will simply be the rst ve in Table 4 whereas RPCs based on (m iexcl 1) retained components are not necessarilysimilar to those based on m

A further ldquoplusrdquo for SCoTLASS is the presence of zero loadings which aids inter-pretation but even where there are few zeros the components are simpli ed compared toPCA Consider speci cally the interpretation of the second component which we notedearlier was ldquomessyrdquo for PCA For t = 175 this component is now interpreted as measuringmainly moisture content and speci c gravity with small contributions from numbers ofannual rings and all other variables negligible This gain in interpretability is achieved byreducing the percentage of total variance accounted for from 182 to 160 For other com-ponents too interpretation is made easier because in the majority of cases the contributionof a variable is clearcut Either it is important or it is not with few equivocal contributions

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 12: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 541

Table 6 Speci ed Eigenvectors of a Six-Dimensional Block Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0096 iexcl0537 0759 iexcl0120 0335 iexcl0021x2 0082 iexcl0565 iexcl0599 0231 0511 iexcl0013x3 0080 iexcl0608 iexcl0119 iexcl0119 iexcl0771 0016x4 0594 0085 iexcl0074 iexcl0308 0069 0731x5 0584 0096 iexcl0114 iexcl0418 0052 iexcl0678x6 0533 0074 0180 0805 iexcl0157 iexcl0069

Variance 18367 1640 0751 0659 0607 0506

For t = 225 200 175 150 the number of the zeros is as follows 18 22 31 and 23 Itseems surprising that we obtain fewer zeros with t = 150 than with t = 175 that is thesolution with t = 175 appears to be simpler than the one with t = 150 In fact this impres-sion is misleading (see also the next paragraph) The explanation of this anomaly is in theprojected gradient method used for numerical solution of the problem which approximatesthe LASSO constraint with a certain smooth function and thus the zero-loadings producedmay be also approximate One can see that the solution with t = 150 contains a total of 56loadings with less than 0005 magnitude compared to 42 in the case t = 175

Another interesting comparison is in terms of average varimax simplicity over the rst six components This is 0343 for RPCA compared to 0165 for PCA For t =

225 200 175 150 the average simplicityis 0326 0402 0469 0487 respectivelyThisdemonstrates that although the varimax criterion is not an explicit part of SCoTLASS bytaking t small enough we can do better than RPCA with respect to its own criterion This isachieved by moving outside the space spanned by the retained PCs and hence settling fora smaller amount of overall variation retained

5 SIMULATION STUDIES

One question of interest is whether SCoTLASS is better at detecting underlying simplestructure in a data set than is PCA or RPCA To investigate this question we simulated datafrom a variety of known structures Because of space constraints only a small part of theresults is summarized here further details can be found in Uddin (1999)

Given a vector l of positive real numbers and an orthogonal matrix A we can attemptto nd a covariance matrix or correlation matrix whose eigenvalues are the elements of land whose eigenvectorsare the column of A Some restrictions need to be imposed on l andA especially in the case of correlation matrices but it is possible to nd such matrices for awide range of eigenvectorstructuresHavingobtaineda covarianceor correlationmatrix it isstraightforward to generate samples of data from multivariate normal distributions with thegiven covariance or correlation matrix We have done this for a wide variety of eigenvectorstructures (principal component loadings) and computed the PCs RPCs and SCoTLASScomponents from the resulting sample correlation matrices Various structures have been

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 13: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

542 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 7 Speci ed Eigenvectors of a Six-Dimensional Intermediate Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 0224 iexcl0509 0604 0297 iexcl0327 0361x2 0253 iexcl0519 iexcl0361 iexcl0644 iexcl0341 iexcl0064x3 0227 iexcl0553 iexcl0246 0377 0608 iexcl0267x4 0553 0249 iexcl0249 iexcl0052 0262 0706x5 0521 0254 iexcl0258 0451 iexcl0509 iexcl0367x6 0507 0199 0561 iexcl0384 0281 iexcl0402

Variance 1795 1674 0796 0618 0608 0510

investigated which we call block structure intermediate structure and uniform structureTables 6ndash8 give one example of each type of structure The structure in Table 6 has blocks ofnontrivial loadings and blocks of near-zero loadings in each underlying component Table8 has a structure in which all loadings in the rst two components have similar absolutevalues and the structure in Table 7 is intermediate to those of Tables 6 and 8 An alternativeapproach to the simulation study would be to replace the near-zero loadings by exact zerosand the nearly equal loadingsby exact equalitiesHowever we feel that in reality underlyingstructures are never quite that simple so we perturbed them a little

It might be expected that if the underlying structure is simple then sampling variationis more likely to take sample PCs away from simplicity than to enhance this simplicityIt is of interest to investigate whether the techniques of RPCA and SCoTLASS whichincrease simplicity compared to the sample PCs will do so in the direction of the trueunderlying structure The closeness of a vector of loadings from any of these techniquesto the underlying true vector is measured by the angle between the two vectors of interestThese anglesare given in Tables9ndash11 for single simulateddatasets from three different typesof six-dimensional structure they typify what we found in other simulations Three valuesof t (apart from that for PCA) are shown in the tables Their exact values are unimportantand are slightly different in different tables They are chosen to illustrate typical behaviorin our simulations

The results illustratethat for eachstructureRPCA isperhapssurprisinglyand certainlydisappointinglybad at recovering the underlying structure SCoTLASS on the other hand

Table 8 Specied Eigenvectors of a Six-Dimensional Uniform Structure

Eigenvectors

Variable (1) (2) (3) (4) (5) (6)

x1 iexcl0455 0336 iexcl0087 0741 iexcl0328 0125x2 iexcl0439 0370 iexcl0212 iexcl0630 iexcl0445 iexcl0175x3 iexcl0415 0422 0378 iexcl0110 0697 0099x4 0434 0458 0040 iexcl0136 iexcl0167 0744x5 0301 0435 iexcl0697 0114 0356 iexcl0306x6 0385 0416 0563 0104 iexcl0234 iexcl0545

Variance 1841 1709 0801 0649 0520 0480

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 14: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 543

Table 9 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoBlockrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 129 120 154 795RPCA 377 452 453 831

SCoTLASS t = 200 129 120 154 786t = 182 119 114 138 732t = 175 94 100 125 852

is capable of improvement over PCA For example for t = 175 it not only improvesover PCA in terms of angles in Table 9 but it also has 3 3 and 2 zero loadings in its rstthree components thus giving a notably simpler structure None of the methods managesto reproduce the underlying structure for component 4 in Table 9

The results for intermediate structure in Table 10 are qualitatively similar to thosein Table 9 except that SCoTLASS does best for higher values of t than in Table 9 Foruniform structure (Table 11) SCoTLASS does badly compared to PCA for all values of tThis is not unexpected because although uniform structure is simple in its own way it isnot the type of simplicity which SCoTLASS aims for It is also the case that the varimaxcriterion is designedso that it stands littlechanceof ndinguniform structureOther rotationcriteria such as quartimax can in theory nd uniform vectors of loadings but they weretried and also found to be unsuccessful in our simulations It is probable that a uniformstructure is more likely to be found by the techniques proposed by Hausman (1982) orVines (2000) Although SCoTLASS will usually fail to nd such structures their existencemay be indicated by a large drop in the variance explained by SCoTLASS as decreasingvalues of t move it away from PCA

6 DISCUSSION

A new techniqueSCoTLASS has been introduced for discovering and interpreting themajor sources of variability in a dataset We have illustrated its usefulness in an exampleand have also shown through simulations that it is capable of recovering certain types of

Table 10 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS With Various Values of t for a Speci ed ldquoIntermediaterdquo Structure of Correlation Eigen-vectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 137 151 236 783RPCA 533 429 664 777

SCoTLASS t = 228 55 104 237 783t = 212 95 120 235 778t = 201 175 191 225 717

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 15: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

544 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Table 11 Angles Between the Underlying Vectors and the Sample Vectors of PCA RPCA and SCoT-LASS with VariousValues of t for a Speci ed ldquoUniformrdquo Structure of Correlation Eigenvectors

Vectors

Technique t (1) (2) (3) (4)

PCAp

6 60 64 232 318RPCA 520 541 394 435

SCoTLASS t = 215 243 246 234 307t = 191 345 346 218 235t = 173 410 413 225 190

underlying structure It is preferred in many respects to rotated principal components as ameans of simplifying interpretation compared to principal component analysis Althoughwe are convincedof the value of SCoTLASS there are a number of complicationsand openquestions which are now listed as a set of remarks

Remark 1 In this article we have carried out the techniques studied on correlationmatrices Although it is less common in practice PCA and RPCA can also be implementedon covariance matrices In this case PCA successively nds uncorrelated linear functions ofthe original unstandardized variables SCoTLASS can also be implemented in this casethe only difference being that the sample correlation matrix R is replaced by the samplecovariance matrix S in equation (31) We have investigated covariance-based SCoTLASSboth for real examples and using simulation studies Some details of its performance aredifferent from the correlation-based case but qualitatively they are similar In particularthere are a number of reasons to prefer SCoTLASS to RPCA

Remark 2 In PCA the constraint a0

hak = 0 (orthogonality of vectors of loadings)is equivalent to a

0

hRak = 0 (different components are uncorrelated) This equivalence isspecial to the PCs and is a consequenceof the ak being eigenvectors of R When we rotatethe PC loadings we lose at least one of these two properties (Jolliffe 1995) Similarly inSCoTLASS if we impose a

0

hak = 0 we no longer have uncorrelated components Forexample Table 12 gives the correlations between the six SCoTLASS components whent = 175 for the pitprop data Although most of the correlations in Table 12 are small inabsolute value there are also nontrivial ones (r12 r14 r34 and r35)

Table 12 Correlation Matrix for the First Six SCoTLASS Components for t = 175 using Jeffersrsquo PitpropData

Correlation matrix

Components (1) (2) (3) (4) (5) (6)

(1) 1000 0375 iexcl0234 0443 iexcl0010 0061(2) 1000 iexcl0114 iexcl0076 iexcl0145 iexcl0084(3) 1000 iexcl0438 0445 iexcl0187(4) 1000 iexcl0105 0141(5) 1000 iexcl0013(6) 1000

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 16: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 545

It is possible to replace the constraint a0

hak = 0 in SCoTLASS by a0

hR ak = 0 thuschoosing to have uncorrelated components rather than orthogonal loadings but this optionis not explored in the present article

Remark 3 The choice of t is clearly important in SCoTLASS As t decreases simplic-ity increases but variation explained decreases and we need to achieve a suitable tradeoffbetween these two properties The correlation between components noted in the previousremark is another aspect of any tradeoff Correlations are small for t close to

pp but have

the potential to increase as t decreases It might be possible to construct a criterion whichde nes the ldquobest tradeoffrdquo but there is no unique construction because of the dif culty ofdeciding how to measure simplicity and how to combine variance simplicity and correla-tion At present it seems best to compute the SCoTLASS components for several values oft and judge subjectively at what point a balance between these various aspects is achievedIn our example we used the same value of t for all components in a data set but varying t

for different components is another possibility

Remark 4 Our algorithms for SCoTLASS are slower than those for PCA This isbecause SCoTLASS is implemented subject to an extra restriction on PCA and we losethe advantage of calculation via the singular value decomposition which makes the PCAalgorithm fast Sequential-based PCA with an extra constraint requires a good optimizerto produce a global optimum In the implementation of SCoTLASS a projected gradientmethod is used which is globally convergent and preserves accurately both the equalityand inequality constraints It should be noted that as t is reduced from

pp downwards

towards unity the CPU time taken to optimize the objective function remains generallythe same (11 sec on average for 1GHz PC) but as t decreases the algorithm becomesprogressively prone to hit local minima and thus more (random) starts are required to nd a global optimum Osborne Presnell and Turlach (2000) gave an ef cient procedurebased on convex programming and a dual problem for implementing the LASSO in theregression context Whether or not this approach can be usefully adapted to SCoTLASSwill be investigated in further research Although we are reasonably con dent that ouralgorithm has found global optima in the example of Section 4 and in the simulations thereis no guarantee The jumps that occur in some coef cients such as the change in a36 fromiexcl 0186 to 0586 as t decreases from 175 to 150 in the pitprop data could be due to one ofthe solutions being a local optimum However it seems more likely to us that it is caused bythe change in the nature of the earlier components which together with the orthogonalityconstraint imposed on the third component opens up a different range of possibilities forthe latter component There is clearly much scope for further work on the implementationof SCoTLASS

Remark 5 In a number of examples not reportedhere several of the nonzero loadingsin the SCoTLASS components are exactly equal especially for large values of p and smallvalues of t At present we have no explanation for this but it deserves further investigation

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 17: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

546 I T JOLLIFFE N T TRENDAFILOV AND M UDDIN

Remark 6 The way in which SCoTLASS sets some coef cients to zero is different inconcept from simply truncating to zero the smallest coef cients in a PC The latter attemptsto approximate the PC by a simpler version and can have problems with that approximationas shown by Cadima and Jolliffe (1995) SCoTLASS looks for simple sources of variationand likePCA aims for highvariancebut becauseof simplicityconsiderationsthesimpli edcomponents can in theory be moderately different from the PCs We seek to replace PCsrather than approximate them although because of the shared aim of high variance theresults will often not be too different

Remark 7 There are a number of other recent developments which are relevant tointerpretation problems in multivariate statistics Jolliffe (2002) Chapter 11 reviews thesein the context of PCA Vinesrsquos (2000) use of only a discrete number of values for loadingshas already been mentioned It works well in some examples but for the pitprops datathe components 4ndash6 are rather complex A number of aspects of the strategy for selectinga subset of variables were explored by Cadima and Jolliffe (1995 2001) and by Tanakaand Mori (1997) The LASSO has been generalized in the regression context to so-calledbridge estimation in which the constraint p

j = 1 jshy j j micro t is replaced by pj = 1 jshy j jreg micro t

where reg is not necessarily equal to unitymdashsee for example Fu (1998) Tibshirani (1996)also mentioned the nonnegativegarotte due to Breiman (1995) as an alternative approachTranslation of the ideas of the bridge and nonnegative garotte to the context of PCA andcomparison with other techniques would be of interest in future research

ACKNOWLEDGMENTSAn earlier draft of this article was prepared while the rst author was visiting the Bureau of Meteorology

Research Centre (BMRC) Melbourne Australia He is grateful to BMRC for the support and facilities providedduring his visit and to the Leverhulme Trust for partially supporting the visit under their Study Abroad Fellowshipscheme Comments from two referees and the editor have helped to improve the clarity of the article

[Received November 2000 Revised July 2002]

REFERENCES

Breiman L (1995) ldquoBetter Subset Regression Using the Nonnegative Garotterdquo Technometrics 37 373ndash384

Cadima J and Jolliffe I T (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

(2001) ldquoVariable Selection and the Interpretation of Principal Subspacesrdquo Journal of AgriculturalBiological and Environmental Statistics 6 62ndash79

Chu M T and Trenda lov N T (2001) ldquoThe Orthogonally Constrained Regression Revisitedrdquo Journal ofComputationaland Graphical Statistics 10 1ndash26

Fu J W (1998) ldquoPenalized Regression The Bridge Versus the Lassordquo Journal of Computationaland GraphicalStatistics 7 397ndash416

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Page 18: Open Research Onlineoro.open.ac.uk/3949/1/Article_JCGS.pdfDepartmentof Statistics,University ofKarachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail.com). ®c2003 American

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 547

Goffe W L Ferrier G D and Rogers J (1994) ldquoGlobal Optimizations of Statistical Functions with SimulatedAnnealingrdquo Journal of Econometrics 60 65ndash99

Hausman R (1982) ldquoConstrained Multivariate Analysisrdquo in Optimization in Statistics eds S H Zanckis and JS Rustagi Amsterdam North Holland pp 137ndash151

Helmke U and Moore J B (1994) Optimization and Dynamical Systems London Springer

Jeffers J N R (1967)ldquoTwo Case Studies in the Applicationof Principal ComponentAnalysisrdquo AppliedStatistics16 225ndash236

Jolliffe I T (1989) ldquoRotation of Ill-De ned Principal Componentsrdquo Applied Statistics 38 139ndash147

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

(2002) Principal Component Analysis (2nd ed) New York Springer-Verlag

Jolliffe I T and Uddin M (2000) ldquoThe Simpli ed Component TechniquemdashAn Alternative to Rotated PrincipalComponentsrdquo Journal of Computational and Graphical Statistics 9 689ndash710

Jolliffe I T Uddin M and Vines S K (2002) ldquoSimpli ed EOFsmdashThree Alternatives to Rotationrdquo ClimateResearch 20 271ndash279

Krzanowski W J and Marriott F H C (1995) Multivariate Analysis Part II London Arnold

LeBlanc M and TibshiraniR (1998) ldquoMonotoneShrinkageof Treesrdquo Journalof ComputationalandGraphicalStatistics 7 417ndash433

Morton S C (1989) ldquoInterpretable Projection Pursuitrdquo Technical Report 106 Department of Statistics StanfordUniversity

McCabe G P (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

Osborne M R Presnell B and Turlach B A (2000) ldquoOn the LASSO and its Dualrdquo Journal of Computationaland Graphical Statistics 9 319ndash337

Tanaka Y and Mori Y (1997)ldquoPrincipal ComponentAnalysisBased on a Subset ofVariables Variable Selectionand Sensitivity Analysisrdquo American Journal of Mathematical and Management Sciences 17 61ndash89

Tibshirani R (1996)ldquoRegression Shrinkageand Selection via the Lassordquo Journalof the Royal StatisticalSocietySeries B 58 267ndash288

Uddin M (1999) ldquoInterpretation of Results from Simpli ed Principal Componentsrdquo PhD thesis University ofAberdeen Aberdeen Scotland

Vines S K (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451