PLS-regression: a basic tool of chemometrics - IASBS · PLS-regression: a basic tool of...

Ž .Chemometrics and Intelligent Laboratory Systems 58 2001 109–130www.elsevier.comrlocaterchemometrics

PLS-regression: a basic tool of chemometrics

Svante Wolda,), Michael Sjostroma, Lennart Erikssonb¨ ¨a Research Group for Chemometrics, Institute of Chemistry, Umea UniÕersity, SE-901 87 Umea, Sweden˚ ˚

b Umetrics AB, Box 7960, SE-907 19 Umea, Sweden˚

Abstract

Ž . ŽPLS-regression PLSR is the PLS approach in its simplest, and in chemistry and technology, most used form two-block.predictive PLS . PLSR is a method for relating two data matrices,X andY, by a linear multivariate model, but goes beyond

traditional regression in that it models also the structure ofX andY. PLSR derives its usefulness from its ability to analyzedata with many, noisy, collinear, and even incomplete variables in bothX andY. PLSR has the desirable property that theprecision of the model parameters improves with the increasing number of relevant variables and observations.

This article reviews PLSR as it has developed to become a standard tool in chemometrics and used in chemistry andengineering. The underlying model and its assumptions are discussed, and commonly used diagnostics are reviewed togetherwith the interpretation of resulting parameters.

Ž .Two examples are used as illustrations: First, a Quantitative Structure–Activity Relationship QSARrQuantitative Struc-Ž .ture–Property Relationship QSPR data set of peptides is used to outline how to develop, interpret and refine a PLSR model.

Second, a data set from the manufacturing of recycled paper is analyzed to illustrate time series modelling of process data bymeans of PLSR and time-lagged X-variables.q2001 Elsevier Science B.V. All rights reserved.

Keywords: PLS; PLSR; Two-block predictive PLS; Latent variables; Multivariate analysis

1. Introduction

In this article we review a particular type of mul-tivariate analysis, namely PLS-regression, which usesthe two-block predictive PLS model to model the re-lationship between two matrices,X and Y. In addi-tion PLSR models theAstructureB of X and of Y,which gives richer results than the traditional multi-ple regression approach. PLSR and similar ap-proaches providequantitatiÕe multivariate modellingmethods, with inferential possibilities similar to mul-tiple regression,t-tests and ANOVA.

) Corresponding author. Tel.:q46-90-786-5563; fax:q46-90-13-88-85.

Ž .E-mail address: [email protected] S. Wold .

The present volume contains numerous examplesof the use of PLSR in chemistry, and this article ismerely an introductory review, showing the develop-ment of PLSR in chemistry until, around, the year1990.

1.1. General considerations

Ž .PLS-regression PLSR is a recently developedŽ .generalization of multiple linear regression MLR

w x1–6 . PLSR is of particular interest because, unlikeMLR, it can analyze data with strongly collinearŽ .correlated , noisy, and numerous X-variables, andalso simultaneously model several response vari-ables,Y, i.e., profiles of performance. For the mean-ing of the PLS acronym, see Section 1.2.

0169-7439r01r$ - see front matterq2001 Elsevier Science B.V. All rights reserved.Ž .PII: S0169-7439 01 00155-1

( )S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58 2001 109–130110

The regression problem, i.e., how to model one orseveral dependent variables, responses,Y, by meansof a set of predictor variables,X, is one of the mostcommon data-analytical problems in science andtechnology. Examples in chemistry include relatingYs properties of chemical samples toXs theirchemical composition, relatingYs the quality andquantity of manufactured products toXs the condi-tions of the manufacturing process, andYschemicalproperties, reactivity or biological activity of a set of

Žmolecules toXs their chemical structure coded by.means of many X-variables . The latter models are

often called QSPR or QSAR. Abbreviations are ex-plained in Section 1.3.

Traditionally, this modelling ofY by means ofXis done using MLR, which works well as long as theX-variables are fairly few and fairly uncorrelated, i.e.,X has full rank. With modern measuring instrumen-tation, including spectrometers, chromatographs andsensor batteries, the X-variables tend to be many andalso strongly correlated. We shall therefore not callthemAindependentB, but insteadApredictorsB, or justX-variables, because they usually are correlated,noisy, and incomplete.

In handling numerous and collinear X-variables,Ž .and response profilesY , PLSR allows us to investi-

gate more complex problems than before, and ana-lyze available data in a more realistic way. However,some humility and caution is warranted; we are stillfar from a good understanding of the complicationsof chemical, biological, and economical systems.Also, quantitative multivariate analysis is still in itsinfancy, particularly in applications with many vari-

Ž .ables and few observations objects, cases .

1.2. A historical note

The PLS approach was originated around 1975 byHerman Wold for the modelling of complicated data

Ž .sets in terms of chains of matrices blocks , so-calledw xpath models, reviewed in Ref. 1 . This included a

simple but efficient way to estimate the parameters inŽthese models called NIPALS Non-linear Iterative

.PArtial Least Squares . This led, in turn, to theŽacronym PLS for these models Partial Least

.Squares . This relates to the central part of the esti-

mation, namely that each model parameter is itera-tively estimated as the slope of a simple bivariate re-

Ž .gression least squares between a matrix column orrow as the y-variable, and another parameter vectoras the x-variable. So, for instance, the PLS weights,

X Ž X . Žw, are iteratively re-estimated asX ur u u see.Section 3.10 . TheApartialB in PLS indicates that this

Ž .is a partial regression, since thex-vector u above isconsidered as fixed in the estimation. This also showsthat we can see any matrix–vector multiplication asequivalent to a set of simple bivariate regressions.This provides an intriguing connection between twocentral operations in matrix algebra and statistics, aswell as giving a simple way to deal with missing data.

w xGerlach et al. 7 applied multi-block PLS to theanalysis of analytical data from a river system inColorado with interesting results, but this was clearlyahead of its time.

Around 1980, the simplest PLS model with twoŽ .blocks X and Y was slightly modified by Svante

Wold and Harald Martens to better suit to data fromscience and technology, and shown to be useful todeal with complicated data sets where ordinary re-gression was difficult or impossible to apply. To givePLS a more descriptive meaning, H. Wold et al. havealso recently started to interpret PLS as Projection toLatent Structures.

1.3. AbbreÕiations

AA Amino AcidANOVA ANalysis Of VArianceAR Ž .AutoRegressive modelARMA Ž .AutoRegressive Moving Average modelCV Cross-ValidationCVA Canonical Variates AnalysisDModX Distance to Model in X-spaceEM Expectation MaximizationH-PLS Hierarchical PLSLDA Linear Discriminant AnalysisLV Latent VariableMA Ž .Moving Average modelMLR Multiple Linear RegressionMSPC Multivariate SPCNIPALS Non-linear Iterative Partial Least SquaresNN Neural Networks

( )S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58 2001 109–130 111

PCA Principal Components AnalysisPCR Principal Components RegressionPLS Partial Least Squares projection to latent

structuresPLSR PLS-RegressionPLS-DA PLS Discriminant AnalysisPRESD Predictive RSDPRESS Predictive Residual Sum of SquaresQSAR Quantitative Structure–Activity

RelationshipQSPR Quantitative Structure–Property

RelationshipRSD Residual SDSD Standard DeviationSDEP, SEP Standard error of predictionSECV Standard error of cross-validationSIMCA Simple Classification AnalysisSPC Statistical Process ControlSS Sum of SquaresVIP Variable Influence on Projection

1.4. Notation

We shall employ the common notation where col-umn vectors are denoted by bold lower case charac-ters, e.g.,v, and row vectors shown as transposed,e.g.,vX. Bold upper case characters denote matrices,e.g.,X.

) multiplication, e.g.,A)BX transpose, e.g.,vX,XX

a Ž .index of components model dimensions ;Ž .as1,2, . . . ,A

A number of components in a PC or PLSmodel

i Ž . Žindex of objects observations, cases ;is.1,2, . . . ,N

N Ž .number of objects cases, observationsk Ž .index of X-variables ks1,2, . . . ,Km Ž .index of Y-variables ms1,2, . . . ,MX Ž .matrix of predictor variables, sizeN)KY Ž .matrix of response variables, sizeN)Mbm regression coefficient vector of themth y.

Ž .Size K)1B matrix of regression coefficients of all Y’s.

Ž .Size K)Mc a PLSR Y-weights of componenta

C Ž .the M) A Y-weight matrix;c area

columns in this matrixE Ž .the N)K matrix of X-residualsf m Ž .residuals ofmth y-variable; N)1 vectorF Ž .the N)M matrix of Y-residualsG Ž .the number of CV groupsgs1,2, . . . ,Gpa PLSR X-loading vector of componentaP Loading matrix; p are columns ofPa

R2 multiple correlation coefficient; amountYAexplainedB in terms of SS

R2X amount X AexplainedB in terms of SS

Q2 cross-validatedR2; amountY ApredictedBt a X-scores of componentaT Ž .score matrix N) A , where the columns are

t a

u a Y-scores of componentaU Ž .score matrix N) A , where the columns are

u a

wa PLSR X-weights of componentaW Ž .the K) A X-weight matrix;w area

columns in this matrixw)

a PLSR weights transformed to be indepen-dent between components

W ) Ž .K) A matrix of transformed PLSRweights;w) are columns inW ).a

2. Example 1, a quantitative structure property( )relationship QSPR

We use a simple example from the literature withone Y-variable and seven X-variables. The problemis one of QSPR or QSAR, which differ only in that

Ž .the response sY are chemical properties in the for-mer and biological activities in the latter. In bothcases,X contains a quantitative description of thevariation in chemical structure between the investi-gated compounds.

The objective is to understand the variation ofysDDGTSs the free energy of unfolding of a pro-

Žtein tryptophane synthase a unit of bacteriophage T4.lysosome when position 49 is modified to contain

Ž .each of the 19 coded amino acids AA’s exceptarginine. The AA’s are described by seven highlycorrelated X-variables as shown in Table 1. Compu-

w xtational and other details are given in Ref. 8 . For al-


Table 1Raw data of example 1

PIE PIF DGR SAC MR Lam Vol DDGTS

Ž .1 Ala 0.23 0.31 y0.55 254.2 2.126 y0.02 82.2 8.5Ž .2 Asn y0.48 y0.60 0.51 303.6 2.994 y1.24 112.3 8.2Ž .3 Asp y0.61 y0.77 1.20 287.9 2.994 y1.08 103.7 8.5Ž .4 Cys 0.45 1.54 y1.40 282.9 2.933 y0.11 99.1 11.0Ž .5 Gln y0.11 y0.22 0.29 335.0 3.458 y1.19 127.5 6.3Ž .6 Glu y0.51 y0.64 0.76 311.6 3.243 y1.43 120.5 8.8Ž .7 Gly 0.00 0.00 0.00 224.9 1.662 0.03 65.0 7.1Ž .8 His 0.15 0.13 y0.25 337.2 3.856 y1.06 140.6 10.1Ž .9 Ile 1.20 1.80 y2.10 322.6 3.350 0.04 131.7 16.8Ž .10 Leu 1.28 1.70 y2.00 324.0 3.518 0.12 131.5 15.0Ž .11 Lys y0.77 y0.99 0.78 336.6 2.933 y2.26 144.3 7.9Ž .12 Met 0.90 1.23 y1.60 336.3 3.860 y0.33 132.3 13.3Ž .13 Phe 1.56 1.79 y2.60 366.1 4.638 y0.05 155.8 11.2Ž .14 Pro 0.38 0.49 y1.50 288.5 2.876 y0.31 106.7 8.2Ž .15 Ser 0.00 y0.04 0.09 266.7 2.279 y0.40 88.5 7.4Ž .16 Thr 0.17 0.26 y0.58 283.9 2.743 y0.53 105.3 8.8Ž .17 Trp 1.85 2.25 y2.70 401.8 5.755 y0.31 185.9 9.9Ž .18 Tyr 0.89 0.96 y1.70 377.8 4.791 y0.84 162.7 8.8Ž .19 Val 0.71 1.22 y1.60 295.1 3.054 y0.13 115.6 12.0

Correlation matrixPIE 1.000 0.967 y0.970 0.518 0.650 0.704 0.533 0.645PIF 0.967 1.000 y0.968 0.416 0.555 0.750 0.433 0.711DGR y0.970 y0.968 1.000 y0.463 y0.582 y0.704 y0.484 y0.648SAC 0.518 0.416 y0.463 1.000 0.955 y0.230 0.991 0.268MR 0.650 0.555 y0.582 0.955 1.000 y0.027 0.945 0.290Lam 0.704 0.750 y0.704 y0.230 y0.027 1.000 y0.221 0.499Vol 0.533 0.433 y0.484 0.991 0.945 y0.221 1.000 0.300DDGTS 0.645 0.711 y0.648 0.268 0.290 0.499 0.300 1.000

The lower half of the table shows the pair-wise correlation coefficients of the data. PIE and PIF are the lipophilicity constant of the AA sidew xchain according to El Tayar et al. 8 , and Fauchere and Pliska, respectively; DGR is the free energy of transfer of an AA side chain from

protein interior to water according to Radzicka and Woldenden; SAC is the water-accessible surface area of AA’s calculated by MOLSV;Ž . w xMR molecular refractivity from Daylight data base ; Lam is a polarity parameter according to El Tayar et al. 8 . Vol is the molecular

w xvolume of AA’s calculated by MOLSV. All the data except MR are from Ref. 8 .

w xternative ways, see Hellberg et al. 9 and Sandbergw xet al. 10 .

3. PLSR and the underlying scientific model

PLSR is a way to estimate parameters in a scien-Žtific model, which basically is linear see Section 4.3

.for non-linear PLS models . This model, like anyscientific model, consists of several parts, the philo-sophical, conceptual, the technical, the numerical, thestatistical, and so on. We here illustrate these using

Ž .the QSPRrQSAR model of example 1 see above ,

but the arguments are similar in most other mod-elling in science and technology.

Our chemical thinking makes us formulate the in-Žfluence of structural change on activity and other

.properties in terms ofA effectsB—lipophilic, steric,polar, polarizability, hydrogen bonding, and possiblyothers. Similarly, the modelling of a chemical pro-cess is interpreted usingAeffectsB of thermodynam-

Žics equilibria, heat transfer, mass transfer, pressures,. Žtemperatures, and flows and chemical kinetics reac-

.tion rates .Although this formulation of the scientific model

is not of immediate concern for the technicalities ofPLSR, it is of interest that PLSR modelling is consis-


tent with AeffectsB causing the changes in the inves-Žtigated system. The concept oflatent Õariables Sec-

.tion 4.1 may be seen as directly corresponding tothese effects.

3.1. The data— X and Y

The PLSR model is developed from a training setŽof N observations objects, cases, compounds, pro-

.cess time points withK X-variables denoted byŽ . Žx ks 1, . . . ,K , and M Y-variables y msk m

.1,2, . . . ,M . These training data form the two matri-Ž . Ž .cesX andY of dimensionsN)K and N)M , re-

spectively, as shown in Fig. 1. In example 1,Ns19,Ks7, and Ms1.

Later, predictions for new observations are madebased on their X-data. This gives predicted X-scoresŽ .t-values , X-residuals, their residual SD’s, andy-values with confidence intervals.

3.2. Transformation, scaling and centering

Before the analysis, the X- and Y-variables are of-ten transformed to make their distributions be fairlysymmetrical. Thus, variables with range of more thanone magnitude of 10 are often logarithmically trans-

Fig. 1. Data of a PLSR can be arranged as two tables, matrices,XŽand Y. Note that the raw data may have been transformed e.g.,

.logarithmically , and usually have been centered and scaled beforethe analysis. In QSAR, the X-variables are descriptors of chemicalstructure, and the Y-variables measurements of biological activity.

formed. If the value zero occurs in a variable, thefourth root transformation is a good alternative to thelogarithm. The response variable in example 1 is al-ready logarithmically transformed, i.e., expressed inthermodynamic units. No further transformation wasmade of the example 1 data.

Results of projection methods such as PLSR de-pend on thescaling of the data. With an appropriatescaling, one can focus the model on more importantY-variables, and use experience to increase theweights of more informative X-variables. In the ab-sence of knowledge about the relative importance ofthe variables, the standard multivariate approach is toŽ .i scale each variable to unit variance by dividing

Ž .them by their SD’s, and ii center them by subtract-ing their averages, so-calledauto-scaling. This cor-

Ž .responds to giving each variable column the sameweight, the same prior importance in the analysis.

In example 1, all variables were auto-scaled, ex-cept x sLam, the auto-scaling weight of which was6

multiplied by 1.5 to make it not be masked by theŽ .others so-called block-scaling .

In CoMFA and GRID-QSAR, as well as in multi-variate calibration in analytical chemistry, auto-scal-ing is often not the best scaling ofX, but non-scaledX-data or some intermediate between auto-scaled and

w xnon-scaled may be appropriate 5 . In process data,the acceptable interval of variation of each variablecan form the basis for the scaling.

For ease of interpretation and for numerical stabil-ity, it is recommended tocenter the data before theanalysis. This is done—either before or after scaling—by subtracting the averages from all variables bothin X and Y. In process data other reference valuessuch as set point values may be subtracted instead ofaverages. Hence, the analysis concerns the deviationsfrom the reference points and how they are related.The centering leaves the interpretation of the resultsunchanged.

In some applications it is customary to normalizeŽ .also theobserÕations objects . In chemistry, this is

often done in the analysis of chromatographic orspectral profiles. The normalization is typically doneby making the sum of all peaks of one profile be 100or 1000. This removes thesize of the observationsŽ .objects , which may be desirable if size is irrele-vant. This is closely related to correspondence analy-

w xsis 11 .


3.3. The PLSR model

The linear PLSR model finds a fewAnewB vari-ables, which are estimates of the LV’s or their rota-tions. These new variables are called X-scores and

Ž .denoted byt as1,2, . . . ,A . The X-scores areaŽ Ž . Ž .predictors ofY and also modelX Eqs. 4 and 2

.below , i.e., bothY andX are assumed to be, at leastpartly, modelled by the same LV’s.

Ž .The X-scores areAfewB A in number , and or-thogonal. They are estimated as linear combinationsof the original variablesx with the coefficients,k

) Ž .AweightsB, w as1,2, . . . ,A . These weights arek aw xsometimes denoted byr 12,13 . Below, formulask a

Žare shown both in element and matrix form the lat-.ter in parentheses :

t s W ) X ; TsXW ) . 1Ž . Ž .Ýi a k a i kk

Ž .The X-scorest ’s have the following properties:aŽ .a They are, multiplied by the loadingsp , goodak

AsummariesB of X, so that the X-residuals,e , in Eq.i kŽ .2 areAsmallB:

X s t p qe ; XsTPXqE . 2Ž . Ž .Ýi k i a ak i ka

Ž .With multivariateY when M)1 , the correspond-Ž .ing AY-scoresB u are, multiplied by the weightsa

c , good AsummariesB of Y, so that the residuals,amŽ .g , in Eq. 3 areAsmallB:i m

y s u c qg YsUCXqG , 3Ž . Ž .Ýim ia am ima

Ž .b the X-scores are good predictors ofY, i.e.:

y s c t q f YsTCXqF . 4Ž . Ž .Ýi m m a i a i ma

The Y-residuals,f express the deviations betweeni m

the observed and modelled responses, and comprisethe elements of the Y-residual matrix,F.

Ž . Ž .Because of Eq. 1 , Eq. 4 can be rewritten to lookas a multiple regression model:

y c w) x q f s b x q fÝ Ý Ýi m m a k a i k i m m k i k i ma k k

YsXW )CXHFsXBHF . 5Ž . Ž .

Ž .The APLS-regression coefficientsB, b B , can bem k

written as:

b s c w) BsW )CX . 6Ž . Ž .Ým k m a k aa

Note that theseb’s are not independent unless AŽ . Žthe number of PLSR components equalsK the

.number of X-variables . Hence, their confidence in-tervals according to the traditional statistical interpre-tation are infinite.

An interesting special case is at hand when thereŽ . Xis a singley-variable Ms1 andX X is diagonal,

Ži.e., X originates from an orthogonal design frac-.tional factorial, Plackett–Burman, etc. . In this case

there is no correlation structure inX, and PLSR ar-w xrives at the MLR solution in one component 14 , and

the MLR and PLS-regression coefficients are equal tow cX .1 1

After each component,a, the X-matrix is Ade-) Ž XflatedB by subtractingt p from x t p fromi a k a i k a a

.X . This makes the PLSR model alternatively be ex-pressed in weightsw referring to the residuals aftera

previous dimension,E , instead of relating to theaI1Ž .X-variables themselves. Thus, instead of Eq. 1 , we

can write:

t s w e t sE W 7aŽ . Ž .Ýi a k a i k ,ay1 a ay1 ak

e se y t pi k ,ay1 i k ,ay2 i ,ay1 ay1,k

E sE y t pX 7bŽ . Ž .ay1 ay2 ay1 ay1

e sX E sX . 7cŽ . Ž .i k ,0 i k 0

However, the weights,w, can be transformed tow),Ž .which directly relate toX, giving Eq. 1 above. Thew xrelation between the two is given as 14 :

y1X)W sW P W . 8Ž . Ž .The Y-matrix can also beAdeflatedB by subtract-

ing t cX , but this is not necessary; the results area a

equivalent with or without Y-deflation.Ž .From the PLSR algorithm see below , one can see

Ž .that the first weight vectorw is the first eigenvec-1

tor of the combined variance–covariance matrix,X X ŽX YY X, and the following weight vectors compo-

.nent a are eigenvectors to the deflated versions ofthe same matrix, i.e.,ZX YY XZX , where Z sZa a a ay1

X Ž .yT P . Similarly, the first score vectort isay1 ay1 1

an eigenvector toXXX YY X, and later X-score vectorsŽ . X Xt are eigenvectors ofZ Z YY .a a a


These eigenvector relationships also show that thevectors w form an ortho-normal set, and that thea

vectorst are orthogonal to each other. The loadingaŽ .vectors p are not orthogonal to each other, anda

neither are the Y-scores,u . Theu’s and thep’s area

orthogonal to thet’s and thew’s, respectively, oneand more components earlier, i.e.,uX t s 0 andb a

pX w s0, if b)a. Also, wX p s1.0.b a a a

3.4. Interpretation of the PLSR model

One way to see PLSR is that it formsAnew x-Ž .variablesB LV estimates ,t , as linear combinationsa

of the old x ’s, and thereafter uses these newt ’s aspredictors ofY. Hence, PLSR is based on a linear

Ž .model see, however, Section 4.3 . Only as many newt ’s are formed as are needed, as are predictively sig-

Ž .nificant Section 3.8 .Ž ) . ŽAll parameters,t, u, w andw , p, andc see Fig.

.1 , are determined by a PLSR algorithm as describedbelow. For theinterpretation of the PLSR model, thescores,t andu, contain the information about the ob-jects and their similaritiesrdissimilarities with re-spect to the given problem and model.

) ŽThe weightsw or the closely similarw see be-a a.low , and c , give information about how the vari-a

ables combine to form the quantitative relation be-tweenX andY, thus providing an interpretation of thescores,t andu . Hence, these weights are essentiala a

for the understanding of which X-variables are im-Ž .portant numerically largew -values , and which X-a

Žvariables that provide the same information similar.profiles of w -values .a

The PLS weightsw express both theApositiveBa

correlations betweenX and Y, and theAcompensa-tion correlationsB needed to predictY from X clearfrom the secondary variation inX. The latter is ev-erything varying inX that is not primarily related toY. This makesw difficult to interpret directly, espe-a

Ž .cially in later componentsa)1 . By using an or-thogonal expansion of the X-parameters in O-PLS,one can get the part ofw that primarily relates toY,a

w xthus making the PLS interpretation more clear 15 .The part of the data that are not explained by the

model, the residuals, are of diagnostic interest. LargeY-residuals indicate that the model is poor, and anormal probability plot of the residuals of a singleY-variable is useful for identifying outliers in the re-

lationship betweenT andY, analogously to MLR. InPLSR we also have residuals forX; the part not usedin the modelling ofY. These X-residuals are usefulfor identifying outliers in the X-space, i.e., moleculeswith structures that do not fit the model, and processpoints deviating fromAnormalB process operations.This, together with control charts of the X-scores,t ,a

Ž . w xis used in multivariate SPC MSPC 16 .

3.5. Geometric interpretation

PLSR is a projection method and thus has a sim-ple geometric interpretation as a projection of the

ŽX-matrix a swarm ofN points in a K-dimensional.space down on anA-dimensional hyper-plane in

Žsuch a way that the coordinates of the projectiont ,a.as1,2, . . . ,A are good predictors ofY. This is in-

dicated in Fig. 2.

Fig. 2. The geometric representation of PLSR. TheX-matrix canbe represented asN points in theK dimensional space where each

Ž .column of X x defines one coordinate axis. The PLSR modelk

defines anA-dimensional hyper-plane, which in turn, is defined byone line, one direction, per component. The direction coefficientsof these lines arep . The coordinates of each object,i, when itsak

Ž .data row i in X are projected down on this plane aret . Thesei a

positions are related to the values ofY.


The direction of the plane is expressed as slopes,Žp , of each PLS direction of the plane each compo-ak

.nent with respect to each coordinate axis,x . Thisk

slope is the cosine of the angle between the PLS di-rection and the coordinate axis.

Thus, PLSR develops anA-dimensional hyper-plane in X-space such that this plane well approxi-

Ž .matesX the N points, row vectors ofX , and at thesame time, the positions of the projected data pointson this plane, described by the scorest , are relatedi a

Žto the values of the responses, activities,Y see Fig.i m.2 .

( )3.6. Incomplete X and Y matrices missing data

Projection methods such as PLSR tolerate moder-ate amounts of missing data both inX and inY. Tohave missing data inY, it must be multivariate, i.e.,have at least two columns. The larger the matricesXand Y are, the higher proportion of missing data istolerated. For small data sets with around 20 obser-vations and 20 variables, around 10% to 20% miss-ing data can be handled, provided that they are notmissing according to some systematic pattern.

The NIPALS PLSR algorithm automatically ac-counts for the missing values, in principle by itera-tively substituting the missing values by predictionsby the model. This corresponds to, for each compo-nent, giving the missing data values that have zeroresiduals and thus have no influence on the compo-nent parameterst andp . Other approaches based ona a

the EM algorithm have been developed, and oftenwork better than NIPALS for large percentages of

w xmissing data 17,18 . One should remember, how-ever, that with much missing data, any resulting pa-rameters and predictions are highly uncertain.

3.7. One Y at a time, or all in a single model?

PLSR has the ability to model and analyze severalY’s together, which has the advantage to give a sim-pler overall picture than one separate model for eachY-variable. Hence, when the Y’s are correlated, theyshould be analyzed together. If the Y’s really mea-sure different things, and thus are fairly independent,a single PLSR model tends to have many compo-nents and be difficult to interpret. Then a separatemodelling of the Y’s gives a set of simpler modelswith fewer dimensions, which are easier to interpret.

Hence, one should start with a PCA of just theY-matrix. This shows the practical rank ofY, i.e., thenumber of resulting components,A . If this isPCA

Ž .small compared to the number of Y-variablesM ,the Y’s are correlated, and a single PLSR model ofall Y’s is warranted. If, however, the Y’s cluster instrong groups, which is seen in the PCA loading plots,separate PLSR models should be developed for thesegroups.

3.8. The number of PLS components, A

In any empirical modelling, it is essential to deter-mine the correct complexity of the model. With nu-merous and correlated X-variables there is a substan-tial risk for Aover-fittingB, i.e., getting a well fittingmodel with little or no predictive power. Hence, astrict test of the predictive significance of each PLScomponent is necessary, and then stopping whencomponents start to be non-significant.

Ž .Cross-validation CV is a practical and reliablew xway to test this predictive significance 1–6 . This has

become the standard in PLSR analysis, and incorpo-rated in one form or another in all available PLSRsoftware. Good discussions of the subject were given

w xby Wakeling and Morris 19 and Clark and Cramerw x20 .

Basically, CV is performed by dividing the data ina number of groups,G, say, five to nine, and thendeveloping a number of parallel models from re-duced data with one of the groups deleted. We notethat havingGsN, i.e., the leave-one-out approach,

w xis not recommendable 21 .After developing a model, differences between ac-

tual and predicted Y-values are calculated for thedeleted data. The sum of squares of these differencesis computed and collected from all the parallel mod-els to form the predictive residual sum of squaresŽ .PRESS , which estimates the predictive ability of themodel.

When CV is used in the sequential mode, CV isperformed on one component after the other, but the

Ž Ž . .peeling off Eq. 7b , Section 3.3 is made only onceon the full data matrices, where after the resultingresidual matricesE andF are divided into groups forthe CV of next component. The ratio PRESSrSSa ay1

is calculated after each component, and a componentis judged significant if this ratio is smaller thanaround 0.9 for at least one of the Y-variables. Slightly


sharper bonds can be obtained from the results ofw xWakeling and Morris 19 . Here SS denotes theay1

Ž .fitted residual sum of squaresbefore the currentŽ .component indexa . The calculations continue until

a component is non-significant.Alternatively withAtotal CVB, one first divides the

data into groups, and then calculates PRESS for eachcomponent up to, say 10 or 15 with separateApeel-

Ž Ž . .ingB Eq. 7b , Section 3.3 of the data matrices ofeach CV group. The model with the number of com-

Ž .ponents giving the lowest PRESSr NyAy1 isthen used. ThisAtotalB approach is computationallymore taxing, but gives similar results.

Both with theAsequentialB and theAtotalB mode,a PRESS is calculated for the final model with theestimated number of significant components. This is

2 Ž 2.often re-expressed asQ the cross-validatedRŽ .which is 1yPRESSrSS where SS is the sum of

squares ofY corrected for the mean. This can be2 Ž .compared withR s 1yRSSrSS , where RSS is

the fitted residual sum of squares. In models withseveral Y’s, one obtains alsoR2 and Q2 for eachm m

Y-variable, y .m

These measures can, of course, be equivalentlyŽ .expressed as residual SD’s RSD’s and predictive

Ž .residual SD’s PRESD’s . The latter is often calledŽ .standard error of prediction SDEP or SEP , or stan-

Ž .dard error of cross-validation SECV . If one hassome knowledge of the noise in the investigated sys-

Ž .tem, for example"0.3 units for log 1rC inQSAR’s, these predictive SD’s should, of course, besimilar in size to the noise.

3.9. Model Õalidation

Any model needs to be validated before it is usedfor AunderstandingB or for predicting new events suchas the biological activity of new compounds or theyield and impurities at other process conditions. Thebest validation of a model is that it consistently pre-cisely predicts the Y-values of observations with newX-values—aÕalidation set. But an independent andrepresentative validation set is rare.

In the absence of a real validation set, two reason-able ways of model validation are given by cross-

Ž .validation CV, see Section 3.8 which simulates howwell the model predicts new data, and model re-estimation after data randomization which estimates

Ž .the chance probability to get a good fit with ran-dom response data.

3.10. PLSR algorithms

The algorithms for calculating the PLSR model aremainly of technical interest, we here just point outthat there are several variants developed for different

w xshapes of the data 2,22,23 . Most of these algo-rithms tolerate moderate amounts of missing data.Either the algorithm, like the original NIPALS algo-rithm below, works with the original data matrices,X

Ž .andY scaled and centered . Alternatively, so-calledkernel algorithms work with the variance–covariancematrices,XXX, Y X Y, andXX Y, or association matri-ces,XXX andYY X, which is advantageous when the

Ž .number of observationsN differs much from theŽ .number of variablesK and M .

For extensions of PLSR, the results of Hoskulds-¨w xson 3 regarding the possibilities to modify the NI-

PALS PLSR algorithm are of great interest. Ho-Ž . Ž .skuldsson shows that as long as the steps C to G

below are unchanged, modifications can be made ofŽ .w in step B . Central properties remain, such as or-

thogonality between model components, good sum-marizing properties of the X-scores,t , and inter-a

pretability of the model parameters. This can be usedw xto introduce smoothness in the PLSR solution 24 , to

develop a PLSR model where a majority of the PLSRw xcoefficients are zero 25 , alignw with a priori speci-`

Žfied vectors similar toAtarget rotationB of Kvalheimw x.et al. 26 , and more.

w xThe simple NIPALS algorithm of Wold et al. 2is shown below. It starts with optionally transformed,scaled, and centered data,X andY, and proceeds as

Žfollows note that with a single y-variable, the algo-.rithm is non-iterative .

Ž .A Get a starting vector ofu, usually one of theY columns. With a single y,usy.

Ž .B The X-weights,w:

wsXXuruXu,

Ž . 5 5herew can now be modified normw to w s1.0Ž .C Calculate X-scores,t:

tsXw.Ž .D The Y-weights,c:

csY X trtX t.


Ž .E Finally, an updated set of Y-scores,u:

usYcrcXc.

Ž .F Convergence is tested on the change int, i.e.,5 5 5 5t I t r t -´ , where ´ is AsmallB, e.g.,old new new

10y6 or 10y8. If convergence hasnot been reached,Ž . Ž .return to B , otherwise continue with G , and then

Ž .A . If there is only one y-variable, i.e.,Ms1, theprocedure converges in a single iteration, and one

Ž .proceeds directly with G .Ž . Ž .G Remove deflate, peel off the present compo-

nent fromX andY use these deflated matrices asXand Y in the next component. Here the deflation ofY is optional; the results are equivalent whetherY isdeflated or not.

psXX tr tX tŽ .XsXy tpX

YsYy tcX .

Ž . ŽH Continue with next component back to step. Ž .A until cross-validation see above indicates that

there is no more significant information inX aboutY.

w xChu et al. 27 recently has reviewed the attractiveproperties of matrix decompositions of the Wedder-burn type. The PLSR NIPALS algorithm is such aWedderburn decomposition, and hence is numeri-cally and statistically stable.

3.11. Standard errors and confidence interÕals

Numerous efforts have been made to theoreticallyderive confidence intervals of the PLSR parameters,

w xsee, e.g., Ref. 28 . Most of these are, however, basedon regression assumptions, seeing PLSR as a biasedregression model. Only recently, in the work of

w xBurnham et al. 12 , have these matters been investi-gated with PLSR as alatent Õariable regressionmodel.

A way to estimate standard errors and confidenceintervals directly from the data is to use jack-knifingw x w x29 . This was recommended by Wold 1 in his orig-inal PLS work, and has recently been revived by

w xMartens and Martens 30 and others. The idea issimple; the variation in the parameters of the varioussub-models obtained during cross-validation is used

Žto derive their standard deviations called standard

.errors , followed by using the t-distribution to giveconfidence intervals. Since all PLSR parametersŽ .scores, loadings, etc. are linear combinations of the

Ž .original data possibly deflated , these parameters areclose to normally distributed, and hence jack-knifingworks well.

4. Assumptions underlying PLSR

4.1. Latent Variables

In PLS modelling, we assume that the investi-gated system or process actually is influenced by just

( )a few underlying variables,latent Õariables LV’s .The number of these LV’s is usually not known, andone aim with the PLSR analysis is to estimate thisnumber. Also, the PLS X-scores,t , are usually nota

direct estimates of the LV’s, but rather they span theŽsame space as the LV’s. Thus, the latter denoted by

. Ž .V are related to the formerT by a, usually un-known, rotation matrix,R, with the propertyRXRs1:

VsTRX or TsRV.

Both the X- and the Y-variables are assumed to berealizations of these underlying LV’s, and are hencenot assumed to be independent. Interestingly, the LVassumptions closely correspond to the use of micro-scopic concepts such as molecules and reactions inchemistry and molecular biology, thus making PLSRphilosophically suitable for the modelling of chemi-cal and biological data. This has been discussed by,

w x w xamong others, Wold et al. 31,32 , Kvalheim 33 , andrecently from a more fundamental perspective, by

w xBurnham et al. 12,13 . In spectroscopy, it is clear thatthe spectrum of a sample is the sum of the spectra ofthe constituents multiplied by their concentrations in

Žthe sample. Identifying the latter witht Lambert–.BeersAlawB , and the spectra withp, we get the la-

tent variable modelXs t1p1Xq t2p2Xq . . .sTPX

qnoise. In many applications this interpretation withŽthe data explained by a number ofAfactorsB compo-

.nents makes sense.As discussed below, we can also see the scores,T,

as comprised of derivatives of an unknown functionunderlying the investigated system. The choice of theinterpretation depends on the amount of knowledge


about the system. The more knowledge we have, themore likely it is that we can assign a latent variableinterpretation of the X-scores or their rotation.

If the number of LV’s actually equals the numberof X-variables,K, then the X-variables are indepen-dent, and PLSR and MLR give identical results.Hence, we can see PLSR as a generalization of MLR,containing the latter as a special case in situationswhen the MLR solution exists, i.e., when the numberof X- and Y-variables is fairly small in comparisonto the number of observations, N. In most practicalcases, except whenX is generated according to anexperimental design, however, the X-variables are notindependent. We then callX rank deficient. ThenPLSR gives aAshrunkB solution which is statisticallymore robust than the MLR solution, and hence gives

w xbetter predictions than MLR 34 .PLSR gives a model ofX in terms of a bilinear

projection, plus residuals. Hence, PLSR assumes thatthere may be parts ofX that are unrelated toY. Theseparts can include noise andror regularities non-re-lated toY. Hence, unlike MLR, PLSR tolerates noisein X.

4.2. AlternatiÕe deriÕation

The second theoretical foundation of LV-models isw xone of Taylor expansions 35 . We assume the data

X and Y to be generated by a multi-dimensionalŽ .function F u,v , where the vector variableu de-

Ž .scribes the change between observations rows inXand the vector variablev describes the change be-

Ž .tween variables columns inX . Making a Taylor ex-pansion of the function F in theu-direction, and dis-cretizing for isobservation and ksvariable, givesthe LV-model. Again, the smaller the interval ofuthat is modelled, the fewer terms we need in the Tay-lor expansion, and the fewer components we need inthe LV-model. Hence, we can interpret PCA and PLS

Ž .as models of similarity. Data variables measured onŽa set of similar observations samples, items,

. Ž .cases, . . . can always be modelled approximated bya PC- or PLS model. And the more similar are theobservations, the fewer components we need in themodel.

We hence have two different interpretations of theLV-model. Thus, real data well explained by thesemodels can be interpreted as either being a linearcombination ofAfactorsB or according to the latter

interpretation as being measurements made on a setof similar observations. Any mixture of these two in-terpretations is, of course, often applicable.

4.3. Homogeneity

Any data analysis is based on an assumption ofhomogeneity. This means that the investigated sys-tem or process must be in a similar state throughoutall the investigation, and the mechanism of influenceof X on Y must be the same. This, in turn, corre-sponds to having some limits on the variability anddiversity of X andY.

Hence, it is essential that the analysis providesdi-agnostics about how well these assumptions indeedare fulfilled. Much of the recent progress in applied

w xstatistics has concerned diagnostics 36 , and many ofthese diagnostics can be used also in PLSR mod-elling as discussed below. PLSR also provides addi-tional diagnostics beyond those of regression-likemethods, particularly those based on the modelling of

Ž .X score and loading plots and X-residuals .In the first example, the first PLSR analysis indi-

cated that the data set was inhomogeneous—threeŽ .aromatic amino acids AA’s are indicated to have a

different type of effect than the others on the mod-elled property. This type of information is difficult toobtain in ordinary regression modelling, where onlylarge residuals in Y provide diagnostics about inho-mogeneities.

4.4. Non-linear PLSR

For non-linear situations, simple solutions havew xbeen published by Hoskuldsson 4 , and Berglund and¨

w xWold 37 . Another approach based on transformingselected X-variables or X-scores to qualitative vari-ables coded as sets of dummy variables, the so-called

w xGIFI approach 38,39 , is described elsewhere in thisw xvolume 15 .

5. Results of example 1

5.1. The initial PLSR analysis of the AA data

Ž .The first PLSR analysis linear model of the AAdata gives one significant component explaining 43%

Ž 2 2 .of the Y-variance R s0.435,Q s0.299 . In con-trast, the MLR gives anR2 of 0.788, which is equiv-


Fig. 3. The PLS scoresu andt of the AA example, 1st analysis.1 1

alent to PLSR withAs7 components. The full MLRsolution, however, has aQ2 of y0.215, indicatingthat the model is poor, and does not predict better thanchance.

With just one significant PLS component, the onlyŽ .meaningful score plot is that ofy againstt Fig. 3 .

The aromatic AA’s, Trp, Phe, and, may be, Tyr, showa much worse fit than the others, indicating an inho-mogeneity in the data. To investigate this, a secondround of analysis is made with a reduced data set,Ns16, without the aromatic AA’s.

5.2. Second analysis

The modelling ofNs16 AA’s with the same lin-ear model as before gives a substantially better result

with As2 significant components andR2s0.783,Q2s0.706. The MLR model for these 16 objectsgives a R2 of 0.872, and aQ2 of 0.608. This strongimprovement indicates that indeed the data set now ismore homogeneous and can be properly modelled.

Ž .The plot in Fig. 4 ofu y vs. t shows, however,1 1

a fairly strong curvature, indicating that squared termsin lipophilicity and, may be, polarity are warranted.In the final analysis, the squares of these four vari-ables were included in the model, which indeed gavebetter results. Two significant PLS components andone additional with border-line significance were ob-tained. The resultingR2 and Q2 values are forAs2: 0.90 and 0.80, and forAs3: 0.925 and 0.82, re-spectively. TheAs3 values correspond to RSDs

Ž .0.92, and PRESD SDEPs1.23, since the SD of Y

Fig. 4. The PLS scores,u andt of the AA example, 2nd analysis.1 1


Fig. 5. The PLS scorest andt of the AA example, 2nd analysis. The overlapping points up to the right are Ile and Leu.1 2

is 2.989. The full MLR model givesR2s0.967, butwith much worseQ2s0.09.

( )5.2.1. X-scores t show object similarities and dis-a

similaritiesŽ .The plot of the X-scorest vs. t , Fig. 5 shows1 2

the 16 amino acids grouped according to polarityfrom upper left to lower right, and inside each group,according to size and lipophilicity.

5.2.2. PLSR weights w and cFor the interpretation of PLSR models, the stan-

dard is to plot thew) ’s andc’s of one model dimen-sion against another. Alternatively, one can plot the

w’s and c’s; the results and interpretation is similar.This plot shows how the X-variables combine to formthe scorest ; X-variables important for theath com-a

ponent fall far from the origin along theath axis inthe wc-plot. Analogously, Y-variables well modelledby the ath component have large coefficients,c ,am

and hence fall far from the origin along theath axisin the same plot.

ŽThe example 1 weight plot Fig. 6, often also.called loading plot shows the first PLS component

Ždominated by lipophilicity and polarity PIF, PIE, and.Lam on the positive side, and DGR on the negative ,

and the second component being a mixture of size andpolarity with MR, Vol, and SAC on the positiveŽ . Ž .high side, and Lam on the negative low side.

Fig. 6. The PLS weights,w) andc for the first two dimensions of the 2nd AA model.


Ž .Fig. 7. PLS regression coefficients afterAs2 components second analysis . The bars indicate 95% confidence intervals based on jack-knifing.

The c-values of the response,y, are proportionalto the linear variation of Y explained by the corre-sponding dimension, i.e.,R2. They define one pointper response; in the example with a single response,

Ž .this point DDGTS sits far to the right in the firstquadrant of the plot.

The importance of a given X-variable forY isproportional to its distance from the origin in theloading space. These lengths correspond to the PLS-

Žregression coefficients afterAs2 dimensions Fig..7 .

5.3. Comparison with multiple linear regression( )MLR

Ž .Comparing the PLS modelAs2 and the MLRŽ .model equivalent to the PLS model withAs7

Ž .shows that the coefficients ofx DGR and x3 4Ž . Ž .SAC change sign between the PLSR model Fig. 7

Ž .and the MLR model Fig. 8 . Moreover, the coeffi-cients of x , x , and x which are moderate in the4 5 7

Ž .PLSR As2 model become large and with oppo-Ž .site signs in the MLR modelAs7 , although they

are strongly positively correlated to each other in theraw data.

The MLR coefficients clearly are misleading andun-interpretable, due to the strong correlations be-tween the X-variables. PLSR, however, stops atAs2, and gives reasonable coefficient values both forAs2 and As3. With correlated variables one can-not assignAcorrectB values to the individual coeffi-cients, the only thing we can estimate is theirjointcontribution toy.

Fig. 8. The MLR coefficients corresponding to analysis 2. The bars indicate 95% confidence intervals based on jack-knifing. Note the differ-ences in the vertical scale to Fig. 7.


Fig. 9. VIP of the X-variables of the three-component PLSR model, 3rd analysis. The squares ofx sPIE, x sPIF, x sDGR, andx s1 2 3 6

Lam, are denoted by S1)1, S2)2, S3)3, and S6)6, respectively.

5.4. Measures of Õariable importance

Ž .In PLSR modelling, a variablex may be im-k

portant for the modelling ofY. Such variables areidentified by large PLS-regression coefficients,b .m k

However, a variable may also be important for themodelling of X, which is identified by large load-ings, p . A summary of the importance of an X-ak

Žvariable forboth Y andX is given by VIP variablek.importance for the projection, Fig. 9 . This is a

weighted sum of squares of the PLS-weights,w) ,ak

with the weights calculated from the amount of Y-variance of each PLS component,a.

5.5. Residuals

Ž Ž .The residuals ofY andX E andF in Eqs. 2 andŽ . .4 above are of diagnostic value for the quality ofthe model. A normal probability plot of the Y-residu-

Ž .als Fig. 10 of the final AA model shows a fairlystraight line with all values within"3 SD’s. To be aserious outlier, a point should clearly deviate from theline, and be outside"4 SD’s.

Ž .Since the X-residuals are manyN) K , oneŽ .needs a summary for each observation compound

not to drown in details. This is provided by the resid-ual SD of the X-residuals of the corresponding row

Fig. 10. Y-residuals of the three-component PLSR model, 3rd analysis, normal probability plot.


of the residual matrixE. Because this SD is propor-tional to the distance between the data point and themodel plane in X-space, it is also often called

Ž .DModX distance to the model in X-space . ADModX larger than around 2.5 times the overall SD

Žof the X-residuals corresponding to anF-value of.6.25 indicates that the observation is an outlier. Fig.

11 shows that none of the 16 compounds in example1 has a large DModX. Here the overall SDs0.34.

5.6. Conclusions of example 1

ŽThe initial PLSR analysis gave diagnostics score.plots that indicated in-homogeneities in the data. A

much better model was obtained for theNs 16non-aromatic AA’s. A remaining curvature in thescore plot ofu vs. t lead to the inclusion of squared1 1

terms, which gave a good final model. Only thesquares in the lipophilicity variables are significant inthe final model.

If additional aromatic AA’s had been present, asecond separate model could have been developed forthis type of AA’s. This would provide insight in howthis aromatic group differs from the non-aromaticAA’s. This is, in a way, a non-linear model of thechanges in the relationship between structure and ac-tivity when going from non-aromatic to aromaticAA’s. These changes are too large and non-linear tobe modelled by a linear or low degree polynomialmodel. The use of two separate models, which do not

directly model the change from one group to another,provides a simple approach to deal with these non-linearities.

( )6. Example 2 SIMCODM

High and consistent product quality combined withAgreenB plant operation is important in today’s com-petitive industrial climate. The goal of process datamodelling is often to reduce the amount of down timeand eliminate sources of undesired and deleteriousprocess variability. The second example shows theinvestigation of possibilities to operate a process in-

w xdustry in an environment-friendly manner 40 .At the Aylesford Newsprint paper-mill in Kent,

UK, premium quality newsprint is produced from100% recycled fiber, i.e., reclaimed newspapers andmagazines. The first step of the process is one of de-inking. This has stages of cleaning, screening, andflotation. First the wastepaper is fed into rotatingdrum pulpers, where it is mixed with warm water andchemicals to initiate the fiber separation and ink re-moval. In the next stage, the pulp undergoes pre-screening and cleaning to remove light and heavy

Žcontaminants staples, paper clips, sand, plastics,.etc. . Thereafter the pulp goes to a pre-flotation stage.

After the pre-flotation the pulp is thickened and dis-persed and the brightness is adjusted. After post-flo-

Ž .tation and washing the de-inked pulp DIP is sent via

Ž .Fig. 11. RSD’s of each compound’s X-residuals DModX , 3rd analysis.


storage towers to the paper machines for productionof premium quality newsprint.

The wastewater discharged from the de-inkingprocess is the main source of chemical oxygen de-

Ž .mand COD of the mill effluent. COD estimates theoxygen requirements of organic matter present in theeffluent. The COD in the effluent is reduced by 85%prior to discharge to the river Medway. Stricter envi-ronmental regulations and costs of COD reductionmade the investigation of the COD sources and pos-sible control strategies one of the company’s envi-ronmental objectives for 1999. A multivariate ap-proach was adopted.

Thus, data from more than a year were used in themultivariate study. The example data set contains 384

Ždaily observations, with one response variable COD. Ž .load, y and 54 process variablesx yx . Half of1 2 55

Ž .the observations process time points are used formodel training and half for model validation.

6.1. OÕerÕiew by PCA

Ž .PCA applied to the entire data setX andY gavean eight-component model explainingR2s79% andpredictingQ2s68% of the data variation. Scores andloadings of the first two components accounting for54% of the sum of squares are plotted in Figs. 12 and13.

Ž .Fig. 12. Example 2. PCA score plott1rt2 of overview model.Each point represents one process time point.

Ž .Fig. 13. Example 2. PCA loading plotp1rp2 corresponding toFig. 12. The position of the COD load, variable number 1, is high-lighted by an increased font size.

Fig. 12 reveals a main cluster in the right-hand partof the score plot, where the process has spent mostof its time. Occasionally, the process has drifted awaytowards the left-hand area of the score plot, corre-sponding to a lower production rate, includingplanned shut downs in one of the process lines. Theloading plot shows most of the variables having low

Žnumerical values i.e., low material flow, low addi-.tion of process chemicals, etc. for the low produc-

tion rates.This analysis shows that the data are not strongly

clustered. The low production process time pointsdeviate from the main cluster, but not seriously.Rather, the inclusion of the low production timepoints in the subsequent PLSR modelling give a goodspanning of key features such as brightness, residualink, and addition of external chemicals.

6.2. Selection of training set for PLSR and lagging ofdata

The parent data set, with preserved time order, wassplit into two halves, one training set and one predic-tion set. Only forward prediction was applied. Pro-cess insight, initial PLSR modelling, and inspectionof cross-correlations, indicated that lagging 10 of thevariables was warranted to catch process dynamics.


Fig. 14. Example 2. Observed verses predicted COD for the pre-diction set.

These were: lags 1–4 of COD, and lags 1 and 2 ofŽ .variables x , x addition of two chemicals , and23 24

Ž .x a flow , totally 10 lagged X’s. Lagging a vari-35

able L steps means that the same variable shiftedLsteps in time is included as an additional variable inthe data set.

6.3. PLSR modelling

PLSR was utilized to relate the 54q10 processvariables to the COD load. A two-component PLSRmodel was obtained, withR2 s0.49, R2s0.60, Q2

X Y YŽ . 2 Ž . ŽCV s0.57, andQ ext s0.59 external valida-Y

.tion set . Fig. 14 shows relationships between ob-served and predicted levels of COD for the vali-dation set. The picture for the training set is verysimilar.

6.4. Results

The collected process variables together with theirlags predict COD load withQ2)0.6. This is a satis-fying result considering the complexity of the pro-cess and the great variations that occur in the starting

Ž .material recycled paper .Ž .The PLS-regression coefficients Fig. 15 , show

the process variables with the strongest correlation toŽ .the COD load to bex , x addition of chemicals ,23 24

Ž . Ž .x , x flows , and x and x temperatures ,34 35 49 50

with coefficients exceeding 0.04. Only some of thesecan be controlled, namelyx , x , and x . More-23 24 49

Fig. 15. Example 2. Plot of regression coefficients of scaled and centered variables of PLS model. Coefficients with numbers 2 to 55 repre-sent the 54 original process variables. Coefficients 57–60 depict lags 1–4 of COD, and coefficients 61–66 the two first lags of variablesx , x , and x .23 24 35


over, the lagged variables are important. Variables57–60 are the four first lags of the COD variable.Obviously, past measurements of COD are useful formodelling and predicting future levels of COD—thechange in this variable is slow. The variables 61–66are lags 1 and 2 ofx , x , and x . The dynamics23 24 35

of these process variables are thus related to COD.Unfortunately, none of the three controlled vari-

Ž .ables x , x , and x offer a means to decrease23 24 49

COD. The former two variables describe the additionof caustic and peroxide to the dispergers, and cannotbe significantly reduced without endangering pulpquality. The last variable is the raw effluent tempera-ture which cannot be lowered without negative ef-fects on pulp quality.

This result is typical for the analysis of historicalprocess data. The value of the analysis is mainly todetect deviations from normal operation, not as ameans to process optimization. The latter demandsdata from a statistically designed set of experimentswith the purpose of optimizing the process.

7. Summary; how to develop and interpret a PLSRmodel

Ž .1 Have a good understanding of the stated prob-Ž .lem, particularly which responses properties ,Y, that

are of interest to measure and model, and which pre-dictors, X, that should be measured and varied. Ifpossible, i.e., if the X-variables are subject to experi-mental control, use statistical experimental designw x41 for the construction ofX.

Ž . Ž . Ž2 Get good data, bothY responses andX pre-.dictors . Multivariate Y’s provide much more infor-

mation because they can first be separately analyzedby a PCA. This gives a good idea about the amountof systematic variation inY, which Y-variables thatshould be analyzed together, etc.

Ž .3 The first piece of information about the modelis the number of significant components, A—thecomplexity of the model and hence of the system.This number of components gives a lower bound ofthe number ofeffects we need to postulate in thesystem.

Ž .4 The goodness of fit of the model is given by2 2 Ž 2.R and Q cross-validatedR . With several Y’s,

one obtains alsoR2 and Q2 for eachy . The R2’sm m m

give an upper bound of how well the model explainsthe data and predicts new observations, and theQ2’sgive lower bounds for the same things.

Ž . Ž .5 The u,t score plots for the first two or threemodel dimensions will show the presence of outliers,curvature, or groups in the data.

Ž .The t,t score plots—windows in X-space—showin-homogeneities, groups, or other patterns. Together

Ž ) .with this, the w c weight plots gives an interpreta-tion of these patterns.

Ž . 2 26a If problems are seen, i.e., smallR and Qvalues, or outliers, groups, or curvature in the scoreplots, one should try to remedy the problem. Plots of

Žresiduals normal probability and DModX and.DModY may give additional hints for the sources of

the problems.Single outliers should be inspected for correctness

of data, and if this does not help, be excluded fromŽthe analysis only if they are non-interesting i.e., low

.activity .Ž .Curvature in u,t plots may be improved by

Ž .transforming parts of the data e.g., log , or by in-cluding squared andror cubic terms in the model.

After possibly having transformed the data, modi-fied the model, divided the data into groups, deletedoutliers, or whatever warranted, one returns to step 1.

Ž . 2 26b If no problems are seen, i.e.,R and Q areOK, and the model is interpretable, one may try tomildly prune the model by deleting unimportantterms, i.e., with small regression coefficients and lowVIP values. Thereafter a final model is developed,interpreted, and validated, predictions are made, etc.

Ž ) .For the interpretation thew c weight plots, coeffi-cient plots, and contour or 3D plots with dominatingX’s as plot coordinates, are invaluable.

8. Regression-like data-analytical problems

A number of seemingly different data-analyticalproblems can be expressed as regression problemswith a special coding ofY or X. These include linear

Ž .discriminant analysis LDA , analysis of varianceŽ . ŽANOVA , and time series analysis ARMA and

.similar models . With many and collinear variablesŽ .rank-deficientX , a PLSR solution can therefore beformulated for each of these.


Ž .In linear discriminant analysis LDA , and theŽ .closely related canonical variates analysis CVA , one

has theX-matrix divided in a number of classes, withindex 1,2, . . . ,g, . . . ,G. The objective of the analysis

Žis to find linear combinations of the X-variables dis-.criminant functions that discriminate between the

classes, i.e., have very different values for the classesw x11 . Provided that each class isAtightB and occupiesa small and separate volume in X-space, one can finda plane—a discriminant plane—in which the pro-jected observations are well separated according toclass. With many and collinear X-variables, a PLS

Ž . w xversion of LDA PLS-DA is useful 42,43 .However, when some of the classes are not tight,

often due to a lack of homogeneity and similarity inthese non-tight classes, the discriminant analysis doesnot work. Then other approaches, such as, SIMCAhave to be used, where a PC or PLSR model is de-veloped for each tight class, and new observations areclassified according to their nearness in X-space tothese class models.

Analogously, whenX contains qualitative vari-Ž .ables the ANOVA situation , these can be coded us-

ing dummy variables, and the data then analyzed us-ing MLR or PLSR. The latter is advantageous if the

Žcoded X is unbalanced andror rank deficient or.close to . With several Y-variables, this provides a

Žsimple approach to MANOVA multiple responses. w xANOVA 11 .

Auto-regressive and transfer function models intime-series analysis are easily analyzed by first con-structing an expanded X-matrix that contains the ap-propriate lags of Y- and X-variables, and then calcu-lating a PLSR solution. If the disturbancesa aret

Ž .included in the models ARMA, etc. , a two-stepanalysis is needed, first calculating estimates ofa ,t

and then using these in lagged forms in the finalPLSR model.

In the modelling of mixtures, for instance ofchemical ingredients in paints, pharmaceutical, cos-metic, and polymer formulations, beverages, etc., thesum of the X-variables is 1.0, since ingredients sumto 100%. This makesX rank deficient, and a numberof special regression approaches have been devel-oped for the analysis of mixture data. With PLSR,however, the rank deficiency presents no difficulties,and the data analysis becomes straight forward, as

w xshown by Kettaneh-Wold 44 .

The iterative calculations in non-linear regressionŽoften involve a regression-like updating step e.g., the

.Gauss–Newton approach . TheX-matrix of this stepis often highly rank-deficient, and ridge-regression is

Žused to remedy the situation Marquardt–Lefwen-.berg algorithms . PLSR provides a simple and inter-

esting alternative, which also is computationally veryfast.

9. Conclusions and discussion

PLSR provides an approach to the quantitativemodelling of the often complicated relationships be-tween predictors,X, and responses,Y, that withcomplex problems often is more realistic than MLRincluding stepwise selection variants. This becausethe assumptions underlying PLS— correlationsamong the X’s, noise inX, model errors—are morerealistic than the MLR assumptions of independentand error free X’s.

The diagnostics of PLSR, notably cross-validationŽ .and score plotsu, t and t, t with corresponding

loading plots, provide information about the correla-tion structure ofX that is not obtained in ordinaryMLR. In particular, PLSR results showing that the

Ždata are inhomogeneous like the AA example looked.at here , are hard to obtain by MLR. In complicated

systems, non-linearities so strong that a single poly-nomial model cannot be constructed, seem to berather common. Hence, a flexible approach to mod-elling is often warranted with separate models fordifferent mechanistic classes. And there is no loss ofinformation with this approach in comparison with

Žthe single model approach. A new observation ob-.ject is first classified with respect to its X-values, and

predicted response values are then obtained by theappropriate class model.

The ability of PLSR to analyzeprofiles of re-sponses, makes it easier to device response measure-ments that are relevant to the stated objective of theinvestigation; it is easier to capture the behavior of acomplicated system by a battery of measurementsthan by a single variable.

PLSR can be extended in various directions, tonon-linear modelling, hierarchical modelling whenthe variables are very numerous, multi-way data, PLStime series, PLS-DA, etc. Many recently developed


PLS based approaches are discussed in other articlesin this volume. The statistical understanding of PLSR

w xhas recently improved substantially 12,13,34 .We feel that the flexibility of the PLS-approach,

its graphical orientation, and its inherent ability tohandle incomplete and noisy data with many vari-

Ž .ables and observations makes PLS a simple butpowerful approach for the analysis of data of compli-cated problems.

Acknowledgements

Support from the Swedish Natural Science Re-Ž .search Council NFR , and from the Center for Envi-

Ž .ronmental Research in Umea CMF is gratefully˚acknowledged. Dr. Erik Johansson is thanked forhelping with the examples.

References

w x1 H. Wold, Soft modelling, The basic design and some exten-Ž .sions, in: K.-G. Joreskog, H. Wold Eds. , Systems Under In-¨

direct Observation, vols. I and II, North-Holland, Amster-dam, 1982.

w x2 S. Wold, A. Ruhe, H. Wold, W.J. Dunn III, The collinearityproblem in linear regression, The partial least squares ap-proach to generalized inverses, SIAM J. Sci. Stat. Comput. 5Ž .1984 735–743.

w x3 A. Hoskuldsson, PLS regression methods, J. Chemom. 2¨Ž .1988 211–228.

w x4 A. Hoskuldsson, Prediction Methods in Science and Technol-¨ogy, vol. 1, Thor Publishing, Copenhagen, 1996, ISBN 87-985941-0-9.

w x5 S. Wold, E. Johansson, M. Cocchi, PLS—partial least squaresŽ .projections to latent structures, in: H. Kubinyi Ed. , 3D

QSAR in Drug Design, Theory, Methods, and Applications,ESCOM Science Publishers, Leiden, 1993, pp. 523–550.

w x6 M. Tenenhaus, La Regression PLS: Theorie et Pratique,Technip, Paris, 1998.

w x7 R.W. Gerlach, B.R. Kowalski, H. Wold, Partial least squaresŽ .modelling with latent variables, Anal. Chim. Acta 112 1979

417–421.w x8 N.E. El Tayar, R.-S. Tsai, P.-A. Carrupt, B. Testa, Octan-1-

ol-water partition coefficients of zwitterionic a-amino acids,Determination by centrifugal partition chromatography andfactorization into stericrhydrophobic and polar components,

Ž .J. Chem. Soc., Perkin Trans. 2 1992 79–84.w x9 S. Hellberg, M. Sjostrom, S. Wold, The prediction of¨ ¨

bradykinin potentiating potency of pentapeptides, An exam-ple of a peptide quantitative structure–activity relationship,

Ž .Acta Chem. Scand., Ser. B 40 1986 135–140.

w x10 M. Sandberg, L. Eriksson, J. Jonsson, M. Sjostrom, S. Wold,¨ ¨New chemical dimensions relevant for the design of biologi-cally active peptides, A multivariate characterization of 87

Ž .amino acids, J. Med. Chem. 41 1998 2481–2491.w x11 J.E. Jackson, A User’s Guide to Principal Components, Wi-

ley, New York, 1991.w x12 A.J. Burnham, J.F. MacGregor, R. Viveris, Latent variable

Ž .regression tools, Chemom. Intell. Lab. Syst. 48 1999 167–180.

w x13 A. Burnham, R. Viveros, J. MacGregor, Frameworks for la-Ž .tent variable multivariate regression, J. Chemom. 10 1996

31–45.w x14 R. Manne, Analysis of two partial least squares algorithms for

Ž .multivariate calibration, Chemom. Intell. Lab. Syst. 1 1987187–197.

w x15 J. Trygg et al., This issue.w x16 P. Nomikos, J.F. MacGregor, Multivariate SPC charts for

Ž .monitoring batch processes, Technometrics 37 1995 41–59.w x17 P.R.C. Nelson, P.A. Taylor, J.F. MacGregor, Missing data

methods in PCA and PLS: score calculations with incompleteŽ .observation, Chemom. Intell. Lab. Syst. 35 1996 45–65.

w x18 B. Grung, R. Manne, Missing values in principal componentŽ .analysis, Chemom. Intell. Lab. Syst. 42 1998 125–139.

w x19 I.N. Wakeling, J.J. Morris, A test of significance for partialŽ .least squares regression, J. Chemom. 7 1993 291–304.

w x20 M. Clark, R.D. Cramer III, The probability of chance correla-Ž .tion using partial least squares PLS , Quant. Struct.-Act. Re-

Ž .lat. 12 1993 137–145.w x21 J. Shao, Linear model selection by cross-validation, J. Am.

Ž .Stat. Assoc. 88 1993 486–494.w x22 F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS

Ž .I, Many observations and few variables, J. Chemom. 7 199345–59.

w x23 S. Rannar, P. Geladi, F. Lindgren, S. Wold, The kernel algo-¨rithm for PLS II, Few observations and many variables, J.

Ž .Chemom. 8 1994 111–125.w x24 K.H. Esbensen, S. Wold, SIMCA, MACUP, SELPLS,

GDAM, SPACE and UNFOLD: the ways towards regional-ized principal components analysis and subconstrained N-waydecomposition—with geological illustrations, in: O.J. ChristieŽ .Ed. , Proc. Nord. Symp. Appl. Statist. Stavanger, 1983, pp.11–36, ISBN 82-90496-02-8.

w x25 N. Kettaneh-Wold, J.F. MacGregor, B. Dayal, S. Wold, Mul-Ž .tivariate design of process experiments M-DOPE , Chemom.

Ž .Intell. Lab. Syst. 23 1994 39–50.w x26 O.M. Kvalheim, A.A. Christy, N. Telnaes, A. Bjoerseth, Ma-

turity determination of organic matter in coals using themethylphenantrene distribution, Geochim. Cosmochim. Acta

Ž .51 1987 1883–1888.w x27 M.T. Chu, R.E. Funderlic, G.H. Golub, A rank-one reduction

formula and its applications to matrix factorizations, SIAMŽ .Rev. 37 1995 512–530.

w x28 M.C. Denham, Prediction intervals in partial least squares, J.Ž .Chemom. 11 1997 39–52.

w x29 B. Efron, G. Gong, A leisurely look at the bootstrap, theŽ .jackknife, and cross-validation, Am. Stat. 37 1983 36–48.

w x30 H. Martens, M. Martens, Modified jack-knife estimation of


Ž .parameter uncertainty in bilinear modeling PLSR , FoodŽ .Qual. Preference 11 2000 5–16.

w x31 S. Wold, C. Albano, W.J. Dunn III, U. Edlund, B. Eliasson,E. Johansson, B. Norden, M. Sjostrom, The indirect observa-¨ ¨tion of molecular chemical systems, in: K.-G. Joreskog, H.¨

Ž .Wold Eds. , Systems Under Indirect Observation, vols. I andII, North-Holland, Amsterdam, 1982, pp. 177–207, Chapter8.

w x32 S. Wold, M. Sjostrom, L. Eriksson, PLS in chemistry, in:¨ ¨P.v.R. Schleyer, N.L. Allinger, T. Clark, J. Gasteiger, P.A.

Ž .Kollman, H.F. Schaefer III, P.R. Schreiner Eds. , The Ency-clopedia of Computational Chemistry, Wiley, Chichester,1999, pp. 2006–2020.

w x33 O. Kvalheim, The latent variable, an editorial, Chemom. In-Ž .tell. Lab. Syst. 14 1992 1–3.

w x34 I.E. Frank, J.H. Friedman, A statistical view of some chemo-metrics regression tools, With discussion, Technometrics 35Ž .1993 109–148.

w x35 S. Wold, A theoretical foundation of extrathermodynamic re-Ž .lationships linear free energy relationships , Chem. Scr. 5

Ž .1974 97–106.w x36 D.A. Belsley, E. Kuh, R.E. Welsch, Regression Diagnostics:

Identifying Influential Data and Sources of Collinearity, Wi-ley, New York, 1980.

w x37 A. Berglund, S. Wold, INLR, implicit non-linear latent vari-Ž .able regression, J. Chemom. 11 1997 141–156.

w x38 S. Wold, A. Berglund, N. Kettaneh, N. Bendwell, D.R.Cameron, The GIFI approach to non-linear PLS modelling, J.

Ž .Chemom. 15 2001 321–336.w x39 L. Eriksson, E. Johansson, F. Lindgren, S. Wold, GIFI-PLS:

modeling of non-linearities and discontinuities in QSAR,Ž .QSAR 19 2000 345–355.

w x40 L. Eriksson, P. Hagberg, E. Johansson, S. Rannar, D. Sarney,¨˚O. Whelehan, A. Astrom, T. Lindgren, Multivariate process¨

monitoring of a newsprint mill, Application to modelling andpredicting COD load resulting from de-inking of modeling, J.

Ž .Chemom. 15 2001 337–352.w x41 G.E.P. Box, W.G. Hunter, J.S. Hunter, Statistics for Experi-

menters, Wiley, New York, 1978.w x42 M. Sjostrom, S. Wold, B. Soderstrom, PLS discriminant plots,¨ ¨ ¨ ¨

Proceedings of PARC in Practice, Amsterdam, June 19–21,1985, Elsevier, North-Holland, 1986, pp. 461–470.

w x43 L. Stahle, S. Wold, Partial least squares analysis with cross-˚validation for the two-class problem: a Monte Carlo study, J.

Ž .Chemom. 1 1987 185–196.w x44 N. Kettaneh-Wold, Analysis of mixture data with partial least

Ž .squares, Chemom. Intell. Lab. Syst. 14 1992 57–69.

PLS-regression: a basic tool of chemometrics - IASBS · PLS-regression: a basic tool of...

Documents

Transcript of PLS-regression: a basic tool of chemometrics - IASBS · PLS-regression: a basic tool of...