Application of maximum likelihood multivariate curve resolution to noisy data sets

Application ofmaximum likelihoodmultivariatecurve resolution to noisy data setsMahsa Dadashia,b, Hamid Abdollahib and Roma Taulera*

In this work, two different maximum likelihood approaches for multivariate curve resolution based on maximumlikelihood principal component analysis (MLPCA) and on weighted alternating least squares (WALS) are comparedwith the standard multivariate curve resolution alternating least squares (MCR-ALS) method. To illustrate thiscomparison, three different experimental data sets are used: the first one is an environmental aerosol source appor-tionment; the second is a time-course DNA microarray, and the third one is an ultrafast absorption spectroscopy.Error structures of the first two data sets were heteroscedastic and uncorrelated, and the difference between themwas in the existence of missing values in the second case. In the third data set about ultrafast spectroscopy, errorcorrelation between the values at different wavelengths is present. The obtained results confirmed that the resolvedcomponent profiles obtained by MLPCA-MCR-ALS are practically identical to those obtained by MCR-WALS and thatthey can differ from those resolved by ordinary MCR-ALS, especially in the case of high noise. It is shown that meth-ods that incorporate uncertainty estimations (such as MLPCA-ALS and MCR-WALS) can provide more reliable resultsand better estimated parameters than unweighted approaches (such as MCR-ALS) in the case of the presence of highamounts of noise. The possible advantage of using MLPCA-MCR-ALS over MCR-WALS is then that the former doesnot require changing the traditional MCR-ALS algorithm because MLPCA is only used as a preliminary data pretreat-ment before MCR analysis. Copyright © 2013 John Wiley & Sons, Ltd.

Keywords: noisy data; error structure; multivariate curve resolution alternating least squares; weighted alternating leastsquares; maximum likelihood principal component analysis

1. INTRODUCTION

Multivariate curve resolution (MCR) methods have emerged aspowerful chemometric tools to investigate multivariate data sets[1–3]. In most MCR methods, residual measurement errors are as-sumed to exhibit a uniform measurement variance. In the recentyears, chemometric methods that incorporate information aboutmeasurement errors have received more interest. The goal ofthese methods is to find solutions and parameters less affectedby error propagation. Different algorithms had been proposedfor this purpose, including weighted least squares [4], weightedprincipal component analysis (PCA) [5], positive matrix factoriza-tion (PMF) [6], maximum likelihood PCA (MLPCA) [7], maximumlikelihood principal component regression [8], maximum likeli-hood parallel factor analysis [9], and MCR weighted alternatingleast squares (MCR-WALS) [10]. MCR-WALS [11–13] is the generalextension of MCR-ALS [14–16] for non-homoscedastic errorcases. MCR-ALS algorithm assumes an independent and identicaldistributed (i.i.d.) error structure for the measured data. Theabbreviation i.i.d. is very common in statics, and it refers to thehaving a normal and uniform variance structure in the measure-ments distribution. This assumption is generally and approxi-mately valid for most of the measurements obtained fromspectroscopic and chromatographic methods, where measure-ment errors are homoscedastic and uncorrelated; in these cases,the obtained MCR-ALS solutions and parameters result to be suf-ficiently precise. However, in cases where uncertainties are largesuch as for environmental [11] or DNA microarray data sets [10],this assumption, the i.i.d error structure, is usually not fulfilledand special attention should be paid to error propagation andperturbation effects.

A frequently used method for the analysis of experimentalmultivariate data matrices is PCA [17] where i.i.d. error structuresare also assumed for the measured data. The extension of PCAalgorithm for non-homoscedastic error structure is MLPCA. Inthis method, the error structure of the experimental measure-ments has to be known in advance.The aim of this work is to investigate and compare the results

obtained either by the direct application of MCR-WALS or by theapplication in two steps of MLPCA first, followed by the applica-tion of ordinary MCR-ALS. Additionally, the results obtained bythese two strategies will be compared with the results obtainedwhen only MCR-ALS is applied in the traditional way. This work isthe continuation of a previous recent work [18] in which thiscomparison was systematically presented for simulated datahaving different noise structures, from homocedastic noise toheterocedastic, and correlated noise of different intensities andstructures. In the present paper, this comparison is extended toexperimental data that covers different typical situations en-countered in the investigation of analytical data. Such a compar-ison is pertinent because it will facilitate the extent of the use ofMCR methods to noisy data. Three experimental data sets will be

* Correspondence to: Roma Tauler, IDAEA-CSIC, Jordi Girona 18–24, Barcelona08034, Spain.E-mail: [email protected]

a M. Dadashi, R. TaulerIDAEA-CSIC, Jordi Girona 18-24, Barcelona 08034, Spain

b M. Dadashi, H. AbdollahiFaculty of Chemistry, Institute for Advanced Studies in Basic Sciences (IASBS),Zanjan, Iran

Research Article

Received: 13 September 2012, Revised: 30 November 2012, Accepted: 5 January 2013, Published online in Wiley Online Library: 5 February 2013

(wileyonlinelibrary.com) DOI: 10.1002/cem.2489

J. Chemometrics 2013; 27: 34–41 Copyright © 2013 John Wiley & Sons, Ltd.

34

used for this purpose. These include an environmental aerosolsource apportionment data set, a DNA time series microarraydata set, and finally an ultrafast absorption spectroscopy dataset. In these three data sets, different types of non-homocedasticerror structures are present in the measured data and specialattention has to be paid to their consideration for optimal pa-rameter estimation. The results obtained in the analysis of theseexamples are expected to be general and extendable to otherdata sets not having homocedastic error structures.

2. EXPERIMENTAL

Three experimental data sets were used to investigate the applica-tion of maximum likelihood MCR methods: (i) an environmentalaerosol source apportionment; (ii) a time-course DNA microarray;and (iii) and an ultrafast absorption spectroscopy. These three dataexamples have been already described elsewhere [11,12,19], andonly a brief description of each of them is presented here.

2.1. Environmental aerosol source apportionment data set

The first data set is an environmental data set for aerosol pollu-tion studies. In this data set, different samples of particulate mat-ter with grain size PM10 were collected at different geographicalsites in highly industrial areas. Information and other details onthe study area, source apportionment analysis, and on PM10sampling can be found elsewhere [20]. This data set contained34 variables (corresponding to different species concentrationsin microgram per cubic meter and nanogram per cubic meter)and 87 valid PM10 samples (Figure 1). The analysis of chemicalcomponents was performed according to the methodologydescribed in reference [21]. In a previous study [11,20], this dataset was analyzed by different chemometric methods (MCR-ALS,MCR-WALS, and PMF). This study showed that applyingMCR-WALS

and PMF produces essentially the same results, whereas discrepan-cies were obtained with MCR-ALS. Data weighting by means ofuncertainty estimates was found to be essential to obtain maxi-mum likelihood accurate estimations of the component profiles.Both methods, MCR-WALS and PMF, require data uncertaintyestimations in the input. These uncertainty estimations wereestimated by the procedure discussed in Section 3.1.1.

2.2. Time-course DNA microarray data set

DNA microarray technologies allow for the simultaneous measure-ment of gene expression values for thousands of genes in a sam-ple [22]. This gene expression value measures the hybridization offlorescent-labeled samples and the genes attached to a solidsurface. In this work, an example of a data set obtained fromBrauer et al. [23] and available at the Stanford University DNAmicroarray repository [24] has been used. This data set is knownin the literature as the “diauxic shift data set”, and it representsthe gene expression level in yeast during the diauxic shift in aglucose-limited culture. In glucose depletion conditions, themetabolism shifts rapidly to an oxidative stage. RNA samples wereobtained about every 15min and measured by DNA microarraytechnology. The finally analyzed data set contained 12 measure-ments taken between 7.25 and 10h at intervals of 0.25 h. Thegene expression values of 2284 genes at each time were analyzed.The size of the finally analyzed data set was (2284, 12). Thisdata set was already studied by Jaumot et al. using MCR-ALSand MCR-WALS algorithms [12]. In this study, a better and easierinterpretation of gene and time evolution profiles was achievedwhen MCR-WALS was applied compared with MCR-ALS.

2.3. Ultrafast absorption spectroscopy data set

Time-resolved pump-probe absorption spectroscopy [25] hasbeen used extensively in photochemistry to study the electronicexited states of molecules. In this technique, a laser is being usedfor perturbing the molecules and another laser to characterizethem. The example used in this case is the study of the photo-physics of benzophenon by ultraviolet–visible pump-probeabsorption spectroscopy [26]. More detail about the electronictransitions in this molecule and about the instrumental des-cription of this system can be found in references [19,26].A 5� 10�4 M solution of benzophenon in acetonitrile solventwas used, and 31 ultraviolet–visible spectra were recordedbetween 290 and 570 nm at intervals of 7.5 nm. The wholedata set is a matrix or table sized (31, 50). One example isshown in Figure 2.

In these previous studies, hard–soft MCR was used for under-standing the kinetics of this system [26], and the error structureof this system was investigated in the time and spectra modes[19]. The error covariance matrix of this system has been shownto have an independent error structure in the time direction anda correlated error structure in the spectral direction.

3. METHODS

3.1. Estimation of the measurement error covariancestructures

3.1.1. Environmental data set error covariance matrix

A preliminary data pretreatment was performed for the concentra-tion values below the detection limit, which were replaced by half

Figure 1. First data example. Environmental aerosol source apportion-ment data set. This data set contained 34 variables (corresponding todifferent species concentrations in microgram per cubic meter and nano-gram per cubic meter) and 87 valid PM10 samples (see experimentaldetails in reference [20]).

Maximum likelihood MCR of noisy data sets

J. Chemometrics 2013; 27: 34–41 Copyright © 2013 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

35

of their detection limit [27]. Direct experimental determinations ofsample-specific or variable-specific uncertainties were not avail-able, and therefore, the error structure of this system was esti-mated from the previous analytical knowledge of the system.Errors were considered to be proportional to the measured con-centrations plus a constant term related to their limit of detection.If the value of a particular variable is equal to or above its analyt-ical detection limit, then its uncertainty was considered to beequal to 10% of the value plus the detection limit value dividedby three (i.e., when xij> LOD, sij=0.1� xij+ LOD/3), xij representsa particular data of row i and column j, in data set X. On the otherhand, when the value is below its analytical detection limit, the un-certainty was considered to be 20% of value plus the detection limitvalue divided by three (i.e., when xij< LOD,sij=0.2� xij+ LOD/3).This estimation of uncertainty gives a heteroscedastic uncorrelatederror covariance structure.

3.1.2. Time-course microarray error covariance matrix

In this data set example, all i = 1,..I genes of a single samplewere arranged in a long vector and all measurements forj = 1,..J different samples were arranged in a data matrix ortable of size (J, I). In this type of data sets, a logarithmic pre-treatment is usually performed. Because of the application ofnonnegativity constraints in MCR-ALS, no logarithmic pretreat-ment was performed in this study as it is often done ingenomic studies. For a more detailed discussion about this,see reference [10]. Our previous studies have shown that thesedata sets can have high proportional errors, around 20% or30% of the measured signal. Uncertainties were finally consid-ered to be 25% of the measured data values. Moreover, thepresence of missing values should be taken into account.Because each sample has at least one measurement missing,this data set cannot be analyzed by eliminating rows or col-umns. MLPCA easily handles this situation [28]. Missing valueswere arbitrarily substituted by values of one, and their uncer-tainties were considered to be very large values, that is, equalto 100. In this way, missing values had a minimum effect onfinal MCR results.

3.1.3. Ultrafast absorption spectroscopy error covariance matrix

To obtain information about the error covariance structure of thissystem, one experiment was replicated 30 times. For K replicationsof this experiment, the error covariance matrix for every row canbe estimate by Equation (1):

Σi ¼ 1K � 1

XKk¼1

xk � �xð ÞT xk � �xð Þ (1)

where xk is the kth replication of one row and�x is the average vec-tor of K replications. Σi is the error covariance matrices for eachrow i, which provides the measurement error variances in its diag-onal elements, and the measured error covariances between mea-surements in the same row, in its off-diagonal elements of size (J,J). Error covariance matrix for every column j of X could be calcu-lated in a similar way using the replicate of each column. Detailsabout how to estimate error covariance matrices in this data setwere discussed in a previous work [19].

3.2. Data analysis methods

Three different variants of the MCR method based on ALS(MCR-ALS) [29–31] are compared in this work. The three meth-ods decompose the original data matrix X into the product ofthe two factor matrices G and FT, using the bilinear model ofEquation (2). In ordinary MCR-ALS method, the objective func-tion to be minimized is

Q2 ¼XI

i ¼ 1

XJ

j ¼ 1

xi;j � x i;j� �2

(2)

where xij and x ij are, respectively, the experimental and calcu-lated values. In MCR-WALS, this general minimization functionis written as

Q2 ¼ vec ΧT � ΧT� �� T

Ξ�1vec ΧT � ΧT� �

(3)

where now X and X are, respectively, the experimental andcalculated data matrices containing xij and x ij values and Ξis the full augmented error covariance matrix as shown inEquation (4):

Ξ ¼Σ1

Σ2⋱

ΣI

2664

3775 (4)

where Σi are the error covariance matrices for each row i.When there is no correlation between column and row errors,the off-diagonal elements of Ξ are zero. In contrast, whenthey are correlated, the off-diagonal elements of Ξ are notzero. Row covariance matrices can be described by Σi (withinrow correlation errors) and column error covariance matricesby cj (within column correlation errors, see the succeedingparagraphs) matrices, respectively. In some particular cases(as in ultrafast absorption spectroscopy data set), there iscorrelation between column and row errors and it is neces-sary to consider the full augmented error covariance matrix(Equation (4)). For a thorough discussion about the differentpossible error cases, see reference [32].The first step in the MCR-ALS algorithm is the data projection

onto the subspace defined by its principal components [17]. Thisfirst step can be written as

Figure 2. Third data example. Ultrafast absorption spectroscopy exper-iment; this figure shows one slice of 30 replications in this experiment(see experimental details in reference [26]).

M. Dadashi, H. Abdollahi and R. Tauler

wileyonlinelibrary.com/journal/cem Copyright © 2013 John Wiley & Sons, Ltd. J. Chemometrics 2013; 27: 34–41

36

XN;PCA ¼ XVNVTN (5)

In this equation, VN is the loadings matrix for N componentand XN;PCA is the projection of the original data set onto the load-ings subspace. In PCA, measurement errors are not consideredexplicitly during the analysis because they are assumed to behomoscedastic (randomly distributed with equal variances).In contrast to conventional PCA, MLPCA [7] incorporates

known error information into the bilinear decomposition pro-cess, and it can deal with different types of error structures.Data measurements with high heteroscedastic and correlatederrors should, therefore, be preferably analyzed by MLPCAinstead of ordinary PCA. However, in MLPCA, the structure ofthe error covariance matrix needs to be known accurately inadvance. Several procedures and guidelines for the use ofMLPCA have been given [33]. MLPCA seeks to minimize themaximum likelihood objective function (Equation (3)). In thiscase, X and X are the original and MLPCA projected data matri-ces, respectively. The algorithm used by MLPCA is based on theidea that maximum likelihood projections should be the samein both row and column subspaces.After initial data projection (before ALS optimization) either by

PCA or by MLPCA, at each ALS iteration of the optimizations, Gand FT factor matrices are estimated iteratively using an ALSalgorithm, giving, respectively, MCR-ALS (PCA projection first)or MLPCA-MCR-ALS (MLPCA projection first). The equations forthe unconstrained least square solution for G when FT is assumedto be known are

minG XN � GFT�� (6)

G ¼ XNF FT F� ��1 ¼ XN FT

� �þ(7)

where XN is either the PCA or the MLPCA projected matrix forN components in MCR-ALS and MLPCA-MCR-ALS algorithms,respectively.And the equation for FT when G is assumed to be known is

minF XN � GFT�� (8)

FT ¼ GT G� ��1

GXN ¼ G� �þ

XN (9)

Equations (7) and (9) find the unconstrained least squaressolutions. However, to have meaningful MCR solutions and avoidrotation ambiguities, different constraints such as nonnegativity,unimodality, closure, selectivity, local rank constraints, or otherconstraints [30,31,34] should be applied. In this work, only non-negativity constraints were applied using nonnegative leastsquares algorithms [35].MCR-WALS finds G and FT factor matrices in a similar way than

MCR-ALS [32] using Equations (7) and (9), but they are estimatedat each iteration of the optimization using the row and columnmaximum likelihood projections of X matrix, onto the subspacesdefined by current ALS estimations of the F and G matrices (X i

and X j), in a similar way as in MLPCA algorithm (i.e., VN and UN

factor matrices are now substituted by the current ALS estimatesof FT and G factor matrices for the same number of componentsN). For more details, see reference [32].

To evaluate how well the methods actually fit the data, severalparameters and equations are proposed [36]. Lack of fit andexplained variance (R2) are two parameters that can be used:

Lack of fit ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXI

i¼1

XJ

j¼1

xi;j �x i;j� �2

XI

i¼1

XJ

j¼1

x2i;j

vuuuuuuuut� 100 (10)

R2 ¼ 1�

XI

i¼1

XJ

j¼1

xi;j �x i;j� �2

XI

i¼1

XJ

j¼1

x2i;j

0BBBBB@

1CCCCCA

� 100 (11)

Both parameters measure essentially the same, but the firstone (lack of fit) has more sensitivity to fitting differences whenthey are small and give very similar explained variances.

To measure the similarities between two individual profiles (xand y vectors), their correlation and the angle (whose arc has thecosine, arccosine, equal to their correlation coefficient) betweenthem can be obtained by

r¼ xTyxk k yk k (12)

θ ¼ cos�1 rð Þ (13)

In Equations (12) and (13), r2 is the correlation coefficient be-tween two profiles and θ is the angle between them. To measurethe similarity between subspaces defined by a different set ofprofiles(X and Y matrices), a matrix correlation coefficient canalso be used [37], which defines their similarities and takesvalues between zero and one. If the results are close to 1, thisindicates that there is a strong linear relationship (correlation)between the subspaces defined by the two matrices. If it is zero,this means that there is no linear relationship between the twosubspaces. In general, for two subspaces defined by matrices X(I, J) and Y (I, J), the matrix correlation is defined by

r X; Yð Þ ¼ tr XTY� �

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffitr XTX� �

tr YTYð Þq (14)

This formula is based on the inner product of the twomatrices, where tr indicates the trace of a matrix (its diagonalelements).

Another parameter that can also be used to find the similaritybetween the subspaces generated by two sets of vectors is theangle between the two subspaces defined by the columns of Xand Y matrices. This angle is similar to that used for the compar-ison of x and y column vectors (once they are normalized to unitlength). For this calculation, the columns of X are first projectedonto the subspace defined by the columns of Ymatrix, and then,the differences between them are calculated. By using these dif-ferences, the angle between X and Y column subspaces can beobtained. Therefore, if X has, for instance, N columns, then Nangles are obtained. The lowest angle is then considered the



37

angle between the two subspaces. A small angle between thetwo subspaces indicates that they are very similar. For moredetails about these calculations, see references [38,39].

4. RESULTS AND DISCUSSION

The three methods previously described (MCR-ALS, MLPCA-MCR-ALS, and MCR-WALS) were applied for the analysis of thethree experimental data sets. In all cases, nonnegativity con-straints were applied to factor matrices, G and FT. Lack of fitand R2 values were calculated according to Equations (10)and (11). Angles between resolved profiles were calculated toshow their similarity. In all cases, initial estimations wereobtained from the purest samples or variables of the analyzeddata matrices [40].

4.1. Analysis of the environmental source apportionmentdata set

In Table I, the results of applying different algorithms on this dataset are given. According to a previous work on the same data set,the selected number of components or sources was six.Explained variances after application of MLPCA-MCR-ALS andof MCR-WALS on this data set were similar, 86.3% and 87.6%,respectively. These two values were rather different comparedwith the results obtained by MCR-ALS without considering theerror structure, which for the same number of componentsexplained as much as 99.3% of the experimental data variance.This comparison clearly shows the tendency of MCR-ALS to over-fit and to incorporate noise in the solutions. The same tendencycould be seen in the values of lack of fit; these values were 37.0%and 35.5% for MLPCA-MCR-ALS and MCR-WALS, whereas forMCR-ALS, it was only 8.3%. This characteristic of MCR-ALS hasbeen already reported in previous works [12,36].

Subspace congruences show the differences between thesubspaces spanned by the MCR-WALS and those obtained byMLPCA-MCR-ALS or MCR-ALS; in other words, MCR-WALS wastaken as reference algorithm and the results obtained with theother algorithms were compared with it.Congruence values between score and loading matrices

obtained by MCR-WALS and MLPCA-MCR-ALS were 13.9 and 10,respectively. On the contrary, the same values for the comparisonbetween MCR-WALS and MCR-ALS matrices were 82.5 and 79.5,which are much higher, meaning that MCR-ALS and MCR-WALSspaces are significantly different. This provides more evidence thatthe results obtained by MLPCA-MCR-ALS and MCR-WALS arerather similar and that the observed small difference betweenthe results obtained by the two methods is attributed to the effecton the algorithms or possible inaccuracies in the estimation oferror covariance matrix and to the possible existence of smallrotational ambiguities for the applied constraints.

4.2. Analysis of time series microarray data set

The first step in analyzing this data set was the estimation of thenumber of components (N). Finding the number of componentsfor a microarray data set is more challenging than for ordinaryspectroscopic data sets because of the higher noise structurespresent in gene expression data sets. In a previous work [12],the application of MCR-ALS and MCR-WALS algorithms wasrepeated for a different number of components, ranging fromthree to seven. The comparison of the explained variancesshowed that adding more than three components did not pro-vide improvement in explained variances or in MCR-ALS resolvedprofiles. When less than three components were considered, theshape of the sample profiles between 8 and 9 h was unreliableand more difficult to interpret biologically. In a previous work [12],for a better interpretability of the results, four components werepreferred. In this work, three components were preferred becauseby using only three components, similarity values between the two

Table I. Results for environmental aerosol source apportionment data set with heteroscedastic error

Method lof (%)a R2 (%)b Recovery anglec Subspace congruenced

g1 g2 g3 g4 g5 g6 G

f1 f2 f3 f4 f5 f6 FT

MCR-WALS 35.1 87.6 — — — — — — —MLPCA-ALS 37.0 86.3 5.2 4.7 5.1 7.3 5.8 8.6 13.9

1.8 9.7 4.1 2.6 2.9 5.2 10.0MCR-ALS 8.3 99.3 52.4 14.5 36.7 23.9 71.0 53.2 82.5

58.1 14.5 33.7 9.1 64.8 37.9 79.5MLPCA 37.0 86.3 — — — — — — —PCA 8.0 99.4 — — — — — — —

MCR, multivariate curve resolution; WALS, weighted alternating least squares; ML, maximum likelihood; PCA, principal componentanalysis.aLack of fit (lof) values (see Equation (10)).bExplained variance values (see Equation (11)).cg1, g2, g3, g4, g5, and g6 are the distribution (column or scores) profiles of the six components in factor matrix G; f1, f2, f3, f4, f5,and f6 are the composition (row or loadings) profiles of the six components in factor matrix FT; numerical values give the anglesbetween MCR-WALS resolved G and FT profiles and MCR-ALS or MLPCA-ALS resolved C and ST profiles (see Equations (12) and (13)).dSubspace congruence gives the angle between MCR-WALS resolved G and FT profiles subspaces and MCR-ALS or MLPCA-ALSresolved G and FT profiles subspaces (see Section 3).



38

methods, MLPCA-MCR-ALS andMCR-WALS, were better comparedand discussed. When considering recovery angles in Table II, themaximum difference between the results of MCR-WALS andMLPCA-MCR-ALS is for the third component (g3, recovery angleof 2.8�). If the number of component was considered to be four,the recovery angle for the fourth component t4 was much worse,equal to 24.0� . Because the aim of this study was to comparebetween profiles obtained from MCR-ALS, MCR-WALS, andMLPCA-MCR-ALS rather than to interpret the profiles obtained,which has already been performed in previous studies [12,23,32],an analysis with three components was finally preferred.Table II reveals again the similarity between the results

obtained by MLPCA-MCR-ALS and MCR-WALS in the case ofthe analysis of this experimental data set. This case has beenassumed to contain proportional heteroscedastic errors [41] suchas the previous case, but with the difference that in this data set,missing values were also present and they were considered assuch using an appropriate weighting scheme (see Section 3).Lack of fit values obtained by MLPCA-MCR-ALS and MCR-WALS

were 27.9% and 28.0%, respectively, whereas this value for MCR-ALS was lower and equal to 22.7%. Again, this comparison con-firms the tendency of MCR-ALS (and also PCA in Table II) to dataoverfitting compared with MLPCA-MCR-ALS and MCR-WALS. Thesame is observed in Table II for the R2 values (MLPCA-MCR-ALSand MCR-WALS values were about 92% in comparison with 94%for MCR-ALS). The angles reporting the subspace congruenciesbetween MLPCA-MCR-ALS and MCR-WALS were only of 2.5� and1.6� for G and FT profiles, respectively, whereas they were 18.3�

and 23.7� , respectively, in the case of the comparison with MCR-ALS profiles. In Figure 3, the comparison between MCR-WALSand MLPCA-MCR-ALS is shown, confirming their similarity.

4.3. Analysis of ultrafast absorption spectroscopy data set

In this third data set, spectroscopic data [19] had a correlatederror structure in the variables direction, and a heteroscedastic

and independent error structure is present in the time direction.This error structure was considered during the different analysisby MLPCA and MCR-WALS. However, in this case, the noise levelwas rather low (about 1% of the measured signal), as it usuallyhappens in spectroscopic measurements. This low level of noisecould be reduced significantly by replication of measurementsand averaging. To examine more precisely the effect of noiseon the results obtained by the different methods, a single dataset is analyzed considering the corresponding estimated errorcovariance matrix. If the average of 30 replications was taken

Table II. Results of the diauxic shift data set


g1 g2 g3 G

t1 t2 t3 TT

MCR-WALS 27.9 92.2 — — — —MLPCA-ALS 28.0 92.1 0.7 0.8 2.8 2.5

0.9 1.2 1.1 1.6MCR-ALS 22.7 94.8 4.8 10.4 17.4 18.3

11.3 6.3 14.9 23.7MLPCA 28.0 92.1 — — — —PCA 22.4 95.0 — — — —

MCR, multivariate curve resolution; WALS, weighted alternating least squares; ML, maximum likelihood; PCA, principal componentanalysis.aLack of fit (lof) values (see Equation (10)).bExplained variance values (see Equation (11)).cg1, g2, and g3 are the distribution (column or scores) profiles of the three components in factor matrix G; t1, t2, and t3 are thecomposition (row or loadings) profiles of the three components in factor matrix TT; numerical values give the angles betweenMCR-WALS resolved G and TT profiles and MCR-ALS or MLPCA-ALS resolved G and TT profiles (see Equations (12) and (13)).dSubspace congruence gives the angle between MCR-WALS resolved G and TT profiles subspaces and MCR-ALS or MLPCA-ALSresolved G and TT profiles subspaces (see Section 3).

Figure 3. Second data example. Comparison of the resolved temporalevolution of the gene expression profiles (three components) for thediauxic shift data set obtained by multivariate curve resolution (MCR)-weighted alternating least squares (ALS) (solid line), maximum likelihoodprincipal component analysis-MCR-ALS (dashed dotted line), and MCR-ALS(dashed line).



39

as data set, the results obtained using maximum likelihood curveresolution MLPCA-MCR-ALS and MCR-WALS approaches andusing the ordinary MCR-ALS approach would have been practi-cally identical. This result shows the importance and effects ofreplication and averaging in the experimental work. These resultshave not been reported here for brevity.

In Table III, results for the analysis of one single spectroscopicexperiment (without replication) are given. Similar lack of fit(around 3.9%) and R2 (99.8%) values were obtained for MLPCA-MCR-ALS and MCR-WALS. Calculated values for MCR-ALS were2.2% and 99.9%, respectively. These results confirmed the resultsobtained in previous comparisons, although now the differencesamong the results obtained by the three methods were thelowest. The same happened when subspace congruencies werecompared with the same interpretation as before.

5. CONCLUSIONS

Results obtained in the analysis of three experimental data setsconfirmed the conclusions raised in our previous work on simu-lated data sets with different noise structures and complexities[18]. MLPCA-MCR-ALS and MCR-WALS differ in the fact that MCR-WALS uses simultaneously “chemical” constraints (nonnegativityand others) and noise information during the ALS optimization,whereas MLPCA-MCR-ALS uses them sequentially, first noise infor-mation as a data pretreatment and then, separately, during the ALSoptimization, the chemical constraints; it has been shown thatMLPCA can be used as a preliminary projection step on ordinaryMCR-ALS standard algorithms, with equivalent results than apply-ing MCR-WALS. The use of this preliminary projection of datamatrix onto the MLPCA subspace has the advantage over MCR-WALS of its easier and more general implementation and applica-tion to currently developed MCR methods, without the need tochange algorithms. Moreover, it provides the concurrent visualiza-tion of MLPCA results (instead of PCA results) as a preliminary data

exploratory tool. However, whenever an accurate estimation of theerror structure is not possible, the use of the ordinary MCR-ALSalgorithm is still useful, and it would give good estimations unlessnoise levels are high (e.g., more than 10%). The extent of distortionon resolved profiles of the factor matrices depends not only on thenoise intensity but also in its complexity.

Acknowledgements

Research project grant number CTQ 2009-11572 from theMinisterio de Ciencia y Innovación, Spain, is acknowledged,and also, the authors acknowledge the Institute for AdvancedStudies in Basic Sciences for financial support (grant numberG2011IASBS117). The authors would like to thank Anna de Juanand Joaquim Jaumot who provided us with the data sets.

REFERENCES1. Lawton WH, Sylvestre EA, Maggio MS. Self modeling nonlinear

regression. Technometrics 1972; 14(3): 513–532.2. de Juan A, Tauler R. Chemometrics applied to unravel multicompo-

nent processes and mixtures: revisiting latest trends in multivariateresolution. Anal. Chim. Acta 2003; 500(1–2): 195–210.

3. Hamilton JC, Gemperline PJ. Mixture analysis using factor analysis. II:self-modeling curve resolution. J. Chemom. 1990; 4(1): 1–13.

4. Kiers H. Weighted least squares fitting using ordinary least squaresalgorithms. Psychometrika 1997; 62(2): 251–266.

5. Simeon V, Pavković D. Weighted analysis of principal components:two approximations to statistical weights. J. Chemom. 1992; 6(5):257–266.

6. Paatero P, Tapper U. Positive matrix factorization: a non-negative fac-tor model with optimal utilization of error estimates of data values.Environmetrics 1994; 5(2): 111–126.

7. Wentzell PD, Andrews DT, Hamilton DC, Faber K, Kowalski BR.Maximum likelihood principal component analysis. J. Chemom.1997; 11(4): 339–366.

8. Wentzell PD, Andrews DT, Kowalski BR. Maximum likelihood multi-variate calibration. Anal. Chem. 1997; 69(13): 2299–2311.

9. Vega-Montoto L, Wentzell PD. Maximum likelihood parallel factoranalysis (MLPARAFAC). J. Chemom. 2003; 17(4): 237–253.

Table III. Results of the ultrafast absorption spectroscopy for one data set


c1 c2 c3 C

s1 s2 s3 ST

MCR-WALS 3.9 99.85 — — — —MLPCA-ALS 3.9 99.85 3.9 3.5 7.4 0.85

1.8 1.1 8.1 0.39MCR-ALS 2.2 99.95 5.4 11.7 6.7 12.2

4.6 2.4 8.1 5.6MLPCA 3.9 99.85 — — — —PCA 2.2 99.95 — — — —

MCR, multivariate curve resolution; WALS, weighted alternating least squares; ML, maximum likelihood; PCA, principal componentanalysis.aLack of fit (lof) values (see Equation (10)).bExplained variance values (see Equation (11)).cc1, c2, and c3 are the distribution (column or scores) profiles of the three components in factor matrix C; s1, s2, and s3 are thecomposition (row or loadings) profiles of the three components in factor matrix ST; numerical values give the angles betweenMCR-WALS resolved C and ST profiles and MCR-ALS or MLPCA-ALS resolved C and ST profiles (see Equations (12) and (13)).dSubspace congruence gives the angle between MCR-WALS resolved C and ST profiles subspaces and MCR-ALS or MLPCA-ALSresolved C and ST profiles subspaces (see Section 3).



40

10. Wentzell P, Karakach T, Roy S, Martinez MJ, Allen C, Werner-Washburne M. Multivariate curve resolution of time coursemicroarray data. BMC Bioinformatics 2006; 7(1): 343.

11. Tauler R, Viana M, Querol X, Alastuey A, Flight RM, Wentzell PD,Hopke PK. Comparison of the results obtained by four receptormodelling methods in aerosol source apportionment studies. Atmos.Environ. 2009; 43(26): 3989–3997.

12. Jaumot J, Piña B, Tauler R. Application of multivariate curve resolu-tion to the analysis of yeast genome-wide screens. Chemom. Intell.Lab. Syst. 2010; 104(1): 53–64.

13. Stanimirova I, Tauler R, Walczak B. A comparison of positive matrixfactorization and the weighted multivariate curve resolutionmethod. Application to environmental data. Environ. Sci. Technol.2011; 45(23): 10102–10110.

14. Goicoechea HC, Olivieri AC, Tauler R. Application of the correlationconstrained multivariate curve resolution alternating least-squaresmethod for analyte quantitation in the presence of unexpected interfer-ences using first-order instrumental data. Analyst 2010; 135(3): 636–642.

15. Parastar H, Radović JR, Jalali-Heravi M, Diez S, Bayona JM, Tauler R.Resolution and quantification of complex mixtures of polycyclicaromatic hydrocarbons in heavy fuel oil sample by means of GC �GC-TOFMS combined to multivariate curve resolution. Anal. Chem.2011; 83(24): 9289–9297.

16. Terrado M, Barcelo D, Tauler R. Quality assessment of the multivari-ate curve resolution alternating least squares method for the inves-tigation of environmental pollution patterns in surface water.Environ. Sci. Technol. 2009; 43(14): 5321–5326.

17. Wold S, Esbensen K, Geladi P. Principal component analysis.Chemom. Intell. Lab. Syst. 1987; 2(1–3): 37–52.

18. Dadashi M, Abdollahi H, Tauler R. Maximum likelihood principalcomponent analysis as initial projection step in multivariate curveresolution analysis of noisy data. Chemom. Intell. Lab. Syst. 2012;118(0): 33–40.

19. Blanchet L, Réhault J, Ruckebusch C, Huvenne JP, Tauler R, de Juan A.Chemometrics description of measurement error structure: study ofan ultrafast absorption spectroscopy experiment. Anal. Chim. Acta2009; 642(1–2): 19–26.

20. Viana M, Querol X, Alastuey A, Gil JI, Menéndez M. Identification ofPM sources by principal component analysis (PCA) coupled withwind direction data. Chemosphere 2006; 65(11): 2411–2418.

21. Querol X, Alastuey A, Rodriguez S, Plana F, Ruiz CR, Cots N, MassaguéG, Puig O. PM10 and PM2.5 source apportionment in the BarcelonaMetropolitan area, Catalonia, Spain. Atmos. Environ. 2001;35(36):6407–6419.

22. Causton HC, Quackenbush J, Brāzma A. Microarray Gene ExpressionData Analysis: A beginner’s Guide. Blackwell Pub, Wiley-Blackwell, 2003.

23. Brauer MJ, Saldanha AJ, Dolinski K, Botstein D. Homeostatic adjust-ment and metabolic remodeling in glucose-limited yeast cultures.Mol. Biol. Cell 2005; 16(5): 2503–2517.

24. web, http://smd.stanford.edu/cgi-bin/tools/display/listMicroArrayData.pl?tableName=publication.

25. Rid GD, Wynne K. In Encyclopedia of Analytical Chemistry, Meyers RA(ed.). John Wiley & Sons Ltd.: Chichester, 2000.

26. Aloise S, Ruckebusch C, Blanchet L, Rehault J, Buntinx G, Huvenne J-P.The benzophenone S1(n,p*) ! T1(n,p*) states intersystem crossingreinvestigated by ultrafast absorption spectroscopy and multivariatecurve resolution. J. Phys. Chem. A 2007; 112(2): 224–231.

27. Farnham IM, Singh AK, Stetzenbach KJ, Johannesson KH. Treatmentof nondetects in multivariate analysis of groundwater geochemistrydata. Chemom. Intell. Lab. Syst. 2002; 60(1–2): 265–281.

28. Andrews DT, Wentzell PD. Applications of maximum likelihoodprincipal component analysis: incomplete data sets and calibrationtransfer. Anal. Chim. Acta 1997; 350(3): 341–352.

29. Tauler R. Multivariate curve resolution applied to second order data.Chemom. Intell. Lab. Syst. 1995; 30(1): 133–146.

30. Tauler R, Smilde A, Kowalski B. Selectivity, local rank, three-way dataanalysis and ambiguity in multivariate curve resolution. J. Chemom.1995; 9(1): 31–58.

31. Jaumot J, Gargallo R, de Juan A, Tauler R. A graphical user-friendlyinterface for MCR-ALS: a new tool for multivariate curve resolutionin MATLAB. Chemom. Intell. Lab. Syst. 2005; 76(1): 101–110.

32. Wentzell PD. 2.25 –Other topics in soft-modeling: maximum likelihood-based soft-modeling methods. In Comprehensive Chemometrics, BrownSD, Tauler R, Walczak B (Editors-in-Chief). Elsevier: Oxford, 2009;507–558.

33. Wentzell PD, Lohnes MT. Maximum likelihood principal componentanalysis with correlated measurement errors: theoretical and practi-cal considerations. Chemom. Intell. Lab. Syst. 1999; 45(1–2): 65–85.

34. de Juan A, Vander Heyden Y, Tauler R, Massart DL. Assessment ofnew constraints applied to the alternating least squares method.Anal. Chim. Acta 1997; 346(3): 307–318.

35. Bro R, De Jong S. A fast non-negativity-constrained least squaresalgorithm. J. Chemom. 1997; 11(5): 393–401.

36. Tauler R. Application of non-linear optimization methods to theestimation of multivariate curve resolution solutions and of theirfeasible band boundaries in the investigation of two chemical andenvironmental simulated data sets. Anal. Chim. Acta 2007; 595(1–2):289–298.

37. Smilde AK, Kiers HAL, Bijlsma S, Rubingh CM, van Erk MJ. Matrix cor-relations for high-dimensional data: the modified RV-coefficient.Bioinformatics 2009; 25(3): 401–405.

38. Björck Å, Golub GH. Numerical methods for computing anglesbetween linear subspaces. Math. Comput. 1973; 27(123): 579–594.

39. Wedin P. On angles between subspaces of a finite dimensional innerproduct space Matrix Pencils. In Lecture Notes in Mathematics,Volume 973, Kågström B, Ruhe A (eds.). Springer: Berlin/Heidelberg,1983; 263–285.

40. Windig W, Guilment J. Interactive self-modeling mixture analysis.Anal. Chem. 1991; 63(14): 1425–1432.

41. Karakach T, Flight R, Wentzell P. Bootstrap method for the estimationof measurement uncertainty in spotted dual-color DNA microarrays.Anal. Bioanal. Chem. 2007; 389(7-8): 2125–2141.



41

http://smd.stanford.edu/cgi-bin/tools/display/listMicroArrayData.pl?tableName=publication

http://smd.stanford.edu/cgi-bin/tools/display/listMicroArrayData.pl?tableName=publication

Application of maximum likelihood multivariate curve resolution to noisy data sets

Documents

Transcript of Application of maximum likelihood multivariate curve resolution to noisy data sets