2. The PARAFAC model

21
1 2. The PARAFAC model Quimiometria Teórica e Aplicada Instituto de Química - UNICAMP

description

2. The PARAFAC model. Quimiometria Teórica e Aplicada Instituto de Química - UNICAMP. Example: fluorescence data (1). Each fluorescence spectrum is a matrix of emission vs excitation wavelengths: X i (201  61). emission spectrum of pure tryptophan. - PowerPoint PPT Presentation

Transcript of 2. The PARAFAC model

Page 1: 2. The PARAFAC model

1

2. The PARAFAC model

Quimiometria Teórica e Aplicada

Instituto de Química - UNICAMP

Page 2: 2. The PARAFAC model

2

Example: fluorescence data (1)Example: fluorescence data (1)

240

260

280

300

250300

350400

450-100

0

100

200

300

400

Excitation wavelength (nm)Emission wavelength (nm)

Inte

nsity

Each fluorescence spectrum is a matrix of emission vs excitation wavelengths:

Xi (201 61)

Page 3: 2. The PARAFAC model

3

Example: fluorescence data (2)Example: fluorescence data (2)

• Each spectrum is a linear sum of three components: tryptophan, phenylalanine and tyrosine.

Xi = ai1b1c1T + ai2b2c2

T + ai3b3c3T + Ei

concentration of tryptophan in sample i

emission spectrum of pure tryptophan

excitation spectrum of pure tryptophan

Xi =

b1

c1T

ai1

b2

c2T

ai2 +

b3

c3T

ai3 + + Ei

Page 4: 2. The PARAFAC model

4

Example: fluorescence data (3)Example: fluorescence data (3)

• Five samples were measured and stacked to give a three-way array: X (5 201 61).

X5

X4

X3

X2

X1

5 sa

mp

les

201 emission ’s

61 excitation ’s

=

b1T

c1T

a1

b2T

c2T

a2

+

b3T

c3T

a3

+

+ Econcentration of tryptophan in each

sample

Page 5: 2. The PARAFAC model

5

Example: fluorescence data (4)Example: fluorescence data (4)

• If we are given a set of fluroescence spectra, X, how can we determine:

– How many chemical species are present?

– Which chemical species are present? What are their pure excitation and emission spectra?

i.e. self-modelling curve resolution (SMCR)

– What is the concentration of each species in each sample?

i.e. (second-order) calibration

• Answer: use the PARAFAC model!

Page 6: 2. The PARAFAC model

6

The PARAFAC model (1)The PARAFAC model (1)

EBT

CT

A

+=

K

X

J

I

= b2T

c2T

a2

+cR

T

bRT

aR

… + + E

c1T

b1T

a1

Triad

}

Page 7: 2. The PARAFAC model

7

The PARAFAC model (2)The PARAFAC model (2)

• Loadings– A (I R) describes variation in the first mode.

– B (J R) describes variation in the second mode.

– C (K R) describes variation in the third mode.

• Residuals– E (I J K) are the model residuals.

EBT

CT

A

+=K

X

J

I

Page 8: 2. The PARAFAC model

8

Example: fluorescence data (5)Example: fluorescence data (5)

• Loadings– A (5 3) describes the component concentrations.

– B (201 3) describes the pure component emission spectra.

– C (61 3) describes the pure component excitation spectra.• Residuals

– E (5 201 61) describes instrument noise.

EBT

CT

A

+=X5 sa

mp

les

201 emission ’s

61 excitation ’s

Page 9: 2. The PARAFAC model

9

Example: fluorescence data (6)Example: fluorescence data (6)

• A 3-component PARAFAC model describes 99.94% of X.

250 300 350 400 450-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Emission wavelength (nm)

Load

ings

(se

cond

mod

e)

B (201 3)

240 250 260 270 280 290 3000

0.05

0.1

0.15

0.2

0.25

Excitation wavelength (nm)

Load

ings

(se

cond

mod

e)

C (61 3)

phenylalanine

tyrosine

tryptophan

tryptophan

tyrosine

phenylalanine

Page 10: 2. The PARAFAC model

10

Example: fluorescence data (7)Example: fluorescence data (7)

• The A-loadings describe the relative amounts of species 1 (tryptophan), 2 (tyrosine) and 3 (phenylalanine) in each sample:

• In order to know the absolute amounts, it is necessary to use a standard of known concentrations, i.e. sample 5.

A (5 3)

2.7867

0.0147

0.0492

1.6140

0.9179

-0.0135

2.0803

0.0234

0.8378

0.6949

-0.0042

0.0006

1.8358

0.7990

0.6945

Concentrations (ppm)

2.6685

0.0141

0.0471

1.5455

-0.0853

13.172

0.1484

5.3045

-1.8151

0.2714

785.09

341.68

0.8790 4.4000 297.00

Page 11: 2. The PARAFAC model

11

The PARAFAC formulaThe PARAFAC formula

• Data array– X (I J K) is matricized into XIJK (I JK)

XIJK = A(CB)T + EIJK

• Loadings– A (I R) describes variation in the first mode

– B (J R) describes variation in the second mode

– C (K R) describes variation in the third mode

• Residuals– E (I J K) is matricized into EIJK (I JK)

Khatri-Rao matrix product

Page 12: 2. The PARAFAC model

12

PCA vs PARAFACPCA vs PARAFAC

PCA

Bilinear model

X = ABT + E

PARAFAC

Trilinear model

XIJK = A(CB)T + EIJK

Components are calculated sequentially in order of

importance.

Components are calculated simultaneously in random

order.

Solution is unique (i.e. not possible to rotate factors

without losing fit).

Solution has rotational freedom.

Orthogonal, i.e. BTB = I Not (usually) orthgonal.

Page 13: 2. The PARAFAC model

13

Rotational freedomRotational freedom

• The bilinear model X = ABT + E contains rotational freedom. There are many sets of loadings (and scores) which give exactly the same residuals, E:

X = ABT + E

= ARR-1BT + E

= A*B*T + E (A*=AR B*T=R-1BT)

• This model is not unique – there are many different sets of loadings which give the same % fit.

Page 14: 2. The PARAFAC model

14

PARAFAC solution is uniquePARAFAC solution is unique

• The trilinear model X = A(CB)T + E is said to be unique, because it is not possible to rotate the loadings without changing the residuals, E:

X = A(CB)T + E

= ARR-1(CB)T + E

= A*(C*B*)T + E*

• This is why PARAFAC is able to find the correct fluorescence profiles – because the unique solution is close to the true solution.

Page 15: 2. The PARAFAC model

15

Spot the difference!Spot the difference!

0 50 100 150 200 250-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

PCA loadings PARAFAC loadings

0 50 100 150 200 250-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Page 16: 2. The PARAFAC model

16

Alternating least squares (ALS) Alternating least squares (ALS)

• How to estimate the PCA model X = ABT + E?

• Step 0 - Initialize B

1T2Tmin

BBXBAABXA

• Step 1 - Estimate A using least squares:

1TT2TTmin

AAAXBBAXB

• Step 2 - Estimate B using least squares:

• Step 3 - Check for convergence - if not, go to Step 1.

Each update must reduce the sum-of-squares, 2

E

Page 17: 2. The PARAFAC model

17

Three different unfoldings – the formula is symmetricThree different unfoldings – the formula is symmetric

XIJK = A(CB)T + EIJK

XJKI = B(AC)T + EJKI

XKIJ = C(BA)T + EKIJ

or

or

XIJK

XJKI

XKIJ

Page 18: 2. The PARAFAC model

18

How is the PARAFAC model calculated? How is the PARAFAC model calculated?

• Step 0 - Initialize B & C

2TJKI

TT

min AZX

BCZ

A

• Step 1 - Estimate A:

• Step 4: Check for convergence. If not, go to Step 1.

• Step 3 - Estimate C in same way:2TIJKmin CZX

C

• Step 2 - Estimate B in same way:2TKIJmin BZX

B

• How to estimate the model X = A(CB)T + E?

Page 19: 2. The PARAFAC model

19

Good initialization is sometimes importantGood initialization is sometimes important

Initialization methods

– random numbers (do this ten times and compare models)

– use another method to give rough estimate (e.g. DTLD, MCR)

– use sensible guesses (e.g. elution profiles are Gaussian)

2E

response surface

initialize B & C good solution

local minium

initialize B* & C*

ALS

ALS

Page 20: 2. The PARAFAC model

20

Conclusions (1)Conclusions (1)

• The PARAFAC model decomposes a three-way array array into three sets of loadings – one for each ‘mode’.Each set of loadings describes the variation in that mode, e.g. differences in concentration, changes in time, spectral profiles etc.

• PARAFAC components are calculated together and have no particular order. PARAFAC components are not orthogonal and cannot be rotated.

• PARAFAC can be used for curve resolution and for calibration.

Page 21: 2. The PARAFAC model

21

Conclusions (2)Conclusions (2)

• Some data sets have a chemical structure which is particularly suitable for the PARAFAC model, e.g. fluorescence spectroscopy.

• The PARAFAC model can also be used for four-way, five-way, N-way etc. data by simply using more sets of loadings.