Presentation at Chimiometrie 2009 · id tilbth d tiindustrial batch production. • Differences in...

Presentation at Chimiometrie 2009

Bringing the Chemical Knowledge into EmpiricalBringing the Chemical Knowledge into Empirical Models

ououIntroduire de l’information théorique dans les

modèles empiriquesmodèles empiriques.

Marion Cuny & Frank Westad

CAMO Software [email protected]

www.camo.com

Outline2

1. Limits of empirical modeling and necessity of bringing chemical knowledge in the modelg g g

2. Methodologya) NIR and MIR dataa) NIR and MIR datab) Cross-model validationc) Background informationc) Background informationd) LPLS

3 Results3. Results4. Conclusions

www.camo.com

Outline3




www.camo.com

Limits of empirical modeling4

For many years, chemometrics has focused on new, better methods for classification and regression, and using variable selection and pretreatments for the optimal classification or the lowest RMSEP.

• examples of variable selection methods: iPLS, moving window iPLS, genetic algorithms, best combination search, uncertainty

ti t b li (b t t j k k ifi )estimates by resampling (bootstrap, jack-knifing)…

• examples of pretreatments: MSC, SNV, Detrending, OSC, EPO…

www.camo.com

32 marzipan samples5

• 32 marzipan samples made from 9 recipes were obtained from an i d t i l b t h d tiindustrial batch production.

• Differences in recipes:– processing timesprocessing times– amounts of almonds, – amounts of apricot kernels,

t f t– amounts of water, – amounts of sucrose, – amounts of invert sugar, – amounts of glucose syrup – presence and amount of additives in the marzipan masses. Details about the experimental setup can be found in [1]Details about the experimental setup can be found in [1].

Goal: Online measurement of the sugar content ranging from 35-70% and water content varying from 6-18% with NIR.

www.camo.com

[1] J Christensen, L Nørgaard, H Heimdal, JG Pedersen, SB Engelsen. Rapid Spectroscopic Analysis of Marzipan – Comparative Instrumentation. Journal of Near Infrared Spectroscopy, vol 12 (1), 2004.

Regression results for sugar content6

Type of data Pretreatment # LVs RMSECV (Full) R2 (Validation Full CV)NIR No pretreatment 5 1.57 0.986

SNV 4 1.43 0.988MIR No pretreatment 8 2.01 0.976p

SNV 8 1.77 0.982

Sugar molecule:

Significant B coefficients

Sugar molecule:

NIR SNV

MIR SNV

All those wavelengths can not be related to sugar!

www.camo.com

related to sugar!

How do you deal with very correlated variables?7

PLS2 algorithm.

More selective but still too many wavelengths!

www.camo.com

g

Goals8

• To find a good model of sugar content that can be explained by chemistryp y y– What is the chemistry in the system that is being observed?

• To do so: include more of the chemistry in the models:models:– Combined the available instrumental data, to get a better

understanding: NIR explained by MIR.understanding: NIR explained by MIR.– At the same time, insert the chemical background

knowledge to confirm the interpretation from the models.

www.camo.com

g p

Outline9




www.camo.com

MIR and NIR spectroscopy of marzipan samples 10

1• FT-IR (Perkin-Elmer System 2000 1700-600 cm-1) andVIS/NIR (FOSS NIR System 6500 instrument 400-2500 nm) spectroscopyspectroscopy.

MIR spectra (left) and VIS/NIR spectra (right) of marzipan samplesMIR spectra are considered as containing NIR spectra contain overtones of the

www.camo.com

the fundamental chemical information fundamental chemical information

Preprocessing11

• Standard normal variates (SNV), which is equivalent to ( ), qcentering and normalizing each spectrum to unit variance.

iiikSNVik smxx /)( −= iiikik smxx /)(

www.camo.com

Selection of variables12

• PLS regression between X=MIR and Y=NIR• Study of the significance of the coefficient tested byStudy of the significance of the coefficient tested by

jack-knifing.• Validation method=cross model validation (CMV)• Validation method=cross-model validation (CMV)

www.camo.com

Outline13




www.camo.com

Selected variablesX Y

Significant

14

Cross-

Significant coefficient

Crossvalidation

SelectedCross-model

validation

Selected variables

validation

CMV is more restrictive

www.camo.com

than CV

Cross-model validation (CMV) with variable selection based on jack knifing

15

based on jack-knifing

0. Cross-validation on all objects X Y

1. Take out e.g. 10% of the objects2. Cross-validate the remaining3 Find significant variables3. Find significant variables 4. Predict the objects that were kept out5. Estimate RMSE (or explained variance) CVCMV

6. Repeat 1 - 5 until all objects have been taken out

7. Show frequency and significance for all y gvariables

9. Collect and predict an independent test-set!

www.camo.com

Outline16




www.camo.com

Band assignment table17

• 8 group frequencies were assigned to their respective vibrational regions from an NIR-chart and an initial binary matrix was built using 1 to indicatefrom an NIR chart, and an initial binary matrix was built using 1 to indicate peaks and 0 elsewhere.

• In addition, peaks were weighted either as weak (0.5), normal (1), strong (2) or very strong (4). The band assignment matrix was then convolved ( ) y g ( ) gwith a Gaussian filter of size N=30 (corresponding to 60 nm when the spectral resolution is equal to 2 nm) for each group.

upSugar molecule:

tiona

l gro

uFu

nct

Expected wavelengths

www.camo.comWavelength, nm

p g

Outline18




www.camo.com

Structure of the L-PLS data19

• One interesting aspect of L PLSR is that the regression model is based on• One interesting aspect of L-PLSR is that the regression model is based on the inherent link between the actual spectra and theoretical band assignment, giving direct “chemical” interpretation.

• With broad bands like in NIR, the assignments are rather crude, and one should always interpret the results in the light of the chemical background knowledge. By applying this procedure in e.g. MIR spectroscopy, a more g y y g g ydetailed interpretation compared to NIR would then be possible.

MIR

www.camo.com

L-shape PLS regression20

• Regular two-block PLS:X-weights (used to calculate scores and loadings) can be obtained from the A first eigenvectors of:

• Three-block / L-PLS:Weights for X and Z (used to find X and Z scores and loadings) may be obtained from the A first eigenvectors of:obtained from the A first eigenvectors of:

H Martens E Anderssen A Flatberg L H Gidskehaug M Høy F Westad A Thybo M Martens Regressing a matrix on

www.camo.com

H Martens., E Anderssen, A Flatberg, L.H. Gidskehaug, M Høy, F Westad, A Thybo, M Martens. Regressing a matrix on descriptors of both its rows and of its columns, by low-rank L-PLS Regression. Computational Statistics and Data Analysis, 48, 103-125, 2005.

Outline21




www.camo.com

Selection of variable by CMV22

• The 32 marzipan samples were subject to repeated CMV (100 runs. Uncertainty estimates from jack-( y jknifing).

• 3 PLSR components was the basis for significance3 PLSR components was the basis for significance tests at 5% level.

• The main results are shown as a map of frequency of• The main results are shown as a map of frequency of significance for the regression coefficient matrix B. Sugar and water constitute the chemical compoundsSugar and water constitute the chemical compounds of interest.

www.camo.com

Frequency of significance from repeated CMV23

Frequency of significance from repeated CMV100

600

800

1000

80

90 C-H deformation at 820 cm−1

corresponds to the NIR region d 2200OH

th (n

m),

NIR

1000

1200

140050

60

70 around 2200 nm.

The region 1400–1500 nm is CH2 stretch

OH and CH combination

Wav

elen

gt

1600

180030

40

related to both O-H stretch vibrations and C-H combinations.

OH

OH and CH combination

2000

2200

2400

10

20 The main peaks in the bands that were significant were selected as input to the L-PLS regression.

OH

Wavenumber (cm-1), FT-IR700 800 900 1000 1100 1200 1300 1400 1500 1600

24000

input to the L PLS regression. MIR

C-O stretch1030 cm-1

O-H deformation700-800 cm-1

www.camo.com

O-H stretch820 cm-1

Background: selected peaks24

Selection of the wavelengths representing the top of the peak

H20 (O H)

0.000 0.200 0.400 0.600 0.800 1.000

Matrix Plot

Selection of the wavelengths representing the top of the peak.

H20 (O-H)

C-H

C-H2

C-H3

R-OH

ArOH

C=C

N-HN H

984

1204

1404

1434

1584

1714

1864

1894

2094

2154

2324

Euro nir + assign selected - Matrix Plot, Sam.Set: Selected Samples, Var.Set: NIR selected

www.camo.com

Data structure25

H20 (O-H)

0 .000 0 .200 0 .400 0 .600 0 .800 1 .000

Matr ix Plo t

C-H

C-H2

C-H3

R-OH

ArOH

2 5Line Plot

C=C

N-H

984

120

4

140

4

143

4

158

4

171

4

186

4

189

4

209

4

215

4

232

4

Eu ro ni r + as s ign s el ec te d - Ma tr ix Pl ot, Sa m .Se t: Se lected Sam p les , Var.Se t: NIR s e lected

1.0

1.5 Line Plot

2.0

2.5

MIR selected

-0.5

0

0.5

1.0

1.5

selected

NIR

-1.5

-1.0

0 5

9 12 14 14 15 17 18 18 20 21 23 Variables

0.5

1137 cm

1083 cm

1065 cm

1032 cm

993 cm-

978 cm-

927 cm-

906 cm-

864 cm-

822 cm-

750 cm-

Variables

NIR selected

www.camo.com

84

204

404

434

584

714

864

894

094

154

324

1234567891011121314151617181920212223242526272829303132

m-1

m-1

m-1

m-1

-1 -1 -1 -1 -1 -1 -1

1234567891011121314151617181920212223242526272829303132

Correlation loading plot26

• Direct interpretation of the 1.0 PC2 Correlation Loadings (X and Y)

chemistry and the actual spectral regions that were found to be significant.

0 5

C-H

984 nm

• Note that sugar and water content are inversely correlated in the samples

0.5

456 1415 21

23

242529

32 1083 cm-1SC H2

C-H3C=CN-H

1204 nm2094 nm2154 nm2324 nm

Dummy variables for samples

correlated in the samples themselves, so there is both a concentration dependence as % content

ll h i l

0123456789

12

14

1617

181920

212225

2627

28

3031

32 1083 cm 11065 cm-1

1032 cm-1

978 cm 1

906 cm-1

864 cm-1

SugarC-H2

A OH1404 nm

1714 nm1864 nm

CO-strechas well as a chemical dependence in terms of OH bands.

1894 nm-0.5

101113

1137 cm-1 978 cm-1864 cm 1

Moisture

H20 (O-H)R-OH

ArOH

1434 nm

OH t h• Legend:

– NIR wavelengths, – MIR wave numbers,– Chemical background: wavelength -1.0

993 cm-1927 cm-1

822 cm-1

750 cm-1PC1

OH-strech

www.camo.com

g gtable,

– Composition -1.0 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1.0X-expl: 57%,33% Y-expl: 72%,13%

PC1

Regression on the selected OH peaks27

The –OH peaks to be

70

21

2223

2930

3132

21

22

232930

3132Slope Offset RMSE R-Square0.977228 1.053259 1.913449 0.9772280.974754 1.177793 2.143689 0.973177

Predicted Y

The OH peaks to be used to measure the sugar contents are: 2094 2154 2324 nm 50

60

26 2728

26 2728

2094, 2154, 2324 nm.

40

50

1

10

11

12

13

14151617

1819

20

251 7

10

11

12

13

14151617

1819

20

25

30

30 35 40 45 50 55 60 65 70

12

34567

89 12

24

2512

34

567

89 12

24

25

Measured Y

RESULT10, (Y-var, PC): (Sugar,2) (Sugar,2)

Results may not be better (in this case they are not) but you

www.camo.com

are able to say why you are using those wavelengths!

Outline28



3 Results3. Results 4. Conclusions

www.camo.com

Conclusions29

• Cross model validation with jack-knife estimates is an efficient way of removing variables that are not of interest, and the

l i b i l h d b drelation between two instrumental methods can be presented as a color image.L PLS i i di t i t t ti f th d l i• L-PLS regression gives direct interpretation of the underlying chemistry which is useful for confirming existing knowledge but also for finding unknown phenomena which may lead tobut also for finding unknown phenomena which may lead to innovation and further research.

• The correlation loading plot is a very condensed way of g p y yvisualizing all of interest for the three data tables.

www.camo.com

Perspectives30

• L-PLS is adapted to other types of data:– NMR

Z: information on hift i tshift assignment

Y: NMR measurement

X: information on the samples or other measurements

www.camo.com

Perspectives31

• L-PLS is adapted to other types of data:– metabolomic

Z: information hon phenotype

Y: metabolomicmeasurement

X: information on the samplesp

www.camo.com

Perspectives32

• L-PLS is adapted to other types of data:– sensoryy

Z: information on consumer

Y: consumer preference

X: information on the samplesp p

www.camo.com

Get value out of your data33

Thank you for your attention

Marion Cuny

[email protected]@camo.no

Other webinars: http://www camo com/training/webinars seminars html

www.camo.com

Other webinars: http://www.camo.com/training/webinars‐seminars.html

Recorded webinars: http://www.camo.com/training/archives.html

Presentation at Chimiometrie 2009 · id tilbth d tiindustrial batch production. • Differences in...

Documents

Transcript of Presentation at Chimiometrie 2009 · id tilbth d tiindustrial batch production. • Differences in...