Presentation at Chimiometrie 2009 · id tilbth d tiindustrial batch production. • Differences in...
Transcript of Presentation at Chimiometrie 2009 · id tilbth d tiindustrial batch production. • Differences in...
Presentation at Chimiometrie 2009
Bringing the Chemical Knowledge into EmpiricalBringing the Chemical Knowledge into Empirical Models
ououIntroduire de l’information théorique dans les
modèles empiriquesmodèles empiriques.
Marion Cuny & Frank Westad
CAMO Software [email protected]
www.camo.com
Outline2
1. Limits of empirical modeling and necessity of bringing chemical knowledge in the modelg g g
2. Methodologya) NIR and MIR dataa) NIR and MIR datab) Cross-model validationc) Background informationc) Background informationd) LPLS
3 Results3. Results4. Conclusions
www.camo.com
Outline3
1. Limits of empirical modeling and necessity of bringing chemical knowledge in the modelg g g
2. Methodologya) NIR and MIR dataa) NIR and MIR datab) Cross-model validationc) Background informationc) Background informationd) LPLS
3 Results3. Results4. Conclusions
www.camo.com
Limits of empirical modeling4
For many years, chemometrics has focused on new, better methods for classification and regression, and using variable selection and pretreatments for the optimal classification or the lowest RMSEP.
• examples of variable selection methods: iPLS, moving window iPLS, genetic algorithms, best combination search, uncertainty
ti t b li (b t t j k k ifi )estimates by resampling (bootstrap, jack-knifing)…
• examples of pretreatments: MSC, SNV, Detrending, OSC, EPO…
www.camo.com
32 marzipan samples5
• 32 marzipan samples made from 9 recipes were obtained from an i d t i l b t h d tiindustrial batch production.
• Differences in recipes:– processing timesprocessing times– amounts of almonds, – amounts of apricot kernels,
t f t– amounts of water, – amounts of sucrose, – amounts of invert sugar, – amounts of glucose syrup – presence and amount of additives in the marzipan masses. Details about the experimental setup can be found in [1]Details about the experimental setup can be found in [1].
Goal: Online measurement of the sugar content ranging from 35-70% and water content varying from 6-18% with NIR.
www.camo.com
[1] J Christensen, L Nørgaard, H Heimdal, JG Pedersen, SB Engelsen. Rapid Spectroscopic Analysis of Marzipan – Comparative Instrumentation. Journal of Near Infrared Spectroscopy, vol 12 (1), 2004.
Regression results for sugar content6
Type of data Pretreatment # LVs RMSECV (Full) R2 (Validation Full CV)NIR No pretreatment 5 1.57 0.986
SNV 4 1.43 0.988MIR No pretreatment 8 2.01 0.976p
SNV 8 1.77 0.982
Sugar molecule:
Significant B coefficients
Sugar molecule:
NIR SNV
MIR SNV
All those wavelengths can not be related to sugar!
www.camo.com
related to sugar!
How do you deal with very correlated variables?7
PLS2 algorithm.
More selective but still too many wavelengths!
www.camo.com
g
Goals8
• To find a good model of sugar content that can be explained by chemistryp y y– What is the chemistry in the system that is being observed?
• To do so: include more of the chemistry in the models:models:– Combined the available instrumental data, to get a better
understanding: NIR explained by MIR.understanding: NIR explained by MIR.– At the same time, insert the chemical background
knowledge to confirm the interpretation from the models.
www.camo.com
g p
Outline9
1. Limits of empirical modeling and necessity of bringing chemical knowledge in the modelg g g
2. Methodologya) NIR and MIR dataa) NIR and MIR datab) Cross-model validationc) Background informationc) Background informationd) LPLS
3 Results3. Results4. Conclusions
www.camo.com
MIR and NIR spectroscopy of marzipan samples 10
1• FT-IR (Perkin-Elmer System 2000 1700-600 cm-1) andVIS/NIR (FOSS NIR System 6500 instrument 400-2500 nm) spectroscopyspectroscopy.
MIR spectra (left) and VIS/NIR spectra (right) of marzipan samplesMIR spectra are considered as containing NIR spectra contain overtones of the
www.camo.com
the fundamental chemical information fundamental chemical information
Preprocessing11
• Standard normal variates (SNV), which is equivalent to ( ), qcentering and normalizing each spectrum to unit variance.
iiikSNVik smxx /)( −= iiikik smxx /)(
www.camo.com
Selection of variables12
• PLS regression between X=MIR and Y=NIR• Study of the significance of the coefficient tested byStudy of the significance of the coefficient tested by
jack-knifing.• Validation method=cross model validation (CMV)• Validation method=cross-model validation (CMV)
www.camo.com
Outline13
1. Limits of empirical modeling and necessity of bringing chemical knowledge in the modelg g g
2. Methodologya) NIR and MIR dataa) NIR and MIR datab) Cross-model validationc) Background informationc) Background informationd) LPLS
3 Results3. Results4. Conclusions
www.camo.com
Selected variablesX Y
Significant
14
Cross-
Significant coefficient
Crossvalidation
SelectedCross-model
validation
Selected variables
validation
CMV is more restrictive
www.camo.com
than CV
Cross-model validation (CMV) with variable selection based on jack knifing
15
based on jack-knifing
0. Cross-validation on all objects X Y
1. Take out e.g. 10% of the objects2. Cross-validate the remaining3 Find significant variables3. Find significant variables 4. Predict the objects that were kept out5. Estimate RMSE (or explained variance) CVCMV
6. Repeat 1 - 5 until all objects have been taken out
7. Show frequency and significance for all y gvariables
9. Collect and predict an independent test-set!
www.camo.com
Outline16
1. Limits of empirical modeling and necessity of bringing chemical knowledge in the modelg g g
2. Methodologya) NIR and MIR dataa) NIR and MIR datab) Cross-model validationc) Background informationc) Background informationd) LPLS
3 Results3. Results4. Conclusions
www.camo.com
Band assignment table17
• 8 group frequencies were assigned to their respective vibrational regions from an NIR-chart and an initial binary matrix was built using 1 to indicatefrom an NIR chart, and an initial binary matrix was built using 1 to indicate peaks and 0 elsewhere.
• In addition, peaks were weighted either as weak (0.5), normal (1), strong (2) or very strong (4). The band assignment matrix was then convolved ( ) y g ( ) gwith a Gaussian filter of size N=30 (corresponding to 60 nm when the spectral resolution is equal to 2 nm) for each group.
upSugar molecule:
tiona
l gro
uFu
nct
Expected wavelengths
www.camo.comWavelength, nm
p g
Outline18
1. Limits of empirical modeling and necessity of bringing chemical knowledge in the modelg g g
2. Methodologya) NIR and MIR dataa) NIR and MIR datab) Cross-model validationc) Background informationc) Background informationd) LPLS
3 Results3. Results4. Conclusions
www.camo.com
Structure of the L-PLS data19
• One interesting aspect of L PLSR is that the regression model is based on• One interesting aspect of L-PLSR is that the regression model is based on the inherent link between the actual spectra and theoretical band assignment, giving direct “chemical” interpretation.
• With broad bands like in NIR, the assignments are rather crude, and one should always interpret the results in the light of the chemical background knowledge. By applying this procedure in e.g. MIR spectroscopy, a more g y y g g ydetailed interpretation compared to NIR would then be possible.
MIR
www.camo.com
L-shape PLS regression20
• Regular two-block PLS:X-weights (used to calculate scores and loadings) can be obtained from the A first eigenvectors of:
• Three-block / L-PLS:Weights for X and Z (used to find X and Z scores and loadings) may be obtained from the A first eigenvectors of:obtained from the A first eigenvectors of:
H Martens E Anderssen A Flatberg L H Gidskehaug M Høy F Westad A Thybo M Martens Regressing a matrix on
www.camo.com
H Martens., E Anderssen, A Flatberg, L.H. Gidskehaug, M Høy, F Westad, A Thybo, M Martens. Regressing a matrix on descriptors of both its rows and of its columns, by low-rank L-PLS Regression. Computational Statistics and Data Analysis, 48, 103-125, 2005.
Outline21
1. Limits of empirical modeling and necessity of bringing chemical knowledge in the modelg g g
2. Methodologya) NIR and MIR dataa) NIR and MIR datab) Cross-model validationc) Background informationc) Background informationd) LPLS
3 Results3. Results4. Conclusions
www.camo.com
Selection of variable by CMV22
• The 32 marzipan samples were subject to repeated CMV (100 runs. Uncertainty estimates from jack-( y jknifing).
• 3 PLSR components was the basis for significance3 PLSR components was the basis for significance tests at 5% level.
• The main results are shown as a map of frequency of• The main results are shown as a map of frequency of significance for the regression coefficient matrix B. Sugar and water constitute the chemical compoundsSugar and water constitute the chemical compounds of interest.
www.camo.com
Frequency of significance from repeated CMV23
Frequency of significance from repeated CMV100
600
800
1000
80
90 C-H deformation at 820 cm−1
corresponds to the NIR region d 2200OH
th (n
m),
NIR
1000
1200
140050
60
70 around 2200 nm.
The region 1400–1500 nm is CH2 stretch
OH and CH combination
Wav
elen
gt
1600
180030
40
related to both O-H stretch vibrations and C-H combinations.
OH
OH and CH combination
2000
2200
2400
10
20 The main peaks in the bands that were significant were selected as input to the L-PLS regression.
OH
Wavenumber (cm-1), FT-IR700 800 900 1000 1100 1200 1300 1400 1500 1600
24000
input to the L PLS regression. MIR
C-O stretch1030 cm-1
O-H deformation700-800 cm-1
www.camo.com
O-H stretch820 cm-1
Background: selected peaks24
Selection of the wavelengths representing the top of the peak
H20 (O H)
0.000 0.200 0.400 0.600 0.800 1.000
Matrix Plot
Selection of the wavelengths representing the top of the peak.
H20 (O-H)
C-H
C-H2
C-H3
R-OH
ArOH
C=C
N-HN H
984
1204
1404
1434
1584
1714
1864
1894
2094
2154
2324
Euro nir + assign selected - Matrix Plot, Sam.Set: Selected Samples, Var.Set: NIR selected
www.camo.com
Data structure25
H20 (O-H)
0 .000 0 .200 0 .400 0 .600 0 .800 1 .000
Matr ix Plo t
C-H
C-H2
C-H3
R-OH
ArOH
2 5Line Plot
C=C
N-H
984
120
4
140
4
143
4
158
4
171
4
186
4
189
4
209
4
215
4
232
4
Eu ro ni r + as s ign s el ec te d - Ma tr ix Pl ot, Sa m .Se t: Se lected Sam p les , Var.Se t: NIR s e lected
1.0
1.5 Line Plot
2.0
2.5
MIR selected
-0.5
0
0.5
1.0
1.5
selected
NIR
-1.5
-1.0
0 5
9 12 14 14 15 17 18 18 20 21 23 Variables
0.5
1137 cm
1083 cm
1065 cm
1032 cm
993 cm-
978 cm-
927 cm-
906 cm-
864 cm-
822 cm-
750 cm-
Variables
NIR selected
www.camo.com
84
204
404
434
584
714
864
894
094
154
324
1234567891011121314151617181920212223242526272829303132
m-1
m-1
m-1
m-1
-1 -1 -1 -1 -1 -1 -1
1234567891011121314151617181920212223242526272829303132
Correlation loading plot26
• Direct interpretation of the 1.0 PC2 Correlation Loadings (X and Y)
chemistry and the actual spectral regions that were found to be significant.
0 5
C-H
984 nm
• Note that sugar and water content are inversely correlated in the samples
0.5
456 1415 21
23
242529
32 1083 cm-1SC H2
C-H3C=CN-H
1204 nm2094 nm2154 nm2324 nm
Dummy variables for samples
correlated in the samples themselves, so there is both a concentration dependence as % content
ll h i l
0123456789
12
14
1617
181920
212225
2627
28
3031
32 1083 cm 11065 cm-1
1032 cm-1
978 cm 1
906 cm-1
864 cm-1
SugarC-H2
A OH1404 nm
1714 nm1864 nm
CO-strechas well as a chemical dependence in terms of OH bands.
1894 nm-0.5
101113
1137 cm-1 978 cm-1864 cm 1
Moisture
H20 (O-H)R-OH
ArOH
1434 nm
OH t h• Legend:
– NIR wavelengths, – MIR wave numbers,– Chemical background: wavelength -1.0
993 cm-1927 cm-1
822 cm-1
750 cm-1PC1
OH-strech
www.camo.com
g gtable,
– Composition -1.0 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1.0X-expl: 57%,33% Y-expl: 72%,13%
PC1
Regression on the selected OH peaks27
The –OH peaks to be
70
21
2223
2930
3132
21
22
232930
3132Slope Offset RMSE R-Square0.977228 1.053259 1.913449 0.9772280.974754 1.177793 2.143689 0.973177
Predicted Y
The OH peaks to be used to measure the sugar contents are: 2094 2154 2324 nm 50
60
26 2728
26 2728
2094, 2154, 2324 nm.
40
50
1
10
11
12
13
14151617
1819
20
251 7
10
11
12
13
14151617
1819
20
25
30
30 35 40 45 50 55 60 65 70
12
34567
89 12
24
2512
34
567
89 12
24
25
Measured Y
RESULT10, (Y-var, PC): (Sugar,2) (Sugar,2)
Results may not be better (in this case they are not) but you
www.camo.com
are able to say why you are using those wavelengths!
Outline28
1. Limits of empirical modeling and necessity of bringing chemical knowledge in the modelg g g
2. Methodologya) NIR and MIR dataa) NIR and MIR datab) Cross-model validationc) Background informationc) Background informationd) LPLS
3 Results3. Results 4. Conclusions
www.camo.com
Conclusions29
• Cross model validation with jack-knife estimates is an efficient way of removing variables that are not of interest, and the
l i b i l h d b drelation between two instrumental methods can be presented as a color image.L PLS i i di t i t t ti f th d l i• L-PLS regression gives direct interpretation of the underlying chemistry which is useful for confirming existing knowledge but also for finding unknown phenomena which may lead tobut also for finding unknown phenomena which may lead to innovation and further research.
• The correlation loading plot is a very condensed way of g p y yvisualizing all of interest for the three data tables.
www.camo.com
Perspectives30
• L-PLS is adapted to other types of data:– NMR
Z: information on hift i tshift assignment
Y: NMR measurement
X: information on the samples or other measurements
www.camo.com
Perspectives31
• L-PLS is adapted to other types of data:– metabolomic
Z: information hon phenotype
Y: metabolomicmeasurement
X: information on the samplesp
www.camo.com
Perspectives32
• L-PLS is adapted to other types of data:– sensoryy
Z: information on consumer
Y: consumer preference
X: information on the samplesp p
www.camo.com
Get value out of your data33
Thank you for your attention
Marion Cuny
[email protected]@camo.no
Other webinars: http://www camo com/training/webinars seminars html
www.camo.com
Other webinars: http://www.camo.com/training/webinars‐seminars.html
Recorded webinars: http://www.camo.com/training/archives.html