Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411...

41
Chemometrics Introduction Course 27411 Biological data analysis and chemometrics Jens C. Frisvad

Transcript of Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411...

Page 1: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Chemometrics

Introduction

Course 27411 Biological data analysis and chemometrics

Jens C. Frisvad

Page 2: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Fundamental disciplines in biological sciences

• Classification (theory: taxonomy) – Discrimination (diagnostics) – Identification – Nomenclature

• Cladification (theory: phylogeny) • Modelling and predictions • Tests and validations

Page 3: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Engineering

• The science by which the properties of matter and the sources of energy in nature are made useful to man in structures, machines and products

• Measurement techniques, computers, language, definitions, properties, chemistry, physics, mathematical hard modelling, statistics, chemometrics and many other disciplines are necessary to be a good engineer

Page 4: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

X-metrics

• The use of multivariate statistics in the discipline X

• Psychrometrics (used in psychology) • Taxometrics (used in taxonomy) • Biometrics (used in biology) • Technometrics (used in engineering) • Chemometrics (used in chemistry)

Page 5: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Chemometrics

• Use of statistics and mathematics in chemical sciences, to measure and interpret chemical data (also used for biological data)

• Biometrics has often been restricted to univariate statistics and taxometrics to biosystematics

Page 6: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Other definitions of chemometrics:

• Empirical interactive data-driven modelling in chemistry (induction and abduction)

• Exploratory and confirmatory data-analysis (hypothesis generating and hypothesis testing) in chemistry

• Predictive multivariate modelling in chemistry

Page 7: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Important disciplines in chemometrics

• Sampling, selection of objects and variables • Clustering • Ordination (projection from N dimensions to few

dimensions, eigen vector based analyses) • Multivariate regressions, calibrations and

predictions • Neural networks • Validation (test set validation, boot-strapping,

jackknifing, cross-validation) • Graphical display and outlier analysis

Page 8: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Why use chemometrics?

• Complex systems with many interactions are common in science

• Indirect (and often non-destructive) observation of the world as it is

• An expansion of the human perception (the full electromagnetic spectrum)

• Chromatography and spectroscopy will always yield multivariate data

Page 9: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design
Page 10: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Most new analytical chemical methods give multivariate data

• IR and NIR • UV-VIS & fluorescence • NMR and ESR • MS • TLC • HPLC • GC • CE • CCC

Page 11: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Interpreting chemical data • Abduction

– Sharp independent signals that can be interpreted: • NMR • IR • MS

• Induction – Soft overlayered signals (strong interaction, co-linearity)

• UV • Fluorescence • NIR • FIA

– Any chromatographic/spectrometric measurements on mixtures

Page 12: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

From Jens Bo Holm Nielsen

NIR spectra, ”soft spectra”

Page 13: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

H-NMR spectrum, sharp signals

Page 14: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

No. variables > No. objects

• Classical statistical methods will not work in that case (for example multiple linear regression, linear learning machine etc.)

• Two solutions to this can be variable selection or classical statistics on scores from eigen vector analysis

Page 15: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Important methods in chemometrics (classification)

• Cluster analysis – Hierarchical clustering – Divisive clustering – Block-clustering and fuzzy clustering

• Ordination – Principal component analysis – Correspondence analysis – Multidimensional scaling

Page 16: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

The Celestial Emporium of Benevolent Knowledge (encyclopedia from the 10. century)

An arbitrary and idiosyncratic classification

1. Those that belong to the Emperor

2. Embalmed ones 3. Those that are trained 4. Suckling pigs 5. Mermaids 6. Fabulous ones 7. Stray dogs 8. Those that are not

included in this classification

9. Those that tremble as if they were mad

10. Innumerable ones 11. Those drawn with a very fine

camel's hair brush 12. Others 13. Those that have just broken a

flower vase 14. Those that resemble flies on a

distance

Classification of plants and animals:

Page 17: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Classification is central!

Classification

Phenotype data

(differentiation)

Nomenclature Identification

Cladification

Phylogeny

Chemometrics

Cladistics

Page 18: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Examples of chemical data

Page 19: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

2.00 4.00 6.00 8.00 10.00 12.00 14.00

%

0

100

2.00 4.00 6.00 8.00 10.00 12.00 14.000.0

1.0

2.0

3.810.88

1.05

3.041.95

12.05

4.93

10.6510.109.106.785.23

6.18 7.73

12.58

13.00 13.63

3.850.70

0.55

0.96

3.05

2.751.10

1.47

3.12

12.0710.93

9.144.92

4.51

6.835.71

10.16

9.65

13.1914.73

13.85

Aurasperone B

Pyranonigrin A

Ochratoxin A (shoulder)

Malformin C, A, B1, B, A1

Rubrofusarin B

Malformin B2 Pyranonigrin B/C

Pyranonigrin A

Fumonisin B4

Fumonisin B2 Nigragillin

Funalenone

Nigragillin analogue Tensidol B

Fonsecin B

Fonsecin

Aurasperone G & E

Demethylkotanin

Tensyuic acid A

Aurasperone C

ESI+

TIC

(m/z

100

-900

) U

V/VI

S (2

00-7

00 n

m)

Time (min)

Aspergillus niger secondary metabolite HPLC profile (sharp signals)

Page 20: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Time2.00 4.00 6.00 8.00 10.00 12.00

%

5

2.00 4.00 6.00 8.00 10.00 12.00

%

0

Kir24877 1: TOF MS ES+ 404

2.71e310.27

9.698.15 12.90

Kir24877 1: TOF MS ES+ TIC

1.62e51.69

1.03

10.203.60

3.232.99

2.26

9.509.177.717.233.87

5.955.584.62

6.53 8.888.00

10.71 11.12

11.80

Ochratoxin A Calculated mass 404.0901 Deviation -4.2 ppm

HPLC-ESI+ chromatograms (Luna C18 (2) column) A. niger NRRL 3122 from YES agar. Upper extracted ion chromatogram m/z 404 and lower total ion chromatogram (m/z 100-900)

Fumonisin B2 Fumonisin B4

Chlorine isotope pattern

Page 21: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Clustering of some common Penicillia based on 31 extrolite biosynthetic families

P. verrucosum sensu latoUPGMA Yule

-1.00 -0.50 0.00 0.50 1.00

ver1 ver2 ver3 ver4 nor1 nor2 nor3 nor4 vir1 vir2 vir3 vir4 aur1 aur4 aur2 aur3 cyc1 cyc4 cyc2 cyc3 pol1 pol4 pol2 pol3 cru1 cru4 cru3 cru2 com1 com4 com2 com3 aet1 aet4 aet2 aet3

Series Verrucosa

Series Viridicata

Series Camemberti

Page 22: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Look, daddy, look at the big AACACTGTATCTAATTATT!!!!

Aren’t they cool the new barcodes they’ve given us?

As if that’s

something

Biosystematics: Genome or phenome?

Politiken, 16/3 2003

Page 23: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Methods used in cladistics

• Parsimony • Maximum likelihood • Nearest neighbour • Bayes analysis

• Validation: Often boot-strapping

Page 24: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Important methods in chemometrics (regresssion)

• Regression – MLR (multiple linear regression) – PCR (principal component regression) – PLSR (partial least squares regression) – RR (ridge regression)

• Neural Networks

Page 25: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

The different kinds of reseach Pure basic reseach (Bohr) Use-inspired basic

research (Pasteur)

Pure applied research (Edison)

Considerations of use

Que

st fo

r fun

dam

enta

l und

erst

andi

ng

Yes

No

No Yes

Page 26: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

The scientific method

• Hypothesis • Prediction • Test and validation • Repeat The scientific method is a recursive system

of matching theory with observation A hypothesis is a tentatively held conjecture

for the purposes of developing predictions of empirical observations

Page 27: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

The scientific method (deduction)

• Discovery, observations, ideas, intuition, former results

• Propose a hypothesis and connect it with logical derivatisations from known theory and propose a mathematical model, prove by several experiments/observations (tests) and also try to disprove hypothesis (deduction: from the general to the specific)

Page 28: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

The scientific method (induction)

• Gather many objects and measure by a series of features.

• Classify and find latent features. • Predict by regression. • Validate. • Connect with known theory and set up

hypotheses or set up experimental designs to find important features and their dimensionality

• (induction: from the specific to the general)

Page 29: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Abduction, deduction and induction

• Science often exhibits a subtle interplay between abduction, induction and deduction. Abduction is a common process of creating new generalizations, theories and hypothesis. Deduction takes a hypothesis to make a specific prediction. ”Then” induction is used to fit the evidence to the hypothesis.

Page 30: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Levels of knowledge

• One (few) example(s): ”Laymans science” • Neural networks and validation • X-metrics and validation • Statistics and distributions • Mathematical exact modelling • Essentialism

Page 31: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Technology and science

”It’s alright in practice, will it ever work in theory?”

• Theory (Plato): Know what (clever people) • Practice: Know how (skilled people)

Page 32: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Advantages of technology (applied research)

• Holistic, not reductionistic • Context driven, not subject driven • Mission-oriented research, not ”blue skies” • Team work, not individual scholar • Divergent, not convergent thinking • Decisive criterion: does it work?

Page 33: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Beware of pure reductionism

We must reject this primitive and almost cannibalistic delution about knowledge, that an understanding of something requires first that we dismantle it, like a child who pulls a watch to pieces and spread out the wheels in order to understand the mechanism” (Thom, 1975)

Page 34: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Models are not reality

Precision

Realism Generality

Model

Page 35: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design
Page 36: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Systematic generalization (hierarchical reductionism)

”We very soon got six yards to the mile. Then we tried hundreds yeards to the mile. And then came the grandest idea of all! We actually made a map of the country, on the scale of a mile to a mile!”

”Have you used it much?” I enquired. ”It has never been spread out, yet, ” said Mein Herr: ”the

farmers objected: they said it would cover the whole country, and shut out the sunlight! So now we use the country itself, as its own map, and I assure you it does nearly as well”

(Lewis Carroll, Sylvie and Bruno, 1893)

Page 37: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Chemometrics and science?

• Find an important problem in your field of interest (FOI) to which there is yet no solution (think!). Propose a preliminary hypothesis.

• Observe and measure within the FOI • Use statistical and multivariate design • Propose a hypothesis (think!) • Experiments and/or observations: tests and

predictions based on proposed model • Reject hypothesis or accept it for the time being

Page 38: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

This course

• In this course you will have hands-on experience in how to treat data with a lot of features (variables) measured on several or a lot of objects

Page 39: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Learning objectives of the course • Give an overview of important chemometric methods • Identify situations where exploratory data-analysis is required • Describe and use different forms of scaling, transformation and normalization • Understand and describe the difference between classification and regression • Understand and describe the difference between clustering and ordination • Apply and interpret principal component analysis (PCA) on multivariate data • Apply and interpret the principles of validation and outlier detection • Use and interpret cluster analysis • Describe, apply and interpret multiple linear regression (MLR) and ridge regression

(RR) and where to apply them in two data matrix problems • Describe, apply and interpret principal component regression (PCR) and partial least

squares regression (PLSR) and where to apply them in two data matrix problems • Apply and interpret correspondence analysis • Describe the method metric multidimensional scaling

Page 40: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Programs used • R, R-Studio (free software)

• NTSYS (Exeter publishing) version 2.2

– A whole package on the methods used most frequently: used earlier in this course

– (http://www.exetersoftware.com)(USA)

• UNSCRAMBLER version 10.3 (CAMO, Norway) – (you can buy your own version (but it is expensive))

(http://www.camo.com)

Page 41: Chemometrics - Technical University of Denmark · Chemometrics Introduction . Course 27411 Biological data analysis and chemometrics . ... • Use statistical and multivariate design

Book used

• Lattin, L, Carroll, J.D., Green PE: • Analyzing multivariate data, Thomson,

Pacific Grove, CA, USA, 2003, 556 pp. • Recommended: E-boks on R

• + a little extra reading material, especially

Romesburg (1984) on cluster analysis