Post on 26-Dec-2015
ExercicesMultivariate Data Analysis
Topic 1 Multivariate Data AnalysisTopic 1 Theory: Multivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)
Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentration of chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…
Romà Tauler (IDAEA, CSIC, Barcelona)Febrero 2009
Introduction to MATLAB.
What is MATLAB?
Matlab is a contraction for “Matrix Laboratory" and, though originally designed as a tool for the manipulation of matrices, is now capable of performing a wide range of numerica computations.
Matlab also possess esextensive graphics capabilities.
Introduction to MATLABCommand line programming environmentcommand window prompt (»)Matrix algebra: scalars, vectors, matricesWork / use:•Interactively at the command line•Create/use programs (functions or scripts)•Toolboxes add on additional functionality
The MATLAB Workspace
Workspace is where: variables are stored, create variables, manipulate and operate on variables
Save workspace variables
Information about variables in the workspace: who and whos
»whos
Name Size Bytes Class
fsparse 100x100 1604 sparse array
modstruct 1x1 130 struct array
my3D 10x20x104 166400 double array
mymat 5x4 160 double array
myvect 1x3 24 double array
somechars 1x8 16 char array
zcells 2x2 167082 cell array
Grand total is 41766 elements using 335416 bytes
MATLAB Data Types•double -- double precision floating point
-- number array (this is the traditional
-- MATLAB matrix or array)
•sparse -- 2-D real (or complex) sparse matrix
•struct -- Structure array
•cell -- cell array
•char -- Character array
•logical -- Logical arrays (1,0)
<class_name> -- Custom object class
dataset -- Standard Data Object
Command Line Help: help functionname; lookfor method; which functionname
helpwin
Importing Data into MATLAB
MATLAB can read flat ASCII files
Import Wizard
A variety of image formats can be imported with
IMREAD function (JPEG, BMP, TIFF, etc.)
Various spreadsheet import functions
Custom developed routines for reading binary
instrument files
Additional Functions for Importing Data
‘xlsfinfo’ - reads sheetnames from .xls file
‘xlsread’ - reads in data from .xls file
Format types
A = [1 2 0; 2 5 -1; 4 10 -1]A = 1 2 0 2 5 -1 4 10 -1 >>B = A'B = 1 2 4 2 5 10 0 -1 -1 >>C = A .* BC = 1 4 0 4 25 -10 0 -10 1
The same for the ./and.\operators
NaNconcept:NaN is the IEEE arithmetic representation for Not-a-Number.A NaN is obtained as a result of mathematically undefined operations like 0.0/0.0 and inf-inf.
Useful functions for beginners:
HELP:On-line help, display text at command line.LOOKFOR:Search all M-files for keyword.WHOS:List current variables, long form. MAX:Largest component.MIN:Smallest component.ROUND, CEIL, FLOOR, FIX:Rounding.SQUEEZE:Remove singleton dimensions.FIND:Find indices of nonzero elements.MEAN:Average or mean value.ISNAN:True for Not-a-Number.FLIPUD:Flip matrix in up/down direction.FLIPDIM:Flip matrix along specified dimension.RESHAPE:Change size.PERMUTE:Permute array dimensions.REPMAT:Replicate and tile an array.EVAL:Execute string with MATLAB expression.
Indexing into Three-way (and higher) Arrays
MATLAB supports three-way and higher arrays Indexing extends easily to multi-way:
»x = round(rand(4,5,2)*10)
x(:,:,1) =
10 9 8 9 9
2 8 4 7 9
6 5 6 2 4
5 0 8 4 9
x(:,:,2) =
1 1 3 4 8
4 2 2 9 5
8 2 0 5 2
0 6 7 4 7
»x(:,:,2) = ones(4,5)*5
x(:,:,1) =
10 9 8 9 9
2 8 4 7 9
6 5 6 2 4
5 0 8 4 9
x(:,:,2) =
5 5 5 5 5
5 5 5 5 5
5 5 5 5 5
5 5 5 5 5
Cell Arrays
Cell arrays are a handy way
to store different length
matrices from batch process
data, example at left
»x = cell(4,1)
x =
[]
[]
[]
[]
»x{1} = rand(4,5);
»x{2} = rand(10,5);
»x{3} = rand(6,5);
»x{4} = rand(8,5);
»xx = [ 4x5 double] [10x5 double] [ 6x5 double] [ 8x5 double]»x{1}ans = 0.8381 0.8318 0.3046 0.3028 0.3784 0.0196 0.5028 0.1897 0.5417 0.8600 0.6813 0.7095 0.1934 0.1509 0.8537 0.3795 0.4289 0.6822 0.6979 0.5936
help diary DIARY Save text of MATLAB session. DIARY FILENAME causes a copy of all subsequent command window input and most of the resulting command window output to be appended to the named file. If no file is specified, the file 'diary' is used. DIARY OFF suspends it. DIARY ON turns it back on. DIARY, by itself, toggles the diary state. Use the functional form of DIARY, such as DIARY('file'), when the file name is stored in a string. See also <a href="matlab:help save">save</a>.
Reference page in Help browser <a href="matlab:doc diary">doc diary</a>
doc diarydiary
Introduction to Linear Algebra
• Definitions
• scalar, vector, matrix
• Linear Algebra Operations
• vector and matrix addition
• vector and matrix multiplication
• projection
• Gaussian elimination
• the concept of rank
• matrix inverses
• rank deficiency
• ......
Projection of a vector y onto a vector x
Projection of a vector y onto a subspace X (onto the columns of X)
Diagonalization of a non-singular symetric matrix.Eigenvalues and eigenvectors. Calculation of the principal components.
(X1, X2, ..., Xn) Linear Transformation (PC1, PC2, ...., Pcn)
PC1 = l11X1 + l12X2 + .... + l1nXn
PC2 = l21X2 + l22X2 + .... + l2nXn
...................................................PCn = ln1X1 + ln2X2 + ..... + lnnXn
(PC1, PC2, .....Pcn) = (X1, X2, ....Xn)
PC = X L
l l l
l l l
l l l
n
n
n n nn
11 21 1
12 22 2
1 2
...
...
.............
...
with the constraints applied in ascending order: 1. Var(PC1) maximum
2. Var(PC2) maximum but with Cov(PC1,PC2) = 0
.....................................................................................n. Var(PCn) maximum but with
Cov(PC1,PCn) = 0, Cov(PC2,PCn) = 0, Cov(PC3,PCn) = 0, ..........................., Cov(PCn-1,PCn) = 0
* Diagonalization of a non singular square symetric matrix
S = Cov(X1, X2, ...., Xn)
S = L Diag(1,2,...n) L
t =L D() Lt
L is an orthonormal matrix; it has the eigenvectors of S (loadings); they are in the
columns of matrix L
nnnn
n
n
nnnnn
n
n
lll
lll
lll
lll
lll
lll
....
.............
...
...
...00
..........
0...0
0...0
...
............
...
...
= S
21
22221
11211
2
1
21
22212
12111
Eigenvalues of matrix S are in the diagonal of matrix D
1 = Var(PC1), 2 = Var(PC2), .... ,n = Var(PCn)
s11+s22+...+snn = Trace(S) = Trace(D()) = 1+2+...+n
Det(S) = Det(D()) = 12.....n
Znn = (zij) is the matrix of scores; object coordinates in the new axes (new variables,
or PCs)
Znn=Xnn Lnn ; Znn Ltnn = Xnn
Linear Combination of the original variables Factors Principal Components (PC) Canonic Variables Latent Variables Discriminant Functions ............................................... Linear Combination of random variables y = a1x1 + a2x2 + ....+ anxn = at x E(y) = a1E(x1) + a2E(x2) + ....+ anE(xn) Var(y) = (a1, a2, ..., an) S a = at S a,
on S és la matriu de variances-covariances de X z = b1x1 + b2x2 + ...+ bnxn = bt x Cov(y,z) = at S b ..................................................................................
•Noise Filtering
Selection of the first principal components, e.g.. if e PC are selected
Xmn = ZmeLtee + Emn
Emn is the residuals matrix, after subtracting the
contributions of the first PCs
* Euclidean Distance
d2(Oi,Oj) = d2( (xi1,xi2,...,xin) (xj1,xj2,...,xjn) ) =
= (xi1-xj1)
2 + (xi2-xj2)2 + (xin-xjn)
2 =
= (xi1-xj1, ...,xin-xjn) I
* Mahalanobis Distance
d2m(Oi,Oj) = (xi1-xj1, ...,xin-xjn) S
where S is the covariances matrix
It takes into account covariance between variables!
x x
x x
x x
i j
i j
in jn
1 1
2 2
...........
x x
x x
x x
i j
i j
in jn
1 1
2 2
...........
Univariate Statistics
n
ii 1
n
i2 2 i 1
X x
n
ii 1
X x
n
i ii 1
x,y x,y
x,yx,y
x y
xX
n
(x X)s
n 1
(x X)s
n 1
(x X)(y Y)s
n 1s
rs s
mean
variance
standard deviation
covariance
correlation
Multivariate Statistics
Matrix X of experimental measures Xnm
11 12 1m
21 22 2m
n1 n2 nm
x x ... x
x x ... xX(n,m)
... ... ... ...
x x ... x
vector of column means: )x ..., ,x ,x( = x m21 where
n
iji=1
j
xx = , j 1,...,m
n
Matrix of variances-covariances S(m,m) = (s2
ij) It is a square symmetric matrix
s2jl = Cov(xj , xl) =
n
ij j il li 1
n 1
(x x )(x -x )
2 2 211 12 1m2 221 2m
2 2m1 mm
s s ... s
s ... ... sS
... ... ... ...
s ... ... s
Multivariate Statistics
s2jj = Var(xj) =
1n
xxn
1i
2jij
)( =
1n
1
x j x j
2
X (n,m) = X(n,m) - x x , . . . , x1 2 n,
mn mean centered data matrix
11 1 12 2 1m m
21 1 22 2 2m m
n1 1 n2 2 nm m
x x x x ... x x
x x x x x xX(n,m)
... ... ... ...
x x x x ... x x
S (m,m) = 1/(m-1) XXT(m,n) X(n,m) covariance matrix Standard deviations
(s1, s2,..., sn) = (s211
1/2, s222
1/2,...,s2nn
1/2)
Multivariate Statistics Correlation matrix C (m,m) X (n,m) => mean centering => X (n,m) standardizing Xs(n,m)
(xij) ij jx x
ij j
j
x x
s
C (m,m) = Corr(Xj) = 1/(n-1) Xs
T Xs
Covariance matrix respect the origen M (m,m) = 1/n XT X
11 12 1m
21 22 2m
n1 n2 nm
x x ... x
x x ... xX(n,m)
... ... ... ...
x x ... x
11 1 12 2 1m m
21 1 22 2 2m m
n1 1 n2 2 nm m
x x x x ... x x
x x x x x xX(n,m)
... ... ... ...
x x x x ... x x
2 2 211 12 1m
2 221 2m
2 2m1 mm
s s ... s
s ... ... sS
... ... ... ...
s ... ... s
2
2jj
n
ij ji 1s
n 1
(x x )
2 2 211 12 1m
2 221 2m
2 2m1 mm
r r ... r
r ... ... rC
... ... ... ...
r ... ... r
2ij2
iji j
2i ii
sr
s s
s s
1n
)x)(xx(xs
n
1ililjij
2jl
Univariate Normal Distribution with mean and standard deviation
f(x) = 1
2
1
2
2
2
ex
( )
Sample mean, m, is an estimation of the population mean and standard deviation of the sample, s, is an estimation of the standard deviation of the population,
M u l t i v a r i a t e N o r m a l D i s t r i b u t i o n
= ( 1 ,
2 , . . . . , n ) p o p u l a t i o n m e a n
x ( , , . . . )x x x n1 2 s a m p l e m e a n a s e s t i m a t i o n o f
c o v a r i a n c e s m a t r i x ( m a t r i x S i s a n e s t i m a t i o n o f )
f ( x 1 , x 2 , . . . . , x n ) = 1
21 2 2
1 2 1 11
1 1
/ /
/ ( , . . . . ) . . . . . . . . . . .. . . . . . . . . . .
( )
n
x x
x
xen n
n n
Other subjects to consider (exercises):-Statistical distributions (with MATLAB)•Elementary Statistical functions (in MATLAB)•Statistical tests•ANOVA•Experimental design...
Comparison of sample mean with a
known value (population mean) (0):
BEGIN
n 30
zx
ncalc
0 t
x
s ncalc
0
END
zcal 1 96.
x 0x
0
yes no
END
x 0x
0
yes no
t t
n d. fcal tab
1 . .
yes no
Comparison between the mean of two samples
BEGIN
normality? transformation
zx x
n n
cal
1 2
12
1
22
2
normality
after
transformation?
12
22 2
test F
sn s n s
n n2 1 1
22 2
2
1 2
1 1
2
( ) ( )
TESTS NON PARAMÈTRIC
zcal1 96.
1 2
END
tx x
sn n
cal
1 2
1 2
1 1
1 2
1 2
END
t t
n n g lcal tab
1 2 2 . .
g l
sn
sn
sn
n
sn
n
t t for d f
cal
tab cal
. .
' . .
12
1
22
2
2
12
1
2
1
22
2
2
21 1
2t
tsn
tsn
sn
sn
t t n g l
t t n g ltab
tab
'
( ) . .
( ) . .
112
12
22
2
12
1
22
2
1 1
2 2
1
1
tx x
sn
sn
cal
1 2
12
1
22
2
t tcal '
END
1 2 1 2
n i n1 2 30
1 2
yes no no
yesyes
noyes no
yesno
no
no yes
yes
BEGIN
normality? transformation
normality after the
transformation
non parametric tests
END
zd
s ncal
d
td
s ncal
d
n30
zcal 196.
d 0 d0
END
d 0 d 0
t t
n dcal tab1 . f.
yes no no
yes
yes
no
no yes no yes
Comparison between the mean of two samples
Topic 1 Multivariate Data AnalysisTopic 1 Theory: Multivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)
Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentration of chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…
Romà Tauler (IDAEA, CSIC, Barcelona)Febrero 2009
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Unsupervised PatternRegognition
Supervised PatternRegognition
>> load arch>> whos Name Size Bytes Class Attributes
arch 75x10 6000 double Data matrix class 75x1 600 double Classification index samps 75x5 750 char Sample levels vars 10x2 40 char Variable levels>> plot(arch)>> plot(arch')
0 10 20 30 40 50 60 70 800
200
400
600
800
1000
1200
1400
1600
1800
Fe Ti Ba Ca K Mn Rb Sr Y Zr0
200
400
600
800
1000
1200
1400
1600
1800
Data Statistics min: 45 max: 1100 mean: 334.7000 median: 131 mode: 45 std: 386.7365 range: 1055
1 2 3 4 5 6 7 8 9 10
0
200
400
600
800
1000
1200
1400
1600
1800
Val
ues
Column Number
1 Fe2 Ti3 Ba4 Ca5 K 6 Mn7 Rb8 Sr9 Y 10 Zr
732 836 940 1044 1148 1252 1356 1460 1564 16680
5
10
15
20
25
Fe
100 150 200 250 300 350 400 4500
5
10
15
20
25
Ti
9 15 21 27 33 39 45 51 57 630
2
4
6
8
10
12
14
16
18
20
Ba
200 300 400 500 600 700 800 900 1000 11000
5
10
15
20
25
Ca
250 300 350 400 450 500 5500
2
4
6
8
10
12
14
16
18
K
20 30 40 50 60 70 80 900
5
10
15
20
25
30
Mn
70 80 90 100 110 120 130 140 1500
2
4
6
8
10
12
14
Rb
0 10 20 30 40 50 60 70 800
2
4
6
8
10
12
14
16
18
20
Sr
30 40 50 60 70 80 900
5
10
15
20
25
30
Y
40 60 80 100 120 140 160 180 200 220 2400
2
4
6
8
10
12
14
16
18
20
Zr
boxplot(arch’)
hist(arch(:,v)
Data pretreatment>> xcal=arch(1:63,:);>> xtest=arch(64:75,:);>> axcal=auto(xcal);>> subplot(1,2,1),plot(axcal);>> subplot(1,2,2),plot(axcal');>> boxplot(axcal)
0 10 20 30 40 50 60-3
-2
-1
0
1
2
3
2 4 6 8 10-3
-2
-1
0
1
2
3
1 2 3 4 5 6 7 8 9 10
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5V
alue
s
Column Number
Nr. of componentslarch=svd(arch)larch=larch(1:10)plot(larch)
laxcal=svd(axcal)laxcal = 18.1975 11.4439 8.2437 7.1865 3.9787 2.9436 2.4939 1.8726 1.4955 1.3505plot(laxcal)plot(larch)
1 2 3 4 5 6 7 8 9 100
2000
4000
6000
8000
10000
12000
14000
1 2 3 4 5 6 7 8 9 100
2
4
6
8
10
12
14
16
18
20
PCA Principal components analysisPCA on axcalI/O: [scores,loads,ssq,res,reslm,tsqlm,tsq] = pca(data,plots,scl,lvs); The input is the data matrix (data). Outputs are the scores (scores), loadings (loads), variance info (ssq), residuals (res), Q limit (reslm), T^2 limit (tsqlm), and T^2's (tsq).Optional inputs are (plots) plots = 0 suppresses all plots, plots = 1 [default] produces plots with no confidence limits, plots = 2 produces plots with limits, plots = -1 plots the eigenvalues only (without limits), a vector (scl) for plotting scores against, (if scl = 0 sample numbers will be used), and a scalar (lv) which specifies the number of principal components to use in the model and which suppresses the prompt for number of PCs.
[scores,loads,ssq,res,reslm,tsqlm,tsq]=pca(axcal);
Percent Variance Captured by PCA Model Principal Eigenvalue % Variance % VarianceComponent of Captured Captured Number Cov(X) This PC Total--------- ---------- ---------- ---------- 1 5.34e+000 53.41 53.41 2 2.11e+000 21.12 74.53 3 1.10e+000 10.96 85.50 4 8.33e-001 8.33 93.83 5 2.55e-001 2.55 96.38 6 1.40e-001 1.40 97.78 7 1.00e-001 1.00 98.78 8 5.66e-002 0.57 99.35 9 3.61e-002 0.36 99.71 10 2.94e-002 0.29 100.00
1 2 3 4 5 6 7 8 9 10-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Variable Number
Load
ings
for
PC
# 1
Variable Number vs. Loadings for PC# 1
1 2 3 4 5 6 7 8 9 10-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Variable Number
Load
ings
for
PC
# 2
Variable Number vs. Loadings for PC# 2
1 2 3 4 5 6 7 8 9 10-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
Variable Number
Load
ings
for
PC
# 3
Variable Number vs. Loadings for PC# 3
1 2 3 4 5 6 7 8 9 10-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Variable Number
Load
ings
for
PC
# 4
Variable Number vs. Loadings for PC# 4
1 Fe2 Ti3 Ba4 Ca5 K 6 Mn7 Rb8 Sr9 Y 10 Zr
0 10 20 30 40 50 60 70-5
-4
-3
-2
-1
0
1
2
3
4
5
Sample Number
Sco
re o
n P
C#
1
Sample Scores with 95% Limits
0 10 20 30 40 50 60 70-4
-3
-2
-1
0
1
2
3
Sample Number
Sco
re o
n P
C#
2
Sample Scores with 95% Limits
0 10 20 30 40 50 60 70-3
-2
-1
0
1
2
3
Sample Number
Sco
re o
n P
C#
3
Sample Scores with 95% Limits
0 10 20 30 40 50 60 70-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Sample Number
Sco
re o
n P
C#
4
Sample Scores with 95% Limits
PLTLOADS Plots loadings from PCA This function may be used to make 2-D and 3-D plots of loadings vectors against each other. The inputs to the function are the matrix of loadings vectors (loads) where each column represents a loadings vector from the PCA function and an optional variable of labels (labels) which describe the original data variables. Note: labels must be a "column vector" where each label is in single quotes and has the same number of letters. Example: labels = ['Height'; 'Weight'; 'Waist '; 'IQ '] The function will prompt to select 2 or 3-D plots, for for the numbers of the PCs, and if you would like "drop lines" and axes on the 3-D plots. I/O: pltloads(loads,labels)
pltloads(loads,vars);
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Fe
Ti
Ba
Ca K
Mn
Rb Sr
Y
Zr
Loadings for PC# 1
Load
ings
for
PC
# 2
Loadings for PC# 1 versus PC# 2
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
Fe Ti
Ba
Ca
K
Mn
Rb
Sr
Y
Zr
Loadings for PC# 1
Load
ings
for
PC
# 3
Loadings for PC# 1 versus PC# 3
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Fe
Ti
Ba
Ca
K
Mn
Rb Sr
Y
Zr
Loadings for PC# 1
Load
ings
for
PC
# 4
Loadings for PC# 1 versus PC# 4
pltscrs(scores,samps(1:63,:),class(1:63,:))
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
KAVG
K-1B
K-2
K-3A
K-1C
K-1D
K-3B
K-4R
K-4B
K-1A
BLAV1 BLAV9
BL-2 BL-3
BL-6
BL-7
BLAV7
BL-1
BL-8
SH-1
SH-15
SH-S1
SH-68
SH-2
SH-3
SH-5
SH-13
SHII7 SHV18 SHIL1
SHIL1 SHII1
SHV12 SHV24
SHII5
SHIIK SHIL1
SHV12
SHI10
SHI13
SHV14
SHII7
ANA-2 ANA-3
ANA-4 ANA-5
ANA-6
ANA-7
ANA-8
ANA-9
ANA-1 ANA-1 ANA-1
ANA-1
ANA-1 ANA-1
ANA-1
ANA-1
ANA-1
ANA-1 ANA-1
ANA-2 ANA-2
Scores on PC# 1
Sco
res
on P
C#
2Scores for PC# 1 versus PC# 2
-4 -3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
KAVG
K-1B
K-2
K-3A
K-1C
K-1D
K-3B
K-4R
K-4B
K-1A
BLAV1
BLAV9
BL-2
BL-3
BL-6
BL-7
BLAV7 BL-1
BL-8
SH-1
SH-15
SH-S1 SH-68
SH-2
SH-3
SH-5
SH-13
SHII7
SHV18
SHIL1 SHIL1
SHII1
SHV12
SHV24
SHII5
SHIIK
SHIL1
SHV12 SHI10
SHI13
SHV14 SHII7
ANA-2
ANA-3
ANA-4
ANA-5
ANA-6
ANA-7 ANA-8
ANA-9 ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1 ANA-1
ANA-2
ANA-2
Scores on PC# 1
Sco
res
on P
C#
3
Scores for PC# 1 versus PC# 3
-4 -3 -2 -1 0 1 2 3 4-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
KAVG K-1B
K-2
K-3A
K-1C
K-1D
K-3B
K-4R
K-4B
K-1A
BLAV1
BLAV9
BL-2
BL-3
BL-6 BL-7
BLAV7
BL-1 BL-8
SH-1
SH-15
SH-S1
SH-68
SH-2
SH-3
SH-5
SH-13 SHII7
SHV18
SHIL1
SHIL1
SHII1 SHV12
SHV24
SHII5
SHIIK
SHIL1
SHV12
SHI10 SHI13
SHV14
SHII7
ANA-2
ANA-3
ANA-4
ANA-5
ANA-6
ANA-7
ANA-8
ANA-9
ANA-1 ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1 ANA-1
ANA-1 ANA-2
ANA-2
Scores on PC# 1
Sco
res
on P
C#
4
Scores for PC# 1 versus PC# 4
PCAPRO Projects new data on old principal components model. Inputs are the new data (newdata), the old loadings (loads), the old variance info (ssq), the limit for q (q), the limit for t^2 (tsq) and an optional variable (plots) which suppresses the plots when set to 0. Outputs are the new scores (scores), residuals (res) and t^2 values (tsqvals). These are plotted as the function proceeds if plots ~= 0. The I/O format is: [scores,resids,tsqs] = pcapro(newdata,loads,ssq,q,tsq,plots); WARNING: Be sure that (newdata) is scaled the same as original data!
AUTO Autoscales matrix to mean zero unit variance Autoscales a matrix (x) and returns the resulting matrix (ax) with mean-zero unit variance columns, a vector of means (mx) and a vector of standard deviations (stdx) used in the scaling. I/O format is: [ax,mx,stdx] = auto(x);
SCALE Scales matrix as specified. Scales a matrix (x) using means (mx) and standard deviations (stds) specified. I/O format is: sx = scale(x,mx,stdx);
axtest=scale(xtest,mx,stdx);[scores_xtest]=pcapro(axtest,loads,ssq,reslm,tsqlm);
0 2 4 6 8 10 12-5
-4
-3
-2
-1
0
1
2
3
4
5
Sample Number
Sco
re o
n P
C#
1
New Sample Scores with 95% Limits from Old Model
0 2 4 6 8 10 12-3
-2
-1
0
1
2
3
Sample Number
Sco
re o
n P
C#
2
New Sample Scores with 95% Limits from Old Model
0 2 4 6 8 10 12-4
-3
-2
-1
0
1
2
3
Sample Number
Sco
re o
n P
C#
3
New Sample Scores with 95% Limits from Old Model
0 2 4 6 8 10 12-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Sample Number
Sco
re o
n P
C#
4
New Sample Scores with 95% Limits from Old Model
pltscrs([scores;scores_xtest],samps);
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
KAVG
K-1B
K-2
K-3A
K-1C
K-1D
K-3B
K-4R
K-4B
K-1A
BLAV1 BLAV9
BL-2 BL-3
BL-6
BL-7
BLAV7
BL-1
BL-8
SH-1
SH-15
SH-S1
SH-68
SH-2
SH-3
SH-5
SH-13
SHII7 SHV18 SHIL1
SHIL1 SHII1
SHV12 SHV24
SHII5
SHIIK SHIL1
SHV12
SHI10
SHI13
SHV14
SHII7
ANA-2 ANA-3
ANA-4 ANA-5
ANA-6
ANA-7
ANA-8
ANA-9
ANA-1 ANA-1 ANA-1
ANA-1
ANA-1 ANA-1
ANA-1
ANA-1
ANA-1
ANA-1 ANA-1
ANA-2 ANA-2
s1
s2
s3 s4
s5
s6
s7
s8
s9 s10
s11
s12
Scores on PC# 1
Sco
res
on P
C#
2Scores for PC# 1 versus PC# 2
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
3
KAVG
K-1B
K-2
K-3A
K-1C
K-1D
K-3B
K-4R
K-4B
K-1A
BLAV1
BLAV9
BL-2
BL-3
BL-6
BL-7
BLAV7 BL-1
BL-8
SH-1
SH-15
SH-S1 SH-68 SH-2
SH-3
SH-5
SH-13
SHII7
SHV18
SHIL1 SHIL1
SHII1
SHV12
SHV24
SHII5
SHIIK
SHIL1
SHV12 SHI10
SHI13
SHV14 SHII7
ANA-2
ANA-3
ANA-4
ANA-5
ANA-6
ANA-7 ANA-8
ANA-9 ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1 ANA-1
ANA-2
ANA-2
s1
s2
s3 s4
s5
s6
s7
s8
s9 s10
s11
s12
Scores on PC# 1
Sco
res
on P
C#
3
Scores for PC# 1 versus PC# 3
-4 -3 -2 -1 0 1 2 3 4-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
KAVG K-1B
K-2
K-3A
K-1C
K-1D
K-3B
K-4R
K-4B
K-1A
BLAV1
BLAV9
BL-2
BL-3
BL-6 BL-7
BLAV7
BL-1 BL-8
SH-1
SH-15
SH-S1
SH-68
SH-2
SH-3
SH-5
SH-13 SHII7
SHV18
SHIL1
SHIL1
SHII1 SHV12
SHV24
SHII5
SHIIK
SHIL1
SHV12
SHI10 SHI13
SHV14
SHII7
ANA-2
ANA-3
ANA-4
ANA-5
ANA-6
ANA-7
ANA-8
ANA-9
ANA-1 ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1
ANA-1 ANA-1
ANA-1 ANA-2
ANA-2
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
s11
s12
Scores on PC# 1
Sco
res
on P
C#
4
Scores for PC# 1 versus PC# 4
Exercise: multivariate data analysis of environmental samples
• NW Mediteranean contamination by organic compounds
load envwhosName Size Bytes Class Attributes
sampnames 22x1 1458 cell textdata 74x2 10874 cell varnames 74x1 6296 cell x 22x96 16896 double plot(x)plot(x’)
0 5 10 15 20 250
0.5
1
1.5
2
2.5x 10
4
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5x 10
4
25UCM
25, UCM
PCBs
samples variables
1 'Ty27'2 'BC12'3 'BC15'4 'Ty23'5 'TyK'6 'Ty8'7 'Ty17'8 'BC4'9 'Ty3‘10 'Ty19'11 'BC8'12 'A2'13 'BC10'14 'BC6'15 'BC4'16 'D3'17 'BC9'18 'D2'19 'C1'20 'D1'21 'BC11‘22 'BC7'
'n-C16' 'n-C17' 'n-C18' 'n-C19' 'n-C20' 'n-C21' 'n-C22' 'n-C23' 'n-C24' 'n-C25' 'n-C26' 'n-C27' 'n-C28' 'n-C29' 'n-C30' 'n-C31' 'n-C32' 'n-C33' 'n-C34' 'n-C35' 'n-C36' 'n-C37' 'n-C38' 'n-C39'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
'UCM ( 'pristane' 'phytane' 'fluoranthene' 'phenanthrene' 'anthracene' 'methy I phenanthrene' 'dimethylphenanthrenes' 'fluoranthene' 'acephenantrylene' 'pyrene' 'methylfluoranthenes' 'benzo[a]fluorene' 'benzo[b]fluorene' 'retene' 'benzo[b]phenanthrene' 'benz[a]anthracene' 'crysene + triphenylene' 'benzo[/+b+/c]fluoranthenes' 'benzo[a]fluoranthene' 'benzo[e]pyrene' 'benzo[a]pyrene' 'perylene' 'indeno[7,1,2,3-cde/]chrysene'
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
‘ indeno[1,2,3,-cd]p yrene' 'benzo[ghí\ perylene' 'benzo[ghí\ fluoranthene' 'cyclopenta[cd]pyrene' 'dibenzoanthracenes' 'benzo[b]chrysene' 'coronene' 302 ?? 'naphtho[1,2,-b]thiophene' 'dibenzothiophene' 'naphtho[2,1-b]thiophene' '4-methyldibenzothiophene' '3,2-methyldibenzothiophene' '1-methyldibenzothiophene' 'benzo[b]naphtho[2,1-d]thiophene' 'benzo[b]naphtho[1,2-d]thiophene' 'benzo[b]naphtho[2,3-b]thiophene’
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
alca
nes
1-24
PAHs, alquenes 26-65
'PCB-52‘ 'PCB-101' 'PCB-118' 'PCB-153' 'PCB-138' 'PCB-187' 'PCB-128' 'PCB-180' 'PCB-170 'o,p'-DDD' ‘o,p'-DDE ‘o,p'-DDT p,p'-DDE p,p'-DDD p,p'-DDT hexaclorobenzene hexaclorohexane lindane octachloroestyrene
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
27-nor-24-methylcholesta-5a,22(£)-dien-3/3-olCholesta-5a,22(£)-dien-3/3-olCholesterolCholestanolbrassicasterol24-methyl-5a(W)-cholest-22(£)-en-3/3-ol24-methylhcolest-5-en-3/3-olstigmasterol24-ethyl-5a-cholest-22-en-3/3-ol/3-sitosterol 24-ethyl-5a-cholestan-3/3-ol dinosterol
85 86 87 88 89 90 91 92 93 94 95 96
esterols 85-96organochlorine, PCBs 66-84
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
50
100
150
200
250
300
350
400
Val
ues
Column Number1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0
20
40
60
80
100
120
140
160
180
200
Val
ues
Column Number
excluding variable 25, UCMboxplot variables 26-50
boxplot variables 1-24
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
10
20
30
40
50
60
Val
ues
Column Number
boxplot variables51-74
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
0
0.5
1
1.5
2
x 104
Val
ues
Column Number
>> stdx=std(x);>> mx=mean(x);>> sx=scale(x,zeros(1,96),stdx);>> plot(sx)>> plot(sx')>> lsx=svd(sx);>> lsx=lsx(1:10)
0 10 20 30 40 50 60 70 80 90 1000
1
2
3
4
5
6
0 5 10 15 20 250
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
lsx = 74.5755 28.6436 14.0461 9.3163 8.7101 8.3390 7.3844 6.4339 5.5273 5.1046
3 components?
[scores,loads,ssq,res,reslm,tsqlm,tsq]=pca(sx); Warning: Data does not appear to be mean centered. Variance captured table should be read as sum of squares captured. Percent Variance Captured by PCA Model Principal Eigenvalue % Variance % VarianceComponent of Captured Captured Number Cov(X) This PC Total--------- ---------- ---------- ---------- 1 2.65e+002 78.50 78.50 2 3.91e+001 11.58 90.08 3 9.39e+000 2.78 92.86 4 4.13e+000 1.23 94.09 5 3.61e+000 1.07 95.16 6 3.31e+000 0.98 96.14 7 2.60e+000 0.77 96.91 8 1.97e+000 0.58 97.49 9 1.45e+000 0.43 97.92 10 1.24e+000 0.37 98.29
0 10 20 30 40 50 60 70 80 90 100-0.2
-0.18
-0.16
-0.14
-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
Variable Number
Load
ings
for
PC
# 1
Variable Number vs. Loadings for PC# 1
0 10 20 30 40 50 60 70 80 90 100-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Variable Number
Load
ings
for
PC
# 2
Variable Number vs. Loadings for PC# 2
0 10 20 30 40 50 60 70 80 90 100-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Variable Number
Load
ings
for
PC
# 3
Variable Number vs. Loadings for PC# 3
PAHS
alcanes
PCBsesterols
alcanes higher PM
0 5 10 15 20 25-40
-30
-20
-10
0
10
20
30
40
Sample Number
Sco
re o
n P
C#
1
Sample Scores with 95% Limits
0 5 10 15 20 25-15
-10
-5
0
5
10
15
Sample Number
Sco
re o
n P
C#
2
Sample Scores with 95% Limits
0 5 10 15 20 25-8
-6
-4
-2
0
2
4
6
8
Sample Number
Sco
re o
n P
C#
3
Sample Scores with 95% Limits
-0.2 -0.18 -0.16 -0.14 -0.12 -0.1 -0.08 -0.06 -0.04 -0.02
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
1
2
3
4 5
6
7 8
9 10
11
12
13
14
15
16
17
18
19
20
21 22 23 24
25
26
27
28
29 30
31 32
33
34
35
36
37 38
39
40 41 42
43 44
45 46
47
48 49 50
51
52
53
54
55
56
57
58 59 60
61
62
63
64 65
66 67 68 69 70
71 72
73 74 75
76
77
78 79
80 81 82
83
84
85 86 87
88 89 90
91
92 93 94
95
96
Loadings for PC# 1
Load
ings
for
PC
# 2
Loadings for PC# 1 versus PC# 2
pltloads(loads);
alcanes
esterols
PCBs
PAHs
-0.2 -0.18 -0.16 -0.14 -0.12 -0.1 -0.08 -0.06 -0.04 -0.02
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
1
2 3 4 5
6
7
8
9
10
11 12
13
14 15
16 17
18 19
20 21 22 23
24
25
26
27
28
29 30
31 32 33
34 35
36
37 38
39
40 41
42
43
44
45 46 47
48
49
50
51
52
53
54
55 56
57
58
59
60 61
62
63
64 65
66 67
68 69 70
71 72 73 74 75
76
77
78
79 80
81
82
83
84
85
86
87
88 89
90
91
92
93
94
95
96
Loadings for PC# 1
Load
ings
for
PC
# 3
Loadings for PC# 1 versus PC# 3
pltloads(loads);
alcanes
PAHs
PCBs
-25 -20 -15 -10 -5 0-10
-5
0
5
10
15
Ty27
BC12 BC15
Ty23 TyK
Ty8 Ty17
BC4
Ty3
Ty19
BC8 A2
BC10 BC6 BC4
D3
BC9
D2
C1
D1
BC11
BC7
Scores on PC# 1
Sco
res
on P
C#
2
Scores for PC# 1 versus PC# 2
pltscrs(scores,samp)
open sea
-25 -20 -15 -10 -5 0-8
-6
-4
-2
0
2
4
6
Ty27
BC12
BC15
Ty23
TyK
Ty8
Ty17
BC4
Ty3
Ty19
BC8
A2
BC10
BC6
BC4
D3
BC9
D2 C1 D1
BC11
BC7
Scores on PC# 1
Sco
res
on P
C#
3
Scores for PC# 1 versus PC# 3
pltscrs(scores,samp)
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60
5
10
15
20
Distance to K-Nearest Neighbor
Ty27
BC12
BC15
Ty23
TyK
Ty8
Ty17
BC4
Ty3
Ty19
BC8
A2
BC10
BC6
BC4
D3
BC9
D2
C1
D1
BC11
BC7
Dendrogram Using Mahalanobis Distance on 3 PCs
0 0.5 1 1.5 2
0
5
10
15
20
Distance to K-Nearest Neighbor
Ty27
BC12
BC15
Ty23
TyK
Ty8
Ty17
BC4
Ty3
Ty19
BC8
A2
BC10
BC6
BC4
D3
BC9
D2
C1
D1
BC11
BC7
Dendrogram Using Mahalanobis Distance on 3 PCs
cluster(x,samp)
opensea
BCNGulfLion
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60
5
10
15
20
Distance to K-Nearest Neighbor
Ty27
BC12
BC15
Ty23
TyK
Ty8
Ty17
BC4
Ty3
Ty19
BC8
A2
BC10
BC6
BC4
D3
BC9
D2
C1
D1
BC11
BC7
Dendrogram Using Mahalanobis Distance on 3 PCs
[axscores,axloads,axssq,axres,axreslm,axtsqlm,axtsq]=pca(ax); Percent Variance Captured by PCA Model Principal Eigenvalue % Variance % VarianceComponent of Captured Captured Number Cov(X) This PC Total--------- ---------- ---------- ---------- 1 3.98e+001 41.42 41.42 2 2.57e+001 26.78 68.21 3 8.07e+000 8.41 76.62 4 3.72e+000 3.88 80.50 5 3.33e+000 3.47 83.97 6 3.22e+000 3.35 87.32 7 2.60e+000 2.70 90.02 8 1.93e+000 2.01 92.03 9 1.31e+000 1.36 93.39 10 1.17e+000 1.22 94.62
autoscaled data
0 10 20 30 40 50 60 70 80 90 100-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Variable Number
Load
ings
for
PC
# 1
Variable Number vs. Loadings for PC# 1
0 10 20 30 40 50 60 70 80 90 100-0.18
-0.16
-0.14
-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
Variable Number
Load
ings
for
PC
# 2
Variable Number vs. Loadings for PC# 2
0 10 20 30 40 50 60 70 80 90 100-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Variable Number
Load
ings
for
PC
# 3
Variable Number vs. Loadings for PC# 3
0 5 10 15 20 25-15
-10
-5
0
5
10
15
20
Sample Number
Sco
re o
n P
C#
1
Sample Scores with 95% Limits
0 5 10 15 20 25-15
-10
-5
0
5
10
15
Sample Number
Sco
re o
n P
C#
2
Sample Scores with 95% Limits
0 5 10 15 20 25-6
-4
-2
0
2
4
6
Sample Number
Sco
re o
n P
C#
3
Sample Scores with 95% Limits
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15-0.18
-0.16
-0.14
-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
1
2
3
4 5
6
7 8
9
10
11
12
13 14
15
16
17
18
19
20
21 22
23
24
25
26
27
28 29
30
31 32
33
34
35
36
37
38
39
40 41 42
43 44
45 46
47
48 49
50
51
52
53
54
55
56
57 58
59 60 61
62
63
64
65
66 67
68 69
70
71
72
73 74
75
76
77
78
79
80
81
82
83
84
85
86 87
88
89
90 91
92 93
94
95 96
Loadings for PC# 1
Load
ings
for
PC
# 2
Loadings for PC# 1 versus PC# 2
PAHS
alcanes PCBs
esterols
alcanes
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
1
2 3 4 5
6
7 8
9
10
11 12
13
14 15
16 17
18 19
20
21 22 23
24
25
26
27
28
29 30
31 32
33
34 35
36
37 38
39
40 41
42
43
44
45
46 47
48
49
50
51 52
53
54
55
56
57
58
59
60 61
62
63 64 65
66 67 68 69 70
71 72 73 74
75 76
77
78 79
80 81
82
83
84
85
86
87
88 89
90
91
92
93
94
95
96
Loadings for PC# 1
Load
ings
for
PC
# 3
Loadings for PC# 1 versus PC# 3
-10 -5 0 5 10 15 20-15
-10
-5
0
5
10
15
Ty27
BC12
BC15
Ty23 TyK
Ty8
Ty17
BC4
Ty3 Ty19
BC8
A2
BC10 BC6
BC4
D3
BC9
D2
C1
D1
BC11
BC7
Scores on PC# 1
Sco
res
on P
C#
2
Scores for PC# 1 versus PC# 2
open sealess contam.
Gulf of Lion
Ebro Delta
BCN
1 'Ty27'2 'BC12'3 'BC15'4 'Ty23'5 'TyK'6 'Ty8'7 'Ty17'8 'BC4'9 'Ty3‘10 'Ty19'11 'BC8'12 'A2'13 'BC10'14 'BC6'15 'BC4'16 'D3'17 'BC9'18 'D2'19 'C1'20 'D1'21 'BC11‘22 'BC7'
-10 -5 0 5 10 15 20-5
-4
-3
-2
-1
0
1
2
3
4
5
Ty27
BC12
BC15
Ty23
TyK
Ty8
Ty17
BC4
Ty3
Ty19
BC8 A2
BC10
BC6
BC4 D3
BC9
D2
C1
D1 BC11
BC7
Scores on PC# 1
Sco
res
on P
C#
3
Scores for PC# 1 versus PC# 3
Ebro Delta
BCN
1 'Ty27'2 'BC12'3 'BC15'4 'Ty23'5 'TyK'6 'Ty8'7 'Ty17'8 'BC4'9 'Ty3‘10 'Ty19'11 'BC8'12 'A2'13 'BC10'14 'BC6'15 'BC4'16 'D3'17 'BC9'18 'D2'19 'C1'20 'D1'21 'BC11‘22 'BC7'
0 0.5 1 1.5 2
0
5
10
15
20
Distance to K-Nearest Neighbor
Ty27
BC12
BC15
Ty23
TyK
Ty8
Ty17
BC4
Ty3
Ty19
BC8
A2
BC10
BC6
BC4
D3
BC9
D2
C1
D1
BC11
BC7
Dendrogram Using Mahalanobis Distance on 3 PCs
cluster(x,samp)
opensea
BCNGulfLion
CHEMOMETRICS STUDY OF CTD ARTIC SEA WATER DATA
• Introduction
• CTD data description
• PCA results for XTOT, Xdcm,Xsurf,Xdeep
• PLS prediction yfluor = f(Xdcm)
• PARAFAC modelling of X(80,200,10)
• MCR of Xfluor, Xcond, Xtemp,...
• PCA of continuos integrated data
1
2
3
4
5 6 7 8 910
1112131415161718192021222324
252627282930313233343536373839404142
43444546474849
30 W
20 W 10
W 0
10 E
20 E
70 N
80 N
90 N
Decluttered
10 20 30 40 50 60 70 80-20
-15
-10
-5
0
5
10
15
20
Sample
long
1 1b
2b
3b
4b
5b
6b 7 8 9a 9c 10 11 12a 12c
13 14 15a 15c
16 18a 18c
20a 21 22 23a 23c 25 26b 27b
28 29 30a
31 32
33a 33c 34 36a 37 39a 39c 40 42a 42c
43b 44 45
46a 46c 48 49b
Decluttered
10 20 30 40 50 60 70 8068
70
72
74
76
78
80
82
Sample
latd
1 1b
2b
3b
4b
5b 6b 7
8 9b 10 11 12b 13 15a 15c
17 18b 20a 21 22 23a 23c
25 26b 27b 29 31 33a 33c 35 36b 38 39b 40 42a 42c 43b 44
46a 46c 48 49b
longitud
latitud
PROYECTO ATOS (Julio 2007)
E
W
78N
80N
y18ay19
y20ay21
y22y23a
y24
y25 y26a
y27a
y28
y29
y30a
y31
y32 y33ay34
y35y36a
y37y38
y39ay40y41y42a
y43ay44
y45
y46a
y47y48
y49a
17 E 18 E
78 N
79 N
80 N
y6ay7
y8
y9ay10
y11y12a
y13
y14y15a
y16 y17
2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12 E 13 E 14 E 15 E 16 E
81 N
SW
NE
Colderwaters
Warmerwaters
ICE
10 CTD measured variables:1 Depth, press2 Temperature, temp3 Conductivity, cond4 Salt concentration, salt5 Oxygent dissolved, oxyg6 beam light transmission, btrm7 fluorescence, fluor8 turbidimetry, turb9 latitude, latd10 longitude, long
CTD data evaluation and Integration
depths(100,..1000 m)
10 variables
EstaciónCTD1
10 variables
EstaciónCTD49
depths(100,…1000 m)
10 variables
EstaciónCTD20
depths(100,…1000 m)
Gross data table : X(53367x10) 81 CTD experiments CTD (49 estaciones with replicates and depths to 100-1000
Station 19 was removedbecause it had only54 depths
80 experiments49 stations
BuildingX(10,80,200)
For each variable:Xvar(80,200), Yvar
…..
…..80 estaciones
Fast data acquisition
Data should be averaged and filtered
CHEMOMETRICS STUDY OF CTD ARTIC SEA WATER DATA
• Introduction
• CTD data description
• PCA results for XTOT, Xdcm,Xsurf,Xdeep
• PLS prediction yfluor = f(Xdcm)
• PARAFAC modelling of X(80,200,10)
• MCR of Xfluor, Xcond, Xtemp,...
• PCA of continuos integrated data
salt
condtemp
temp/salt
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
90
100
31
31.5
32
32.5
33
33.5
34
34.5
35
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
90
100
26
27
28
29
30
31
32
33
34
35
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
90
100
-1
0
1
2
3
4
5
6
7
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
90
100 -0.05
0
0.05
0.1
0.15
0.2
SW NW
NE23 493711
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
90
10060
65
70
75
80
85
90
95
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
90
100
200
250
300
350
400
fluor
btrmoxyg
turb
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
90
100
0
5
10
15
20
25
30
35
40
45
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
23 493711
0 10 20 30 40 50 60 70 80
-10
0
10
20
30
40
50fluo
surf
dcmdeep
0 10 20 30 40 50 60 70 80-2
-1
0
1
2
3
4
5
6
7
8temp
surf
dcmdeep
0 10 20 30 40 50 60 70 8031
31.5
32
32.5
33
33.5
34
34.5
35
35.5salt
surf
dcmdeep
0 10 20 30 40 50 60 70 80150
200
250
300
350
400
450oxyg
surf
dcm
deep
temp salt
oxyg
fluor
1
2
3
4
5
6
7
8
9
10
Scale Gives Value of R for Each Variable Pair
Correlation Map, Variables in Original Order
1 2 3 4 5 6 7 8 9 10
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
correlation in dcm (máximo de clorofilamáximo de fluorescencia)
1 'press'2 'temp'3 'cond'4 'salt'5 'oxyg'6 'btra'7 'fluo'8 'turb'9 'long'10 'latd'
CHEMOMETRICS STUDY OF CTD ARTIC SEA WATER DATA
• Introduction
• CTD data description
• PCA results for XTOT, Xdcm,Xsurf,Xdeep
• PLS prediction yfluor = f(Xdcm)
• PARAFAC modelling of X(80,200,10)
• MCR of Xfluor, Xcond, Xtemp,...
• PCA of continuos integrated data
Percent Variance Captured by PCA ModelPrincipal Eigenvalue % Variance % VarianceComponent of Captured Captured Number Cov(X) This PC Total--------- ---------- ---------- ---------- 1 3.95e+000 39.52 39.52 2 3.10e+000 31.00 70.53 3 1.54e+000 15.42 85.95 4 8.46e-001 8.46 94.41
1 2 3 4 5 6 7 8 9 10-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Variable
Load
ings
on
PC
1 (
39.5
2%)
press
temp cond
salt
oxyg
btra
fluo turb
long latd
Variables/Loadings Plot for ydcm
1 2 3 4 5 6 7 8 9 10-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Variable
Load
ings
on
PC
2 (
31.0
0%)
press
temp cond
salt
oxyg
btra
fluo turb
long latd
Variables/Loadings Plot for ydcm
1 2 3 4 5 6 7 8 9 10-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
Variable
Load
ings
on
PC
3 (
15.4
2%)
press
temp cond
salt
oxyg
btra
fluo
turb
long latd
Variables/Loadings Plot for ydcm
PCAX(80,10)At DCM
-5 -4 -3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
4
5
Scores on PC 1 (39.52%)
Sco
res
on P
C 2
(31
.00%
)
1 2a 2b 3b 4a
4b
5a
5b
6a 6b
6c
7 8 9a
9b
10
11
12a
12c 13
14
15a 15b
15c
16
17
18b
18c
20a 20b
22
23a
23b
23c
24
25
26a 26b
27a
27b
28
29
30a
31
32 33a 33b 33c
34
36a
36b
37 38
39a
39b
39c 40
41
42a
42c
43a
43b 43c
44
45
46a
46b 46c
47
48
49a
49b
Samples/Scores Plot of ydcm
Decluttered
tempcondsalt
oxyg
fluorturb
btrm
PCAX(80,10)At DCM
Percent Variance Captured by PCA ModelPrincipal Eigenvalue % Variance % VarianceComponent of Captured Captured Number Cov(X) This PC Total--------- ---------- ---------- ---------- 1 3.86e+000 38.55 38.55 2 2.01e+000 20.11 58.67 3 1.53e+000 15.27 73.94 4 1.19e+000 11.87 85.81
1 2 3 4 5 6 7 8 9 10-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Variable
Load
ings
on
PC
1 (
38.5
5%)
press
temp cond salt
oxyg
btra
fluo
turb
long latd
Variables/Loadings Plot for ysurf
PCAX(80,10)
At surfaceExcluded sample 67, 42b
1 2 3 4 5 6 7 8 9 10-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Variable
Load
ings
on
PC
2 (
20.1
1%)
press
temp
cond
salt
oxyg
btra
fluo
turb
long latd
Variables/Loadings Plot for ysurf
1 2 3 4 5 6 7 8 9 10-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Variable
Load
ings
on
PC
3 (
13.3
5%)
press
temp cond salt
oxyg
btra
fluo
turb
long latd
Variables/Loadings Plot for ysurf
Decluttered
-4 -3 -2 -1 0 1 2 3 4 5-6
-5
-4
-3
-2
-1
0
1
2
Scores on PC 1 (38.55%)
Sco
res
on P
C 2
(20
.11%
)
1 1b
2a 2b
3b
4a 4b
5b 6a 6b
7 8 9a 9b
9c
11 12b
15a 15b 15c
16 17 18a 18b
18c
22 23a 23b 23c
24 26a
26b
27b
28
29
30a
31 32
33b
33c 34 35 37
39a
40 41 42a 42c
43a
43b 43c 44 46a 46b 46c
47 48 49a 49b
Samples/Scores Plot of ysurf
Excluded station 67
latdlong
tempcondsaltfluor
oxygbtrm PCA
X(80,10)At surface
Topic 1 Multivariate Data AnalysisTopic 1 Theory: Multivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)
Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentration of chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…
Romà Tauler (IDAEA, CSIC, Barcelona)Febrero 2009
CHEMOMETRICS STUDY OF CTD ARTIC SEA WATER DATA
• Introduction
• CTD data description
• PCA results for XTOT, Xdcm,Xsurf,Xdeep
• PLS prediction yfluor = f(Xdcm)
• PARAFAC modelling of X(80,200,10)
• MCR of Xfluor, Xcond, Xtemp,...
• PCA of continuos integrated data
Linear regression model usingPartial Least Squares calculated with SIMPLS Cross validation: random samples w/ 8 splits Percent Variance Captured by Regression Model -----X-Block----- -----Y-Block----- Comp This Total This Total ---- ------- ------- ------- ------- 1 29.00 29.00 76.95 76.95 2 38.72 67.72 2.51 79.46 1 2 3 4 5 6 7 8 9
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Variable
Reg
Vec
tor
for
Y 1
press
temp cond salt
oxyg
btra
turb
long latd
Variables/Loadings Plot for yred11Fluorescence PLS prediction from other parameters
(DCM data X(80,9), y(80,1))
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5
3
3.5
4
Variable
VIP
Sco
res
for
Y 1
press temp cond
salt oxyg
btra
turb
long latd
Variables/Loadings Plot for ydcm
0 5 10 15 20 25 30 35 40 45 50-10
0
10
20
30
40
50
60
Y Measured 1
Y P
redi
cted
1
1
2a 2b
3b
4a
4b
5a
6a 6b
6c
7
8 9a
9c 10 12a 13
15a 15b
15c
16
18a
18c
20a 20b 21
22
23a 23c
24
25
26a
26b
27a 27b
28
30a
31
32 33a 33b
33c
34 36a
36b
37 38 39a
39c
40
41
42b 42c
43a
43b 43c
44
45 46a
46b 46c
47
48
49a
49b
Samples/Scores Plot of ydcm
Decluttered
Linear regression model usingPartial Least Squares calculated with the SIMPLS Cross validation: random samples w/ 8 splits Percent Variance Captured by Regression Model -----X-Block----- -----Y-Block----- Comp This Total This Total ---- ------- ------- ------- ------- 1 34.94 34.94 33.23 33.23 2 39.55 74.49 10.53 43.76 3 11.72 86.21 12.44 56.20
Fluorescence PLS prediction from otherexcluding beam transmission and turbidity
(DCM data X(80,7), y(80,1))
1 2 3 4 5 6 7-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
Variable
Load
ings
on
LV 3
(11
.72%
)
press
temp cond
salt
oxyg
long
latd
Variables/Loadings Plot for yred11
1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
Variable
VIP
Sco
res
for
Y 1
press
temp cond
salt
oxyg
long
latd
Variables/Loadings Plot for yred11
0 5 10 15 20 25 30 35 40 45 50-10
0
10
20
30
40
50
Y Measured 1
Y P
redi
cted
1
1
1b
2a 2b
3b
4a
4b
5a
5b
6a 6b
6c
8 9a 9c
11 12c
13
14 15a
15b 15c
16
17
18a
18b
18c
20a
21
22
24
25
26a 26b
27a
27b
28
29
31
32
33a
33c 34
36a 36b
37 38
39a
39c 40
42b 42c
43a 43b 43c 44
45
46a 46b 46c
47
48
49a 49b
Samples/Scores Plot of ydcm
Decluttered
Fluorescence PLS prediction from other parameterssurf data X(80,9), y(80,1))
Percent Variance Captured by Regression Model -----X-Block----- -----Y-Block----- Comp This Total This Total ---- ------- ------- ------- ------- 1 33.89 33.89 29.08 29.08 2 22.29 56.18 3.53 32.61 3 14.83 71.01 2.76 35.37 4 4.80 75.81 5.10 40.48
1 2 3 4 5 6 7 8 9 10-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Variable
Reg
Vec
tor
for
Y 1
press
temp cond
salt
oxyg
btra
turb long
latd
Variables/Loadings Plot for ysurf
1 2 3 4 5 6 7 8 9 100
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Variable
VIP
Sco
res
for
Y 1
press
temp
cond salt
oxyg
btra
turb
long latd
Variables/Loadings Plot for ysurf
-5 0 5 10 15 20 25 30-4
-2
0
2
4
6
8
10
12
14
Y Measured 1
Y P
redi
cted
1
2a
2b
4b
5a 6a
6c
7
8 9a
9b
9c
10
11
12a
12c
14
15a
15b
15c
16
18a
18c 22 23b
23c
27a
27b
29
31
32
33a
33c
37
39a
39c
40
42b
43a
43b
43c 44
46a
46b
46c
48 49b
Samples/Scores Plot of ysurf
Decluttered