1.1. Principal component...

61
Pierre Legendre Département de sciences biologiques Université de Montréal http://www.NumericalEcology.com/ 1.1. Principal component analysis © Pierre Legendre 2017

Transcript of 1.1. Principal component...

Page 1: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Pierre LegendreDépartement de sciences biologiques

Université de Montréalhttp://www.NumericalEcology.com/

1.1. Principal component analysis

© Pierre Legendre 2017

Page 2: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Outline of the presentation

1. PCA algebra and computation steps

2. Data transformations before PCA

3. Scalings in PCA

4. Equilibrium contribution of variables

5. The meaningful components

6. Algorithms for PCA

7. Some applications of PCA in ecology

8. References

PCAalgebraandcomputationsteps

Page 3: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

This section of the ordination course will describe the most fundamental of all ordination methods. It is called Principal component analysis (PCA).

In a sense, it is “the mother” of the other ordination methods that we will study in later sections of the course because these other methods will try to produce the same type of ordination plots as PCA using data that are not quite appropriate, or that are inappropriate for PCA.

I will first describe the algebra and computation steps of PCA.

PCA algebra and computation steps

PCAalgebraandcomputationsteps

Page 4: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Principal component analysis (PCA) is an ordination method preserving the Euclidean distance among the objects.

Definition of PCA

PCA is only applicable to multivariate quantitative data.

PCAalgebraandcomputationsteps

Page 5: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

PCA: Computation steps

For the means, variances and covariances to make sense, the variables must be quantitative in PCA.

PCAalgebraandcomputationsteps

Page 6: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

PCAalgebraandcomputationsteps

Page 7: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Multiply the centred data by the eigenvector matrix U

to obtain the positions of the objects in the PCA ordination plot:

F = y− y⎡⎣

⎤⎦U

PCAalgebraandcomputationsteps

Page 8: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

F = y− y⎡⎣

⎤⎦U

PCAalgebraandcomputationsteps

Matrix U contains direction cosines, which are cosines of the angles between the

variables and the PCA axesF

Page 9: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

R code for this example –Y <- matrix(c(2,3,5,7,9,1,4,0,6,2),5,2)Y.c <- scale(Y, center=TRUE, scale=FALSE)Y.eig <- eigen(cov(Y.c))U <- Y.eig$vectorsF <- Y.c %*% U

biplot(F,U)

PCAalgebraandcomputationsteps

Page 10: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Note – The axes may be inverted due to an arbitrary decision that the software has to make during calculation, which decides which end of each eigenvector should correspond to the positive direction

Page 11: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

This decision is of no fundamental consequence for the ordination:

the distances among objects are the same if any or all of the axes are inverted.

After inversion of the signs along one or several axes, PCA still preserves the Euclidean distances among the objects.

Page 12: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

(b)Centrethevariables

0 2 4 6 8 10

-20

24

68

Var.1

Var.2

(a)Scatterdiagram

-4 -2 0 2 4

-4-2

02

4

Var.1 centred

Var

.2 c

entre

d

Var1centred

Var2

centred

PCAalgebraandcomputationsteps

Animation: the PCA example in 4 steps

Page 13: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Var1centred

Var2

centred

-4 -2 0 2 4

-4-2

02

4

(c)ComputePCAaxes (d)PCAbiplot

-3 -2 -1 0 1 2 3

-3-2

-10

12

3PCA axis 1

PC

A a

xis

2

1

2

3

4

5

-1.0 -0.5 0.0 0.5

-1.0

-0.5

0.0

0.5

Var 1

Var 2

1

Var 1

Rotate

thegraph

PCAalgebraandcomputationsteps

Page 14: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Data transformations before PCA

Field data often need to be transformed before they are analysed by PCA. We will see why.

Page 15: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

PCA if a form of variance decomposition

1. The sum of the eigenvalues is equal to the sum of the variances of the variables. Yes, variances

have physical dimensions!!

Properties of variances

DatatransformationsbeforePCA

2. Variables have physical dimensions (e.g. altitude is measured in m) and the variance of a variable has the same dimension as the variable, squared.

3. The variances of all variables subjected together to PCA must have the same physical dimensions; otherwise, we could not add them into a sum to be decomposed into eigenvalues during PCA.

For example, the variance of altitude is expressed in m2.

Page 16: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

• Ranging is also sometimes used:

y 'i =yi − yminymax − ymin

Both methods make the transformed variables dimensionless.

• When the input variables do not all have the same physical dimension, they must be subjected to standardization:

Input variable standardization is an option of PCA functions.

zi =yi − ysy

DatatransformationsbeforePCA

Make physical variables dimensionless

Page 17: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

• Hence, the sum of the eigenvalues of a matrix of standardized variables (argument scale) is also p.

• Standardized variables have a mean of 0 and a variance of 1.• The sum of the variances of p standardized variables is p.

# PCA, spider environmental data using vegan’s rda()spiders.env <- read.table(file.choose())dim(spiders.env)# [1] 28 15 <= 28 sites, 15 variables

rda.spiders.env <- rda(spiders.env, scale=TRUE)sum(rda.spiders.env$CA$eig)# [1] 15 <= Sum of the PCA eigenvalues

The data file used here is ‘Spiders_env_(28x15).txt’, containing 15 variables.

Page 18: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

PCA produces the best dispersion of the points when the input variables have symmetrical distributions. The ideal situation, rarely achieved, is when their distribution is multivariate normal.

PCA graphs are easier to interpret when the distributions are at least symmetrical. Variables may be transformed to make their distributions more symmetrical. The normalizing transformations most often used are the square root (exponent ½), double square root (exponent ¼), and log.

Any log base can be used for transformation. For a data vector vec, log10(vec) is a linear transformation of loge(vec); the two transformed vectors are perfectly correlated.

DatatransformationsbeforePCA

“Normalizing” transformations

Page 19: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Example in two dimensions: random lognormal versus normal data.=> Which graph shows the best dispersion of the points ?

# Generate a matrix of random lognormal data# with 50 rows and 5 columnsn <- 50 ; p <- 5mat2.2 <- matrix(exp(rnorm((n*p),mean=0,sd=2.5)),n,p)colnames(mat2.2) <- paste("Var",1:5,sep=".")

# Apply a log transformation to the datamat2.2.log <- log(mat2.2)

# Plot columns 1 and 2 of these data files

DatatransformationsbeforePCA

Page 20: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

0 50 100 150 200

0100

200

300

Random lognormal data

Var.1

Var.2

-6 -4 -2 0 2 4-4

-20

24

6

Random normal data

Var.1

Var.2

log

=> Which graph shows the best dispersion of the points?

Page 21: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

1. Transformation to reduce the asymmetry of species distributions: y' = log(y + 1).

2. Before PCA, community composition data can also be transformed using transformations that are appropriate for the study of beta diversity =>

Transformations for community composition data

For community data, we use y' = log(y + 1) because the lowest value in community data matrices is 0.

log(0) = –Inf, but log(0 + 1) = 0.

The R function to carry out this transformation is:

log1p(x)

Page 22: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

The Hellinger and chord transformations (Legendre & Gallagher, 2001) are appropriate before PCA because …

• Chord transformation:

chord-transformed data + Euclidean distance => chord distance

" y ij = yij yij2

j=1

p∑

• Hellinger tranformation:

Hellinger-transf. data + Euclidean distance => Hellinger distance

" y ij = yij yi +

DatatransformationsbeforePCA

The Hellinger and chord distances are appropriate for ordination of community data because they have 9 properties that are important for beta diversity studies.

These two distances will be described in the course on dissimilarities and their properties will be shown in the course on beta diversity.

Page 23: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Example: the spider data set (28 sites, 12 species) —PCA of the original abundance and Hellinger-transformed data.

spiders <- read.table(file.choose())spiders.hel <- decostand(spiders, "hellinger")

# PCA using function prcomp of {stats}pca.spiders <- prcomp(spiders)pca.spiders.hel <- prcomp(spiders.hel)

# PCA biplots from prcomp {stats}par(mfrow=c(1,2))biplot(pca.spiders, scale=0)biplot(pca.spiders.hel, scale=0)

Which PCA biplot shows the best dispersion of the points?

DatatransformationsbeforePCA

Page 24: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Original abundance data Hellinger-transformed data

-50 0 50 100

-50

050

100

PC1

PC2

Site1

Site2

Site3

Site4

Site5

Site6

Site7

Site8

Site9Site10

Site11Site12

Site13Site14

Site15Site16Site17Site18Site19Site20Site21Site22Site23Site24

Site25Site26Site27Site28

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

Alop.acce

Alop.cune

Alop.fabrArct.luteArct.periAulo.albi

Pard.lugu

Pard.mont

Pard.nigr

Pard.pull

Troc.terrZora.spin

-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

PC1

PC2

Site1

Site2

Site3 Site4Site5Site6

Site7

Site8

Site9

Site10Site11

Site12

Site13

Site14

Site15

Site16Site17Site18Site19Site20

Site21Site22Site23Site24

Site25

Site26

Site27

Site28

-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

Alop.acce

Alop.cune

Alop.fabr

Arct.lute

Arct.peri

Aulo.albi

Pard.lugu

Pard.mont

Pard.nigr

Pard.pull

Troc.terr

Zora.spin

DatatransformationsbeforePCA

Page 25: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

The role of scalings is to transform the eigenvectors in matrix U into coordinates that are appropriate to represent the sites and the species in biplots.

Different types of scalings are available in PCA.

Scalings in PCA

ScalingsinPCA

Page 26: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

PCA biplots are graphs in which objects and variables (descriptors) are represented together.

Biplots

ScalingsinPCA

Page 27: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

In biplots, objects and variables can be presented together in two different ways, called scalings:

• Scaling type 1: distance biplot, used when the interest is on the positions of the objects with respect to one another. –Ø  Plot matrices F to represent the objects and U for the variables.

• Scaling type 2: correlation biplot, used when the angular relationships among the variables are of primary interest. –

Ø  Plot matrices G to represent the objects and Usc2 for the variables, where G = FΛ–1/2 , and Usc2 = UΛ1/2.

Scalings in biplots

ScalingsinPCA

Page 28: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

The generally accepted rule for representing sites and variables (e.g. species) together in a PCA biplot is the following: In biplots, we can use matrices which, together, reconstruct the centred data Yc . Hence,

Ø  In a distance biplot, we can use F and U together because F U' = Yc ;

Ø  In a correlation biplot, we can use G and Usc2 together because G Usc2' = (FΛ–1/2) (UΛ1/2)' = Yc .

This biplot rule was proposed by K. Ruben Gabriel in 1971.

The biplot rule

ScalingsinPCA

Page 29: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Extended code for the numerical example, scalings 1 and 2 –Y <- matrix(c(2,3,5,7,9,1,4,0,6,2),5,2)Y.c <- scale(Y, center=TRUE, scale=FALSE)Y.eig <- eigen(cov(Y))k <- length(which(Y.eig$values > 1e-10))

# Scaling 1 (distance biplot)U <- Y.eig$vectorsF <- Y.c %*% Ubiplot(F, U, expand=1.5, xlim=c(-4,4), ylim=c(-4,4))abline(h=0, v=0, lty=2, col="grey60")

# Scaling 2 (correlation biplot)U.sc2 <- U %*% diag(Y.eig$values[1:k]^(0.5))G <- F %*% diag(Y.eig$values[1:k]^(-0.5))biplot(G, U.sc2, expand=1.3, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5))abline(h=0, v=0, lty=2, col="grey60")

Page 30: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Biplot,scaling1 Biplot,scaling2

-4 -2 0 2 4

-4-2

02

4

PCA axis 1

PC

A a

xis

2

1

2

3

4

5

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Var 1

Var 2

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

PCA axis 1

PC

A a

xis

2

1

2

3

4

5

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

Var 1

Var 2

ScalingsinPCA

Page 31: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

• In scaling 1 (distance biplot), Ø  the sites have variances, along each axis (or principal

component), equal to the axis eigenvalue (column of F);Ø  the eigenvectors (columns of U) are normed to lengths = 1; Ø  the length (norm) of each species vector in the p-

dimensional ordination space (rows of U) is 1.

• In scaling 2 (correlation biplot), Ø  the sites have unit variance along each axis (columns of G);Ø  the eigenvectors (columns of Usc2) are normed to

lengths = sqrt(eigenvalues); Ø  the norm of each species vector in the p-dimensional

ordination space (rows of Usc2) is its standard deviation.

Mathematical relationships in scaled matrices

Page 32: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

In scaling 1 (distance biplot), 1.  Distances among objects approximate their Euclidean distances in

full multidimensional space.2.  Projecting an object at right angle on a descriptor approximates the

position of the object along that descriptor. 3.  Since descriptors have equal lengths of 1 in the full-dimensional

space, the length of the projection of a descriptor in reduced space indicates how much it contributes to the formation of that space.

4.  A scaling 1 biplot thus shows which variables contribute the most to the ordination in a few dimensions (see also section: Equilibrium contribution of variables).

5.  The descriptor-axes are orthogonal (90°) to one another in multidimensional space. These right angles, projected in reduced space, do not reflect the variables’ correlations.

Interpretation of relationships in biplots

ScalingsinPCA

Revise these relationships when you

compute a PCA

Page 33: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

In scaling 2 (correlation biplot), 1.  Distances among objects approximate their Mahalanobis distances

in full multidimensional space.2.  Projecting an object at right angle on a descriptor approximates the

position of the object along that descriptor.3.  Since descriptors have lengths sj in full-dimensional space, the

length of the projection of a descriptor j in reduced space is an approximation of its standard deviation sj. Note: sj is 1 when the variables have been standardized.

4.  The angles between descriptors in the biplot reflect their correlations.

5.  When the distance relationships among objects are important for interpretation, this type of biplot is inadequate; a distance biplot should be used.

ScalingsinPCA

Revise these relationships when you

compute a PCA

Page 34: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Equilibrium contribution of variables

If p variables contributed equally to all dimensions of a reduced PCA space (i.e. in 2 dimensions), we would say that their contributions are in equilibrium with respect to the axes.

For any set of p variables, we can draw a circle on the PCA scaling 1 biplot whose radius is equal to the length of the projection of a variable that would contribute equally to all axes of the reduced space.The logic is explained in the following slides =>

length = 1

projection = 2 / 3

Page 35: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

In PCA scaling 1, the p-dimensional space preserves the Euclidean distance among objects. The descriptors all have lengths (or norms) of 1 in multidimensional space.

Lengths of descriptors in PCA space

Hence, the lengths of their projections in 2-dimensional plots, for example, can be compared: Ø  long arrows represent variables that contribute highly to the

axes of this projection in 2 dimensions,Ø  short arrows represent variables that contribute less.

In PCA scaling 1, the variables are at right angles to one another in multivariate space. Projected in reduced 2-dimensional space, these angles look acute or obtuse, but they are still right angles.

Equilibriumcontributionofvariables

Page 36: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

A circle can be drawn on the scaling 1 biplot, corresponding to the hypothesis of equal contributions of all descriptors to the reduced space (e.g. in 2 dimensions): àit is called the equilibrium circle of descriptors, or circle of equilibrium contribution. àIts radius is sqrt(d/p), where

d = dimension of the reduced space (usually, d = 2)p = dimensionality of the multivariate PCA space, which is the number of eigenvalues > 0; usually equal to the number of descriptors.

Equilibrium circle of descriptors

Equilibriumcontributionofvariables

Page 37: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Contributionéquilibréedesvariables

length = 1

projection = 2 / 3

Radius (R) of the circle of equilibrium contribution: R = sqrt(d/p), where

d = dimension of the reduced space (d = 2 in this example)p = number of descriptors in the PCA (p = 3)

Page 38: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

With an equilibrium circle, the scaling 1 biplot shows the 6 species that contribute the most to the dispersion of the sites in reduced space. Angles among descriptors are not interpretable in that type of biplot.

PCA biplot, scaling 1, of the Hellinger-transformed spider data.

The circle of equilibrium contribution is shown in red. Its radius is sqrt(2/12) = 0.408.

Six species have arrows longer than the radius of the circle.

Drawn with function cleanplot.pca() of the Numerical ecology with R book.

-1.0 -0.5 0.0 0.5

-0.5

0.0

0.5

PCA biplot - Scaling 1

PCA 1

PC

A 2

Site1

Site2

Site3Site4Site5Site6

Site7

Site8

Site9

Site10

Site11

Site12

Site13

Site14

Site15

Site16

Site17Site18Site19Site20

Site21Site22Site23Site24

Site25

Site26

Site27

Site28

Alop.acce

Alop.cune

Alop.fabr

Arct.lute

Arct.peri

Aulo.albi

Pard.lugu

Pard.mont

Pard.nigr

Pard.pull

Troc.terr

Zora.spin

Page 39: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

In PCA scaling 2, the p-dimensional space preserves the Mahalanobis distance among objects, not the Euclidean distance.In that multivariate space, • the variables are not at right angles with respect to one another but at angles that reflect their correlations;• the lengths of the variable vectors = their standard deviations.

In scaling 2: no equilibrium circle

Ø  The lengths of the variable projections in (for example) a 2-dimensional reduced space depend on:• their standard deviations, which are all different if the

variables have not been standardized;•the angles of their projections, in Mahalanobis space, onto

the 2-dimensional projection plane.

Page 40: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

For these reasons, in scaling 2,•the angles between descriptors reflect their correlations, •no value of equilibrium contribution would apply to all descriptors.•Henceno circle of equilibrium contribution can be drawn in PCA scaling 2 biplots.

Equilibriumcontributionofvariables

Page 41: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

In a scaling 2 biplot, the angles between descriptors reflect their correlations. Species with long arrows separated by small angles may indicate species associations.

PCA biplot, scaling 2, of the Hellinger-transformed spider data.

Drawn with function cleanplot.pca() of the Numerical ecology with R book.

-2 -1 0 1 2

-2-1

01

2

PCA biplot - Scaling 2

PCA 1

PC

A 2

Site1

Site2

Site3 Site4Site5Site6

Site7

Site8

Site9

Site10

Site11

Site12

Site13

Site14

Site15

Site16

Site17Site18Site19Site20

Site21Site22Site23Site24

Site25

Site26

Site27

Site28

Alop.acce

Alop.cune

Alop.fabr

Arct.lute

Arct.peri

Aulo.albi

Pard.lugu

Pard.mont

Pard.nigr

Pard.pull

Troc.terr

Zora.spin

Page 42: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Environmental variables must be standardized at the beginning of a principal component analysis. A circle of equilibrium contribution can be drawn in scaling 1. The circle helps determine the variables that contribute the most to the formation of the reduced space plane.

Environmental variables

# PCA, spider environmental data using vegan’s rda()spiders.env <- read.table(file.choose())rda.spiders.env <- rda(spiders.env, scale=TRUE) par(mfrow=c(1,2))cleanplot.pca(rda.spiders.env, scaling=1, opt=TRUE)cleanplot.pca(rda.spiders.env, scaling=2, opt=TRUE)

The data file used here is ‘Spiders_env_(28x15).txt’, containing 15 variables.cleanplot.pca(): R function of the book Numerical ecology with R.

Page 43: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

•Distances among sites in scaling 1 biplot reflect the distances in multivariate space. A circle of equilibrium contribution can be drawn. •In a scaling 2 biplot, the angles between variables reflect their correlations.

-4 -2 0 2 4

-4-2

02

46

PCA biplot - Scaling 1

PCA 1

PC

A 2 Site1

Site2

Site3

Site4Site5

Site6

Site7

Site8

Site9

Site10

Site11Site12

Site13

Site14

Site15Site16

Site17

Site18Site19Site20Site21

Site22Site23Site24

Site25

Site26

Site27Site28

Water.contentHumus

Bare.sandLeaves.twigs

Herb.cover

Herb.height

Calamagrostis

Corynephorus Tree.coverTree.heightPopulus

Crataegus

Ill.grey.sky

Ill.clear.skySoil.reflection

-2 -1 0 1 2

-2-1

01

2

PCA biplot - Scaling 2

PCA 1P

CA

2

Site1

Site2

Site3

Site4Site5

Site6

Site7

Site8

Site9

Site10

Site11Site12

Site13

Site14

Site15Site16

Site17

Site18

Site19Site20Site21Site22Site23Site24

Site25

Site26

Site27Site28

Water.contentHumus

Bare.sandLeaves.twigs

Herb.cover

Herb.height

Calamagrostis

Corynephorus Tree.coverTree.heightPopulus

Crataegus

Ill.grey.sky

Ill.clear.skySoil.reflection

Page 44: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

The meaningful components

A PCA produces an ordination in k dimensions, with

k ≤ min(p, n – 1)

Which ones of these dimensions should we look at and try to interpret?

Page 45: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

PCA provides a description of multivariate data in a space of reduced dimensionality. It is not a test of statistical significance.

1 See also Legendre & Legendre (2012), Section 9.1.6, “The meaningful components”.

Yet, users of the method would like to know: how many axes should we look at? How many axes display more than random variation? Various criteria have been proposed 1.

1. Arbitrary decision — For example, display and examine the axes that represent, together, at least 75% of the total variation.2. The Kaiser-Guttman criterion — Interpret the axes whose eigenvalues are larger then the mean of the eigenvalues. (For standardized data, the sum of the eigenvalues is the number of variables p, so that the mean eigenvalue is 1.)

Criteria to select the meaningful components

Page 46: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

3. Compare the eigenvalues to the broken-stick model, a null model for random distribution of the variance among the axes.

Null model: break a stick of unit length into p parts by placing (p–1) breakpoints at random along it. Measure the piece lengths and place them in decreasing order. Repeat a large number of times. Compute the mean of the longest parts, the second longest, etc.

For a unit stick broken at random into p = 2, 3, … pieces, the expected values (E) of the relative lengths of the successively smaller pieces (j) are given by the following model equation:

Several R packages provide functions to compute scree plots that compare eigenvalues to the broken stick model.

E( j) = 1p

1xx= j

p

Page 47: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

# Screeplot example: the spider dataspiders <- read.table(file.choose())spiders.hel <- decostand(spiders, "hellinger")

# PCA using function rda of {vegan}rda.spiders.hel <- rda(spiders.hel)screeplot(rda.spiders.hel, bstick=TRUE, npcs=12)

Screeplot of the eigenvalues. Grey rectangles: the eigenvalues. Red dots: values of the broken stick model rescaled to the sum of the eigenvalues. Vegan’s function screeplot.cca(). Screeplot: see http://psychologydictionary.org/scree-plot/ù

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12

rda.spiders.hel

Inertia

0.00

0.10

0.20

Broken Stick

Page 48: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

The decision on how many eigenvalues should be interpreted can be based … • either on the comparison of individual eigenvalues with the corresponding broken stick values,• or the comparison of cumulative eigenvalues with cumulative broken stick values.

Grey rectangles: the eigenvalues. Red dots: values of the broken stick model rescaled to the sum of the eigenvalues.

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12

rda.spiders.hel

Inertia

0.00

0.10

0.20

Broken Stick

Page 49: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

For this example, using the comparison of individual eigenvalues with the corresponding broken stick values, one could decide to interpret the first 2 or 3 eigenvalues because the first two eigenvalues are larger than the corresponding broken stick values and the third one is about equal.

Themeaningfulcomponents

Grey rectangles: the eigenvalues. Red dots: values of the broken stick model rescaled to the sum of the eigenvalues.

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12

rda.spiders.hel

Inertia

0.00

0.10

0.20

Broken Stick

Page 50: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

PCAalgebraandcomputationsteps

In the analysis of the spider data, what proportion of the species variance is expressed on the first 2 PCA axes? On the first 3 axes?Compute pseudo-R2 statistics for m = 2 or 3 axes:

R.square = Sum of the first m eigenvalues / Sum of all eigenvalues

# PCA using function rda() of {vegan}res1 <- rda(spiders.hel)( sum(res1$CA$eig[1:2])/sum(res1$CA$eig) ) # 0.744( sum(res1$CA$eig[1:3])/sum(res1$CA$eig) ) # 0.874

# Same analysis using function prcomp() of {stats}res2 <- prcomp(spiders.hel)( sum(res2$sdev[1:2]^2)/sum(res2$sdev^2) ) # 0.744( sum(res2$sdev[1:3]^2)/sum(res2$sdev^2) ) # 0.874

Page 51: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Algorithms for PCA

Principal component analysis can be computed using different computer algorithms.

AlgorithmsforPCA

Page 52: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

PCA is a statistical method of data analysis (not a statistical test).Three different algorithms (or methods of calculation) can be used to implement it:Ø  Eigenvalue decomposition (EVD); eigen(cov(Y)) in R.

Ø  Singular value decomposition (SVD); svd(Y.c) in R.These two algorithms are interchangeable, although statisticians often prefer svd(), which offers greater numerical accuracy.

Details are found in Legendre & Legendre (2012, Section 9.1.9).

Ø  An iterative algorithm developed by Clint & Jennings (1970) was adapted to correspondence analysis by Hill (1973). It was then used by ter Braak in the Canoco ordination package.

AlgorithmsforPCA

Page 53: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

It had to be U – The (wonderful) SVDThe SVD song

Calculation of PCA by Singular Value Decomposition (SVD) is demonstrated in a video written by Prof. Michael Greenacre, Barcelona Graduate School of Economics. The video is available at the following address:

https://www.youtube.com/watch?v=JEYLfIVvR9I

The video presents a song explaining the mathematics of singular value decomposition (SVD), a most useful results in matrix algebra, which has a vast range of practical applications, including PCA. It is sung by Gurdeep Stephens, accompanied on the piano by Lisa Olive. Concept, lyrics and animations by Michael Greenacre.

This video was first played at the 9th Tartu Conference of Multivariate Statistics in Tartu, Estonia, on 28 June 2011.

The video links to mathematical lectures on SVD.

Playthevideo

AlgorithmsforPCA

Page 54: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Some applications of PCA in ecology

Principal component analysis can help answer different questions in ecology.Here are some examples.

Page 55: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

SomeapplicationsofPCA

Display two-dimensional ordinations of the objects with their variables. Objects are often sampling sites in ecology.

-0.3 -0.2 -0.1 0.0 0.1 0.2

-0.3

-0.2

-0.1

0.0

0.1

0.2

PC1

PC2

Site1

Site2

Site3Site4Site5Site6

Site7

Site8

Site9

Site10

Site11

Site12

Site13

Site14

Site15

Site16

Site17Site18Site19Site20

Site21Site22Site23Site24

Site25

Site26

Site27

Site28

-1.5 -1.0 -0.5 0.0 0.5 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

Alop.acce

Alop.cune

Alop.fabr

Arct.lute

Arct.peri

Aulo.albi

Pard.lugu

Pard.mont

Pard.nigr

Pard.pull

Troc.terr

Zora.spin

This is the most common application of PCA.

Example 1 –

Page 56: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Identify or display groups of variables that are intercorrelated, for example species associations.

Oribatid mite species associations (symbols: groups 1 and 2) plotted in a PCA scaling 2 ordination. From: Legendre (2005).

Example 2 –

Page 57: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Example 3 –Detect outliers or erroneous data in data files

-0.8 -0.6 -0.4 -0.2 0.0 0.2

-0.8

-0.6

-0.4

-0.2

0.0

0.2

PC1

PC2

Site1Site2

Site3Site4Site5

Site6Site7

Site8

Site9

Site10Site11Site12Site13Site14Site15Site16Site17Site18Site19Site20Site21

Site22Site23Site24 Site25Site26Site27Site28

-20 -15 -10 -5 0 5

-20

-15

-10

-50

5

Water_content

Calamagrostis

ReflectanceCorynephorus

-0.3 -0.1 0.1 0.2 0.3 0.4

-0.3

-0.1

0.10.20.30.4

PC1

PC2

Site1Site2

Site3

Site4

Site5

Site6

Site7

Site8Site9

Site10Site11Site12

Site13

Site14

Site15Site16Site17Site18Site19Site20Site21

Site22Site23Site24

Site25

Site26

Site27Site28

-4 -2 0 2 4 6

-4-2

02

46

Water_content

Calamagrostis

Reflectance

Corynephorus

• Left: erroneous value of 1000 for Calamagrostis at site 9; no other change. • Right: biplot with corrected value (Calamagrostis = 0) at site 9.

Page 58: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Example 4 –

Simplify collinear data that will be used as explanatory variables or covariables in canonical analysis (RDA or CCA).

# Example: the spider data, p=12spiders <- read.table(file.choose())spiders.hel <- decostand(spiders, "hellinger")

# PCA using function rda of {vegan}rda.spiders.hel <- rda(spiders.hel)eigenval = rda.spiders.hel$CA$eigformat(cumsum(eigenval)/sum(eigenval),digits=3)

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 0.502 0.744 0.874 0.906 0.935 0.953 0.969 0.979 0.987 0.994 0.998 1.000

Ø  Noise in data can be removed by dropping the axes with small eigenvalues.Ø  The first 4 components account for 90% of the variance of the data, and the

first 6 account for 95%. Use these components instead of the 12 original variables to represent the species as explanatory variables or covariables in RDA or CCA, in order to save degrees of freedom in tests of significance.

Page 59: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

Remove a linear component of variation, e.g. the size factor in log-transformed morphological data.

Assumption: size is a multiplicative factor for all measured morphological traits.• After a log transformation of all morphological measures (lengths), size becomes an additive factor.• Compute PCA of the log-transformed measurements.• Size should be highly correlated to one of the first PCA axes.• From matrix F (see Computation steps), remove the PCA axis most related to size. Use the other columns of matrix F as size-detrended measurements.

Example 5 –

Page 60: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

References

References

Borcard, D., F. Gillet & P. Legendre. 2018. Numerical ecology with R, 2nd edition. Use R! series, Springer International Publishing, New York. xv + 435 pp.

Clint, M. & A. Jennings. 1970. The evaluation of eigenvalues and eigenvectors of real symmetric matrices by simultaneous iteration. Computer Journal 13: 76–80.

Gabriel, K. R. 1971. The biplot graphical display of matrices with applications to principal component analysis. Biometrika 58: 453–467.

Goodall, D. W. 1954. Objective methods for the classification of vegetation. III. An essay in the use of factor analysis. Australian Journal of Botany 2: 304–324.

Hill, M. O. 1973. Reciprocal averaging: an eigenvector method of ordination. Journal of Ecology 61: 237–249.

Legendre, P. 2005. Species associations: the Kendall coefficient of concordance revisited. Journal of Agricultural, Biological, and Environmental Statistics 10: 226-245.

Legendre, P. & E. D. Gallagher. 2001. Ecologically meaningful transformations for ordination of species data. Oecologia 129: 271–280.

Legendre, P. & L. Legendre. 2012. Numerical ecology, 3rd English edition. Elsevier Science BV, Amsterdam. xvi + 990 pp. ISBN-13: 978-0444538680.

ter Braak, C. J. F. & P. Smilauer. 2002. CANOCO reference manual and CanoDraw for Windows user’s guide – Software for canonical community ordination (version 4.5). Microcomputer Power, Ithaca, New York. 500 pp.

Page 61: 1.1. Principal component analysisbiol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf · component analysis (PCA). In a sense, it is “the mother” of the other

End of section