Post on 20-Oct-2019
Multivariate Ordination Analyses: Principal Component Analysis
Dilys Vela
Tatiana BozaTatiana Boza
Multivariate AnalysesMultivariate Analyses
A multivariate data set includes more thani bl d d f b fone variable recorded from a number of
replicate sampling or experimental units,i f d bjsometimes referred to as objects.
If these objects areIf these objects are organisms, the variables might be morphological g p gor physiological measurements
If the objects are ecological samplingecological sampling units, the variables might be gphysicochemical measurements or species abundances
What ordinations analyses are ?
Ordination is arranging items along a scale (axis) orl i l Th d f di i imultiples axes. The proposed of ordination is
summarized graphically complex relationships,extracting one or few dominant patterns from anextracting one or few dominant patterns from aninfinite number of possible patterns.
The placement of variables along an axis it is possiblebecause the ordination it is base on the variablesbecause the ordination it is base on the variablescorrelation.
What ordination analyses help us to ?see?
Select the most important variables from multipleSelect the most important variables from multiple variables imagined or hypothesized.
Reveal unforeseen patterns and suggest unforeseen processes.p
What type of question can we answer with ordination analysis?with ordination analysis?
In ecology, to seek and describe pattern of process.
In community ecology, to describe the strongest patterns in species composition.
I i i d d fi iIn systematics, to recognize and to define species boundaries.
Multivariate Analysis
Ordination Analysis Clasification (or Clustering Analysis)
Direct Gradient Analysis
Indirect Gradient Analysis
Linear Regression
(Few Species)
Correspondence Analysis (CA) (Many Species)
Distant
Detrended CA (DCA)
Canonical CA (CCA)
Redundancy Analysis (RDA)
Values
P i i l N t i
Raw Data available
Principal Coordinate Analysis (PCoA
Non‐metric Dimensional Analysis (NMDS) Principal
Components Analysis (PCA)
Non‐metric Dimensional Analysis (NMDS)
Detrended CA (DCA)
Canonical CA (CCA)
(PCA) (NMDS)
Principal Components AnalysisPrincipal Components Analysis
Principal component analysis (PCA) is a statistical p p y ( )technique that has been specifically developed to address data reduction.In general terms the major aim of PCA is to reduce theIn general terms, the major aim of PCA is to reduce the complexity of the interrelationships among a potentially large number of observed variables to a relatively small
b f l b f h h hnumber of linear combinations of them, which are referred to as principal components.Principal components analysis finds a set of orthogonalPrincipal components analysis finds a set of orthogonal standardized linear combinations which together explain all of the variation in the original data.
What are the assumptions of PCA?What are the assumptions of PCA?
Assumes relationships among variablesAssumes relationships among variables.cloud of points in p‐dimensional space has linear dimensions that can be effectively summarized by the principal axes.
If the structure in the data is NONLINEAR (the cloud f d h hof points twists and curves its way through p‐
dimensional space), the principal axes will not be an efficient and informative summary of the dataefficient and informative summary of the data.
Considerations before to run a PCAConsiderations before to run a PCA
Normal DistributionsNormal Distributions
Data Outliers
f iTransformations
Standardization
Data Matrix
Normal DistributionsNormal Distributions
• When using PCA data normality is notWhen using PCA data normality is not essential. However, these methods are based on the correlation or covariance matrix whichon the correlation or covariance matrix, which is strongly affected by non‐normally distributed data and the presence of outliersdistributed data and the presence of outliers.
Data outliersData outliers
• Extreme values as well as outliers can have aExtreme values as well as outliers can have a severe influence on PCA, since they are based on the correlation or covariance matrix (Pisonet al., 2003).
• Outliers should thus be removed prior to the statistical analysis, or statistical methods able to handle outliers should be employed, and h i fl f l d bthe influence of extreme values needs to be reduced (e.g., via a suitable transformation).
TransformationsTransformationsTransformations, which change the scale of measurement of the data in relation to meeting the normality assumption ofthe data, in relation to meeting the normality assumption of parametric analyses and the homogeneity of variance assumption of most of these analyses.
Transformations are particularly important for multivariate procedures based on eigenanalysis (e.g. principal components analysis) because covariances and correlations measure linearanalysis) because covariances and correlations measure linear relationships between variables.
Transformations that improve linearity will increase theTransformations that improve linearity will increase the efficiency with which the eigenanalysis extracts the eigenvectors.
StandardizationStandardization
The first stage in rotating the data cloud is toThe first stage in rotating the data cloud is to standardize the data by subtracting the mean and dividing by the standard deviationand dividing by the standard deviation.
It may be argued that we should not divide by the standard deviation By standardizing wethe standard deviation. By standardizing, we are giving all species the same variation, i.e. a standard deviation of 1standard deviation of 1.
Data MatrixData Matrix
We actually can have it both ways:We actually can have it both ways: A PCA without dividing by the standard deviation is an analysis of the covariance matrix. A PCA in which you do indeed divide by the standard deviation is an analysis of the correlation matrixmatrix.
When using species/variables measured inWhen using species/variables measured in different units, you must use a correlation matrixmatrix.
Look at Descriptors
Homogeneous nature?
All Same Kind ?
Same Units?
Heterogenous nature?
Different kind?
Different Units?
Same Order of Magnitude Different order of Magnitude?
S matrix
(Covariance)
R matrix
(Correlation)( ) ( )
Advantages Disadvantages
Correlation The results of There are considerable differences in the Matrix analyses for
different sets of random variables are more directly
standard deviations, caused mainly bydifferences in scale.None of the correlations is particularly large in
absolute valueare more directly comparable.
absolute value.PCs has moderate‐sized coefficients for several
of the variables.PCs give coefficients for standardized variables
and are therefore less easy to interpret directly.
CovarianceMatrix
PCs for the covariance matrix
The sensitivity of the PCs to the units of measurement used for each element of theMatrix covariance matrix
are each dominated by a single variable.The variances and
measurement used for each element of thevariables. If there are large differences between the variances of the elements of the variables, then those variables whose variances are largest
total variance are more meaningful indices for measuring variability
will tend to dominate the first few PCs.
measuring variability in data sets that are symmetric.
Eigenvalues & EigenvectorsEigenvalues & Eigenvectors
The eigenvectors are the loadings of theThe eigenvectors are the loadings of the principal components spanning the new PCA coordinate systemcoordinate system.
The amount of variability contained in each principal component is expressed by theprincipal component is expressed by the eigenvalues which are simply the variances of the scoresthe scores.
PCA searches for the direction in the multivariate space thatin the multivariate space that contains the maximum variability. This is the direction of the first principal component (PC1). The second principal p pcomponent (PC2) has to be orthogonal (perpendicular) to PC1 and will contain thePC1 and will contain the maximum amount of the remaining data variability. S b t i i lSubsequent principal components are found by the same principle.
Biplots
A biplot is a visualization tool to t lt f PCA Th PCApresent results of PCA. The PCA
biplot is called the scaling process.
The loadings(arrows) represent the elements. The lengths of the arrows i h l di l i lin the plot are directly proportional to the variability included in the two components (PC1 and PC2) displayed, and the angle between any two arrows is a measure of the correlation between those variablescorrelation between those variables.
MisconceptionsMisconceptions
PCA cannot cope with missing values (butPCA cannot cope with missing values (but neither can most other statistical methods).
It does not require normalityIt does not require normality.
It is not a hypothesis test.
There are no clear distinctions between response variables and explanatory variables.
When should PCA be used?When should PCA be used?
In community ecology PCA is useful forIn community ecology, PCA is useful for summarizing variables whose relationships are approximately linear or at least monotonicapproximately linear or at least monotonic.
e.g. A PCA of many soil properties might be used to extract a few components that summarize main dimensions of soil variation
PCA is generally NOT useful for ordinatingcommunity data. Why? Because relationships among species are highly nonlinear.
Community trendsCommunity trends along environmenalgradients appear as
Beta Diversity 2R - Covariance
g pp“horseshoes” in PCA ordinations.
2
None of the PC axes effectively
i h dA
xis
summarizes the trend in species composition along Axis 1composition along the gradient.
The “Horseshoe”EffectThe Horseshoe Effect
Curvature of the gradient and the degree of infolding ofCurvature of the gradient and the degree of infolding of the extremes increase with beta diversity.
PCA ordinations are not useful summaries ofPCA ordinations are not useful summaries of community data except when beta diversity is very low
Using correlation generally does better than covariance.This is because standardization by species improves the correlation between Euclidean distance and environmental distancedistance.
What if there’s more than one d l l l d ?underlying ecological gradient?
When two or more underlying gradients with high beta diversity a “horseshoe” is usuallyhigh beta diversity a horseshoe is usually not detectable.
Interpretation problems are more severeInterpretation problems are more severe.
Data Set
Morphological and anatomicalMorphological and anatomical variation of Calophyllum L.
(Calophyllaceae) in South America.
D. Vela
Kielmeyeroideae
Calophylleae
Caraipa
Calophylleae•Calophyllum•Neotatea•Marila•Marila•Mahurea•Clusiella•Kielmeyera•Caraipa•Haploclathra•Poeciloneuron•MesuaMesua•Kayea•Mammea
Kayea
Endodesmieae•Endoodesmia•Lebrunia
CalophyllumStevens, 2006
Wurdarck & Davis (2009)
Distribution of Calophyllaceae
144 144 speciesspecies
259259speciesspecies
1010speciesspecies
176176speciesspecies
Stevens, 2006http://www.mobot.org/MOBOT/research/APweb/
www.wikimedia.org
VeinResin canal
http://www.botany.hawaii.edu/faculty/carr/images/cal_ino.jpghttp://pakuwon.wordpress.com/2009/02/13/nyamplung‐calophyllum‐inophyllum/
http://www.flickr.com/photos/mauroguanandi
Calophyllum brasiliense
• There is infraspecific variation in• There is infraspecific variation in tepal number between individuals of the same species, and between flowers from the same inflorescence.flowers from the same inflorescence.
Stevens (1974,1980)
Calophyllum brasiliense
http://www.nationaalherbarium.nl/sungaiwain/Calophyllum pisiferum
Calophyllum lanigerum
1 M i bj ti1. Main objective1.A To distinguish species limits of Calophyllum
in South America.
2 Specific objectives2. Specific objectives 2.A To analyze morphological and anatomical
i tivariation.
Data collection for morphological observationsHerbarium and personalHerbarium and personal
collections.
Collection sort: qualitative characteristics (Systematic Association Committee forAssociation Committee for descriptive Biological Terminology (cited by Stearn2006).
Measurement. Ruler and a digital g
caliper.
E l d t t iExcel data matrix .Specimen collections in rows and variables
in columns.
Leaf characters Flower characters Fruit charactersExternal Fruit length mm
Petiole length mm (PTL) Pedicel length mm (PDL)g
(FrLEx)
Leaf length cm (LL) Perianth width mm (PW )External Fruit width mm (FrWEx)
L f l th t id t tLeaf length at widest part cm (LWWP) Perianth length mm (PRL) Internal Fruit length mm (FrLIn)
Leaf width cm (LW) Anther length mm (AL)Internal Fruit width mm (FrWIn) ( ) g ( ) ( )
Apex length mm (PL) Anther width mm (AW) Stigma remained mm (StygR) Midrib width at abaxial side mm(MW) Stamen length mm (STL) Basal discoloration mm (BsDis)
Vein angle degree (VA) Filament length mm (FL) Stone mm (Stn)
Venation density (VD) Style length mm (STYL) Corky mm (CRK)
Gynoecium length mm (GL)Gynoecium length mm (GL)
Ovary length mm (OL)
Stigma width mm (SL)
REFERENCESREFERENCES
Claude, Julien. 2008. Morphometrics with R. Springer.
Gotelli, Nicholas J., and Aaron M. Ellison. 2004. A primer of ecological statistics. Sinauer Associates Publishers.
Jolliffe, I. T. 2002. Principal component analysis. Springer.
Legendre, Pierre, and Louis Legendre. 1998. Numerical ecology. Elsevier.
Q i G ld P d Mi h l J K h 2002 E i lQuinn, Gerald Peter, and Michael J. Keough. 2002. Experimental design and data analysis for biologists. Cambridge University Press.