Step 5: Conduct Analysis The CCA Algorithm
Transcript of Step 5: Conduct Analysis The CCA Algorithm
1
Model Parameterization:
Step 5: Conduct Analysis
P Dropped species with fewer than 5 occurrences
P Log-transformed species abundances
P Row-normalized species’ log abundances (chord distance)
P Selected smallest “best”subset of variables from eachset of independent variables
P Chose RDA (in combo withchord distance)
2
The CCA Algorithm
P Gradients are assumed tobe linear combinations ofthe environmentalvariables.
P Key assumption is that ofunimodal speciesdistributions alongenvironmental gradients.
CCA estimates species optima,regression coefficients, and sitescores using a Gaussianweighted-averaging modelcombined with regression.
Assign arbitraryweights to samples
Compute species scores asweighted averages
Rescale (standardize)sample scores
Compute sample scores asweighted averages
Subsequent axes computedfrom residuals
Regress sample scores on environmentalvariables and assign new sample scoresas the predicted values from regression
3
Step 6: Interpret Results
P Effectiveness of the Constrained Ordination
< Total inertia, eigenvalues, percent variance explained,significance tests, species-environment correlation,cumulative species/sample fit.
P Interpreting the Constrained Ordination
< Canonical coefficients, inter-set and intra-setcorrelations, biplot scores.
< The Triplot: sample vs species scaling, centroid rulevs biplot rule, general interpretation.
< Interpreting species-environment relationships.
4
Effectiveness of the Constrained Ordination
Total inertiaP Sum of the eigenvalues; total
“variance” in the speciesdata, some of which isaccounted for by theenvironmental constraints.
EigenvaluesP Variance in the community
matrix attributed to aparticular axis; measure ofimportance of the ordinationaxis.
5
P Proportion of speciesvariance (total inertia)explained by each axis.Computed aseigenvalue/total inertia(reported as cumulative %).
Percent “variance” of species dataP For ecological data the
percentage of explained varianceis usually low; often ~ 10%. Notto worry, as it is an inherentfeature of data with a strongpresence/absence aspect.
Effectiveness of the Constrained Ordination
6
Significance TestsP Monte Carlo global
permutation tests ofsignificance of canonicalaxes.
P Test based on 1st canonical axis has maximum poweragainst the alternative hypothesis that there is a singledominating gradient that determines the relation betweenthe species and environment.
Effectiveness of the Constrained Ordination
7
Scores for Sites and Species
P Ordination scores (coordinates on ordination axes) givenfor each site (sample) and each species. Two sets ofscores are produced for sites (samples):
< Linear combination (LC) scores – linear combinations ofthe independent variables. LC scores are inenvironmental space, each axis formed by linearcombinations of environmental variables.
< Weighted averages (WA) scores – weighted averages (sums)of the species scores that are as simimlar to LC scoresas possible. WA scores are in species space.
LC scores show were the site should be; theWA scores show where the site is.
Effectiveness of the Constrained Ordination
8
LC scores (samples) WA scores (samples)
Effectiveness of the Constrained Ordination
9
LC versus WA Scores
P Longstanding debate (confusion) over whether to use LCor WA scores.
P LC scores are very sensitive to noise in the environmentaldata, whereas the WA scores are much more stable, beinglargely determined by the species data rather than being adirect projection in environmental space.
“Use of LC scores will misrepresent the observedcommunity relationships unless the environmental dataare both noiseless and meaningful.... CCA (RDA) withLC scores is inappropriate where the objective is todescribe community structure.” (McCune)
Effectiveness of the Constrained Ordination
10
Species-environmental correlationsP The multiple correlation coefficient of the final regression.
Correlation between site scores computed as weightedaverages of species scores (WA’s) and site scores computedas linear combinations of environmental variables (LC’s);measures how well the extracted variation in communitycomposition can be explained by environmental variables.
“This is a bad measure of goodness of ordination,because it is sensitive to extreme scores (likecorrelations are), and very sensitive to overfitting orusing too many constraints.” (Oksanen)
Effectiveness of the Constrained Ordination
11
P Ordination diagnostic to find outwhich species/samples are ill-represented in the ordination.
< % variance (inertia) explainedby selected axes.
Goodness-of-Fit for Species/Samples
Plot species with >10% explained
Effectiveness of the Constrained Ordination
12
Interpreting the Constrained Ordination
Canonical Coefficients
P Regression coefficients of the linear combinations ofenvironmental variables that define the ordination axes;they are eigenvector coefficients and not very useful forinterpretation.
Unstandardizedcoefficients
Standardizedcoefficients
13
IntRA-set Correlation (structure) Coefficients
P Correlations between sample scores derived as linearcombinations of environmental variables (LC scores) andenvironmental variables (i.e., loadings); they are related tothe canonical coefficients but differ in several importantaspects.
Interpreting the Constrained Ordination
14
P Both canonical coefficients and IntRA-set correlationsrelate to the rate of change in community compositionper unit change in the corresponding environmentalvariable – but for canonical coefficients it is assumed thatother environmental variables are being held constant,while for IntRA-set correlations other environmentalvariables are assumed to covary as they do in the dataset.
P When variables are intercorrelated canonical coefficientsbecome unstable – the multicollinearity problem;however, IntRA-set correlations are not effected.
IntRA-set Correlation (structure) Coefficients
Interpreting the Constrained Ordination
15
P Canonical coefficients and IntRA-set correlationsindicate which environmental variables are moreinfluential in structuring the ordination, but cannot beviewed as an independent measure of the strength ofrelationship between communities and environmentalvariables.
P Probably only meaningful if the site scores are derived asLC scores (which is generally not recommended).
“They have all the problems of correlations, like beingsensitive to extreme values, and they focus on therelationship between single constraints and single axesinstead of multivariate analysis.” (Oksanen)
IntRA-set Correlation (structure) Coefficients
Interpreting the Constrained Ordination
16
P Correlations between species-derived sample scores (WAscores) and environmental variables.
< Should be higher than correlations derived fromunconstrained ordination.
< Generally lower than IntRA-set correlations.
< IntRA-set correlation x species-environment correlation.
IntER-set Correlation (structure) Coefficients
Interpreting the Constrained Ordination
17
P Like IntRA-set correlations, they do not become unstablewhen the environmental variables are strongly correlatedwith each other (multicollinear).
P Probably only meaningful if the site scores are derived asWA scores (which is generally recommended).
IntER-set Correlation (structure) Coefficients
“They have all the problems of correlations, like beingsensitive to extreme values, and they focus on therelationship between single constraints and single axesinstead of multivariate analysis.” (Oksanen)
Interpreting the Constrained Ordination
18
P Environmental variables are typically represented asvectors radiating from the centroid of the ordination.
P Biplot scores give the coordinates of the heads of theenvironmental vectors, and are based on the IntRA-setcorrelations.
Biplot Scores for Environmental Variables
Interpreting the Constrained Ordination
19
The Canonical Triplot
P The triplot displays themajor patterns in the speciesdata with respect to theenvironmental variables.
Tri =(1) Samples(2) Species(3) Environment
P But there are a number ofways to scale the triplotwhich affect the exactquantitative interpretationof the relationships shown.
20
P Optimally displays inter-sample relationships.
P Variance of sample scoreson each axis reflects theimportance of the axis asmeasured by theeigenvalue, whereas thevariance of the speciesscores along the axes areequal.
Sample Scaling
The Canonical Triplot
21
P Optimally displays inter-species relationships.
P Variance of species scores oneach axis reflects theimportance of the axis asmeasured by the eigenvalue,whereas the variance of thesample scores along the axesare equal.
Species Scaling
Generally the preferred scaling
The Canonical Triplot
22
P Way of interpretingordination diagrams inunimodal methods (CCA).
P Each species' point is at thecentroid (weighted average)of the sample points where itoccurs (i.e., center of itsniche); the samples thatcontain a particular speciesare scattered around thatspecies' point in the diagram.
The Centroid Principle
Species-Hill’s Scaling
The Canonical Triplot
23
The Biplot Rule
Species-Biplot Scaling
P Way of interpretingordination diagrams inlinear methods (RDA).
P An arrow through thespecies points in directionof maximum change inabundance of the species.
P The order of sitesprojected onto arrow givesthe inferred ranking of therelative abundance of thespecies across sites.
The Canonical Triplot
24
The Canonical Triplot
P Sample locations indicatetheir compositionalsimilarity to each other.
P Samples tend to bedominated by the speciesthat are located near them(or projected toward them)in ordination space.
General Interpretation – regardless of scaling
25
The Canonical Triplot
P Species locations indicatetheir distributionalsimilarity to each other.
P Species tend to be mostpresent and abundant inthe sites that are locatednear them (or in thedirection of theirprojection) in ordinationspace.
General Interpretation – regardless of scaling
26
The Canonical Triplot
P Length of environmentalvector indicates itsimportance to theordination.
P Direction of the vectorindicates its correlationwith each of the axes.
P Angles between vectorsindicate the correlationbetween the environmentalvariables themselves.
General Interpretation – regardless of scaling
27
The Canonical Triplot
P In unimodal model (CCA),perpendiculars drawn fromspecies to environmentalarrow gives approximateranking of species responseto that variable, andwhether species has higher-than-average or lower-thanaverage optimum on thatenvironmental variable.
Higher-than-average
Lower-than-average
General Interpretation – regardless of scaling
28
The Canonical Triplot
P In linear model (RDA),direction of arrow drawnfrom orgin to species andenvironmental arrow givesapproximate relationshipbetween species andenvironmental variable; i.e.,whether species haspositive or negativerelationship with thatenvironmental variable.
Positivecorrelation
Negativecorrelation
General Interpretation – regardless of scaling
29
The Canonical Triplot
P Perpendiculars drawn fromsamples to environmentalarrows gives approximateranking of sample valuesfor that environmentalvariable.
P And whether a sample has ahigher-than-average orlower-than-average valueon that environmentalvariable.
Higher-than-average
Lower-than-average
General Interpretation – regardless of scaling
30
P It is more natural torepresent nominalenvironmental variables bypoints instead of arrows.
P The points are located at thecentroid of sites that belongto that class.
P Classes containing sites withhigh values for a species willtend to lie close to thatspecies point.
Nominal Environmental Variables
The Canonical Triplot
31
Don’t worry; focuson relationships
The Canonical Triplot
Samples-scaling Species-scaling
32
P Chi-square distance amongsamples preserved(approximates unimodalspecies responses).
P Rare species are givendisproportionate weight inthe analysis.
P Unequal weighting of sites inthe regression (equal to sitetotals).
Major Differences Between RDA and CCA
P Euclidean distance amongsamples preserved(appropriate with linearspecies respones and canapproximate unimodalresponses with appropriatedata transformations).
P Rare species are not given highweight in the analysis.
P Equal weighting of sites in theregression.
CCA/dCCA RDA
33
P Species-environmentcorrelation equals thecorrelation between the sitescores that are weightedaverages of the species scores(WA’s) and the site scoresthat are a linear combinationof the environmentalvariables (LC’s).
P Ordination diagram can beinterpreted using the centroidprinciple.
Major Differences Between RDA and CCA
P Species-environmentcorrelation equals thecorrelation between the sitescores that are weighted sumsof the species scores (WA’s)and the site scores that are alinear combination of theenvironmental variables(LC’s).
P Ordination diagram can beinterpreted using the biplotrule.
CCA/dCCA RDA
34
Usefulness of Constrained Ordination
Displaying Interpretable Relationships
P Constrained ordination (aswith any ordination) isultimately about portrayingmeaningful species-environment relationships.
35
Usefulness of Constrained Ordination
Displaying Interpretable Relationships
P Constrained ordination (aswith any ordination) isultimately about portrayingmeaningful species-environment relationships.
36
Usefulness of Constrained Ordination
Displaying Interpretable Relationships
P Constrained ordination (aswith any ordination) isultimately about portrayingmeaningful species-environment relationships.
37
Multivariate Regression Trees (MRT)
P Nonparametric procedure useful for exploration,description, and prediction of species-environmentrelationships (De’ath 2002).
P Natural extension of Univariate Regression Trees (URT):with the univariate response of URT being replacedby a multivariate response.
P Recursive partitioning of thedata space such that the‘populations’ within eachpartition become more andmore homogeneous.
38
P Makes no assumption about the form of the relationshipsbetween species and their environment.
P Form of multivariate regression in that the response isexplained, and can be predicted, by the explanatoryvariables.
Important Characteristics of MRT
P Method of constrained clusteranalysis, because it determinesclusters that are similar in achosen measure of speciesdissimilarity (e.g., assemblagetype), with each cluster beingdefined by a set of environmentalvalues (e.g., habitat type).
39
Univariate Regression Trees (URT)
1.At each node, the tree algorithm searches through thevariables one by one, beginning with x1 and continuing up toxM.
2.For each variable it finds the best split (minimizes the totalsums of squares or sums of absolute deviations about themedian).
3.Then it compares the M best single-variable splits and selectsthe best of the best.
4.Recursively partition each node until an over-large tree isgrown and then pruned back to the desired size (usuallybased on cross-validated predicted mean square error).
5.Describe each terminal node (mean, median,variance) andthe overall fit of the tree (% variance explained).
40
Multivariate Regression Trees (MRT)
P Redefine node impurity by summing the univariateimpurity measure over the multivariate response.
< Minimize the sums of squared Euclidean distances(SSD) of sites about the node centroid.
P Additional ways to interpret the results of MRT:
< Identification of species that most stronglydetermine the splits.
< Tree biplots to represent group means and speciesand site information.
< Indentification of species that best characterize thegroups (ISA).
41
Multivariate Regression Trees (MRT)
P Compare the MRT solution to unconstrainedclustering using a metric equivalent to the MRTimpurity and the same number of clusters.
P MRT can be extended to work directly from adissimilarity matrix; clusters can be formed bysplitting the data on environmental values thatminimize the intersite SSD.
|
42
MRT vs CCA vs RDA
P Unlike CCA/RDA, MRT is not a method of directgradient analysis, in the sense of locating sites alongecological gradients.
< MRT does locate sites (albeit grouped) in an ecologicalspace defined by the environmental variables.
P Unlike CCA/RDA, the MRT solution space is nonlinearand includes interactions between the environmentalvariables.
P MRT is a divisive partitioning technique; CCA/RDAmodel continuous structure.
0
0.5
1
1.5
2
2.5
3
3.5
1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163
43
MRT vs CCA vs RDA
P MRT emphasizes local structure and interactions betweenenvironmental effects; CCA/RDA determine globalstructure.
P MRT does not assume particular relationships betweenspecies abundances and environmental characteristics;CCA/RDA assume unimodal and linear response models.
|
44
MRT vs CCA vs RDA
P MRT has outperformed CCA/RDA in both simulated andreal ecological data sets in both explaining and predictingspecies composition.
P The advantage of MRT increases with the strength ofinteractions and the nonlinearity of relationships betweenspecies composition and the environmental variables.
|