Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA...

Introduction Distance Measures Formation of Groups (Clusters) in CA How to obtain the number of clusters? Further Analysis of the obtained clusters Useful R-Commands Cluster Analysis Janette Walde [email protected] Department of Statistics University of Innsbruck Janette Walde Cluster Analysis

Transcript of Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA...

Page 1: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Cluster Analysis

Janette Walde

[email protected]

Department of StatisticsUniversity of Innsbruck

Janette Walde Cluster Analysis

Page 2: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Outline I

1 IntroductionProblemsIdea of Cluster Analysis

2 Distance MeasuresGeneral CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

3 Formation of Groups (Clusters) in CAHierarchical Agglomerative CA Methods

Janette Walde Cluster Analysis

Page 3: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Outline IIPartitional Clustering: k-means Clustering

4 How to obtain the number of clusters?

5 Further Analysis of the obtained clusters

6 Useful R-Commands

Janette Walde Cluster Analysis

Page 4: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

ProblemsIdea of Cluster Analysis

Problems/Questions

Cluster analysis was developed in taxonomy. Theaim was originally to get away from the high degreeof subjectivity when single taxonomists performed agrouping.

Clustering is used to build groups of genes with relatedexpression patterns (co-expressed genes): ”In analyzingDNA microarray gene-expression data, a major role hasbeen played by various cluster-analysis techniques, mostnotably by hierarchical clustering, K-means clusteringand self-organizing maps. These clustering techniquescontribute significantly to our understanding of theunderlying biological phenomena.” Genome Biology2002, 3(2):research0009.1–0009.8.

Janette Walde Cluster Analysis

Page 5: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

ProblemsIdea of Cluster Analysis

Problems/Questions, cont.

In plant and animal ecology, clustering is used todescribe and to make spatial and temporal comparisonsof communities of organisms in heterogeneousenvironments; in plant systematics to generate artificialphylogenies or clusters of organisms at the species, genusor higher level that share a number of attributes.

CA might be used to classify regions based on vegetationcommunities or abundances of species.

We may find out various stakeholders regarding anational park via conducting a survey questioningfarmers, visitors, inhabitants of the municipalities ...

Janette Walde Cluster Analysis

Page 6: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

ProblemsIdea of Cluster Analysis

Idea of Cluster analysis

Clusters are formed numerically – on the basisof distance measures. Interpretation of clustersis the final step.

The resulting clusters should exhibit highinternal (within-cluster) homogeneity and highexternal (between-cluster) heterogeneity.

The Interpretation of clusters has to take intoaccount (needs to be consistent with) themathematical procedures.

Janette Walde Cluster Analysis

Page 7: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

ProblemsIdea of Cluster Analysis

Idea of Cluster analysis, Cont.

Comparing clusters (of cases) with respect toadditional variables (i.e. variables, which havenot been considered in the formation of theclusters) can help in the interpretation ofclusters (subsequent to CA).

CA itself is usually not combined with thecalculation of statistical significance.

CA does – likewise to Factorial Analysis - servedata reduction purposes.

Janette Walde Cluster Analysis

Page 8: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

ProblemsIdea of Cluster Analysis

Cluster analysis of cases

Cluster analysis evaluates the similarity of cases(e.g. persons, products, areas, other entities)with respect to a defined set of variables.

Cases are grouped into clusters on the basis oftheir similarities. Similar cases shall be assignedto the same cluster. Dissimilar cases shall beassigned to different clusters.

The number of clusters to be formed can bedefined in advance or certain criteria aredefined and applied to the data.

Janette Walde Cluster Analysis

Page 9: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

ProblemsIdea of Cluster Analysis

Cluster analysis of variables

When variables are clustered, the similarity ofthe variables is evaluated with respect to thesimilarity of the values, which a predefined setof persons have on these variables.

Similar variables shall be assigned to the samecluster of variables. Dissimilar variables shall beassigned to different clusters.

The number of clusters to be formed can bedefined in advance or certain criteria aredefined and applied to the data.

Janette Walde Cluster Analysis

Page 10: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

ProblemsIdea of Cluster Analysis

Clustering cases or variables?

Whether cases or variables should be clustereddepends on the research question you have.

Mathematically there exists no fundamentaldifference between clustering cases or variables.

Clustering variables starts with the Transposedmatrix, as compared to the matrix you startwith when clustering cases.

Similarities of cases wrt. their values of certain variables

(clustering cases) versus similarities of variables wrt. the

values of certain cases on these variables (clustering variables).Janette Walde Cluster Analysis

Page 11: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures in CA I

The definition of the distance measure, anddeciding whether to use standardized or rawdata are fundamental steps in CA.

Measures of distance (dissimilarity versussimilarity/proximity) → Recursively mostsimilar clusters are unified.

Janette Walde Cluster Analysis

Page 12: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures in CA II

In most CA approaches, the first round of asequential clustering procedure assumes nopre-existing clusters of cases. Instead, in thefirst round each case is regarded as a cluster →agglomerative methods.

Accordingly in the first round of a sequentialCA procedure the distance is measuredbetween all cases, that is, if n cases areincluded, the number of distances to becalculated equals n·(n−1)

2 .

Janette Walde Cluster Analysis

Page 13: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures in CA III

The scale level of your data can limit thenumber of distance measures which makesense.

If a variable is dichotomous two cases can onlybe identical or different from each other.

If variables are ordinal scaled (ranks), then thedistances between values are difficult tointerpret.→ It is problematic to use distance measures,which are appropriate for interval scale data,

Janette Walde Cluster Analysis

Page 14: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures in CA IV

for ordinal scaled data.→ A possibility is splitting ranks using themedian, and thus recode the variables (e.g.0 = below median; 1 = above median) tosubsequently apply a distance measure fordichotomous variables.

Janette Walde Cluster Analysis

Page 15: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures in CA V

If a variable is nominal scaled with more thantwo levels (e.g. nationality; region ofresidence), then a dummy coding of thevariables can transform them into dichotomousvariables.

If your variables are interval scaled(quantitative variables), then distances can beproperly interpreted.

Janette Walde Cluster Analysis

Page 16: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

General properties of a distance measure

d(x , y) ≥ 0, the distance is never negative

d(x , y) = 0, if x = y

d(x , y) = d(y , x), symmetric

d(x , y) ≤ d(x , z) + d(z , y)

Janette Walde Cluster Analysis

Page 17: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures for interval scale data

Cases/Variables V1 V2 V3 V4 V5 V6 V71 5 2 3 0 1 0 12 4 4 3 3 1 1 13 10 7 8 5 6 5 6

Different aspects of similarity (distance) can befocused:

Cases 1 and 2 are similar regarding theabsolute values of the variables → mostdistance measures (e.g. Euclidean distance)focus this aspectCases 1 and 3 ...

Janette Walde Cluster Analysis

Page 18: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures for interval scale data

Cases/Variables V1 V2 V3 V4 V5 V6 V71 5 2 3 0 1 0 12 4 4 3 3 1 1 13 10 7 8 5 6 5 6

Cases 1 and 2Cases 1 and 3 are similar with respect to theprofile (increase and decrease/relative values)over the variables.→ To focus on this aspect, product moment

correlation can be selected as similaritymeasure!

Janette Walde Cluster Analysis

Page 19: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures for interval scale data

Cases/Variables V1 V2 V3 V4 V5 V6 V71 5 2 3 0 1 0 12 4 4 3 3 1 1 13 10 7 8 5 6 5 6

Cases 1 and 2Cases 1 and 3 are similar with respect to theprofile over the variables.→ After standardizing the variables (e.g.z-transforming) conventional distance measures(e.g. Euclidean distance) also result in a focuson this latter aspect.

Janette Walde Cluster Analysis

Page 20: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Euclidian distance

The Euclidean Distance between two cases Aand B is the “straight line” between the twocases. Assume A(xA, yA),B(xB, yB), then

dEuklidean(A,B) =√

(xA − xB)2 + (yA − yB)2.The value of the Euclidean distance dependson the scale of the variables. → Standardizethe variables!Generally, two cases having the variable vectors

x and y : d(x , y) =√

∑ki=1(xi − yi)2, where k

is the number of variables.Janette Walde Cluster Analysis

Page 21: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Minkowski Metrik

General basis of both, the City-Block-Distanceand the Euclidean Distance (ordinary distance,Pythagorean metric) are the so calledMinkowski-Metrics, corresponding to theformula: d(x , y) = (

∑ki=1 |xi − yi |

r )1r , where r

is called the Minkowski constant.

In the City-Block Metric r = 1, in EuclideanDistance r = 2. These distance measure canbe calculated for any number of variables(dimensions).

Janette Walde Cluster Analysis

Page 22: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Minkowski Metrik, Cont.

High values of r increase the weight of largedistances relative to small ones.

Dominance (Supremum) metric: r → ∞, thusonly the biggest difference matters

For applying Minkowski-Metrics the scales ofthe variables should be identical. Else, astandardization (e.g. z-transformation of allvariables) must be accomplished.

Janette Walde Cluster Analysis

Page 23: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Cosine

The distance between two cases x and y isdefined as:

d(x , y) =

∑ki=1 xiyi

i x2i

i y2i

The direction of the variable vectors is decisiveand not their length. The ”angle” between thetwo variable vectors is computed.

Janette Walde Cluster Analysis

Page 24: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Pearson Correlation

Correlation coefficient between x and y

rx ,y =

∑k

i=1(xi − x)(yi − y)

[∑k

i=1(xi − x)2∑k

i=1(yi − y)2]12

Measure of Similarity

Note: Clustering variables using Pearson correlation asdistance measure is similar to Factorial Analysis as bothare closely related to the correlation matrix of theinvolved variables and both aim to identify similaritiesbetween these variables.

Janette Walde Cluster Analysis

Page 25: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures for binary variables

Cases/variables V1 V2 V3 V4 V5 V6 V71 1 1 1 0 1 0 12 0 1 0 0 0 1 13 1 1 0 0 0 0 0

Even for binary data manifold distance (similarity)measures exist, e.g.:

1 Simple Matching Coefficient (SMC) =matches/number of paired variables.→ e.g. SMC (case 1, case 2) = 3/7; SMC(case 2, case 3) = 4/7.

Janette Walde Cluster Analysis

Page 26: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures for binary variables

Cases/variables V1 V2 V3 V4 V5 V6 V71 1 1 1 0 1 0 12 0 1 0 0 0 1 13 1 1 0 0 0 0 0

1 Simple Matching Coefficient.2 Jaccard Matching coefficient = matches with

characteristic present (=1)/ number of pairedvariables where characteristic is given (=1) atleast once.→ e.g. JMC (case 1, case 2) = 2/6; JMC(case 2, case 3) = 1/4

Janette Walde Cluster Analysis

Page 27: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Distance measures for binary variables

Cases/variables V1 V2 V3 V4 V5 V6 V71 1 1 1 0 1 0 12 0 1 0 0 0 1 13 1 1 0 0 0 0 0

1 Simple Matching Coefficient.2 Jaccard Matching coefficient.3 Phi-coefficient → Product-moment correlation

formula applied to binary data that is coded 0and 1.

In these three methods, the distance measure isdefined by: d = (1 - similarity measure).

Janette Walde Cluster Analysis

Page 28: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables

Comments regarding distance measures

Specific distance measures for nominal scaledvariables that are not binary as well as distancemeasures for ordinal scaled variables areavailable.

Act with caution if variables with different levelof measurements are used combined!

Analogous to the distances between cases,distances between variables can also becalculated.

Janette Walde Cluster Analysis

Page 29: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Formation of groups (clusters) in CA

Different methods can be used to form clusters.

Hierarchical Agglomerative CA methods startwith the finest partitioning (i.e. each case[variable] forms one cluster) and sequentiallyreduce the number of clusters by 1 throughunifying two clusters.Divisive methods which start with one supercluster are very uncommon and not touchedhere.

Janette Walde Cluster Analysis

Page 30: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Formation of clusters in CA, Cont.

In hierarchical CA methods, clusters are sequentiallyenlarged. Once included an element is included in acluster it will not be excluded from the cluster.

In non-hierarchical CA methods, elements cansequentially be included and excluded into differentclusters in order to identify the best cluster structure.→ Thus, they are more flexible and complex.→ The k-means method is the most frequently appliednon-hierarchical CA method.

Janette Walde Cluster Analysis

Page 31: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Algorithms for establishing clusters

Janette Walde Cluster Analysis

Page 32: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Hierarchical agglomerative CA methods

Different hierarchical CA methods use differentcriteria for selecting which two clusters will beunified in the next step.

Hierarchical Methods are for example SingleLinkage, Complete Linkage, Average Linkage,[weighted/unweighted] Centroid-method, andthe Ward-method.

Janette Walde Cluster Analysis

Page 33: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Hierarchical agglomerative CA methods,Cont.

Even if the same distance measure is used,these methods result in different distancesbetween clusters → i.e. different distancematrices (But only if at least one of theclusters contains more than one case).Ward method is somewhat different approachas it uses a sums of square criterion and unifiesthe two clusters which results in the smallestincrease of error variance.

Janette Walde Cluster Analysis

Page 34: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

The procedure of hierarchical methods

1 Start with finest grouping, i.e. each case is a cluster.

2 Calculation of distance matrix.

3 The two cluster with the lowest distance are identifiesand unified to a common cluster.

4 A new distance matrix for the reduced number ofclusters is calculated (where the two formerly separatedcluster are now considered as one cluster).

5 The two cluster with the lowest distance in the reduceddistance matrix are unified.

Janette Walde Cluster Analysis

Page 35: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

The procedure of hierarchical methods,Cont.

... Recursion to step 4, ... etc. - until all cases areunified in a single super cluster or some predefinedcriterion for stopping the agglomeration schedule isdefined.

Janette Walde Cluster Analysis

Page 36: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Single Linkage

1. Single Linkage calculates the distanceD(R;P&Q) between a (divisible)cluster containingthe two cases P, Q, and the indivisible cluster R as:

D(R;P&Q) = min{D(R;P),D(R;Q)}

Here, the smallest distance between a cluster P&Qand the cluster R determines the distance betweenthe two clusters.

Janette Walde Cluster Analysis

Page 37: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Single Linkage, Cont.

The method is also known as Nearest Neighbor asthose two clusters, which have the nearest neighbor(i.e. where the pair with the smallest distancebetween two objects of the two clusters exists) areunified.

This results in a tendency towards the formation oflarge clusters.

Janette Walde Cluster Analysis

Page 38: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Complete Linkage

2. Complete Linkage calculates the distanceD(R;P&Q) between a (divisible) cluster containingthe two cases P, Q, and the indivisible cluster R as:

D(R;P&Q) = max{D(R;P),D(R;Q)}

Here, the largest distance between a cluster P&Qand the cluster R determines the distance betweenthe two clusters.

Janette Walde Cluster Analysis

Page 39: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Complete Linkage, Cont.

The method is also known as Furthest Neighbor asthe two clusters where the furthest neighbor (i.e.The pair with the largest distance between twoobjects of the two clusters) is least distant areunified.

This results in a tendency towards the formation ofsmall clusters.

Janette Walde Cluster Analysis

Page 40: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Average Linkage between groups

3. Average Linkage between groups calculates thedistance D(R;P&Q) between a (divisible) clustercontaining the two cases P, Q, and the indivisiblecluster R as:

D(R;P&Q) = mean(D(R;P),D(R;Q))

Here, the average of the pair-wise distances betweenall the pairs formed by objects of both clustersdetermines the effective distance between the twoclusters.

Janette Walde Cluster Analysis

Page 41: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Average Linkage between groups, Cont.

3. Average Linkage:

D(R;P&Q) = mean(D(R;P),D(R;Q))

Here, the average of the pair-wise distances betweenall the pairs formed by objects of both clustersdetermines the effective distance between the twoclusters.→ The two clusters where the average betweengroup distance is smallest are unified.→ This method is frequently used.

Janette Walde Cluster Analysis

Page 42: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Average Linkage within groups

4. Average Linkage within groups calculates themean distance D(R;P&Q) between all cases in thecluster to be formed out of the divisible cluster andthe indivisible cluster R:

D(R;P&Q) = mean(D(R;P),D(R;Q),D(P;Q))

Here, the average of the pair-wise distances betweenall the objects in the new cluster is calculated foreach possible cluster fusion. The two clustershaving the lowest average within cluster distance areunified.

Janette Walde Cluster Analysis

Page 43: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Cluster centroids

For calculating the distances between two large clusters,the 4 previous methods function analogous to ourexample of an indivisible cluster (R) and a composedcluster with two cases (P,Q).

In the four methods described so far, individual objects ofthe (non-elementary) composed clusters are consideredfor calculating the distance between two clusters.The subsequent methods focuses on the two clustercentroids.

Janette Walde Cluster Analysis

Page 44: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Unweighted Centroid Method

5. The unweighted Centroid Method (= MedianMethod) calculates the distance between clusters asthe distance between the centroids of the clusters.

In principle, the position of the centroid of aCluster is defined by the average value of theobjects forming the cluster of each of thevariables that are considered in the CA. In thecase of an elementary cluster with only oneobject, this object represents the centroid.

Janette Walde Cluster Analysis

Page 45: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Unweighted Centroid Method, Cont.

However, in the unweighted Centroid Method(= Median Method) when calculating thedistances between two clusters only the twocentroids are considered instead of the singleobjects within the two clusters.→ The centroid of the cluster resulting fromthe fusion of two clusters results from themeans of the two clusters.→ Large and small clusters are not weighteddifferentially when they are unified.

Janette Walde Cluster Analysis

Page 46: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Weighted Centroid Method

In the unweighted centroid method the size of thetwo sub-clusters that are unified is ignored.→ The resulting centroid is usually different fromthe centroid of all elementary objects (i.e. cases)contained in the unified cluster (it is rather thecentroid of the two cluster centroids)

The (weighted) Centroid Method considers thedifferences in the size (number of objects) of thetwo original sub-clusters.

Janette Walde Cluster Analysis

Page 47: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Weighted Centroid Method, Cont.

Thus in the case of the fusion of a small and a largecluster, the centroid of the resulting cluster is closerto the centroid of the large cluster than to thecentroid of the small cluster.

Janette Walde Cluster Analysis

Page 48: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Ward Method (Min. Variance Method)

7. The idea has much in common with an ANOVA.The two clusters that are connected to the smallestincrease in the error sums of squares (ESS) aresequentially unified.

Let Xijk denote the value for variable k inobservation j belonging to cluster i :

ESS(X ) =∑

i

j

k

(Xijk − Xi ·k)2

=∑

clusters

cases

(Xij − centroidi )2

Janette Walde Cluster Analysis

Page 49: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Ward Procedure I

1 At the beginning each case is a cluster.2 In the first step of the algorithm, n − 1 clusters

are formed, one of size two and the remainingof size 1. The error sum of squares iscomputed. The pair of sample units that yieldthe smallest ESS will form the first cluster.

Janette Walde Cluster Analysis

Page 50: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Ward Procedure II

3 Then, in the second step of the algorithm,n − 2 clusters are formed from that n − 1clusters defined in step 2. These may includetwo clusters of size 2, or a single cluster of size3 including the two items clustered in step 1.Again, the value of ESS is minimized.

4 Thus, at each step of the algorithm clusters orobservations are combined in such a way as tominimize the results of error from the squares.

Janette Walde Cluster Analysis

Page 51: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Ward Procedure III

5 The algorithm stops when all sample units arecombined into a single large cluster of size n.

Janette Walde Cluster Analysis

Page 52: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Ward method, Cont.

The Ward method is frequently applied.

Homogeneous clusters.

It has a tendency to generate clusters of similar size (i.e.with similar numbers of cases).

However, the Ward method’s tendency towards equallylarge clusters can be an advantage (e.g. if homogenouscluster sizes are aimed at) as well as a disadvantage (e.g.if unbalanced cluster sizes do better reflect the reality).In the latter case, Average Linkage or the (weighted)Centroid method could produce better results.

Janette Walde Cluster Analysis

Page 53: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

Partitional clustering methods

A partitioning in G clusters is given (A predefinedclustering solution can be used as starting point).

Number of clusters remains constant.

The partitioning is suboptimal and the algorithm tries toimprove it.

Iteratively rearrangements improve the partitioning.

The methods differ in the criteria to measure thisimprovement and the rules for the rearrangements.

Janette Walde Cluster Analysis

Page 54: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

k-means clustering

Follows the idea of an ANOVA. The clusters areestablished in order to maximize the F-statistic:The ratio of the variance between clusters andthe variance inside the clusters is maximized.

The main advantages of this algorithm are itssimplicity and speed which allows it to run onlarge data sets.

Janette Walde Cluster Analysis

Page 55: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering

k-means clustering-algorithm

Randomly generate G clusters/centroids.

Compute the centroids for each cluster.

Calculate for each case the distance to thecentroid of each cluster.

Move the case into the cluster with thesmallest distance.

Repeat the above two steps till no change inthe cluster appears.

Janette Walde Cluster Analysis

Page 56: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Dendrogram

The results of a CA can be depicted in aDendrogram. It shows the sequence in which theclusters have been formed and the increases ofwithin cluster distances that are connected with theformation of ever larger clusters (i.e., the increase oferror variance in the case of the Ward Method)

Janette Walde Cluster Analysis

Page 57: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Dendrogram, Cont.

Together with the classification of the objects,the Dendrogram is the most interesting outputof a CA.

It can be used to judge the homogeneity of theclusters.

It can also be used to define the number ofclusters that should be used for the finalclassification of the objects

Janette Walde Cluster Analysis

Page 58: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Dendrogram, Cont.

Janette Walde Cluster Analysis

Page 59: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Classification

If the cases in the data set are assigned toclusters (clustering of cases), then the numberof the cluster in which each case is includedcan be saved as a new variable.

The number of clusters to be formed can bepredefined or a range of solutions can begenerated.

Subsequent analyzes comparing the differentclusters with respect to variables of interestbecomes possible.

Janette Walde Cluster Analysis

Page 60: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Classification, Cont.

If clusters differ significantly with respect tovariables of interest that are rather independentof the variables which have been used forclustering the cases, then the meaningfulnessof the clusters and the usefulness of theclustering procedure becomes evident.

Janette Walde Cluster Analysis

Page 61: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Useful R-commands I

Hierarchic clustering (function hclust) is in standardR and available without loading any specificlibraries. Hierarchic clustering needs dissimilarities asits input. Standard R has function dist to calculatemany dissimilarity functions, but for communitydata we may prefer vegan function vegdist havingecologically useful dissimilarity indices. The default

index is Bray-Curtis (djk =∑

i |xij−xik |∑

i (xij+xik)):

d < −dist(data) or d < −vegdist(data)

Janette Walde Cluster Analysis

Page 62: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Useful R-commands II

The ecologically useful indices in vegan have anupper limit of 1 for completely different sites (noshared species), and use simple differences ofabundances. In contrast, the standard Euclideandistance has no upper limit, but varies with the sumof total abundances of compared sites when thereare no shared species, and uses squares ofdifferences of abundances. There are many otherecologically useful indices in vegdist, but Bray-Curtisis usually not a bad choice.

Janette Walde Cluster Analysis

Page 63: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Useful R-commands III

Single linkage method:singlink < −hclust(d ,method = ”single”) withthe dendrogram plot(singlink)

Complete linkage :complink < −hclust(d ,method = ”complete”)and plot(complink, hang = -1)

Average linkage methods:averlink < −hclust(d ,method = ”aver”) withplot(averlink , hang = −1)

Janette Walde Cluster Analysis

Page 64: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Useful R-commands IV

The fixed classification can be visuallydemonstrated with rect.hclust

function:plot(singlink , hang = −1) withrect.hclust(singlink , 3)

We can extract classification in a certain levelusing function cutree:cl < −cutree(complink , 3). This gives anumeric classification vector of clusteridentities.

Janette Walde Cluster Analysis

Page 65: Cluster Analysis - uibk.ac.at · Introduction DistanceMeasures Formation ofGroups (Clusters)inCA Howtoobtain thenumber ofclusters? FurtherAnalysis oftheobtained clusters UsefulR-Commands

IntroductionDistance Measures

Formation of Groups (Clusters) in CAHow to obtain the number of clusters?

Further Analysis of the obtained clustersUseful R-Commands

Useful R-commands V

We can tabulate the numbers of observationsin each cluster: table(cl).

We can compare two clustering schemes bycross-tabulation: table(cl , cutree(singlink , 3)).

Also available: library(cluster).

Janette Walde Cluster Analysis