Cluster analysis
description
Transcript of Cluster analysis
Cluster analysis
• Partition MethodsDivide data into disjoint clusters
• Hierarchical Methods
Build a hierarchy of the observations and deduce the clusters from it.
K-means
Criteria
Same criteria with multivariate data:
Justifying the criteria• Anova: decomposition of the variance.
Univariate:
SST=SSW+SSB
Multivariate:
Minimizing the withing clusters variance is equivalent to maximize the between clusters variance (the difference between clusters).
K-means algorithm
Number of clusters
Consequences of standardization
Ruspini example
Problems of k-means
• Very sensitive to outliers
• Euclidean distances not appropriate for eliptical clusters
• It does not give the number of clusters.
Hierarchical Algoritms
Agglomerative algorithms
Nearest neighbour distance
Farthest neighbour distance
Average distance
Centroid method distance
Ward’s method distance
Dendograms
Example
Problems of hierarchical cluster
• If n is large, slow. Each time n(n-1)/2 comparisons.
• Euclidean distances not always appropriate
• If n is large, dendogram difficult to interpret
Clustering by variables
Distances between quantitative variables
Distances between qualitative variables
Similarity between attributes