Post on 11-Apr-2020
Data Mining with RClustering
Hugh Murrell
reference books
These slides are based on a book by Graham Williams:
Data Mining with Rattle and R,The Art of Excavating Data for Knowledge Discovery.
for further background on decision trees try Andrew Moore’sslides from: http://www.autonlab.org/tutorials
and as always, wikipedia is a useful source of information.
clustering
Clustering is one of the core tools that is used by the dataminer.
Clustering gives us the opportunity to group observations in agenerally unguided fashion according to how similar they are.
This is done on the basis of a measure of the distance betweenobservations.
The aim of clustering is to identify groups of observations thatare close together but as a group are quite separate from othergroups.
k-means clustering
Given a set of observations, (~x1,~x2, . . . ,~xn), where eachobservation is a d-dimensional real vector, k-means clusteringaims to partition the n observations into k sets (S1, S2, . . . , Sk)so as to minimize the within-cluster sum of squares:
k∑i
∑~xj∈Si
||~xj − ~µi ||2
where ~µi is the mean of observations in Si .
k-means algorithm
Given an initial set of k means, ~m1, . . . , ~mk , the algorithmproceeds by alternating between two steps:
I Assignment step: Assign each observation to thecluster whose mean is closest to it.
I Update step: Calculate the new means to be thecentroids of the observations in the new clusters.
The algorithm has converged when the assignments no longerchange.
variants of k-means
As it stands the k-means algorithm gives different resultsdepending on how the initial means are chosen. Thus therehave been a number of attempts in the literature to addressthese problems.
The cluster package in R implements three variants ofk-means.
I pam: partitioning around medoids
I clara: clustering large applications
I fanny: fuzzy analysis clustering
In the next slide, we outline the k-medoids algorithm which isimplemented as the function pam.
partitioning around medoids
I Initialize by randomly selecting k of the n data points asthe medoids.
I Associate each data point to the closest medoid.
I For each medoid mI For each non-medoid data point o
I Swap m and o and compute the total cost of theconfiguration
I Select the configuration with the lowest cost.
I repeat until there is no change in the medoid.
distance measures
There are a number of ways to measure closest whenimplementing the k-medoids algorithm.
I Euclidean distance d(~u, ~v) = (∑
i(ui − vi)2)
12
I Manhattan distance d(~u, ~v) = (∑
i |ui − vi |I Minkowski distance d(~u, ~v) = (
∑i(ui − vi)
p)1p
Note that Minkowski distance is a generalization of the othertwo distance measures with p = 2 giving Euclidian distanceand p = 1 giving Manhatten (or taxi-cab) distance.
example data set
For purposes of demonstration we will again make use of theclassic iris data set from R’s datasets collection.
> summary(iris$Species)
setosa versicolor virginica
50 50 50
Can we throw away the Species attribute and recover itthrough unsupervised learning?
partitioning the iris dataset
> library(cluster) # load package
> dat <- iris[, -5] # drop known Species
> pam.result <- pam(dat,3) # perform k-medoids
> pam.result$clustering # print the clustering
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[18] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[52] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[69] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2
[86] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
[103] 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3
[120] 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3
[137] 3 3 2 3 3 3 2 3 3 3 2 3 3 2
success rate
> # how many does it get wrong
> #
> sum(pam.result$clustering != as.numeric(iris$Species))
[1] 16
> #
> # plot the clusters and produce a cluster silhouette
> par(mfrow=c(2,1))
> plot(pam.result)
In the silhouette, a large si (almost 1) suggests that theobservations are very well clustered, a small si (around 0)means that the observation lies between two clusters.Observations with a negative si are probably in the wrongcluster.
cluster plot
−3 −2 −1 0 1 2 3−
3−
11
clusplot(pam(x = dat, k = 3))
Component 1
Com
pone
nt 2
These two components explain 95.81 % of the point variability.
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = dat, k = 3)
Average silhouette width : 0.55
n = 150 3 clusters Cj
j : nj | avei∈Cj si1 : 50 | 0.80
2 : 62 | 0.42
3 : 38 | 0.45
hierarchical clustering
In hierarchical clustering, each object is assigned to its owncluster and then the algorithm proceeds iteratively, at eachstage joining the two most similar clusters, continuing untilthere is just a single cluster.
At each stage distances between clusters are recomputed by adissimilarity formula according to the particular clusteringmethod being used.
hierarchical clustering of iris dataset
The cluster package in R implements two variants ofhierarchical clustering.
I agnes: AGglomerative NESting
I diana: DIvisive ANAlysis Clustering
However, R has a built-in hierarchical clustering routine calledhclust (equivalent to agnes) which we will use to cluster theiris data set.
> dat <- iris[, -5]
> # perform hierarchical clustering
> hc <- hclust(dist(dat),"ave")
> # plot the dendogram
> plclust(hc,hang=-2)
cluster plot
42 15 16 33 34 37 21 32 44 24 27 36 5 38 50 8 40 28 29 41 1 18 45 6 19 17 11 49 47 20 22 23 14 43 9 39 12 25 7 13 2 46 26 10 35 30 31 3 4 48 105
129
133
112
104
117
138
111
148
113
140
142
146
116
137
149
101
125
121
144
141
145
109
135
110
118
132
119
106
123
136
108
131
103
126
130 61 99 58 94 66 76 55 59 78 77 87 51 53 86 52 57 74 79 64 92 72 75 98 120 69 88 115
122
114
102
143
150 71 128
139
147
124
127 73 84 134
107 63 68 83 93 62 95 100 89 96 97 67 85 56 91 65 80 60 54 90 70 81 82
01
23
4
hclust (*, "average")dist(dat)
Hei
ght
Similar to the k-means clustering, hclust shows that clustersetosa can be easily separated from the other two clusters, andthat clusters versicolor and virginica are to a small degreeoverlapped with each other.
success rate
> # how many does it get wrong
> #
> clusGroup <- cutree(hc, k=3)
> sum(clusGroup != as.numeric(iris$Species))
[1] 14
exercises
By invitation only:
Revisit the wine dataset from my website. This time discardthe Cultivar variable.
Use the pam routine from the Cluster package to derive 3clusters for the wine dataset. Plot the clusters in a 2D planeand compute and report on the success rate of your chosenmethod.
Also perform a hierarchical clustering of the wine dataset andmeasure its performance at the 3-cluster level.
email your wine clustering script to me by Monday the 9th
May, 06h00.