Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based...
Transcript of Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based...
![Page 1: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/1.jpg)
Data Mining with RClustering
Hugh Murrell
![Page 2: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/2.jpg)
reference books
These slides are based on a book by Graham Williams:
Data Mining with Rattle and R,The Art of Excavating Data for Knowledge Discovery.
for further background on decision trees try Andrew Moore’sslides from: http://www.autonlab.org/tutorials
and as always, wikipedia is a useful source of information.
![Page 3: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/3.jpg)
clustering
Clustering is one of the core tools that is used by the dataminer.
Clustering gives us the opportunity to group observations in agenerally unguided fashion according to how similar they are.
This is done on the basis of a measure of the distance betweenobservations.
The aim of clustering is to identify groups of observations thatare close together but as a group are quite separate from othergroups.
![Page 4: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/4.jpg)
k-means clustering
Given a set of observations, (~x1,~x2, . . . ,~xn), where eachobservation is a d-dimensional real vector, k-means clusteringaims to partition the n observations into k sets (S1, S2, . . . , Sk)so as to minimize the within-cluster sum of squares:
k∑i
∑~xj∈Si
||~xj − ~µi ||2
where ~µi is the mean of observations in Si .
![Page 5: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/5.jpg)
k-means algorithm
Given an initial set of k means, ~m1, . . . , ~mk , the algorithmproceeds by alternating between two steps:
I Assignment step: Assign each observation to thecluster whose mean is closest to it.
I Update step: Calculate the new means to be thecentroids of the observations in the new clusters.
The algorithm has converged when the assignments no longerchange.
![Page 6: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/6.jpg)
variants of k-means
As it stands the k-means algorithm gives different resultsdepending on how the initial means are chosen. Thus therehave been a number of attempts in the literature to addressthese problems.
The cluster package in R implements three variants ofk-means.
I pam: partitioning around medoids
I clara: clustering large applications
I fanny: fuzzy analysis clustering
In the next slide, we outline the k-medoids algorithm which isimplemented as the function pam.
![Page 7: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/7.jpg)
partitioning around medoids
I Initialize by randomly selecting k of the n data points asthe medoids.
I Associate each data point to the closest medoid.
I For each medoid mI For each non-medoid data point o
I Swap m and o and compute the total cost of theconfiguration
I Select the configuration with the lowest cost.
I repeat until there is no change in the medoid.
![Page 8: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/8.jpg)
distance measures
There are a number of ways to measure closest whenimplementing the k-medoids algorithm.
I Euclidean distance d(~u, ~v) = (∑
i(ui − vi)2)
12
I Manhattan distance d(~u, ~v) = (∑
i |ui − vi |I Minkowski distance d(~u, ~v) = (
∑i(ui − vi)
p)1p
Note that Minkowski distance is a generalization of the othertwo distance measures with p = 2 giving Euclidian distanceand p = 1 giving Manhatten (or taxi-cab) distance.
![Page 9: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/9.jpg)
example data set
For purposes of demonstration we will again make use of theclassic iris data set from R’s datasets collection.
> summary(iris$Species)
setosa versicolor virginica
50 50 50
Can we throw away the Species attribute and recover itthrough unsupervised learning?
![Page 10: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/10.jpg)
partitioning the iris dataset
> library(cluster) # load package
> dat <- iris[, -5] # drop known Species
> pam.result <- pam(dat,3) # perform k-medoids
> pam.result$clustering # print the clustering
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[18] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[52] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[69] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2
[86] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
[103] 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3
[120] 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3
[137] 3 3 2 3 3 3 2 3 3 3 2 3 3 2
![Page 11: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/11.jpg)
success rate
> # how many does it get wrong
> #
> sum(pam.result$clustering != as.numeric(iris$Species))
[1] 16
> #
> # plot the clusters and produce a cluster silhouette
> par(mfrow=c(2,1))
> plot(pam.result)
In the silhouette, a large si (almost 1) suggests that theobservations are very well clustered, a small si (around 0)means that the observation lies between two clusters.Observations with a negative si are probably in the wrongcluster.
![Page 12: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/12.jpg)
cluster plot
−3 −2 −1 0 1 2 3−
3−
11
clusplot(pam(x = dat, k = 3))
Component 1
Com
pone
nt 2
These two components explain 95.81 % of the point variability.
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = dat, k = 3)
Average silhouette width : 0.55
n = 150 3 clusters Cj
j : nj | avei∈Cj si1 : 50 | 0.80
2 : 62 | 0.42
3 : 38 | 0.45
![Page 13: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/13.jpg)
hierarchical clustering
In hierarchical clustering, each object is assigned to its owncluster and then the algorithm proceeds iteratively, at eachstage joining the two most similar clusters, continuing untilthere is just a single cluster.
At each stage distances between clusters are recomputed by adissimilarity formula according to the particular clusteringmethod being used.
![Page 14: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/14.jpg)
hierarchical clustering of iris dataset
The cluster package in R implements two variants ofhierarchical clustering.
I agnes: AGglomerative NESting
I diana: DIvisive ANAlysis Clustering
However, R has a built-in hierarchical clustering routine calledhclust (equivalent to agnes) which we will use to cluster theiris data set.
> dat <- iris[, -5]
> # perform hierarchical clustering
> hc <- hclust(dist(dat),"ave")
> # plot the dendogram
> plclust(hc,hang=-2)
![Page 15: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/15.jpg)
cluster plot
42 15 16 33 34 37 21 32 44 24 27 36 5 38 50 8 40 28 29 41 1 18 45 6 19 17 11 49 47 20 22 23 14 43 9 39 12 25 7 13 2 46 26 10 35 30 31 3 4 48 105
129
133
112
104
117
138
111
148
113
140
142
146
116
137
149
101
125
121
144
141
145
109
135
110
118
132
119
106
123
136
108
131
103
126
130 61 99 58 94 66 76 55 59 78 77 87 51 53 86 52 57 74 79 64 92 72 75 98 120 69 88 115
122
114
102
143
150 71 128
139
147
124
127 73 84 134
107 63 68 83 93 62 95 100 89 96 97 67 85 56 91 65 80 60 54 90 70 81 82
01
23
4
hclust (*, "average")dist(dat)
Hei
ght
Similar to the k-means clustering, hclust shows that clustersetosa can be easily separated from the other two clusters, andthat clusters versicolor and virginica are to a small degreeoverlapped with each other.
![Page 16: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/16.jpg)
success rate
> # how many does it get wrong
> #
> clusGroup <- cutree(hc, k=3)
> sum(clusGroup != as.numeric(iris$Species))
[1] 14
![Page 17: Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e96eca393593205f85d771e/html5/thumbnails/17.jpg)
exercises
By invitation only:
Revisit the wine dataset from my website. This time discardthe Cultivar variable.
Use the pam routine from the Cluster package to derive 3clusters for the wine dataset. Plot the clusters in a 2D planeand compute and report on the success rate of your chosenmethod.
Also perform a hierarchical clustering of the wine dataset andmeasure its performance at the 3-cluster level.
email your wine clustering script to me by Monday the 9th
May, 06h00.