Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based...

Data Mining with RClustering

Hugh Murrell

reference books

These slides are based on a book by Graham Williams:

Data Mining with Rattle and R,The Art of Excavating Data for Knowledge Discovery.

for further background on decision trees try Andrew Moore’sslides from: http://www.autonlab.org/tutorials

and as always, wikipedia is a useful source of information.

clustering

Clustering is one of the core tools that is used by the dataminer.

Clustering gives us the opportunity to group observations in agenerally unguided fashion according to how similar they are.

This is done on the basis of a measure of the distance betweenobservations.

The aim of clustering is to identify groups of observations thatare close together but as a group are quite separate from othergroups.

k-means clustering

Given a set of observations, (~x1,~x2, . . . ,~xn), where eachobservation is a d-dimensional real vector, k-means clusteringaims to partition the n observations into k sets (S1, S2, . . . , Sk)so as to minimize the within-cluster sum of squares:

k∑i

∑~xj∈Si

||~xj − ~µi ||2

where ~µi is the mean of observations in Si .

k-means algorithm

Given an initial set of k means, ~m1, . . . , ~mk , the algorithmproceeds by alternating between two steps:

I Assignment step: Assign each observation to thecluster whose mean is closest to it.

I Update step: Calculate the new means to be thecentroids of the observations in the new clusters.

The algorithm has converged when the assignments no longerchange.

variants of k-means

As it stands the k-means algorithm gives different resultsdepending on how the initial means are chosen. Thus therehave been a number of attempts in the literature to addressthese problems.

The cluster package in R implements three variants ofk-means.

I pam: partitioning around medoids

I clara: clustering large applications

I fanny: fuzzy analysis clustering

In the next slide, we outline the k-medoids algorithm which isimplemented as the function pam.

partitioning around medoids

I Initialize by randomly selecting k of the n data points asthe medoids.

I Associate each data point to the closest medoid.

I For each medoid mI For each non-medoid data point o

I Swap m and o and compute the total cost of theconfiguration

I Select the configuration with the lowest cost.

I repeat until there is no change in the medoid.

distance measures

There are a number of ways to measure closest whenimplementing the k-medoids algorithm.

I Euclidean distance d(~u, ~v) = (∑

i(ui − vi)2)

12

I Manhattan distance d(~u, ~v) = (∑

i |ui − vi |I Minkowski distance d(~u, ~v) = (

∑i(ui − vi)

p)1p

Note that Minkowski distance is a generalization of the othertwo distance measures with p = 2 giving Euclidian distanceand p = 1 giving Manhatten (or taxi-cab) distance.

example data set

For purposes of demonstration we will again make use of theclassic iris data set from R’s datasets collection.

> summary(iris$Species)

setosa versicolor virginica

50 50 50

Can we throw away the Species attribute and recover itthrough unsupervised learning?

partitioning the iris dataset

> library(cluster) # load package

> dat <- iris[, -5] # drop known Species

> pam.result <- pam(dat,3) # perform k-medoids

> pam.result$clustering # print the clustering

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[18] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

[52] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[69] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2

[86] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2

[103] 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3

[120] 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3

[137] 3 3 2 3 3 3 2 3 3 3 2 3 3 2

success rate

> # how many does it get wrong

> #

> sum(pam.result$clustering != as.numeric(iris$Species))

[1] 16

> #

> # plot the clusters and produce a cluster silhouette

> par(mfrow=c(2,1))

> plot(pam.result)

In the silhouette, a large si (almost 1) suggests that theobservations are very well clustered, a small si (around 0)means that the observation lies between two clusters.Observations with a negative si are probably in the wrongcluster.

cluster plot

−3 −2 −1 0 1 2 3−

3−

11

clusplot(pam(x = dat, k = 3))

Component 1

Com

pone

nt 2

These two components explain 95.81 % of the point variability.

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = dat, k = 3)

Average silhouette width : 0.55

n = 150 3 clusters Cj

j : nj | avei∈Cj si1 : 50 | 0.80

2 : 62 | 0.42

3 : 38 | 0.45

hierarchical clustering

In hierarchical clustering, each object is assigned to its owncluster and then the algorithm proceeds iteratively, at eachstage joining the two most similar clusters, continuing untilthere is just a single cluster.

At each stage distances between clusters are recomputed by adissimilarity formula according to the particular clusteringmethod being used.

hierarchical clustering of iris dataset

The cluster package in R implements two variants ofhierarchical clustering.

I agnes: AGglomerative NESting

I diana: DIvisive ANAlysis Clustering

However, R has a built-in hierarchical clustering routine calledhclust (equivalent to agnes) which we will use to cluster theiris data set.

> dat <- iris[, -5]

> # perform hierarchical clustering

> hc <- hclust(dist(dat),"ave")

> # plot the dendogram

> plclust(hc,hang=-2)

cluster plot

42 15 16 33 34 37 21 32 44 24 27 36 5 38 50 8 40 28 29 41 1 18 45 6 19 17 11 49 47 20 22 23 14 43 9 39 12 25 7 13 2 46 26 10 35 30 31 3 4 48 105

129

133

112

104

117

138

111

148

113

140

142

146

116

137

149

101

125

121

144

141

145

109

135

110

118

132

119

106

123

136

108

131

103

126

130 61 99 58 94 66 76 55 59 78 77 87 51 53 86 52 57 74 79 64 92 72 75 98 120 69 88 115

122

114

102

143

150 71 128

139

147

124

127 73 84 134

107 63 68 83 93 62 95 100 89 96 97 67 85 56 91 65 80 60 54 90 70 81 82

01

23

4

hclust (*, "average")dist(dat)

Hei

ght

Similar to the k-means clustering, hclust shows that clustersetosa can be easily separated from the other two clusters, andthat clusters versicolor and virginica are to a small degreeoverlapped with each other.

success rate

> # how many does it get wrong

> #

> clusGroup <- cutree(hc, k=3)

> sum(clusGroup != as.numeric(iris$Species))

[1] 14

exercises

By invitation only:

Revisit the wine dataset from my website. This time discardthe Cultivar variable.

Use the pam routine from the Cluster package to derive 3clusters for the wine dataset. Plot the clusters in a 2D planeand compute and report on the success rate of your chosenmethod.

Also perform a hierarchical clustering of the wine dataset andmeasure its performance at the 3-cluster level.

email your wine clustering script to me by Monday the 9th

May, 06h00.

Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based...

Documents

Transcript of Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based...