Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based...

Data Mining with RClustering

Hugh Murrell

reference books

These slides are based on a book by Graham Williams:

Data Mining with Rattle and R,The Art of Excavating Data for Knowledge Discovery.

for further background on decision trees try Andrew Moore’sslides from: http://www.autonlab.org/tutorials

and as always, wikipedia is a useful source of information.

clustering

Clustering is one of the core tools that is used by the dataminer.

Clustering gives us the opportunity to group observations in agenerally unguided fashion according to how similar they are.

This is done on the basis of a measure of the distance betweenobservations.

The aim of clustering is to identify groups of observations thatare close together but as a group are quite separate from othergroups.

k-means clustering

Given a set of observations, (~x1,~x2, . . . ,~xn), where eachobservation is a d-dimensional real vector, k-means clusteringaims to partition the n observations into k sets (S1, S2, . . . , Sk)so as to minimize the within-cluster sum of squares:

∑~xj∈Si

||~xj − ~µi ||2

where ~µi is the mean of observations in Si .

k-means algorithm

Given an initial set of k means, ~m1, . . . , ~mk , the algorithmproceeds by alternating between two steps:

I Assignment step: Assign each observation to thecluster whose mean is closest to it.

I Update step: Calculate the new means to be thecentroids of the observations in the new clusters.

The algorithm has converged when the assignments no longerchange.

variants of k-means

As it stands the k-means algorithm gives different resultsdepending on how the initial means are chosen. Thus therehave been a number of attempts in the literature to addressthese problems.

The cluster package in R implements three variants ofk-means.

I pam: partitioning around medoids

I clara: clustering large applications

I fanny: fuzzy analysis clustering

In the next slide, we outline the k-medoids algorithm which isimplemented as the function pam.

partitioning around medoids

I Initialize by randomly selecting k of the n data points asthe medoids.

I Associate each data point to the closest medoid.

I For each medoid mI For each non-medoid data point o

I Swap m and o and compute the total cost of theconfiguration

I Select the configuration with the lowest cost.

I repeat until there is no change in the medoid.

distance measures

There are a number of ways to measure closest whenimplementing the k-medoids algorithm.

I Euclidean distance d(~u, ~v) = (∑

i(ui − vi)2)

I Manhattan distance d(~u, ~v) = (∑

i |ui − vi |I Minkowski distance d(~u, ~v) = (

∑i(ui − vi)

Note that Minkowski distance is a generalization of the othertwo distance measures with p = 2 giving Euclidian distanceand p = 1 giving Manhatten (or taxi-cab) distance.

example data set

For purposes of demonstration we will again make use of theclassic iris data set from R’s datasets collection.

> summary(iris$Species)

setosa versicolor virginica

50 50 50

Can we throw away the Species attribute and recover itthrough unsupervised learning?

partitioning the iris dataset

> library(cluster) # load package

> dat <- iris[, -5] # drop known Species

> pam.result <- pam(dat,3) # perform k-medoids

> pam.result$clustering # print the clustering

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[18] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

[52] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[69] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2

[86] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2

[103] 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3

[120] 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3

[137] 3 3 2 3 3 3 2 3 3 3 2 3 3 2

success rate

> # how many does it get wrong

> sum(pam.result$clustering != as.numeric(iris$Species))

[1] 16

> # plot the clusters and produce a cluster silhouette

> par(mfrow=c(2,1))

> plot(pam.result)

In the silhouette, a large si (almost 1) suggests that theobservations are very well clustered, a small si (around 0)means that the observation lies between two clusters.Observations with a negative si are probably in the wrongcluster.

cluster plot

−3 −2 −1 0 1 2 3−

clusplot(pam(x = dat, k = 3))

Component 1

These two components explain 95.81 % of the point variability.

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = dat, k = 3)

Average silhouette width : 0.55

n = 150 3 clusters Cj

j : nj | avei∈Cj si1 : 50 | 0.80

2 : 62 | 0.42

3 : 38 | 0.45

hierarchical clustering

In hierarchical clustering, each object is assigned to its owncluster and then the algorithm proceeds iteratively, at eachstage joining the two most similar clusters, continuing untilthere is just a single cluster.

At each stage distances between clusters are recomputed by adissimilarity formula according to the particular clusteringmethod being used.

hierarchical clustering of iris dataset

The cluster package in R implements two variants ofhierarchical clustering.

I agnes: AGglomerative NESting

I diana: DIvisive ANAlysis Clustering

However, R has a built-in hierarchical clustering routine calledhclust (equivalent to agnes) which we will use to cluster theiris data set.

> dat <- iris[, -5]

> # perform hierarchical clustering

> hc <- hclust(dist(dat),"ave")

> # plot the dendogram

> plclust(hc,hang=-2)

cluster plot

42 15 16 33 34 37 21 32 44 24 27 36 5 38 50 8 40 28 29 41 1 18 45 6 19 17 11 49 47 20 22 23 14 43 9 39 12 25 7 13 2 46 26 10 35 30 31 3 4 48 105

130 61 99 58 94 66 76 55 59 78 77 87 51 53 86 52 57 74 79 64 92 72 75 98 120 69 88 115

150 71 128

127 73 84 134

107 63 68 83 93 62 95 100 89 96 97 67 85 56 91 65 80 60 54 90 70 81 82

hclust (*, "average")dist(dat)

Similar to the k-means clustering, hclust shows that clustersetosa can be easily separated from the other two clusters, andthat clusters versicolor and virginica are to a small degreeoverlapped with each other.

success rate

> # how many does it get wrong

> clusGroup <- cutree(hc, k=3)

> sum(clusGroup != as.numeric(iris$Species))

[1] 14

exercises

By invitation only:

Revisit the wine dataset from my website. This time discardthe Cultivar variable.

Use the pam routine from the Cluster package to derive 3clusters for the wine dataset. Plot the clusters in a 2D planeand compute and report on the success rate of your chosenmethod.

Also perform a hierarchical clustering of the wine dataset andmeasure its performance at the 3-cluster level.

email your wine clustering script to me by Monday the 9th

May, 06h00.

Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based...

Documents

Transcript of Data Mining with R - Clusteringhughm/dm/content/slides09.pdfreference books These slides are based...

[PPT]LING 180 Intro to Computer Speech and Language …kathy/NLP/ClassSlides/Slides09/... · Web viewLexical Semantics The meanings of individual words Formal Semantics (or Compositional

REFERENCE CHLORITE CHARACTERIZATION FOR ... 3/3-1-117.pdfREFERENCE CHLORITE CHARACTERIZATION FOR CHLORITE IDENTIFICATION IN SOIL CLAYS R. TORRENCE MARTIN Massachusetts Institute of

ucm 307467.pdfReference #2 - Script Your Future · wcm/@hcm/documents/downloadable/ucm_307467.pdf. Reference #2

Semantics: Representations and Analyseskathy/NLP/ClassSlides/Slides09/Class10...Semantics: Representations and Analyses. ... Answer questions (What is the best French ... There is

Reference free Chi 3 dispersion me asurements in planar ...eprints.soton.ac.uk/.../SPIE_manuscript_for_approval_v2.pdfReference free Chi 3 dispersion me asurements in planar tantalum

REFERENCE - streaming.ictp.itstreaming.ictp.it/preprints/P/73/080.pdfREFERENCE IC/T3/80 (Limited distribution) International Atomic Energy Agency and United Nations Educational Scientific

So Slides09

Reference 11 - University of Tennesseeweb.utk.edu/~prdaves/Computerhelp/COMPUSTAT/Compustat_manuals/tape_11.pdfReference 11 Reference i In this chapter… 1 COMPUSTAT Data Item List

Contentsmission.sfgov.org/DOCUMENT_CENTER_DOCUMENTS/DC4211.pdfReference Guide JobAps Beginning a Recruitment August 2015 Page 5 Step Action 12. Click the Analyst list to select your

Thyroid Disorders 101: AThyroid Disorders 101: A Review in ...dcpa.us/20092014CE/2009CE/slides09/sun07.grant.thyroid.pdf · Epidemiology Experts believe more than 13 million Americans

Web-based Factoid Question Answering (including a sketch of …kathy/NLP/ClassSlides/Slides09/... · 2009. 11. 17. · `Factoid questions… Who, where, when, how many… The answers

6596787 DBS Connolly 2004 Slides09 Database Planning Design

cdigital.dgb.uanl.mxcdigital.dgb.uanl.mx/la/1030013948_C/1030013948_T1/1030013948_054.pdfREFERENCE HANDBOOK OF THE MEDICAL SCIENCES. A lirnentary Tract. REFERENCE HANDBOOK OF fossa;

Reference Guide JobAps Creating An Announcementmission.sfgov.org/DOCUMENT_CENTER_DOCUMENTS/DC4213.pdfReference Guide JobAps – Creating An Announcement August 2015 Page 2 retrieval.

Slides09 FoP10 V1&Objects - University of Ottawaaix1.uottawa.ca/~ccollin/PCLWebsite/Teaching_files/CLR_Slides09... · • M vs. P pathways: Movement vs. Particulars ... likely some

Haptic interaction - Heriot-Watt Universityruth/year4VEs/Slides09/L13.pdf · 2012-01-08 · Haptic Technologies Haptics refers to manual interactions with environments, such as sensorial

0EVEL.OPMNTI AID BILYIOGRAPHIC INPUT SHEET 7pdf.usaid.gov/pdf_docs/PNAAF269.pdfREFERENCE ORGANIZATION.NAME AND ADDRESS" ... Workshop on Usel of CCTV System- May 27, 1975 ... Graduate

PRO EDITION JIGSAW 750 W PRO JL750-Pimages.clasohlson.com/medias/sys_master/9542641352734.pdfreference. Viktig information: Läs hela bruksanvisningen noggrant och försäkra dig om

Reference free Chi 3 dispersion me asurements in planar ...redir.eprints.ecs.soton.ac.uk/.../SPIE_manuscript_for_approval_v2.pdfReference free Chi 3 dispersion me asurements in planar

Reference Architecture for SAP Applications using Lenovo ...lenovopress.com/lp1074.pdfReference Architecture for SAP Applications using Lenovo ThinkAgile VX Series 1 Introduction The