machine learning - Clustering in R

34
Clustering Agenda • Definition of Clustering • Existing clustering methods • Clustering examples • Clustering demonstration • Clustering validity

Transcript of machine learning - Clustering in R

Page 1: machine learning - Clustering in R

Clustering Agenda

• Definition of Clustering• Existing clustering methods• Clustering examples• Clustering demonstration• Clustering validity

Page 2: machine learning - Clustering in R

Definition• Clustering can be considered the most important unsupervised learning

technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data.

• Unsupervised: no information is provided to the algorithm on which data points belong to which clusters.

• Clustering is “the process of organizing objects into groups whose members are similar in some way”.

• A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Page 3: machine learning - Clustering in R

What Cluster Analysis is not• Supervised classification• Have class label information

• Simple segmentation• Dividing students into different registration groups

alphabetically, by last name

• Results of a query• Groupings are a result of an external specification

Page 4: machine learning - Clustering in R

Why and Where to use Clustering?

Why?• Simplifications• Pattern detection• Useful in data concept construction• Unsupervised learning process

Where?• Data mining• Information retrieval• text mining• Web analysis• marketing• medical diagnostic

Page 5: machine learning - Clustering in R

Applications• Retail – Group similar customers• Biology – Group similar plants/animals to study their common

behavior• Financial services – Groups similar types of accounts or

customers• Air line – Group similar types of customers to offer different

discounts• Insurance – Group similar nature of consumers as well as claims

to take policy decisions• Government – Group similar areas to announce various subsidies

or other benefits

Page 6: machine learning - Clustering in R

Which method to use?It depends on following:

• Type of attributes in data• Dictates type of similarity

• Scalability to larger dataset• Ability to work with irregular data• Time cost• Complexity• Data order dependency• Result presentation

Page 7: machine learning - Clustering in R

Major existing clustering algorithms

• K-means and its variants

• Hierarchical clustering

• Density-based clustering

Page 8: machine learning - Clustering in R

K-means Clustering• Partition clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid• Number of clusters, K, must be specified• The basic algorithm is very simple

Page 9: machine learning - Clustering in R

K-means Clustering – Details• Initial centroids are often chosen randomly.• Clusters produced vary from one run to another.

• The centroid is (typically) the mean of the points in the cluster.• ‘Closeness’ is measured by Euclidean distance, correlation, etc.• K-means will converge for common similarity measures

mentioned above.• Most of the convergence happens in the first few iterations.• Often the stopping condition is changed to ‘Until relatively

few points change clusters’• Complexity is O( n * K * I * d )• n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes

Page 10: machine learning - Clustering in R

Evaluating K-means Clusters• Most common measure is Sum of Squared Error (SSE)• For each point, the error is the distance to the nearest cluster• To get SSE, we square these errors and sum them.

• x is a data point in cluster Ci and mi is the representative point for cluster Ci • can show that mi corresponds to the center (mean) of the

cluster• Given two clusters, we can choose the one with the smallest

error• One easy way to reduce SSE is to increase K, the number of

clusters• A good clustering with smaller K can have a lower SSE

than a poor clustering with higher K

K

i Cxi

i

xmdistSSE1

2 ),(

Page 11: machine learning - Clustering in R

Limitations of K-means

• K-means has problems when clusters are of differing • Sizes• Densities• Non-globular shapes

• K-means has problems when the data contains outliers.

Page 12: machine learning - Clustering in R

K-means Clustering Algorithm1. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in

the dataset. a) The Arithmetic Mean of a cluster is the mean of all the individual records in

the cluster. In each of the first K initial clusters, their is only one record. b) The Arithmetic Mean of a cluster with one record is the set of values that

make up that record. c) For Example if the dataset we are discussing is a set of Avg Txn Amount,

Merchant Categories Transacted and Age with Citrus measurements for USER, where a record P in the dataset S is represented by a P = {Avg TxnAmt, Mer_Cat_Cnt, Age_Citrus). 

d) Then a record containing the measurements of a User (9898084242), would be represented as 9898084242 = {2000, 3, 6} where 9898084242’s Txn Amount = 2000 rs, Mer Categories = 3 and Age with Citrus = 6 Months.

e) Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for 9898084242 as a member = {2000, 3, 6}.

2. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure.

Page 13: machine learning - Clustering in R

13

K-means Clustering Algorithm3. K-Means re-assigns each record in the dataset to the most similar cluster

and re-calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster.

4. For Example, if a cluster contains two records where the record of the set of measurements for  9898084242 = {2000, 3, 6} and 8652084242 = {1000, 2, 2} then the arithmetic mean Pmean is represented as Pmean= {Avgmean, Mer Catmean, Agemean).  Avgmean= (2000 + 1000)/2, Mer Catmean= (3 + 2)/2 and Agemean= (6 + 2)/2. The arithmetic mean of this cluster = {1500, 2.5, 4}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters.

5. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity 

6. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.

Page 14: machine learning - Clustering in R

K - means clustering - demonstration

1) k initial "means" (in this case k=3) are randomly selected from the data set

2)k clusters are created by associating every observation with the nearest mean.

3) The centroid of each of the k clusters becomes the new means.

4) Steps 2 and 3 are repeated until convergence has been reached

Page 15: machine learning - Clustering in R

15

Steps: K-means Clustering analysis• It is important to define the problem to be solved beforehand

so that clustering method, variables, data range can be selected.• Variable identification

• Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)

• Conversion of non-numeric variables to numeric form• Running Descriptive Analysis

• Importing data • Selecting the variables• Scaling the variables to common metric• Deciding on the number of clusters to be created• Running the analysis• Interpreting the results

Page 16: machine learning - Clustering in R

Case : K-Means• Step 1: Data preparation and Selecting Variables• Step 2: Scaling data – ruspini.scaled <- scale(ruspini)

Page 17: machine learning - Clustering in R

Case : K-Means• Step 3: Identify Number of Clusters

Page 18: machine learning - Clustering in R

Case : K-Means

• Step 4: K-Means Cluster km <- kmeans(ruspini.scaled, centers=6, nstart=10) km

Page 19: machine learning - Clustering in R

Case : K-Means• Step 4: Plot Cluster plot(ruspini.scaled, col=km$cluster) points(km$centers, pch=3, cex=2) # this adds the centroids text(km$centers, labels=1:4, pos=2) # this adds the cluster ID

Page 20: machine learning - Clustering in R

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Page 22: machine learning - Clustering in R

Hierarchical clustering

Agglomerative (bottom up)

1. Start with 1 point (singleton)

2. Recursively add two or more appropriate clusters

3. Stop when k number of clusters is achieved.

Divisive (top down)

1. Start with a big cluster

2. Recursively divide into smaller clusters

3. Stop when k number of clusters is achieved.

Page 23: machine learning - Clustering in R

Case : Hierarchical Clustering• Step 1: Get distance between data points• dist.ruspini <- dist(ruspini.scaled)

Page 24: machine learning - Clustering in R

Case : Hierarchical Clustering• Step 1: Create and plot cluster• hc.ruspini <- hclust(dist.ruspini, method="complete")• plot(hc.ruspini)• rect.hclust(hc.ruspini, k=4)

Page 25: machine learning - Clustering in R

Density Based Clustering : DBSCAN• DBSCAN is a density-based algorithm.• Density = number of points within a specified radius

(Eps)

• A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster

• A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

• A noise point is any point that is not a core point or a border point.

Page 26: machine learning - Clustering in R

DBSCAN Algorithm• Eliminate noise points• Perform clustering on the remaining points

Page 27: machine learning - Clustering in R

Case: DBSCAN Density based clustering• Step 1: Get KKN plot for Epsilon value. library(dbscan) kNNdistplot(ruspini.scaled, k = 3) abline(h=.25, col="red")

Page 28: machine learning - Clustering in R

Case: DBSCAN Density based clustering• Step 2: Run DBSCAN. db <- dbscan(ruspini.scaled, eps=.25, minPts=3) db

Page 29: machine learning - Clustering in R

Case: DBSCAN Density based clustering• Step 3: Plot Cluster. plot(ruspini.scaled, col=db$cluster+1L)

Page 30: machine learning - Clustering in R

Cluster Validity • For supervised classification we have a variety of measures to

evaluate how good our model is• Accuracy, precision, recall

• For cluster analysis, the question is how to evaluate the “goodness” of the resulting clusters?• Then why do we want to evaluate them?• To avoid finding patterns in noise• To compare clustering algorithms• To compare two sets of clusters• To compare two clusters

Page 31: machine learning - Clustering in R

1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data.

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data4. Comparing the results of two different sets of cluster

analyses to determine which is better.5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Cluster Validation - Different Aspects

Page 32: machine learning - Clustering in R

• Below are the different types of numerical measures that are applied to for cluster validity• External Index: Measures the extent to which cluster labels

match externally supplied class labels.• Entropy

• Internal Index: Used to measure the goodness of a clustering structure without respect to external information. • Sum of Squared Error (SSE)

• Relative Index: Compare two different clusters. • Often an external or internal index is used for this

function, e.g., SSE or entropy• Sometimes these are referred to as criteria instead of indices• However, sometimes criterion is the general strategy and

index is the numerical measure that implements the criterion.

Cluster Validity: Measures

Page 33: machine learning - Clustering in R

• Need a framework to interpret any measure. • For example, if our measure of evaluation has the value, 10, is

that good, fair, or poor?• Statistics provide a framework for cluster validity

• The more “atypical” a clustering result is, the more likely it represents valid structure in the data

• Can compare the values of an index that result from random data or clusterings to those of a clustering result.• If the value of the index is unlikely, then the cluster results

are valid• These approaches are more complicated and harder to

understand.• For comparing the results of two different sets of cluster analyses,

a framework is less necessary.• However, there is the question of whether the difference

between two index values is significant

Framework for Cluster Validity

Page 34: machine learning - Clustering in R

• Cluster Cohesion: Measures how closely related are objects in a cluster• Example: SSE

• Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters

• Example: Squared Error• Cohesion is measured by the within cluster sum of squares

(SSE)

• Separation is measured by the between cluster sum of squares

• Where |Ci| is the size of cluster i

Internal Measures: Cohesion and Separation

i Cx

ii

mxWSS 2)(

i

ii mmCBSS 2)(