Pattern recognition binoy k means clustering

Post on 09-Feb-2017

364 views 1 download

Transcript of Pattern recognition binoy k means clustering

ClusteringBinoy B Nair

What is Clustering?• Cluster: a collection of data objects

• Similar to one another within the same cluster• Dissimilar to the objects in other clusters

• Cluster analysis• Grouping a set of data objects into clusters

• Clustering is unsupervised classification: no predefined classes

• Typical applications• As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms

3

Examples of Clustering Applications• Marketing: Help marketers discover distinct groups in their customer bases, and then use this

knowledge to develop targeted marketing programs

• Land use: Identification of areas of similar land use in an earth observation database

• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

• Urban planning: Identifying groups of houses according to their house type, value, and geographical location

• Seismology: Observed earth quake epicenters should be clustered along continent faults

4

What Is a Good Clustering?

• A good clustering method will produce clusters with• High intra-class similarity

• Low inter-class similarity

• Precise definition of clustering quality is difficult• Application-dependent

• Ultimately subjective

5

Requirements for Clustering in Data Mining

• Scalability

• Ability to deal with different types of attributes

• Discovery of clusters with arbitrary shape

• Minimal domain knowledge required to determine input parameters

• Ability to deal with noise and outliers

• Insensitivity to order of input records

• Robustness wrt high dimensionality

• Incorporation of user-specified constraints

• Interpretability and usability

6

Major Clustering Approaches

• Partitioning: Construct various partitions and then evaluate them by some criterion

• Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion

• Model-based: Hypothesize a model for each cluster and find best fit of models to data

• Density-based: Guided by connectivity and density functions

7

Partitioning Algorithms

• Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion• Global optimal: exhaustively enumerate all partitions• Heuristic methods: k-means and k-medoids algorithms• k-means (MacQueen, 1967): Each cluster is represented by the center of the

cluster• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw, 1987):

Each cluster is represented by one of the objects in the cluster

Partitional Clustering• Each instance is placed in exactly one of K nonoverlapping

clusters.

• Since only one set of clusters is output, the user normally has to input the desired number of clusters K.

9

Similarity and Dissimilarity Between Objects

• Euclidean distance (p = 2):

• Properties of a metric d(i,j):

• d(i,j) 0• d(i,i) = 0• d(i,j) = d(j,i)• d(i,j) d(i,k) + d(k,j)

)||...|||(|),( 22

22

2

11 pp jxixjxixjxixjid

Squared Error

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Objective Function

Algorithm k-means

1. Decide on a value for k.

2. Initialize the k cluster centers (randomly, if necessary).

3. Decide the class memberships of the N objects by assigning them to the nearest cluster center.

4. Re-estimate the k cluster centers, by assuming the memberships found above are correct.

5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering: Step 1Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering: Step 2Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering: Step 3Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering: Step 4Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

expr

essi

on in

con

ditio

n 2

K-means Clustering: Step 5Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2k3

Worked out Example

Example

Subject Features

1 (1,1)

2 (1.5,2)

3 (3,4)

4 (5,7)

5 (3.5,5)

6 (4.5,5)

7 (3.5,4.5)

As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of two variables on each of seven individuals:

Scatter Plot

Working

  Individual Mean Vector (centroid)

Group 1 1 (1, 1)

Group 2 4 (5, 7)

This data set is to be grouped into two clusters, i.e k=2.

As a first step in finding a sensible initial partition, let the feature values of the two individuals furthest apart (using the Euclidean distance measure), define the initial cluster means, giving:

Working

• The remaining individuals are now examined in sequence and allocated to the cluster to which they are closest, in terms of Euclidean distance to the cluster mean.

• The mean vector is recalculated each time a new member is added. 

• This leads to the following series of steps:

Iteration 1- Assign Objects to closest clusters

Object (Oi)

Features Centroid 1 (C1)

D(Oi ,C1)Centroid 2 

(C2)D(Oi ,C2)

Closest Centroid

1 (1,1) (1,1) (5,7)

2 (1.5,2) (1,1) (5,7)

3 (3,4) (1,1) (5,7)

4 (5,7) (1,1) (5,7)

5 (3.5,5) (1,1) (5,7)

6 (4.5,5) (1,1) (5,7)

7 (3.5,4.5) (1,1) (5,7)

A cluster is defined by its centroid

Iteration 1- Assign Objects to closest clusters

Object (Oi)

Features Centroid 1 (C1)

D(Oi ,C1)Centroid 2 

(C2)D(Oi ,C2)

Closest Centroid

1 (1,1) (1,1) 0 (5,7) 7.21

2 (1.5,2) (1,1) 1.11 (5,7) 6.1

3 (3,4) (1,1) 3.05 (5,7) 3.60

4 (5,7) (1,1) 7.21 (5,7) 0

5 (3.5,5) (1,1) 4.71 (5,7) 2.5

6 (4.5,5) (1,1) 5.31 (5,7) 2.06

7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91

A cluster is defined by its centroid

=1.11

And so on…

Iteration 1- Assign Objects to closest clusters

Object (Oi)

Features Centroid 1 (C1)

D(Oi ,C1)Centroid 2 

(C2)D(Oi ,C2)

Closest Centroid

1 (1,1) (1,1) 0 (5,7) 7.21 C1

2 (1.5,2) (1,1) 1.11 (5,7) 6.1 C1

3 (3,4) (1,1) 3.05 (5,7) 3.60 C1

4 (5,7) (1,1) 7.21 (5,7) 0 C2

5 (3.5,5) (1,1) 4.71 (5,7) 2.5 C2

6 (4.5,5) (1,1) 5.31 (5,7) 2.06 C2

7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91 C2

A cluster is defined by its centroid

Object 1 is assigned to

cluster 1 and so on..

Re computing Centroids at the end of Iteration 1

  Individuals New Centroids

Cluster 1 1, 2, 3 C1 = ((1,1)+(1.5,2)+(3,4))/3 = (1.8, 2.3)

Cluster 2 4, 5, 6, 7 C2 =((5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/4 = (4.1, 5.4)

Now the initial partition has changed, and the two clusters at this stage having the following characteristics:

Iteration 2- Check if any object has changed clusters

Object (Oi)

Features Centroid 1 (C1)

D(Oi ,C1)Centroid 2 

(C2)D(Oi ,C2)

Closest Centroid

1 (1,1) (1.8,2.3) (4.1, 5.4)

2 (1.5,2) (1.8,2.3) (4.1, 5.4)

3 (3,4) (1.8,2.3) (4.1, 5.4)

4 (5,7) (1.8,2.3) (4.1, 5.4)

5 (3.5,5) (1.8,2.3) (4.1, 5.4)

6 (4.5,5) (1.8,2.3) (4.1, 5.4)

7 (3.5,4.5) (1.8,2.3) (4.1, 5.4)

Iteration 2- Check if any object has changed clusters

Object (Oi)

Features Centroid 1 (C1)

D(Oi ,C1)Centroid 2 

(C2)D(Oi ,C2)

Closest Centroid

1 (1,1) (1.8,2.3) 1.53 (4.1, 5.4) 5.38 C1

2 (1.5,2) (1.8,2.3) 0.42 (4.1, 5.4) 4.28 C1

3 (3,4) (1.8,2.3) 2.08 (4.1, 5.4) 1.78 C2

4 (5,7) (1.8,2.3) 5.69 (4.1, 5.4) 1.84 C2

5 (3.5,5) (1.8,2.3) 3.19 (4.1, 5.4) 0.72 C2

6 (4.5,5) (1.8,2.3) 3.82 (4.1, 5.4) 0.57 C2

7 (3.5,4.5) (1.8,2.3) 2.78 (4.1, 5.4) 1.08 C2

Object 3 has changed

cluster from 1 to 2

Re computing Centroids at the end of Iteration 2

  Individuals New Centroids

Cluster 1 1, 2 C1 = ((1,1)+(1.5,2))/2 = (1.3, 1.5)

Cluster 2 3,4, 5, 6, 7 C2 =((3,4)+(5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/5 = (3.9, 5.1)

Now the initial partition has changed wih Object 3 getting relocated to cluster 2 and the two clusters at this stage having the following characteristics:

Iteration 3- Check if any object has changed clusters

Object (Oi)

Features Centroid 1 (C1)

D(Oi ,C1)Centroid 2 

(C2)D(Oi ,C2)

Closest Centroid

1 (1,1) (1.3, 1.5) 0.58 (3.9, 5.1) 5.02 C1

2 (1.5,2) (1.3, 1.5) 0.54 (3.9, 5.1) 3.92 C1

3 (3,4) (1.3, 1.5) 3.02 (3.9, 5.1) 1.42 C2

4 (5,7) (1.3, 1.5) 6.63 (3.9, 5.1) 2.19 C2

5 (3.5,5) (1.3, 1.5) 4.13 (3.9, 5.1) 0.41 C2

6 (4.5,5) (1.3, 1.5) 4.74 (3.9, 5.1) 0.61 C2

7 (3.5,4.5) (1.3, 1.5) 3.72 (3.9, 5.1) 0.72 C2

No change in clusters

compared to previous iteration

Conclusion

• In this example each individual is now nearer its own cluster mean than that of the other cluster and the iteration stops, choosing the latest partitioning as the final cluster solution.

• Hence Objects {1,2} belong to first cluster and Objects {3,4,5,6,7} belong to second cluster.

Notes• The iterative relocation would continue until no more relocations occur.

• Luckily, in the example, we got the no-relocation condition satisfied in 3 iterations, but this is not usually the case. It might require hundreds of iterations depending on the dataset.

• Also, it is possible that the k-means algorithm won't find a final solution at all.

• In this case it would be a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.

Comments on the K-Means Method• Strength • Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.

Normally, k, t << n.

• Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

• Weakness• Applicable only when mean is defined, then what about categorical data?• Need to specify k, the number of clusters, in advance• Unable to handle noisy data and outliers• Not suitable to discover clusters with non-convex shapes

32

Summary• K-means algorithm is a simple yet popular method for clustering analysis• Its performance is determined by initialisation and appropriate distance

measure• There are several variants of K-means to overcome its weaknesses

• K-Medoids: resistance to noise and/or outliers• K-Modes: extension to categorical data clustering analysis• CLARA: extension to deal with large data sets• Mixture models (EM algorithm): handling uncertainty of clusters

Online tutorial: the K-means function in Matlab

https://www.youtube.com/watch?v=aYzjenNNOcc

References

• http://mnemstudio.org/clustering-k-means-example-1.htm• Ke Chen, K-means Clustering ,COMP24111-Machine Learning,

University of Manchester, 2016.• {Insert Reference}/ 10.ppt• {Insert Reference}/MachinLearning3.ppt