Lecture 9 Clustering

8/13/2019 Lecture 9 Clustering

1/28

Clustering


2/28

Pattern Recognition

Learning/Testing

Supervised Learning (Data with type labels) :

(Bayesian, KNN, Neural network, ..) = Classifiers in

Weka

Un-Supervised Learning (Data with no labels): (K-

Means, CobWeb, EM, ..) = Cluster in Weka= So

generally this topic is called Clustering


3/28

3

Supervised learning vs.

unsupervised learning Supervised learning:discover patterns in the

data that relate data attributes with a target(class) attribute.

These patterns are then utilized to predict thevalues of the target attribute in future datainstances.

Unsupervised learning: The data have no

target attribute. We want to explore the data to find some intrinsic

structures in them.


4/28

4

Clustering Clustering is a technique for finding similarity groupsin

data, called clusters. I.e., it groups data instances that are similar to (near) each other in

one cluster and data instances that are very different (far away)from each other into different clusters.

Clustering is often called an unsupervised learningtaskasno class values denoting an a priorigrouping of the datainstances are given, which is the case in supervisedlearning.


5/28

5

An illustration

The data set has three natural groups of data points, i.e.,3 natural clusters.


6/28

6

What is clustering for?

Let us see some real-life examples Example 1: groups people of similar sizes

together to make small, medium and

large T-Shirts. Tailor-made for each person: too expensive

One-size-fits-all: does not fit all.

Example 2: In marketing, segment customersaccording to their similarities

To do targeted marketing.


7/28

7

What is clustering for? (cont) Example 3: Given a collection of text

documents, we want to organize themaccording to their content similarities, To produce a topic hierarchy

In fact, clustering is one of the most utilizeddata mining techniques. It has a long history, and used in almost every field,

e.g., medicine, psychology, botany, sociology,biology, archeology, marketing, insurance, libraries,etc.

In recent years, due to the rapid increase of onlinedocuments, text clustering becomes important.


8/28

8

Aspects of clustering A clustering algorithm

Partitional clustering

Hierarchical clustering

A distance (similarity, or dissimilarity) function

Clustering quality Inter-clusters distancemaximized

Intra-clusters distanceminimized

The qualityof a clustering result depends onthe algorithm, the distance function, and theapplication.


9/28

9

K-means clustering

K-means is a partitional clusteringalgorithm Let the set of data points (or instances) Dbe

{x1, x2, , xn},

where xi= (x

i1,x

i2, ,x

ir) is a vectorin a real-

valued spaceXRr, and ris the number ofattributes (dimensions) in the data.

The k-means algorithm partitions the given

data into kclusters. Each cluster has a cluster center, called centroid.

kis specified by the user


10/28

Algorithmk-means

1. Decide on a value for k.

2. Initialize the kcluster centers (randomly, if

necessary).

3. Decide the class memberships of the Nobjects byassigning them to the nearest cluster center.

4. Re-estimate the kcluster centers, by assuming the

memberships found above are correct.

5. If none of the Nobjects changed membership in

the last iteration, exit. Otherwise goto 3.


11/28

11

K-means algorithm(cont )


12/28

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering: Step 1Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3


13/28

0

1

2

3

4

5

0 1 2 3 4 5


k1

k2

k3


14/28

0

1

2

3

4

5

0 1 2 3 4 5


k1

k2

k3


15/28

0

1

2

3

4

5

0 1 2 3 4 5


k1

k2

k3


16/28

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

expressioni

nc

ondition2


k1

k2k3


17/28

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

How can we tell the rightnumber of clusters?

In general, this is an unsolved problem. However there are many approximate methods. In

the next few slides we will see an example.

For our example, we will use the dataset

on the left.

However, in this case we are imagining

that we do NOT know the class labels. We

are only clustering on the X and Y axis

values.


18/28

1 2 3 4 5 6 7 8 9 10

When k = 1, the objective function is 873.0


19/28

1 2 3 4 5 6 7 8 9 10



20/28

1 2 3 4 5 6 7 8 9 10



21/28

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

7.00E+02

8.00E+02

9.00E+02

1.00E+03

1 2 3 4 5 6

We can plot the objective function values for k equals 1 to 6

The abrupt change at k = 2, is highly suggestive of two clusters in the data. This

technique for determining the number of clusters is known as knee finding or

elbow finding.

Note that the results are not always as clear cut as in this toy example

k

Obje

ctiveFunction


22/28

Image Segmentation Results

An image (I) Three-cluster image (J) on

gray values of I


23/28

23

Strengths of k-means Strengths:

Simple: easy to understand and to implement

Efficient: Time complexity: O(tkn),

where nis the number of data points,

kis the number of clusters, and

t is the number of iterations.

Since both kand tare small, k-means is considered a linear

algorithm.

K-means is the most popular clustering algorithm.


24/28

24

Weaknesses of k-means

The algorithm is only applicable if the meanisdefined.

For categorical data, k-mode - the centroid is represented

by most frequent values.

The user needs to specify k.

The algorithm is sensitive to outliers

Outliers are data points that are very far away from other

data points.

Outliers could be errors in the data recording or some

special data points with very different values.


25/28

25

Weaknesses of k-means: Problems with

outliers


26/28

26

Weaknesses of k-means: To deal with outliers

One method is to remove some data points in theclustering process that are much further away from the

centroids than other data points.

Another method is to perform random sampling. Since insampling, we only choose a small subset of the data

points, the chance of selecting an outlier is very small.


27/28

27

Weaknesses of k-means (cont )

The algorithm is sensitive to initial seeds.


28/28

28

Weaknesses of k-means (cont )

If we use different seeds: good results

There are some

methods to help

choose good

seeds

Lecture 9 Clustering

Documents

Transcript of Lecture 9 Clustering