Clustering DMM

Data Mining 1

مباحث پیشرفته )داده کاوی(

1

دکتر مهرداد جاللی[email protected]

Data Mining 2

خوشه بندیClustering

2

3

What is Cluster Analysis?

• Cluster: a collection of data objects

– Similar to one another within the same cluster

– Dissimilar to the objects in other clusters

• Cluster analysis

– Finding similarities between data according to the characteristics found in

the data and grouping similar data objects into clusters

• Unsupervised learning: no predefined classes

4

Clustering: Applications

• Pattern Recognition

• Image Processing

• Economic Science (especially market research)

• WWW

– Document classification

– Cluster Weblog data to discover groups of similar access patterns

5

Examples of Clustering Applications

• Marketing: Help marketers discover distinct groups in their customer bases, and

then use this knowledge to develop targeted marketing programs

• Land use: Identification of areas of similar land use in an earth observation database

• Insurance: Identifying groups of motor insurance policy holders with a high average

claim cost

• City-planning: Identifying groups of houses according to their house type, value, and

geographical location

6

Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters with

– high intra-class similarity

– low inter-class similarity

• The quality of a clustering result depends on both the similarity measure

used by the method and its implementation

• The quality of a clustering method is also measured by its ability to

discover some or all of the hidden patterns

7

Measure the Quality of Clustering

• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance

function, typically metric: d(i, j)

• There is a separate “quality” function that measures the “goodness” of a

cluster.

• The definitions of distance functions are usually very different for any kinds

of variables.

• It is hard to define “similar enough” or “good enough”

– the answer is typically highly subjective.

8

Requirements of Clustering in Machine Learning

• Scalability

• Ability to deal with different types of attributes

• Ability to handle dynamic data

• Minimal requirements for domain knowledge to determine

input parameters

• Able to deal with noise and outliers

• High dimensionality

• Interpretability and usability

9

Similarity and Dissimilarity Between Objects

There are some methods to measure similarity (Distance) between objects (can refer to KNN slides )

10

Dissimilarity between Binary Variables

• A contingency table for binary data

• Distance measure for symmetric

binary variables:

• Distance measure for asymmetric

binary variables:

dcbacb jid

),(

cbacb jid

),(

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

11

Dissimilarity between Binary Variables

• Example

– gender is a symmetric attribute– the remaining attributes are asymmetric binary

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

12

Nominal Variables

• A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

• Method 1: Simple matching– m: # of matches, p: total # of variables

• Method 2: use a large number of binary variables– creating a new binary variable for each of the M nominal states

pmpjid ),(

13

Major Clustering Approaches (I)

• Partitioning approach:

– Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of

square errors

– Typical methods: Graph Partitioning, k-means, k-medoids

• Hierarchical approach:

– Create a hierarchical decomposition of the set of data (or objects) using some criterion

– Typical methods: Diana, Agnes, BIRCH, ROCK

• Density-based approach:

– Based on connectivity and density functions

– Typical methods: DBSACN, OPTICS

14

Major Clustering Approaches (II)

• Grid-based approach:

– based on a multiple-level granularity structure

– Typical methods: STING, WaveCluster, CLIQUE

• Model-based:

– A model is hypothesized for each of the clusters and tries to find the best fit of that model to each

other

– Typical methods: EM, SOM, COBWEB

• Frequent pattern-based:

– Based on the analysis of frequent patterns

– Typical methods: pCluster

15

An Example: Graph Partitioning(1)

S1: a, c, b, d, c

S2: b, d, e, a

S3:e,c

[ . . . ]

Sn: b, e, d

Co-Occurrence MatrixSessions

Create Undirected Graph

based on Matrix M

16


C1: a, b, e

C2: c, f, g, i

C3: d

ClustersCo-occurrence Matrix Undirected Graph

Small Clusters are filtered out (MinClusterSize)

17


17

18

Distance (Similarity) Matrix for Graph-based Clustering

similarity (or distance) of to ij i jd D D

Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.

The diagonal is all 1’s (similarity) or all 0’s (distance)

Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.

The diagonal is all 1’s (similarity) or all 0’s (distance)

• Similarity (Distance) Matrix– based on the distance or similarity measure we can construct a symmetric

matrix of distance (or similarity values)– (i, j) entry in the matrix is the distance (similarity) between items i and j

19

Example: Term Similarities in Documents• Suppose we want to cluster terms that appear in a collection of documents with different frequencies

• We need to compute a term-term similarity matrix– For simplicity we use the dot product as similarity measure (note that this is the non-normalized version of cosine similarity)

– Example:

1

( , ) ( )i j jk

N

ikk

sim T T w w

Each term can be viewed as a vector of term frequencies (weights)

Each term can be viewed as a vector of term frequencies (weights)

T1 T2 T3 T4 T5 T6 T7 T8Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2

N = total number of dimensions (in this case documents)wik = weight of term i in document k.

N = total number of dimensions (in this case documents)wik = weight of term i in document k.

Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2>0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7

Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2>0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7

20

Example: Term Similarities in Documents

T1 T2 T3 T4 T5 T6 T7 T8Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2

sim T T w wi j jkikk

N

( , ) ( )

1

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

Term-TermSimilarity Matrix

Term-TermSimilarity Matrix

21

Similarity (Distance) Thresholds– A similarity (distance) threshold may be used to mark pairs that are

“sufficiently” similar

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

Using a threshold value of 10 in the previous example

22

Graph Representation• The similarity matrix can be visualized as an undirected graph

– each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

T1 T3

T4

T6T8

T5

T2

T7If no threshold is used, thenmatrix can be represented asa weighted graph

If no threshold is used, thenmatrix can be represented asa weighted graph

23

Simple Clustering Algorithms• If we are interested only in threshold (and not the degree of

similarity or distance), we can use the graph directly for clustering• Clique Method (complete link)

– all items within a cluster must be within the similarity threshold of all other items in that cluster

– clusters may overlap– generally produces small but very tight clusters

• Single Link Method– any item in a cluster must be within the similarity threshold of at

least one other item in that cluster– produces larger but weaker clusters

24

Simple Clustering Algorithms• Clique Method

– a clique is a completely connected subgraph of a graph– in the clique method, each maximal clique in the graph becomes a cluster

T1 T3

T4

T6T8

T5

T2

T7

Maximal cliques (and therefore the clusters) in the previous example are:

{T1, T3, T4, T6}{T2, T4, T6}{T2, T6, T8}{T1, T5}{T7}

Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.

25

Simple Clustering Algorithms• Single Link Method

– selected an item not in a cluster and place it in a new cluster– place all other similar item in that cluster– repeat step 2 for each item in the cluster until nothing more can be added– repeat steps 1-3 for each item that remains unclustered

T1 T3

T4

T6T8

T5

T2

T7

In this case the single link method produces only two clusters:

{T1, T3, T4, T5, T6, T2, T8} {T7}

Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.

26

K-Means (A partitioning clustering Method)• In machine learning, k-means clustering is a method of cluster analysis which aims to partition n

observations into k clusters in which each observation belongs to the cluster with the nearest mean.

• Centroid: the “middle” of a cluster

• Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum

of squared distance

• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

– Heuristic methods: k-means and k-medoids algorithms

– k-means : Each cluster is represented by the center of the cluster

– k-medoids: Each cluster is represented by one of the objects in the cluster

21 )( mimKmt

km tC

mi

N

tNi ip

mC)(

1

27

The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in four steps:

– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the clusters of the current

partition (the centroid is the center, i.e., mean point, of the cluster)

– Assign each object to the cluster with the nearest seed point

– Go back to Step 2, stop when no more new assignment

28

The K-Means Clustering Method

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

29

The K-Means Clustering Method: An Example

Initial Set with k=3 C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6}

Cluster Means

Document Clustering

T1 T2 T3 T4 T5

D1 0 3 3 0 2D2 4 1 0 1 2D3 0 4 0 0 2D4 0 3 0 3 3D5 0 1 3 0 1D6 2 2 0 0 4D7 1 0 3 2 0D8 3 1 0 0 2C1 4/2 4/2 3/2 1/2 4/2C2 0/2 7/2 0/2 3/2 5/2C3 2/2 3/2 3/2 0/2 5/2

30

Example: K-Means

D1 D2 D3 D4 D5 D6 D7 D8C1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2C2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2C3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2

Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure).

For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…

This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…

31

Example: K-Means

T1 T2 T3 T4 T5

D1 0 3 3 0 2D2 4 1 0 1 2D3 0 4 0 0 2D4 0 3 0 3 3D5 0 1 3 0 1D6 2 2 0 0 4D7 1 0 3 2 0D8 3 1 0 0 2C1 8/3 2/3 3/3 3/3 4/3C2 2/4 12/4 3/4 3/4 11/4C3 0/1 1/1 3/1 0/1 1/1

Now compute new cluster centroids using the original document-term matrix

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

32

Comments on the K-Means Method

Weakness– Applicable only when mean is defined, then what about categorical data?

– Need to specify k, the number of clusters, in advance

– Unable to handle noisy data and outliers

33

Variations of the K-Means Method

• A few variants of the k-means which differ in

– Selection of the initial k means

– Dissimilarity calculations

– Strategies to calculate cluster means

34

What Is the Problem of the K-Means Method?

• The k-means algorithm is sensitive to outliers !

– Since an object with an extremely large value may substantially distort the distribution of

the data.

• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference

point, medoids can be used, which is the most centrally located object in a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

35

The K-Medoids Clustering Method

• Find representative objects, called medoids, in clusters

• PAM (Partitioning Around Medoids) Algorithm :

1. Initialize: randomly select k of the n data points as the medoids

2. Associate each data point to the closest medoid. ("closest" here is defined using any

valid distance metric, most commonly Euclidean distance, Manhattan distance or

Minkowski distance)

3. For each medoid m

1. For each non-medoid data point o

2. Swap m and o and compute the total cost of the configuration

4. Select the configuration with the lowest cost.

5. repeat steps 2 to 5 until there is no change in the medoid.

36

Hierarchical Clustering

• Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES-Agglomerative Nesting)

divisive(DIANA-Divisive analysis)

37

Agglomerative hierarchical clustering

• Group data objects in a bottom-up fashion.

• Initially each data object is in its own cluster.

• Then we merge these atomic clusters into larger and larger clusters, until

all of the objects are in a single cluster or until certain termination

conditions are satisfied.

• A user can specify the desired number of clusters as a termination

condition.

38

Divisive hierarchical clustering

• Groups data objects in a top-down fashion.

• Initially all data objects are in one cluster.

• We then subdivide the cluster into smaller and smaller clusters, until each

object forms cluster on its own or satisfies certain termination conditions,

such as a desired number of clusters is obtained.

39

AGNES (Agglomerative Nesting)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

40

Dendrogram: Shows How the Clusters are Merged

41

Density-Based Clustering Methods

• Clustering based on density (local cluster criterion), such as density-

connected points

• Major features:

– Discover clusters of arbitrary shape

– Handle noise

– One scan

– Need density parameters as termination condition

42

Density-Based Clustering: DBSCAN (Density-Based Spatial Clustering of Applications with

Noise)& OPTIC (Ordering Points To Identify the Clustering Structure)

43

GRID-BASED CLUSTERING METHODS

• This is the approach in which we quantize space into a finite number of

cells that form a grid structure on which all of the operations for clustering

is performed.

• Assume that we have a set of records and we want to cluster with respect

to two attributes, then, we divide the related space (plane), into a grid

structure and then we find the clusters.

44

Age

Salary (10,000)

Our “space” is this plane

20 30 40 50 60

88

77

66

5 5

44

33

22

11

00

45

STING: A Statistical Information Grid Approach

• The spatial area is divided into rectangular cells• There are several levels of cells corresponding to different levels of

resolution

46

Different Grid Levels during Query Processing.

47

The STING Clustering Method

– Each cell at a high level is partitioned into a number of smaller cells in the next lower

level

– Statistical info of each cell is calculated and stored beforehand and is used to answer

queries

– Parameters of higher level cells can be easily calculated from parameters of lower level

cell

• count, mean, s, min, max

• type of distribution—normal, uniform, etc.– Use a top-down approach to answer spatial data queries

– Start from a pre-selected layer—typically with a small number of cells

– For each cell in the current level compute the confidence interval

48

Comments on STING

• Remove the irrelevant cells from further consideration• When finish examining the current layer, proceed to the next lower level • Repeat this process until the bottom layer is reached• Advantages:

– Query-independent, easy to parallelize, incremental update• Disadvantages:

– All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

49

Model-Based Clustering

• What is model-based clustering?– Attempt to optimize the fit between the given data and some mathematical model – Based on the assumption: Data are generated by a mixture of underlying

probability distribution• Typical methods

– Statistical approach• EM (Expectation maximization), AutoClass

– Machine learning approach• COBWEB, CLASSIT

– Neural network approach• SOM (Self-Organizing Feature Map)

50

Conceptual Clustering

• Conceptual clustering

– A form of clustering in machine learning

– Produces a classification scheme for a set of unlabeled objects

– Finds characteristic description for each concept (class)

• COBWEB

– A popular a simple method of incremental conceptual learning

– Creates a hierarchical clustering in the form of a classification tree

– Each node refers to a concept and contains a probabilistic description of that

concept

51

COBWEB Clustering Method

A classification tree

52

What Is Outlier Discovery?

• What are outliers?– The set of objects are considerably dissimilar from the remainder of the data

• Problem: Define and find outliers in large data sets• Applications:

– Credit card fraud detection

– Customer segmentation

– Medical analysis

53

تکلیف

روش های UCIبا انتخاب دیتا ست مناسب از سایت 1.آزمایش نمایید : Wekaزیر را در

.1KMeans

.2DBScan

.3OPTIC

.4EM

Clustering DMM

Documents

Transcript of Clustering DMM