The Clustering Problem

34
The Clustering Problem Yongsub Lim Applied Algorithm Laboratory KAIST

description

The Clustering Problem. Yongsub Lim Applied Algorithm Laboratory KAIST. Contents. The Clustering Problem Basic Algorithms K-Means K-Clustering of Max. Spacing Two-Phase Algorithms Other Algorithms. The Clustering Problem. Given data, it is to discover “meaningful” groups - PowerPoint PPT Presentation

Transcript of The Clustering Problem

Page 1: The Clustering Problem

The Clustering Problem

Yongsub LimApplied Algorithm Laboratory

KAIST

Page 2: The Clustering Problem

04/22/2023 The Clustering Problem 2

Contents

• The Clustering Problem• Basic Algorithms

K-Means K-Clustering of Max. Spacing

• Two-Phase Algorithms• Other Algorithms

Page 3: The Clustering Problem

04/22/2023 The Clustering Problem 3

The Clustering Problem

• Given data, it is to discover “mean-ingful” groups

• Data in same group are similar, and• Data between different groups are

not similar

Page 4: The Clustering Problem

04/22/2023 The Clustering Problem 4

Example of clustering

1x

2x

Page 5: The Clustering Problem

04/22/2023 The Clustering Problem 5

Example of clustering

1x

2x

Page 6: The Clustering Problem

04/22/2023 The Clustering Problem 6

Example of clustering

1x

2x

Page 7: The Clustering Problem

04/22/2023 The Clustering Problem 7

Applications of Clustering

• The image segmentation problem can be considered as a clustering of pixels of an image

• In unsupervised learning, before making a decision rule, we classify unlabeled training data through clus-tering

Page 8: The Clustering Problem

04/22/2023 The Clustering Problem 8

Applications of Clustering

• In a network or a graph, we can do grouping vertices which are highly connected within one group

• Clustering is also useful in biology to classify genes

Page 9: The Clustering Problem

04/22/2023 The Clustering Problem 9

Basic Algorithms• Two algorithms will be introduced

• K-Means computes iteratively centers of K clusters

• K-Clustering of Max. Spacing uses a min-imum spanning tree

• Objective functions of theses are differ-ent

Page 10: The Clustering Problem

04/22/2023 The Clustering Problem 10

K-Means

• Determine means of K clusters ran-domly

• At each iteration, Every data belongs to a cluster whose

mean is the nearest one among K means Re-compute means of all clusters

Page 11: The Clustering Problem

04/22/2023 The Clustering Problem 11

K-Means

• Objective is to minimize the sum of distance of centers of clusters and their members

• It is clustering for high density in one cluster

Page 12: The Clustering Problem

04/22/2023 The Clustering Problem 12

K-Means Algorithm• Worst case

Initial two cen-ters randomly chosen

This may be not what we want!!!

Page 13: The Clustering Problem

04/22/2023 The Clustering Problem 13

K-Clustering of Max. Spacing

• Given data, find K clusters which maximize the minimum distances be-tween all pairs of clusters

• spacing: min. distance between any pair of data in different clusters

Page 14: The Clustering Problem

04/22/2023 The Clustering Problem 14

K-Clustering of Max. Spacing

Page 15: The Clustering Problem

04/22/2023 The Clustering Problem 15

K-Clustering of Max. Spacing• Consider given data to a complete

graph with Euclidean distance

• Compute a MST

• Delete the K-1 most expensive edges of a MST

Page 16: The Clustering Problem

04/22/2023 The Clustering Problem 16

K-Clustering of Max. Spacing

Calg

Copt

≤ spacing of Calg

Page 17: The Clustering Problem

04/22/2023 The Clustering Problem 17

K-Clustering of Max. Spacing

• It is no randomness

• Objective seems to be better or more reasonable than K-means

Page 18: The Clustering Problem

04/22/2023 The Clustering Problem 18

K-Means vs. Max. Spacing• Good clustering is

High density in one cluster (K-Means) Long dist. between clusters (Max. Spac-

ing)

>

Page 19: The Clustering Problem

04/22/2023 The Clustering Problem 19

K-Means vs. Max. Spacing

Page 20: The Clustering Problem

04/22/2023 The Clustering Problem 20

Two-Phase Algorithms• Two algorithms will be introduced

• In the first phase, both do clustering without restriction on K

• In second phase, if # of clusters are larger than K, merge using Max. Spacing

Page 21: The Clustering Problem

04/22/2023 The Clustering Problem 21

Hierarchical EMSTOleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 22: The Clustering Problem

04/22/2023 The Clustering Problem 22

Hierarchical EMST• HEMST removes all edges with weights

greater than the threshold (mean+std. of edges)

• If # of clusters is less than a given K, same with Max. Spacing

• If not, it runs Max. Spacing on data set each of which is nearest to the center of its clus-ter

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 23: The Clustering Problem

04/22/2023 The Clustering Problem 23

Hierarchical EMST

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 24: The Clustering Problem

04/22/2023 The Clustering Problem 24

Hierarchical EMST

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 25: The Clustering Problem

04/22/2023 The Clustering Problem 25

Hierarchical EMST

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 26: The Clustering Problem

04/22/2023 The Clustering Problem 26

Hierarchical EMSTOleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 27: The Clustering Problem

04/22/2023 The Clustering Problem 27

Modified K-Means Process• MKF, in the first phase, is similar to K-

Means• The difference is that if data is far

enough from all clusters, it becomes the center of the new cluster

• While running, if # of clusters is larger than a threshold, the two nearest clus-ters are merged

• In the second phase, apply Max. Spaing

M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection

Page 28: The Clustering Problem

04/22/2023 The Clustering Problem 28

Modified K-Means Process

• This scheme can identify outliers by using Max. Spacing

M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection

Page 29: The Clustering Problem

04/22/2023 The Clustering Problem 29

Modified K-Means ProcessM.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection

Page 30: The Clustering Problem

04/22/2023 The Clustering Problem 30

Modified K-Means ProcessM.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection

Page 31: The Clustering Problem

04/22/2023 The Clustering Problem 31

Two-Phase Algorithms

• Both give more weights to members in small sets in the first phase

• A small set will be the most likely clustered data, so it is reasonable to decrease distances between them

Page 32: The Clustering Problem

04/22/2023 The Clustering Problem 32

Other Algorithms• HCS uses min-cut of a graph

• It recursively separate data to dis-joint two subsets (min-cut) until all clusters are highly connected

• A graph is highly connected if the min. # of edges whose removal disconnects the graph is greater than |V|/2

Erez Hartuv, Ron Shamir, a clustering algorithm based on graph connectivity

Page 33: The Clustering Problem

04/22/2023 The Clustering Problem 33

Other Algorithms• Voting

• Apply K-Means N times

• If any pair of data belonged to same cluster greater than threshold t times, they are grouped

Ana L.N. Fred, Anil K. Jain, Data Clustering Using Evidence ac-cumulation

Page 34: The Clustering Problem

04/22/2023 The Clustering Problem 34

Thanks