C LUSTERING (Segmentation) Saed Sayad 1.

28
CLUSTERING (Segmentation) Saed Sayad 1 www.ismartsoft.com

Transcript of C LUSTERING (Segmentation) Saed Sayad 1.

Page 1: C LUSTERING (Segmentation) Saed Sayad 1.

CLUSTERING (Segmentation)CLUSTERING

(Segmentation)

Saed Sayad

1www.ismartsoft.com

Page 2: C LUSTERING (Segmentation) Saed Sayad 1.

Data Mining Steps

www.ismartsoft.com 2

Page 3: C LUSTERING (Segmentation) Saed Sayad 1.

What is Clustering?

www.ismartsoft.com 3

Age

Income

Given a set of records, organizethe records into clusters

Given a set of records, organizethe records into clusters

A cluster is a subset of records which

are similar

A cluster is a subset of records which

are similar

Page 4: C LUSTERING (Segmentation) Saed Sayad 1.

Clustering Requirements

• The ability to discover some or all of the hidden clusters.

• Within-cluster similarity and between-cluster disimilarity.

• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.

• The ability to discover some or all of the hidden clusters.

• Within-cluster similarity and between-cluster disimilarity.

• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.

www.ismartsoft.com 4

Page 5: C LUSTERING (Segmentation) Saed Sayad 1.

Similarity - Distance Measure

To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality

To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality

www.ismartsoft.com 5

Page 6: C LUSTERING (Segmentation) Saed Sayad 1.

Similarity - Distance Measure

www.ismartsoft.com 6

k

iii yx

1

k

iii yx

1

2Euclidean

Minkowski q

k

i

q

ii yx

1

1

Manhattan

Page 7: C LUSTERING (Segmentation) Saed Sayad 1.

Similarity - Correlation

www.ismartsoft.com 7

Age

Credit$

Age

Credit$

SimilarSimilar DissimilarDissimilar

22 )()(

))((

yyxx

yyxxr

ii

iixy

Page 8: C LUSTERING (Segmentation) Saed Sayad 1.

Similarity – Hamming Distance

Gene 1 A A T C C A G T

Gene 2 T C T C A A G C

Hamming Distance 1 1 0 0 1 0 0 1

www.ismartsoft.com 8

k

iiiH yxD

1

Page 9: C LUSTERING (Segmentation) Saed Sayad 1.

Clustering Methods

• Exclusive vs. Overlapping• Hierarchical vs. Partitive• Deterministic vs. Probabilistic• Incremental vs. Batch learning

• Exclusive vs. Overlapping• Hierarchical vs. Partitive• Deterministic vs. Probabilistic• Incremental vs. Batch learning

www.ismartsoft.com 9

Page 10: C LUSTERING (Segmentation) Saed Sayad 1.

Exclusive vs. Overlapping

www.ismartsoft.com 10

Age

Income

Age

Income

Page 11: C LUSTERING (Segmentation) Saed Sayad 1.

Hierarchical vs. Partitive

www.ismartsoft.com 11

Age

Income

Page 12: C LUSTERING (Segmentation) Saed Sayad 1.

Hierarchical Clustering

• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.

• There are two types of hierarchical clustering:– Agglomerative– Divisive

• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.

• There are two types of hierarchical clustering:– Agglomerative– Divisive

www.ismartsoft.com 12

Page 13: C LUSTERING (Segmentation) Saed Sayad 1.

Hierarchical Clustering

www.ismartsoft.com 13

AgglomerativeAgglomerative DivisiveDivisive

Page 14: C LUSTERING (Segmentation) Saed Sayad 1.

Hierarchical Clustering - Agglomerative

1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)

between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a

single cluster left.

1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)

between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a

single cluster left.

www.ismartsoft.com 14

Page 15: C LUSTERING (Segmentation) Saed Sayad 1.

Hierarchical Clustering - Divisive

1. Assign all of the observations to a single cluster.

2. Partition the cluster to two least similar clusters.

3. Proceed recursively on each cluster until there is one cluster for each observation.

1. Assign all of the observations to a single cluster.

2. Partition the cluster to two least similar clusters.

3. Proceed recursively on each cluster until there is one cluster for each observation.

www.ismartsoft.com 15

Page 16: C LUSTERING (Segmentation) Saed Sayad 1.

Hierarchical Clustering – Single Linkage

www.ismartsoft.com 16

)),(min(),( sjri xxDsrL

rr

ss

Page 17: C LUSTERING (Segmentation) Saed Sayad 1.

Hierarchical Clustering – Complete Linkage

www.ismartsoft.com 17

)),(max(),( sjri xxDsrL

rr

ss

Page 18: C LUSTERING (Segmentation) Saed Sayad 1.

Hierarchical Clustering – Average Linkage

www.ismartsoft.com 18

r sn

i

n

jsjri

sr

xxDnn

srL1 1

),(1

),(

rr

ss

Page 19: C LUSTERING (Segmentation) Saed Sayad 1.

K Means Clustering

1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center

according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in

each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are

assigned to each cluster in consecutive rounds.

1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center

according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in

each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are

assigned to each cluster in consecutive rounds.

www.ismartsoft.com 19

Page 20: C LUSTERING (Segmentation) Saed Sayad 1.

K Means Clustering

www.ismartsoft.com 20

Age

Income

Page 21: C LUSTERING (Segmentation) Saed Sayad 1.

K Means Clustering

www.ismartsoft.com 21

K

j Snjn

j

xJ1

2)(

Sum of Squares functionSum of Squares function

Page 22: C LUSTERING (Segmentation) Saed Sayad 1.

Clustering Evaluation

• Sarle’s Cubic Clustering Criterion • The Pseudo-F Statistic• The Pseudo-T2 Statistic• Beale’s F-Type Statistic • Target-based

• Sarle’s Cubic Clustering Criterion • The Pseudo-F Statistic• The Pseudo-T2 Statistic• Beale’s F-Type Statistic • Target-based

www.ismartsoft.com 22

Page 23: C LUSTERING (Segmentation) Saed Sayad 1.

Clustering Evaluation

www.ismartsoft.com 23

Page 24: C LUSTERING (Segmentation) Saed Sayad 1.

Chi2 Test

www.ismartsoft.com 24

ActualY N

PredictedY n11 n12

N n21 n22

r

i

c

j ij

ijij

e

en

1 1

22 )(

Page 25: C LUSTERING (Segmentation) Saed Sayad 1.

Analysis of Variance (ANOVA)

www.ismartsoft.com 25

Source of Variation

Sum of Squares

Degree of Freedom Mean Square F P

Between Groups

SSB dfB MSB = SSB/dfB F=MSB/MSW P(F)

Within Groups SSW dfw MSW = SSW/dfw

Total SST dfT

Page 26: C LUSTERING (Segmentation) Saed Sayad 1.

Clustering - Applications• Marketing: finding groups of customers with similar

behavior.• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their

features.• Libraries: book ordering.• City-planning: identifying groups of houses according

to their house type, value and geographical location.• World Wide Web: document classification; clustering

weblog data to discover groups with similar access patterns.

• Marketing: finding groups of customers with similar behavior.

• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their

features.• Libraries: book ordering.• City-planning: identifying groups of houses according

to their house type, value and geographical location.• World Wide Web: document classification; clustering

weblog data to discover groups with similar access patterns.

www.ismartsoft.com 26

Page 27: C LUSTERING (Segmentation) Saed Sayad 1.

Summary• Clustering is the process of organizing objects

(records or variables) into groups whose members are similar in some way.

• Hierarchical and K-Means are the two most used clustering techniques.

• The effectiveness of the clustering method depends on the similarity function.

• The result of the clustering algorithm can be interpreted and evaluated in different ways.

• Clustering is the process of organizing objects (records or variables) into groups whose members are similar in some way.

• Hierarchical and K-Means are the two most used clustering techniques.

• The effectiveness of the clustering method depends on the similarity function.

• The result of the clustering algorithm can be interpreted and evaluated in different ways.

www.ismartsoft.com 27

Page 28: C LUSTERING (Segmentation) Saed Sayad 1.

28www.ismartsoft.com

Questions?