C LUSTERING (Segmentation) Saed Sayad 1.

Post on 21-Dec-2015

218 views 0 download

Tags:

Transcript of C LUSTERING (Segmentation) Saed Sayad 1.

CLUSTERING (Segmentation)CLUSTERING

(Segmentation)

Saed Sayad

1www.ismartsoft.com

Data Mining Steps

www.ismartsoft.com 2

What is Clustering?

www.ismartsoft.com 3

Age

Income

Given a set of records, organizethe records into clusters

Given a set of records, organizethe records into clusters

A cluster is a subset of records which

are similar

A cluster is a subset of records which

are similar

Clustering Requirements

• The ability to discover some or all of the hidden clusters.

• Within-cluster similarity and between-cluster disimilarity.

• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.

• The ability to discover some or all of the hidden clusters.

• Within-cluster similarity and between-cluster disimilarity.

• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.

www.ismartsoft.com 4

Similarity - Distance Measure

To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality

To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality

www.ismartsoft.com 5

Similarity - Distance Measure

www.ismartsoft.com 6

k

iii yx

1

k

iii yx

1

2Euclidean

Minkowski q

k

i

q

ii yx

1

1

Manhattan

Similarity - Correlation

www.ismartsoft.com 7

Age

Credit$

Age

Credit$

SimilarSimilar DissimilarDissimilar

22 )()(

))((

yyxx

yyxxr

ii

iixy

Similarity – Hamming Distance

Gene 1 A A T C C A G T

Gene 2 T C T C A A G C

Hamming Distance 1 1 0 0 1 0 0 1

www.ismartsoft.com 8

k

iiiH yxD

1

Clustering Methods

• Exclusive vs. Overlapping• Hierarchical vs. Partitive• Deterministic vs. Probabilistic• Incremental vs. Batch learning

• Exclusive vs. Overlapping• Hierarchical vs. Partitive• Deterministic vs. Probabilistic• Incremental vs. Batch learning

www.ismartsoft.com 9

Exclusive vs. Overlapping

www.ismartsoft.com 10

Age

Income

Age

Income

Hierarchical vs. Partitive

www.ismartsoft.com 11

Age

Income

Hierarchical Clustering

• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.

• There are two types of hierarchical clustering:– Agglomerative– Divisive

• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.

• There are two types of hierarchical clustering:– Agglomerative– Divisive

www.ismartsoft.com 12

Hierarchical Clustering

www.ismartsoft.com 13

AgglomerativeAgglomerative DivisiveDivisive

Hierarchical Clustering - Agglomerative

1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)

between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a

single cluster left.

1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)

between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a

single cluster left.

www.ismartsoft.com 14

Hierarchical Clustering - Divisive

1. Assign all of the observations to a single cluster.

2. Partition the cluster to two least similar clusters.

3. Proceed recursively on each cluster until there is one cluster for each observation.

1. Assign all of the observations to a single cluster.

2. Partition the cluster to two least similar clusters.

3. Proceed recursively on each cluster until there is one cluster for each observation.

www.ismartsoft.com 15

Hierarchical Clustering – Single Linkage

www.ismartsoft.com 16

)),(min(),( sjri xxDsrL

rr

ss

Hierarchical Clustering – Complete Linkage

www.ismartsoft.com 17

)),(max(),( sjri xxDsrL

rr

ss

Hierarchical Clustering – Average Linkage

www.ismartsoft.com 18

r sn

i

n

jsjri

sr

xxDnn

srL1 1

),(1

),(

rr

ss

K Means Clustering

1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center

according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in

each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are

assigned to each cluster in consecutive rounds.

1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center

according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in

each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are

assigned to each cluster in consecutive rounds.

www.ismartsoft.com 19

K Means Clustering

www.ismartsoft.com 20

Age

Income

K Means Clustering

www.ismartsoft.com 21

K

j Snjn

j

xJ1

2)(

Sum of Squares functionSum of Squares function

Clustering Evaluation

• Sarle’s Cubic Clustering Criterion • The Pseudo-F Statistic• The Pseudo-T2 Statistic• Beale’s F-Type Statistic • Target-based

• Sarle’s Cubic Clustering Criterion • The Pseudo-F Statistic• The Pseudo-T2 Statistic• Beale’s F-Type Statistic • Target-based

www.ismartsoft.com 22

Clustering Evaluation

www.ismartsoft.com 23

Chi2 Test

www.ismartsoft.com 24

ActualY N

PredictedY n11 n12

N n21 n22

r

i

c

j ij

ijij

e

en

1 1

22 )(

Analysis of Variance (ANOVA)

www.ismartsoft.com 25

Source of Variation

Sum of Squares

Degree of Freedom Mean Square F P

Between Groups

SSB dfB MSB = SSB/dfB F=MSB/MSW P(F)

Within Groups SSW dfw MSW = SSW/dfw

Total SST dfT

Clustering - Applications• Marketing: finding groups of customers with similar

behavior.• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their

features.• Libraries: book ordering.• City-planning: identifying groups of houses according

to their house type, value and geographical location.• World Wide Web: document classification; clustering

weblog data to discover groups with similar access patterns.

• Marketing: finding groups of customers with similar behavior.

• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their

features.• Libraries: book ordering.• City-planning: identifying groups of houses according

to their house type, value and geographical location.• World Wide Web: document classification; clustering

weblog data to discover groups with similar access patterns.

www.ismartsoft.com 26

Summary• Clustering is the process of organizing objects

(records or variables) into groups whose members are similar in some way.

• Hierarchical and K-Means are the two most used clustering techniques.

• The effectiveness of the clustering method depends on the similarity function.

• The result of the clustering algorithm can be interpreted and evaluated in different ways.

• Clustering is the process of organizing objects (records or variables) into groups whose members are similar in some way.

• Hierarchical and K-Means are the two most used clustering techniques.

• The effectiveness of the clustering method depends on the similarity function.

• The result of the clustering algorithm can be interpreted and evaluated in different ways.

www.ismartsoft.com 27

28www.ismartsoft.com

Questions?