C LUSTERING (Segmentation) Saed Sayad 1.

CLUSTERING (Segmentation)CLUSTERING

(Segmentation)

Saed Sayad

1www.ismartsoft.com

Data Mining Steps

www.ismartsoft.com 2

What is Clustering?


Age

Income

Given a set of records, organizethe records into clusters

Given a set of records, organizethe records into clusters

A cluster is a subset of records which

are similar

A cluster is a subset of records which

are similar

Clustering Requirements

• The ability to discover some or all of the hidden clusters.

• Within-cluster similarity and between-cluster disimilarity.

• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.

• The ability to discover some or all of the hidden clusters.

• Within-cluster similarity and between-cluster disimilarity.

• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.


Similarity - Distance Measure

To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality

To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality


Similarity - Distance Measure


k

iii yx

1

k

iii yx

1

2Euclidean

Minkowski q

k

i

q

ii yx

1

1

Manhattan

Similarity - Correlation


Age

Credit$

Age

Credit$

SimilarSimilar DissimilarDissimilar

22 )()(

))((

yyxx

yyxxr

ii

iixy

Similarity – Hamming Distance

Gene 1 A A T C C A G T

Gene 2 T C T C A A G C

Hamming Distance 1 1 0 0 1 0 0 1


k

iiiH yxD

1

Clustering Methods

• Exclusive vs. Overlapping• Hierarchical vs. Partitive• Deterministic vs. Probabilistic• Incremental vs. Batch learning

• Exclusive vs. Overlapping• Hierarchical vs. Partitive• Deterministic vs. Probabilistic• Incremental vs. Batch learning


Exclusive vs. Overlapping


Age

Income

Age

Income

Hierarchical vs. Partitive


Age

Income

Hierarchical Clustering

• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.

• There are two types of hierarchical clustering:– Agglomerative– Divisive

• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.

• There are two types of hierarchical clustering:– Agglomerative– Divisive


Hierarchical Clustering


AgglomerativeAgglomerative DivisiveDivisive

Hierarchical Clustering - Agglomerative

1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)

between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a

single cluster left.

1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)

between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a

single cluster left.


Hierarchical Clustering - Divisive

1. Assign all of the observations to a single cluster.

2. Partition the cluster to two least similar clusters.

3. Proceed recursively on each cluster until there is one cluster for each observation.

1. Assign all of the observations to a single cluster.

2. Partition the cluster to two least similar clusters.

3. Proceed recursively on each cluster until there is one cluster for each observation.


Hierarchical Clustering – Single Linkage


)),(min(),( sjri xxDsrL

rr

ss

Hierarchical Clustering – Complete Linkage


)),(max(),( sjri xxDsrL

rr

ss

Hierarchical Clustering – Average Linkage


r sn

i

n

jsjri

sr

xxDnn

srL1 1

),(1

),(

rr

ss

K Means Clustering

1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center

according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in

each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are

assigned to each cluster in consecutive rounds.

1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center

according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in

each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are

assigned to each cluster in consecutive rounds.


K Means Clustering


Age

Income

K Means Clustering


K

j Snjn

j

xJ1

2)(

Sum of Squares functionSum of Squares function

Clustering Evaluation

• Sarle’s Cubic Clustering Criterion • The Pseudo-F Statistic• The Pseudo-T2 Statistic• Beale’s F-Type Statistic • Target-based

• Sarle’s Cubic Clustering Criterion • The Pseudo-F Statistic• The Pseudo-T2 Statistic• Beale’s F-Type Statistic • Target-based


Clustering Evaluation


Chi2 Test


ActualY N

PredictedY n11 n12

N n21 n22

r

i

c

j ij

ijij

e

en

1 1

22 )(

Analysis of Variance (ANOVA)


Source of Variation

Sum of Squares

Degree of Freedom Mean Square F P

Between Groups

SSB dfB MSB = SSB/dfB F=MSB/MSW P(F)

Within Groups SSW dfw MSW = SSW/dfw

Total SST dfT

Clustering - Applications• Marketing: finding groups of customers with similar

behavior.• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their

features.• Libraries: book ordering.• City-planning: identifying groups of houses according

to their house type, value and geographical location.• World Wide Web: document classification; clustering

weblog data to discover groups with similar access patterns.

• Marketing: finding groups of customers with similar behavior.

• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their

features.• Libraries: book ordering.• City-planning: identifying groups of houses according

to their house type, value and geographical location.• World Wide Web: document classification; clustering

weblog data to discover groups with similar access patterns.


Summary• Clustering is the process of organizing objects

(records or variables) into groups whose members are similar in some way.

• Hierarchical and K-Means are the two most used clustering techniques.

• The effectiveness of the clustering method depends on the similarity function.

• The result of the clustering algorithm can be interpreted and evaluated in different ways.

• Clustering is the process of organizing objects (records or variables) into groups whose members are similar in some way.

• Hierarchical and K-Means are the two most used clustering techniques.

• The effectiveness of the clustering method depends on the similarity function.

• The result of the clustering algorithm can be interpreted and evaluated in different ways.


28www.ismartsoft.com

Questions?

C LUSTERING (Segmentation) Saed Sayad 1.

Documents

Transcript of C LUSTERING (Segmentation) Saed Sayad 1.