C LUSTERING (Segmentation) Saed Sayad 1.

CLUSTERING (Segmentation)CLUSTERING

(Segmentation)

Saed Sayad

1www.ismartsoft.com

Data Mining Steps

www.ismartsoft.com 2

What is Clustering?

Income

Given a set of records, organizethe records into clusters

A cluster is a subset of records which

are similar

A cluster is a subset of records which

are similar

Clustering Requirements

• The ability to discover some or all of the hidden clusters.

• Within-cluster similarity and between-cluster disimilarity.

• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.

• The ability to discover some or all of the hidden clusters.

• Within-cluster similarity and between-cluster disimilarity.

• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.

Similarity - Distance Measure

To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality

Similarity - Distance Measure

iii yx

2Euclidean

Minkowski q

Manhattan

Similarity - Correlation

Credit$

SimilarSimilar DissimilarDissimilar

22 )()(

Similarity – Hamming Distance

Gene 1 A A T C C A G T

Gene 2 T C T C A A G C

Hamming Distance 1 1 0 0 1 0 0 1

iiiH yxD

Clustering Methods

• Exclusive vs. Overlapping• Hierarchical vs. Partitive• Deterministic vs. Probabilistic• Incremental vs. Batch learning

Exclusive vs. Overlapping

Income

Hierarchical vs. Partitive

Income

Hierarchical Clustering

• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.

• There are two types of hierarchical clustering:– Agglomerative– Divisive

• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.

• There are two types of hierarchical clustering:– Agglomerative– Divisive

Hierarchical Clustering

AgglomerativeAgglomerative DivisiveDivisive

Hierarchical Clustering - Agglomerative

1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)

between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a

single cluster left.

1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)

between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a

single cluster left.

Hierarchical Clustering - Divisive

1. Assign all of the observations to a single cluster.

2. Partition the cluster to two least similar clusters.

3. Proceed recursively on each cluster until there is one cluster for each observation.

1. Assign all of the observations to a single cluster.

2. Partition the cluster to two least similar clusters.

3. Proceed recursively on each cluster until there is one cluster for each observation.

Hierarchical Clustering – Single Linkage

)),(min(),( sjri xxDsrL

Hierarchical Clustering – Complete Linkage

)),(max(),( sjri xxDsrL

Hierarchical Clustering – Average Linkage

srL1 1

K Means Clustering

1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center

according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in

each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are

assigned to each cluster in consecutive rounds.

1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center

according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in

each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are

assigned to each cluster in consecutive rounds.

K Means Clustering

Income

K Means Clustering

j Snjn

Sum of Squares functionSum of Squares function

Clustering Evaluation

• Sarle’s Cubic Clustering Criterion • The Pseudo-F Statistic• The Pseudo-T2 Statistic• Beale’s F-Type Statistic • Target-based

Clustering Evaluation

Chi2 Test

ActualY N

PredictedY n11 n12

N n21 n22

Analysis of Variance (ANOVA)

Source of Variation

Sum of Squares

Degree of Freedom Mean Square F P

Between Groups

SSB dfB MSB = SSB/dfB F=MSB/MSW P(F)

Within Groups SSW dfw MSW = SSW/dfw

Total SST dfT

Clustering - Applications• Marketing: finding groups of customers with similar

behavior.• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their

features.• Libraries: book ordering.• City-planning: identifying groups of houses according

to their house type, value and geographical location.• World Wide Web: document classification; clustering

weblog data to discover groups with similar access patterns.

• Marketing: finding groups of customers with similar behavior.

• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their

features.• Libraries: book ordering.• City-planning: identifying groups of houses according

to their house type, value and geographical location.• World Wide Web: document classification; clustering

weblog data to discover groups with similar access patterns.

Summary• Clustering is the process of organizing objects

(records or variables) into groups whose members are similar in some way.

• Hierarchical and K-Means are the two most used clustering techniques.

• The effectiveness of the clustering method depends on the similarity function.

• The result of the clustering algorithm can be interpreted and evaluated in different ways.

• Clustering is the process of organizing objects (records or variables) into groups whose members are similar in some way.

• Hierarchical and K-Means are the two most used clustering techniques.

• The effectiveness of the clustering method depends on the similarity function.

• The result of the clustering algorithm can be interpreted and evaluated in different ways.

28www.ismartsoft.com

Questions?

C LUSTERING (Segmentation) Saed Sayad 1.

Documents

Transcript of C LUSTERING (Segmentation) Saed Sayad 1.

Implementing Microsoft Windows Server Failover lustering (WSF

Hybrid vehicles mini project-Sivajash Sayad

Final report Siawash Sayad 1

How to Make a Great Profile and Overview- Saed Habib

Simple Linear Regression (SLR) CHE1147 Saed Sayad University of Toronto.

B I O G R A P H I C A L D A T A S H E ... - TTI Group Websites · SAED 201 Preventing Alcohol and Drug Abuse HE 221 Safety SAED 301 Nature and Scope of Safety Education SAED 427 Driver

Jang E Azeem Kay Hero-Syed Ahmed Saed-Feroz Sons-1975

SAED Patterns of Single Crystal, Polycrystalline and Amorphous Samples abc r1r1 r2r2 200 020 110.

SAED in TEM.ppt

Abdurrahman Saed Abdurrahman

Model Deployment - Chemical Engineeringchem-eng.utoronto.ca/~datamining/Presentations/... · Model Deployment Dr. Saed Sayad ... •Use the data mining tool •Programming Scripts

Bourdieu Sayad

Data Exploration - Chemical Engineeringchem-eng.utoronto.ca/~datamining/Presentations/Data_Exploration.pdf · Data Exploration Dr. Saed Sayad University of Toronto 2010 saed.sayad@utoronto.ca

SAED Group Profile

Saed Khalil and Michel Dombrecht - PMA > Home · Saed Khalil and Michel Dombrecht ... And when comparing the ARDL approach with the Johansen test of co-integration results are close

SAED Refresher Training

Balazs Et Sayad. La Violence de l'Institution. 1991

Organic 2 -chapter 16- Dr.Naem al saed -

SAED: Can we Improve the In-Hospital: Chain of Survival?

Pierre Bourdieu and Abdelmalek Sayad