Hierarchical clustering techniques

14
Hierarchical Clustering Techniques CS306 Presentation Presented By: Md Syed Ahamad Yanshul Sharma

Transcript of Hierarchical clustering techniques

Page 1: Hierarchical clustering techniques

Hierarchical Clustering TechniquesCS306 Presentation

Presented By:Md Syed AhamadYanshul Sharma

Page 2: Hierarchical clustering techniques

CS306 Presentation 2

Outline and Reference

▪ Outline– Introduction– Its types and Example– Selected Research papers– Experiment in some datasets

▪ Reference– Introduction to the Hierarchical Clustering , Online Edition ©2009

Cambridge UP.– Elio Masciari, Giuseppe Mazzeo and Carlo Zaniolo: 

A New, Fast and Accurate Algorithm for Hierarchical Clustering on Euclidean Distances. PAKDD (2) 2013: 111-122.

– Steinbach, M., Karypis, G., Kumar, V., “A Comparison of Document Clustering Techniques,” University of Minnesota.

Page 3: Hierarchical clustering techniques

CS306 Presentation 3

Introduction

▪ Hierarchical Clustering – clustering given data in hierarchic structure.– It is structured, more informative than flat clustering.– Deterministic, Low efficiency– Important when one of the potential flat clustering problem is

concerned.▪ Most of the flat clustering techniques are concerned with efficiency.

▪ Types– Agglomerative clustering – bottom up– Divisive Clustering – top down

Page 4: Hierarchical clustering techniques

CS306 Presentation 4

Hierarchical clustering types

[ Src: http://www.saedsayad.com/images/Clustering_h1.png ]

Page 5: Hierarchical clustering techniques

CS306 Presentation 5

Example

[ Src: http://tangibleauditoryinterfaces.de/wp-content/uploads/2010/04/durcheinander-cluster-chart.png ]

Page 6: Hierarchical clustering techniques

CS306 Presentation 6

Selected papers

▪ The paper proposed new algorithm called CLUBS.▪ CLUBS – Clustering Using Binary Splitting.– Faster than existing algorithm.– More accurate, robust and impervious to noise.– Works in complete unsupervised fashion.– Also works density based clustering.– It can be used for refining other algorithm’s performances.

▪ Popular algorithm k-means has repeatability problems of results.– But CLUBS overcomes this problem.

Elio Masciari, Giuseppe Mazzeo and Carlo Zaniolo: A New, Fast and Accurate Algorithm for Hierarchical Clustering on Euclidean Distances. PAKDD (2) 2013: 111-122.

Page 7: Hierarchical clustering techniques

CS306 Presentation 7

Approach

▪ CLUBS has two phases– Divisive – original data set is split recursively into mini-clusters

through binary splitting.▪ May cause a non optimal way.

– Agglomerative – the final mini-clusters are recursively combined into the final results.▪ It backtracks previously wrong calculations.

▪ Algorithm exploits SSQ (Sum of Squares) to minimize cost of split operation.

Page 8: Hierarchical clustering techniques

CS306 Presentation 8

Algorithm

▪ Phase 1:▪ Definition 1 – binary partition BP.– d-dimensional data distribution D (multi-dimensional array of integers).– N – non-zero entries of D– ρi – range [l…u] on the i-th dimension of D, 1 ≤ l ≤u ≤ n, 1 ≤ i ≤ d,

size(ρi) = ub(ρi) − lb(ρi) + 1 = u − l + 1.– block b (of D) is a d-tuple {ρ1, . . . , ρd}, vol(b)=size(ρ1) × . . . ×. size(ρd)– A point x = x1, . . . , xd is chosen, lb(ρi) ≤ xi ≤ ub(ρi).– x divides the range ρi of b into ρlowi = [lb(ρi)..x]and ρhighi = [(x+1)..ub(ρi)],

thus partitioning b into blow={ρ1, . . . , ρlowi , . . . , ρd } and bhigh = {ρ1, . . . , ρhighi , . . . , ρd }.

– (blow, bhigh ) – binary split, i – dimension splitting, x – position splitting.

Page 9: Hierarchical clustering techniques

CS306 Presentation 9

Algorithm

▪ Definition 2 –stopping condition of BP– Cs – a cluster , S = (S1, . . . , Sd) = is a vector, p is a point.

Centre of Cs, Cs0=S/N, Qi = .

Page 10: Hierarchical clustering techniques

CS306 Presentation 10

Algorithm

– Binary splitting stops when avgSSQ > deltSSQ which yields n’ mini-clusters, where avgSSQ = SSQ0/n & deltSSQ = overall reduction of SSQ.

▪ Phase 2:– n’ mini-clusters merged by choosing each best pairs (greedy

approach).– Continues until increase in SSQ is greater than avgdeltSSQ.– It gives the final result.

▪ Complexity – O(n.d.l.s)

Page 11: Hierarchical clustering techniques

CS306 Presentation 11

Example

Page 12: Hierarchical clustering techniques

CS306 Presentation 12

Algorithm

Page 13: Hierarchical clustering techniques

CS306 Presentation 13

Experiment

– Dataset 1 – 42 patients into 3 groups (RM,HN,PM). 98 differentially expressed genes picked up and analysed.

– Dataset 2 – samples extracted from human breast cancer cells which consist of four cell group and analysed.

Ek= Error calculation at 10 clusters ε = probability that two similar data belongs to same clusters.Qk = avg % of points in the k-neighborhood of a generic point belonging to the same class of that point.

Page 14: Hierarchical clustering techniques

CS306 Presentation 14

Thank You