MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury...

16
MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi

Transcript of MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury...

Page 1: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

MAFIA: Adaptive Grids for Clustering Massive Data Sets

Harsha Nagesh, Sanjay Goil, Alok Choudhury

-Udeepta Bordoloi

Page 2: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Clustering Algorithms

• BIRCH• ROCK• CLIQUE

– Inputs: grid size and density threshold– Prunes subspaces

• MAFIA– Adaptive grid size– Inputs: density threshold– No pruning of subspaces

Page 3: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Grids: CLIQUE way

Along each dimension:• Divide the whole range

into intervals (windows) of size given by user.

• Threshold the number of points in each interval by the user input density to get clusters.

Page 4: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Grids: MAFIA way

Along each dimension:• Divide the whole range into many small windows.• Compute a histogram (Assuming discrete data here).

– E.g., we can divide the range of natural numbers (1-15) into 5 windows (1-3, 4-6,…,13-15).

• Value of a window = max(histogram value within the window)– E.g., if there are three 1s, zero 2s, and five 3s, then the

value of the first window (1-3) = three.

Page 5: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Grids: MAFIA way

Along each dimension: (contd.)• From L-to-R, merge adjacent

windows which differ by less than threshold ß.– Can be made a user input, but

they hard-coded it (25-75%)

• What if cannot detect any partition?– Divide the range equally.

Page 6: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Compare…

CLIQUE MAFIA

Page 7: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Which windows are cluster candidates?

• CLIQUE: use user input threshold

• MAFIA: use user input threshold normalized to window size– Cluster dominance factor: α– Reports clusters as DNF expressions– Cluster candidates henceforth referred to as

Candidate Dense Units (CDU)

Page 8: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Algorithm Initialization

B = number of records that fit into memory

1. Read data in chunks of B and build histogram for each dimension.

2. Determine the adaptive windows for each dimension, and the normalized thresholds for each window.

3. Get the candidate windows in each dimension.4. Variable of working dimension, k = 1.

Page 9: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Main Loop

Repeat1. k++;

2. Find candidate dense units (by combining dimensions);

3. Read through the data to find how many points lie in each of these CDUs;

4. Find the true dense units.

Until (no more dense units found)

Report the true dense units as clusters.

Page 10: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Building CDUs

• CDUs in k dimensions– merge two dense units of (k-1) dimensions.

– such that they share any (k-2) dimensions.

– each dense unit of (k-1) dims has to be compared with every other dense unit.

– can lead to duplicate CDUs, compare every CDU with every other CDU.

• Dense units which cannot be combined are a potential cluster (in a subspace).

Page 11: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Building CDU example (2D3D)

• We can get repeated CDUs• Two passes required.

1. To combine two 2D units to one 3D unit.2. To eliminate repeated CDUs.

Page 12: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Variables (Recap)

• Cluster dominance factor, α:– High α, strong clusters and vice-versa.– Usual value: 1.5

• Window merging threshold, β:– High β, fine windows and vice-versa.

Page 13: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

MAFIA vs. CLIQUE (speedup)

• CLIQUE used– without pruning.– with 10 bins for each dimension.– with different thresholds ?

Page 14: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

MAFIA vs. CLIQUE(number of CDUs computed)

• Single 7D cluster in a 10D data space

• CLIQUE: 75 6D clusters, 546 7D clusters

Page 15: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

MAFIA vs. CLIQUE (quality)

• 2 4D clusters in 10D data space

• CLIQUE: cluster boundary very unreliable– On using a variable number of (fixed size) bins

in each dimension (how?), it misses one cluster.

Page 16: MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

MAFIA (scalability)