MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury...

Post on 02-Jan-2016

217 views 4 download

Transcript of MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury...

MAFIA: Adaptive Grids for Clustering Massive Data Sets

Harsha Nagesh, Sanjay Goil, Alok Choudhury

-Udeepta Bordoloi

Clustering Algorithms

• BIRCH• ROCK• CLIQUE

– Inputs: grid size and density threshold– Prunes subspaces

• MAFIA– Adaptive grid size– Inputs: density threshold– No pruning of subspaces

Grids: CLIQUE way

Along each dimension:• Divide the whole range

into intervals (windows) of size given by user.

• Threshold the number of points in each interval by the user input density to get clusters.

Grids: MAFIA way

Along each dimension:• Divide the whole range into many small windows.• Compute a histogram (Assuming discrete data here).

– E.g., we can divide the range of natural numbers (1-15) into 5 windows (1-3, 4-6,…,13-15).

• Value of a window = max(histogram value within the window)– E.g., if there are three 1s, zero 2s, and five 3s, then the

value of the first window (1-3) = three.

Grids: MAFIA way

Along each dimension: (contd.)• From L-to-R, merge adjacent

windows which differ by less than threshold ß.– Can be made a user input, but

they hard-coded it (25-75%)

• What if cannot detect any partition?– Divide the range equally.

Compare…

CLIQUE MAFIA

Which windows are cluster candidates?

• CLIQUE: use user input threshold

• MAFIA: use user input threshold normalized to window size– Cluster dominance factor: α– Reports clusters as DNF expressions– Cluster candidates henceforth referred to as

Candidate Dense Units (CDU)

Algorithm Initialization

B = number of records that fit into memory

1. Read data in chunks of B and build histogram for each dimension.

2. Determine the adaptive windows for each dimension, and the normalized thresholds for each window.

3. Get the candidate windows in each dimension.4. Variable of working dimension, k = 1.

Main Loop

Repeat1. k++;

2. Find candidate dense units (by combining dimensions);

3. Read through the data to find how many points lie in each of these CDUs;

4. Find the true dense units.

Until (no more dense units found)

Report the true dense units as clusters.

Building CDUs

• CDUs in k dimensions– merge two dense units of (k-1) dimensions.

– such that they share any (k-2) dimensions.

– each dense unit of (k-1) dims has to be compared with every other dense unit.

– can lead to duplicate CDUs, compare every CDU with every other CDU.

• Dense units which cannot be combined are a potential cluster (in a subspace).

Building CDU example (2D3D)

• We can get repeated CDUs• Two passes required.

1. To combine two 2D units to one 3D unit.2. To eliminate repeated CDUs.

Variables (Recap)

• Cluster dominance factor, α:– High α, strong clusters and vice-versa.– Usual value: 1.5

• Window merging threshold, β:– High β, fine windows and vice-versa.

MAFIA vs. CLIQUE (speedup)

• CLIQUE used– without pruning.– with 10 bins for each dimension.– with different thresholds ?

MAFIA vs. CLIQUE(number of CDUs computed)

• Single 7D cluster in a 10D data space

• CLIQUE: 75 6D clusters, 546 7D clusters

MAFIA vs. CLIQUE (quality)

• 2 4D clusters in 10D data space

• CLIQUE: cluster boundary very unreliable– On using a variable number of (fixed size) bins

in each dimension (how?), it misses one cluster.

MAFIA (scalability)