MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury...

MAFIA: Adaptive Grids for Clustering Massive Data Sets

Harsha Nagesh, Sanjay Goil, Alok Choudhury

-Udeepta Bordoloi

Clustering Algorithms

• BIRCH• ROCK• CLIQUE

– Inputs: grid size and density threshold– Prunes subspaces

• MAFIA– Adaptive grid size– Inputs: density threshold– No pruning of subspaces

Grids: CLIQUE way

Along each dimension:• Divide the whole range

into intervals (windows) of size given by user.

• Threshold the number of points in each interval by the user input density to get clusters.

Grids: MAFIA way

Along each dimension:• Divide the whole range into many small windows.• Compute a histogram (Assuming discrete data here).

– E.g., we can divide the range of natural numbers (1-15) into 5 windows (1-3, 4-6,…,13-15).

• Value of a window = max(histogram value within the window)– E.g., if there are three 1s, zero 2s, and five 3s, then the

value of the first window (1-3) = three.

Grids: MAFIA way

Along each dimension: (contd.)• From L-to-R, merge adjacent

windows which differ by less than threshold ß.– Can be made a user input, but

they hard-coded it (25-75%)

• What if cannot detect any partition?– Divide the range equally.

Compare…

CLIQUE MAFIA

Which windows are cluster candidates?

• CLIQUE: use user input threshold

• MAFIA: use user input threshold normalized to window size– Cluster dominance factor: α– Reports clusters as DNF expressions– Cluster candidates henceforth referred to as

Candidate Dense Units (CDU)

Algorithm Initialization

B = number of records that fit into memory

1. Read data in chunks of B and build histogram for each dimension.

2. Determine the adaptive windows for each dimension, and the normalized thresholds for each window.

3. Get the candidate windows in each dimension.4. Variable of working dimension, k = 1.

Main Loop

Repeat1. k++;

2. Find candidate dense units (by combining dimensions);

3. Read through the data to find how many points lie in each of these CDUs;

4. Find the true dense units.

Until (no more dense units found)

Report the true dense units as clusters.

Building CDUs

• CDUs in k dimensions– merge two dense units of (k-1) dimensions.

– such that they share any (k-2) dimensions.

– each dense unit of (k-1) dims has to be compared with every other dense unit.

– can lead to duplicate CDUs, compare every CDU with every other CDU.

• Dense units which cannot be combined are a potential cluster (in a subspace).

Building CDU example (2D3D)

• We can get repeated CDUs• Two passes required.

1. To combine two 2D units to one 3D unit.2. To eliminate repeated CDUs.

Variables (Recap)

• Cluster dominance factor, α:– High α, strong clusters and vice-versa.– Usual value: 1.5

• Window merging threshold, β:– High β, fine windows and vice-versa.

MAFIA vs. CLIQUE (speedup)

• CLIQUE used– without pruning.– with 10 bins for each dimension.– with different thresholds ?

MAFIA vs. CLIQUE(number of CDUs computed)

• Single 7D cluster in a 10D data space

• CLIQUE: 75 6D clusters, 546 7D clusters

MAFIA vs. CLIQUE (quality)

• 2 4D clusters in 10D data space

• CLIQUE: cluster boundary very unreliable– On using a variable number of (fixed size) bins

in each dimension (how?), it misses one cluster.

MAFIA (scalability)

MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury...

Documents

Transcript of MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury...

Bordoloi and Bock Chapter 8 : VIEWS, SYNONYMS, AND SEQUENCES.

Leadership Assessment Of Nagesh

Bordoloi and Bock Control Structures: Iterative Control.

10th May 2016 - The Lodge on Loch Goil · 2020. 9. 24. · 10th May 2016 christina & Andy The Lodge on Loch Goil. O F S COT LAND

5th November 2016 The Lodge on Loch Goil · 5th November 2016 Aimee & Simon The Lodge on Loch Goil-

Bordoloi CMIS 450: Database Design Dr. Bijoy Bordoloi Transforming E/R Diagrams to Relations.

Nagesh Project Report

Performance Appraisals in Agile Environment Nagesh Sharma

Nagesh Resume1

Bordoloi and Bock Chapter 6 : JOINS. Bordoloi and Bock A TYPICAL JOIN OPERATION Following figure displays the employee and department tables.Following.

Ppt Nagesh

Nagesh Project

Bordoloi and Bock Copyright 2004 Prentice Hall, Inc.

rkvyassam.inrkvyassam.in/informations/uploding-04-03-… · XLS file · Web view · 2017-05-101.Brishnu Doloi 2.Nathuram Bordoloi 3.Brajen Bordoloi 4.Birsing Doloi 5.Rajesh Doloi

THE LODGE ON LOCH GOIL · Kathie plans to return to Scotland for the film sequel and once the project is complete has agreed to feature The Lodge on Loch Goil on her morning breakfast

Prof. D. Nagesh Kumar - KSCST · Prof D Nagesh Kumar, Dept of Civil Engg, ... service and information economy with reductions ... Prof D Nagesh Kumar, Dept of Civil Engg, IISc Impact

GOIL COMPANY LIMITED

Supply Chain Management by Nagesh Talekar

Bordoloi and Bock EXCEPTIONS. Bordoloi and Bock Errors Two types of errors can be found in a program: compilation errors and runtime errors. There is.

Bordoloi CMIS 450 Database Design Dr. Bijoy Bordoloi Entity Relationship (E-R) Model.