of 42 /42
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU

jerome-willis
• Category

## Documents

• view

216

0

TAGS:

Embed Size (px)

### Transcript of Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul... Clustering of Uncertain data objects by Voronoi-diagram-based approach

Speaker: Chan Kai Fong, Paul

Dept of CS, HKU Presentation Outline

Introduction concept of clustering, clustering of uncertain objects Example: Application of clustering on uncertain data UK-means algorithm

Motivation Voronoi-diagram-based (VD) clustering MinMax-based (MM) clustering VD is strictly better than MinMax

Clustering algorithms VDBi, VDBiP, VD based methods with Cluster Shift When VD based methods are better than MM based methods?

Experiments Conclusion Introduction Introduction

Clustering Group similar data objects together to form clusters

Partition-based clustering Input: # of clusters (k), # of objects (n)

Iterative method In each iteration, divide n data objects into k groups to

minimize an objective function e.g., minimize the sum of squares of distances Stop when the results are converged Introduction

To cluster the data points in 2D spaceData objects: n data pointsApply any partition-based clustering

algorithms (K-means)Distance measure: Euclidean distance,

Manhattan distance, etc. Introduction

To cluster the uncertain objects in 2D space Uncertain objects: objects with uncertainty (e.g. location

uncertainty) No fixed coordinates in 2D space Object’s location is estimated by using a probability density

function (pdf) over an uncertainty region Assume the pdf for each object can be obtained Uncertainty region (ur): a region that the object may appear,

with a certain probability distribution; and the probability of the objects appear outside the uncertainty region is zero

Each object may have an irregular uncertainty region, also the pdf could be arbitrary

o1

o1.ur

MBR of o1.ur The expected distance (ED) is used to measure the distance between uncertain object and cluster representative.

ED is the expected distance function, d is Euclidean distance function, x is any point inside oi’s uncertainty region, f is the pdf of uncertain objects oi, and pj is any cluster representatives.

ED computations are very expensive, in each iteration of K-means, nk ED computations are required.

Expected distance computation

Cluster pj

oi

ED(oi, pj) Application: Clustering the vehicles

Objective: get traffic patterns by clustering vehicles in a city

Data objects: vehicles on a 2D map Uncertainty: location uncertainty of the vehicles,

each pdf defined over object’s uncertainty region represent the probability distribution of possible location of a vehicle in a certain period of time oi

Degree of uncertainty is affected by the following factors,

1. Time

4. Speed of the vehicles oi Results UK-means

UK-means: first extension of K-means algorithm to handle uncertain objects

Distance measure: Expected distance (ED) Disadvantage: Slow and inefficient Show the possibility of using K-means to

handle the clustering of uncertain objects Two Approaches to solve clustering problem by UK-means

1. MinMax-based approach (Jacky)

2. Voronoi-Diagram-based approach (Paul) Motivation Two Approaches to solve clustering problem by UK-means1. MinMax-based approach (Jacky)

• Basic MinMax distance pruning (MinMax)• MinMax with pre-computation of ED• MinMax with Cluster Shift (MinMax-Shift)

2. Voronoi-Diagram-based approach (Paul)• Voronoi diagram with Bisector Pruning (VDBi)• Voronoi diagram with Bisector Pruning and Partial expected

distance computations (VDBiP)

• Voronoi diagram with Bisector Pruning and Cluster Shift (VDBi-Shift)

• Voronoi diagram with Bisector Pruning and Partial expected distance computations and Cluster Shift (VDBiP-Shift) MinMax-based Approach UK-means with MinMax distance pruning

Objective: avoid expected distance computation using mindist and maxdist between object’s MBR and

cluster representatives to represent the distance bounds of ED(cj, oi) & ED(cm, oi)

E.g., given an object oi , cluster rep cj and cm , if mindist(cj, oi) > maxdist (cm , oi) then cj can be pruned

oi

cj

cm

maxdist (cm , oi)

mindist(cj, oi)

ED(cj,oi) need not be calculated. (pruned)

ED(cj,oi) > ED(cm,oi) prune cj MinMax-based Approach

Upper and lower bounds can become tighter by using Cluster Shift (CS) and ED Pre-computation (PC) methodsReplace mindist and maxdist loose estimation

by tighter estimations on distance boundsDetails refer to Jacky’s works Voronoi-diagram-based approach

Each object’s uncertainty region is bounded by its minimum bounding rectangle (MBR)

The objects’ MBRs are indexed by R-tree Voronoi diagram is constructed for the cluster

representatives in each iteration

o1

Voronoi diagram for 5 cluster representatives

Uncertain object o1 indexed by R-tree o1

p1

p2

Bisector of p1 and p2

Voronoi-diagram-based approach

If the bisector of two cluster representatives do not cut an object’s MBR, and fall in p2 side of the bisector, then ED(p1,o1) > ED(p2, o1) p1o1

p2

p3

ED(o1, p2) < ED(o1, p1) and ED(o1,p2) < ED(o1, p3)

o1 is assigned to cluster p2.

Voronoi-diagram-based approach(Cluster Assignment) Voronoi-diagram-based approach

In each iteration, For each Voronoi cell, (approximated by a MBR)

issue a range queries to object’s R-tree retrieve the candidates objects for the cluster

If the candidate’s MBR is completely enclosed in the Voronoi cell, assign the object to the cluster

If the candidate’s MBR intersect with more than one Voronoi cells, special handling methods required for the objects to prune away the unqualified clusters

get candidate objects for the cluster

object enclosed entirely in Voronoi cell

object that intersect with more than one Voronoi cell Avoid expected distance computation

1. If the object is completely enclosed in a Voronoi cell, then the object must belong to this cluster

2. For the best case, we do not need any expensive expected distance calculations, and we do not need to retrieve the object’s pdf during the clustering Voronoi diagram construction cost is independent of number of objects We only need O(k log k) time to compute the

2D Voronoi diagram in each iteration, where k is the number of clusters, and k is not depend on number of objects

n is much larger than k 1. Handling of uncertain objects that intersect with more than one Voronoi cells

• We cannot determine the nearest clusters by just looking at the Voronoi diagram

Difficulties of Voronoi based clustering

c1o1

c2

c3 Is VD better than basic MinMax?

Theorem: VD is strictly better than basic MinMax Given an object oi that is assigned to cluster c1,

for any iteration in UK-means, if VD calculates ED(oi, cp) for some cp, then MM must calculate ED(oi, cp) as well.

If VD does not calculate ED(oi, cp), sometimes MM must calculate ED(oi,cp). In some situations, VD based is better VD based methods is always better than

basic MinMax, but VD based methods may not beat MinMax-Shift

In some situations, VD based methods outperform all MM based methodswhen the object uncertainty are very small,

then VD based methods are preferred Clustering algorithms Clustering Methods

Voronoi-diagram-based approach 1. Voronoi diagram with bisector pruning

(VDBi)

2. Voronoi diagram with bisector pruning and partial expected distance computation (VDBiP) MinMax-based Methods

For each object, Find out the upper and lower bounds of ED values

if Cluster-Shift (CS) method is not enabled, upper and lower bounds is estimated by “maxdist” and “mindist” respectively (MinMax)

if CS method is enabled, then upper and lower bounds become tighter (MinMax-Shift)

Prune unwanted clusters by upper and lower bounds For all un-pruned cluster compute the ED values to

determine the cluster assignment of the object Voronoi-diagram-based Methods

Before each iteration, Voronoi diagram is constructed for all cluster representatives

For each cluster representative,Find out the objects which completely

enclosed in the cluster’s Voronoi cellApply bisector pruning to prune unrelated

clusters Voronoi diagram with Bisector Pruning (VDBi)

c1

o1 c2

c3

Comparing c1 and c3, o1 fall into c1 side of the bisector(c1,c3), then c3 can be pruned.

Since bisector of c1 and c2 cut o1’s MBR, o1 may assigned to either c1 or c2. Voronoi diagram with bisector pruning and partial expected distance computation (VDBiP)

• Cut the object’s MBR input two equal halves (a) and (b)

o1

(a) (b) VDBiP

If o1(b)’s MBR is completely enclosed in Voronoi cell of c2

• Compute ED(o1(a) , c1) & ED(o1(a) , c2)

• Since ED(o1(b), c2) < ED(o1(b), c1)

• If ED(o1(a) , c2) < ED(o1(a) , c1) then

•ED(o1(a) , c2) + ED(o1(b) , c2) < ED(o1(a), c1) + ED(o1(b) , c1)

=> prune c1

c1

o1

c2

(a) (b)

ED(o1(a) , c1) ED(o1(a), c2) Experiments Experiments

Measures Efficiency (Expected distance computation required)

Comparison with Basic Min-max distance pruning (MinMax) Voronoi diagram with Bisector Pruning (VDBi) Voronoi diagram with Bisector Pruning and Partial

expected distance computation (VDBiP) MM-based with Cluster Shift (MinMax-Shift) VD-based with Cluster Shift (VDBi-Shift ,VDBiP-Shift) Experimental Settings

Data set randomly generated synthetic data set

Probability density function

random

Domain 100 x 100 2D space

Number of objects 10000

Number of clusters vary

Maximum length of an MBR’s side

10%, 1%, 0.1%

Number of sample points 20 * 20 Degree of uncertainty is large (MBR width = 10%)

0

0.1

0.2

0.3

0.4

0.5

0.6

5 6 7 8 9

#cluster

# Ex

pect

ed d

ista

nce

calc

uatio

n pe

r obj

ect p

erite

ratio

n

MINMAX

VDBi

VDBiP

MINMAX-SHIFT

VDBi-SHIFT

VDBiP-SHIFT

0

0.2

0.4

0.6

0.8

1

1.2

1.4

10 20 30 40 50

#cluster

# E

xpec

ted

dist

ance

calc

uatio

n pe

r ob

ject

per

itera

tion

MINMAX

VDBi

VDBiP

MINMAX-SHIFT

VDBi-SHIFT

VDBiP-SHIFT

1. VDBi perform slight better than basic MinMax only

2. Cluster shift method greatly improve basic MinMax and VDBi performance Degree of uncertainty is small (MBR width = 1%)

0

0.02

0.04

0.06

0.08

0.1

5 6 7 8 9

#cluster

#ED

per i

tera

tion

per

obje

ct

MINMAX

VDBi

VDBiP

MINMAX-SHIFT

VDBi-SHIFT

VDBiP-SHIFT

0

0.05

0.1

0.15

10 20 30 40 50

#cluster

#ED

per

iter

atio

n pe

rob

ject

MINMAX

VDBi

VDBiP

MINMAX-SHIFT

VDBi-SHIFT

VDBiP-SHIFT

1. Cluster shift method cannot greatly improve the performance of MinMax

2. VD-based approach outperform MM-based approach

1. VD-based approach still better than MM-based approach, but VD perform slightly better if there are less clusters Degree of uncertainty is very small (MBR width = 0.1%)

00.010.020.030.040.050.06

5 6 7 8 9#Cluster

#ED

MINMAX

VDBi

VDBiP

MINMAX-SHIFT

VDBi-SHIFT

VDBiP-SHIFT

00.005

0.010.015

0.020.025

0.030.035

10 20 30 40 50

#cluster

#ED

per

iter

atio

n pe

rob

ject

MINMAX

VDBi

VDBiP

MINMAX-SHIFT

VDBi-SHIFT

VDBiP-SHIFT Performance analysis

Algorithms Description

MinMax the worst one

MinMax-Shift Good when object is large

VDBi Good when object is small

VDBi-Shift Good at all cases, outperform MinMax-based method

VDBiP better than VDBi, perform well when MBR width is small

VDBiP-Shift Further improvement to VDBiP Performance Analysis Basic MinMax performance is bad, because of the

loose upper and lower bound estimation by maxdist and mindist.

When degree of uncertainty of an object are small, MinMax with cluster shift (improved distance bounds) method cannot greatly improve the tightness of distance bounds, since mindist and maxdist is accurate enough

MinMax-Shift’s performance is similar to that of basic MinMax

Because of the smaller object’s size, lesser objects may intersect with multiple Voronoi cells, also we proved that VD is better than basic MinMax

VD is good for small objects, and a hybrid of cluster shift (PC) and VD perform well in all cases

Maxdist(o1 ,cj) is a very loose upper bounds, Cluster shift method can improve a lot

cj

o1

cj

o2

Maxdist(o2 ,cj) is not a loose upper bounds, Cluster shift method cannot improve a lot Conclusion

Uncertain clustering Voronoi-diagram-based approach and

MinMax-based approachVDBi is strictly better than basic MinMaxVoronoi-diagram-based approach beat

MinMax-based approach when object’s uncertainty are small

Hybrid approach is good in all cases Thank you

Questions?