Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul...
-
Upload
jerome-willis -
Category
Documents
-
view
216 -
download
0
Embed Size (px)
Transcript of Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul...

Clustering of Uncertain data objects by Voronoi-diagram-based approach
Speaker: Chan Kai Fong, Paul
Dept of CS, HKU

Presentation Outline
Introduction concept of clustering, clustering of uncertain objects Example: Application of clustering on uncertain data UK-means algorithm
Motivation Voronoi-diagram-based (VD) clustering MinMax-based (MM) clustering VD is strictly better than MinMax
Clustering algorithms VDBi, VDBiP, VD based methods with Cluster Shift When VD based methods are better than MM based methods?
Experiments Conclusion

Introduction

Introduction
Clustering Group similar data objects together to form clusters
Partition-based clustering Input: # of clusters (k), # of objects (n)
Iterative method In each iteration, divide n data objects into k groups to
minimize an objective function e.g., minimize the sum of squares of distances Stop when the results are converged

Introduction
To cluster the data points in 2D spaceData objects: n data pointsApply any partition-based clustering
algorithms (K-means)Distance measure: Euclidean distance,
Manhattan distance, etc.

Introduction
To cluster the uncertain objects in 2D space Uncertain objects: objects with uncertainty (e.g. location
uncertainty) No fixed coordinates in 2D space Object’s location is estimated by using a probability density
function (pdf) over an uncertainty region Assume the pdf for each object can be obtained Uncertainty region (ur): a region that the object may appear,
with a certain probability distribution; and the probability of the objects appear outside the uncertainty region is zero
Each object may have an irregular uncertainty region, also the pdf could be arbitrary
o1
o1.ur
MBR of o1.ur

The expected distance (ED) is used to measure the distance between uncertain object and cluster representative.
ED is the expected distance function, d is Euclidean distance function, x is any point inside oi’s uncertainty region, f is the pdf of uncertain objects oi, and pj is any cluster representatives.
ED computations are very expensive, in each iteration of K-means, nk ED computations are required.
Expected distance computation
Cluster pj
oi
ED(oi, pj)

Application: Clustering the vehicles
Objective: get traffic patterns by clustering vehicles in a city
Data objects: vehicles on a 2D map Uncertainty: location uncertainty of the vehicles,
each pdf defined over object’s uncertainty region represent the probability distribution of possible location of a vehicle in a certain period of time

oi
Degree of uncertainty is affected by the following factors,
1. Time
2. Traffic of the roads
3. Shape of the roads
4. Speed of the vehicles

oi Results

UK-means
UK-means: first extension of K-means algorithm to handle uncertain objects
Distance measure: Expected distance (ED) Disadvantage: Slow and inefficient Show the possibility of using K-means to
handle the clustering of uncertain objects

Two Approaches to solve clustering problem by UK-means
1. MinMax-based approach (Jacky)
2. Voronoi-Diagram-based approach (Paul)

Motivation

Two Approaches to solve clustering problem by UK-means1. MinMax-based approach (Jacky)
• Basic MinMax distance pruning (MinMax)• MinMax with pre-computation of ED• MinMax with Cluster Shift (MinMax-Shift)
2. Voronoi-Diagram-based approach (Paul)• Voronoi diagram with Bisector Pruning (VDBi)• Voronoi diagram with Bisector Pruning and Partial expected
distance computations (VDBiP)
• Voronoi diagram with Bisector Pruning and Cluster Shift (VDBi-Shift)
• Voronoi diagram with Bisector Pruning and Partial expected distance computations and Cluster Shift (VDBiP-Shift)

MinMax-based Approach UK-means with MinMax distance pruning
Objective: avoid expected distance computation using mindist and maxdist between object’s MBR and
cluster representatives to represent the distance bounds of ED(cj, oi) & ED(cm, oi)
E.g., given an object oi , cluster rep cj and cm , if mindist(cj, oi) > maxdist (cm , oi) then cj can be pruned
oi
cj
cm
maxdist (cm , oi)
mindist(cj, oi)
ED(cj,oi) need not be calculated. (pruned)
ED(cj,oi) > ED(cm,oi) prune cj

MinMax-based Approach
Upper and lower bounds can become tighter by using Cluster Shift (CS) and ED Pre-computation (PC) methodsReplace mindist and maxdist loose estimation
by tighter estimations on distance boundsDetails refer to Jacky’s works

Voronoi-diagram-based approach
Each object’s uncertainty region is bounded by its minimum bounding rectangle (MBR)
The objects’ MBRs are indexed by R-tree Voronoi diagram is constructed for the cluster
representatives in each iteration
o1
Voronoi diagram for 5 cluster representatives
Uncertain object o1 indexed by R-tree

o1
p1
p2
Bisector of p1 and p2
Voronoi-diagram-based approach
If the bisector of two cluster representatives do not cut an object’s MBR, and fall in p2 side of the bisector, then ED(p1,o1) > ED(p2, o1)

p1o1
p2
p3
ED(o1, p2) < ED(o1, p1) and ED(o1,p2) < ED(o1, p3)
o1 is assigned to cluster p2.
Voronoi-diagram-based approach(Cluster Assignment)

Voronoi-diagram-based approach
In each iteration, For each Voronoi cell, (approximated by a MBR)
issue a range queries to object’s R-tree retrieve the candidates objects for the cluster
If the candidate’s MBR is completely enclosed in the Voronoi cell, assign the object to the cluster
If the candidate’s MBR intersect with more than one Voronoi cells, special handling methods required for the objects to prune away the unqualified clusters
get candidate objects for the cluster
object enclosed entirely in Voronoi cell
object that intersect with more than one Voronoi cell

Avoid expected distance computation
1. If the object is completely enclosed in a Voronoi cell, then the object must belong to this cluster
2. For the best case, we do not need any expensive expected distance calculations, and we do not need to retrieve the object’s pdf during the clustering
Advantages of using Voronoi-diagram-based clustering

Advantages of using Voronoi-diagram-based clustering
Voronoi diagram construction cost is independent of number of objects We only need O(k log k) time to compute the
2D Voronoi diagram in each iteration, where k is the number of clusters, and k is not depend on number of objects
n is much larger than k

1. Handling of uncertain objects that intersect with more than one Voronoi cells
• We cannot determine the nearest clusters by just looking at the Voronoi diagram
Difficulties of Voronoi based clustering
c1o1
c2
c3

Is VD better than basic MinMax?
Theorem: VD is strictly better than basic MinMax Given an object oi that is assigned to cluster c1,
for any iteration in UK-means, if VD calculates ED(oi, cp) for some cp, then MM must calculate ED(oi, cp) as well.
If VD does not calculate ED(oi, cp), sometimes MM must calculate ED(oi,cp).

In some situations, VD based is better VD based methods is always better than
basic MinMax, but VD based methods may not beat MinMax-Shift
In some situations, VD based methods outperform all MM based methodswhen the object uncertainty are very small,
then VD based methods are preferred

Clustering algorithms

Clustering Methods
Voronoi-diagram-based approach 1. Voronoi diagram with bisector pruning
(VDBi)
2. Voronoi diagram with bisector pruning and partial expected distance computation (VDBiP)

MinMax-based Methods
For each object, Find out the upper and lower bounds of ED values
if Cluster-Shift (CS) method is not enabled, upper and lower bounds is estimated by “maxdist” and “mindist” respectively (MinMax)
if CS method is enabled, then upper and lower bounds become tighter (MinMax-Shift)
Prune unwanted clusters by upper and lower bounds For all un-pruned cluster compute the ED values to
determine the cluster assignment of the object

Voronoi-diagram-based Methods
Before each iteration, Voronoi diagram is constructed for all cluster representatives
For each cluster representative,Find out the objects which completely
enclosed in the cluster’s Voronoi cellApply bisector pruning to prune unrelated
clusters

Voronoi diagram with Bisector Pruning (VDBi)
c1
o1 c2
c3
Comparing c1 and c3, o1 fall into c1 side of the bisector(c1,c3), then c3 can be pruned.
Since bisector of c1 and c2 cut o1’s MBR, o1 may assigned to either c1 or c2.

Voronoi diagram with bisector pruning and partial expected distance computation (VDBiP)
• Cut the object’s MBR input two equal halves (a) and (b)
o1
(a) (b)

VDBiP
If o1(b)’s MBR is completely enclosed in Voronoi cell of c2
• Compute ED(o1(a) , c1) & ED(o1(a) , c2)
• Since ED(o1(b), c2) < ED(o1(b), c1)
• If ED(o1(a) , c2) < ED(o1(a) , c1) then
•ED(o1(a) , c2) + ED(o1(b) , c2) < ED(o1(a), c1) + ED(o1(b) , c1)
=> prune c1
c1
o1
c2
(a) (b)
ED(o1(a) , c1) ED(o1(a), c2)

Experiments

Experiments
Measures Efficiency (Expected distance computation required)
Comparison with Basic Min-max distance pruning (MinMax) Voronoi diagram with Bisector Pruning (VDBi) Voronoi diagram with Bisector Pruning and Partial
expected distance computation (VDBiP) MM-based with Cluster Shift (MinMax-Shift) VD-based with Cluster Shift (VDBi-Shift ,VDBiP-Shift)

Experimental Settings
Data set randomly generated synthetic data set
Probability density function
random
Domain 100 x 100 2D space
Number of objects 10000
Number of clusters vary
Maximum length of an MBR’s side
10%, 1%, 0.1%
Number of sample points 20 * 20

Degree of uncertainty is large (MBR width = 10%)
0
0.1
0.2
0.3
0.4
0.5
0.6
5 6 7 8 9
#cluster
# Ex
pect
ed d
ista
nce
calc
uatio
n pe
r obj
ect p
erite
ratio
n
MINMAX
VDBi
VDBiP
MINMAX-SHIFT
VDBi-SHIFT
VDBiP-SHIFT
0
0.2
0.4
0.6
0.8
1
1.2
1.4
10 20 30 40 50
#cluster
# E
xpec
ted
dist
ance
calc
uatio
n pe
r ob
ject
per
itera
tion
MINMAX
VDBi
VDBiP
MINMAX-SHIFT
VDBi-SHIFT
VDBiP-SHIFT
1. VDBi perform slight better than basic MinMax only
2. Cluster shift method greatly improve basic MinMax and VDBi performance

Degree of uncertainty is small (MBR width = 1%)
0
0.02
0.04
0.06
0.08
0.1
5 6 7 8 9
#cluster
#ED
per i
tera
tion
per
obje
ct
MINMAX
VDBi
VDBiP
MINMAX-SHIFT
VDBi-SHIFT
VDBiP-SHIFT
0
0.05
0.1
0.15
10 20 30 40 50
#cluster
#ED
per
iter
atio
n pe
rob
ject
MINMAX
VDBi
VDBiP
MINMAX-SHIFT
VDBi-SHIFT
VDBiP-SHIFT
1. Cluster shift method cannot greatly improve the performance of MinMax
2. VD-based approach outperform MM-based approach
1. VD-based approach still better than MM-based approach, but VD perform slightly better if there are less clusters

Degree of uncertainty is very small (MBR width = 0.1%)
00.010.020.030.040.050.06
5 6 7 8 9#Cluster
#ED
MINMAX
VDBi
VDBiP
MINMAX-SHIFT
VDBi-SHIFT
VDBiP-SHIFT
00.005
0.010.015
0.020.025
0.030.035
10 20 30 40 50
#cluster
#ED
per
iter
atio
n pe
rob
ject
MINMAX
VDBi
VDBiP
MINMAX-SHIFT
VDBi-SHIFT
VDBiP-SHIFT

Performance analysis
Algorithms Description
MinMax the worst one
MinMax-Shift Good when object is large
VDBi Good when object is small
VDBi-Shift Good at all cases, outperform MinMax-based method
VDBiP better than VDBi, perform well when MBR width is small
VDBiP-Shift Further improvement to VDBiP

Performance Analysis Basic MinMax performance is bad, because of the
loose upper and lower bound estimation by maxdist and mindist.
When degree of uncertainty of an object are small, MinMax with cluster shift (improved distance bounds) method cannot greatly improve the tightness of distance bounds, since mindist and maxdist is accurate enough
MinMax-Shift’s performance is similar to that of basic MinMax
Because of the smaller object’s size, lesser objects may intersect with multiple Voronoi cells, also we proved that VD is better than basic MinMax
VD is good for small objects, and a hybrid of cluster shift (PC) and VD perform well in all cases
Maxdist(o1 ,cj) is a very loose upper bounds, Cluster shift method can improve a lot
cj
o1
cj
o2
Maxdist(o2 ,cj) is not a loose upper bounds, Cluster shift method cannot improve a lot

Conclusion
Uncertain clustering Voronoi-diagram-based approach and
MinMax-based approachVDBi is strictly better than basic MinMaxVoronoi-diagram-based approach beat
MinMax-based approach when object’s uncertainty are small
Hybrid approach is good in all cases

Thank you
Questions?