Clustering of Uncertain data objects by Voronoi diagrambased approach Speaker: Chan Kai Fong, Paul...

Upload
jeromewillis 
Category
Documents

view
216 
download
0
Embed Size (px)
Transcript of Clustering of Uncertain data objects by Voronoi diagrambased approach Speaker: Chan Kai Fong, Paul...
Clustering of Uncertain data objects by Voronoidiagrambased approach
Speaker: Chan Kai Fong, Paul
Dept of CS, HKU
Presentation Outline
Introduction concept of clustering, clustering of uncertain objects Example: Application of clustering on uncertain data UKmeans algorithm
Motivation Voronoidiagrambased (VD) clustering MinMaxbased (MM) clustering VD is strictly better than MinMax
Clustering algorithms VDBi, VDBiP, VD based methods with Cluster Shift When VD based methods are better than MM based methods?
Experiments Conclusion
Introduction
Introduction
Clustering Group similar data objects together to form clusters
Partitionbased clustering Input: # of clusters (k), # of objects (n)
Iterative method In each iteration, divide n data objects into k groups to
minimize an objective function e.g., minimize the sum of squares of distances Stop when the results are converged
Introduction
To cluster the data points in 2D spaceData objects: n data pointsApply any partitionbased clustering
algorithms (Kmeans)Distance measure: Euclidean distance,
Manhattan distance, etc.
Introduction
To cluster the uncertain objects in 2D space Uncertain objects: objects with uncertainty (e.g. location
uncertainty) No fixed coordinates in 2D space Object’s location is estimated by using a probability density
function (pdf) over an uncertainty region Assume the pdf for each object can be obtained Uncertainty region (ur): a region that the object may appear,
with a certain probability distribution; and the probability of the objects appear outside the uncertainty region is zero
Each object may have an irregular uncertainty region, also the pdf could be arbitrary
o1
o1.ur
MBR of o1.ur
The expected distance (ED) is used to measure the distance between uncertain object and cluster representative.
ED is the expected distance function, d is Euclidean distance function, x is any point inside oi’s uncertainty region, f is the pdf of uncertain objects oi, and pj is any cluster representatives.
ED computations are very expensive, in each iteration of Kmeans, nk ED computations are required.
Expected distance computation
Cluster pj
oi
ED(oi, pj)
Application: Clustering the vehicles
Objective: get traffic patterns by clustering vehicles in a city
Data objects: vehicles on a 2D map Uncertainty: location uncertainty of the vehicles,
each pdf defined over object’s uncertainty region represent the probability distribution of possible location of a vehicle in a certain period of time
oi
Degree of uncertainty is affected by the following factors,
1. Time
2. Traffic of the roads
3. Shape of the roads
4. Speed of the vehicles
oi Results
UKmeans
UKmeans: first extension of Kmeans algorithm to handle uncertain objects
Distance measure: Expected distance (ED) Disadvantage: Slow and inefficient Show the possibility of using Kmeans to
handle the clustering of uncertain objects
Two Approaches to solve clustering problem by UKmeans
1. MinMaxbased approach (Jacky)
2. VoronoiDiagrambased approach (Paul)
Motivation
Two Approaches to solve clustering problem by UKmeans1. MinMaxbased approach (Jacky)
• Basic MinMax distance pruning (MinMax)• MinMax with precomputation of ED• MinMax with Cluster Shift (MinMaxShift)
2. VoronoiDiagrambased approach (Paul)• Voronoi diagram with Bisector Pruning (VDBi)• Voronoi diagram with Bisector Pruning and Partial expected
distance computations (VDBiP)
• Voronoi diagram with Bisector Pruning and Cluster Shift (VDBiShift)
• Voronoi diagram with Bisector Pruning and Partial expected distance computations and Cluster Shift (VDBiPShift)
MinMaxbased Approach UKmeans with MinMax distance pruning
Objective: avoid expected distance computation using mindist and maxdist between object’s MBR and
cluster representatives to represent the distance bounds of ED(cj, oi) & ED(cm, oi)
E.g., given an object oi , cluster rep cj and cm , if mindist(cj, oi) > maxdist (cm , oi) then cj can be pruned
oi
cj
cm
maxdist (cm , oi)
mindist(cj, oi)
ED(cj,oi) need not be calculated. (pruned)
ED(cj,oi) > ED(cm,oi) prune cj
MinMaxbased Approach
Upper and lower bounds can become tighter by using Cluster Shift (CS) and ED Precomputation (PC) methodsReplace mindist and maxdist loose estimation
by tighter estimations on distance boundsDetails refer to Jacky’s works
Voronoidiagrambased approach
Each object’s uncertainty region is bounded by its minimum bounding rectangle (MBR)
The objects’ MBRs are indexed by Rtree Voronoi diagram is constructed for the cluster
representatives in each iteration
o1
Voronoi diagram for 5 cluster representatives
Uncertain object o1 indexed by Rtree
o1
p1
p2
Bisector of p1 and p2
Voronoidiagrambased approach
If the bisector of two cluster representatives do not cut an object’s MBR, and fall in p2 side of the bisector, then ED(p1,o1) > ED(p2, o1)
p1o1
p2
p3
ED(o1, p2) < ED(o1, p1) and ED(o1,p2) < ED(o1, p3)
o1 is assigned to cluster p2.
Voronoidiagrambased approach(Cluster Assignment)
Voronoidiagrambased approach
In each iteration, For each Voronoi cell, (approximated by a MBR)
issue a range queries to object’s Rtree retrieve the candidates objects for the cluster
If the candidate’s MBR is completely enclosed in the Voronoi cell, assign the object to the cluster
If the candidate’s MBR intersect with more than one Voronoi cells, special handling methods required for the objects to prune away the unqualified clusters
get candidate objects for the cluster
object enclosed entirely in Voronoi cell
object that intersect with more than one Voronoi cell
Avoid expected distance computation
1. If the object is completely enclosed in a Voronoi cell, then the object must belong to this cluster
2. For the best case, we do not need any expensive expected distance calculations, and we do not need to retrieve the object’s pdf during the clustering
Advantages of using Voronoidiagrambased clustering
Advantages of using Voronoidiagrambased clustering
Voronoi diagram construction cost is independent of number of objects We only need O(k log k) time to compute the
2D Voronoi diagram in each iteration, where k is the number of clusters, and k is not depend on number of objects
n is much larger than k
1. Handling of uncertain objects that intersect with more than one Voronoi cells
• We cannot determine the nearest clusters by just looking at the Voronoi diagram
Difficulties of Voronoi based clustering
c1o1
c2
c3
Is VD better than basic MinMax?
Theorem: VD is strictly better than basic MinMax Given an object oi that is assigned to cluster c1,
for any iteration in UKmeans, if VD calculates ED(oi, cp) for some cp, then MM must calculate ED(oi, cp) as well.
If VD does not calculate ED(oi, cp), sometimes MM must calculate ED(oi,cp).
In some situations, VD based is better VD based methods is always better than
basic MinMax, but VD based methods may not beat MinMaxShift
In some situations, VD based methods outperform all MM based methodswhen the object uncertainty are very small,
then VD based methods are preferred
Clustering algorithms
Clustering Methods
Voronoidiagrambased approach 1. Voronoi diagram with bisector pruning
(VDBi)
2. Voronoi diagram with bisector pruning and partial expected distance computation (VDBiP)
MinMaxbased Methods
For each object, Find out the upper and lower bounds of ED values
if ClusterShift (CS) method is not enabled, upper and lower bounds is estimated by “maxdist” and “mindist” respectively (MinMax)
if CS method is enabled, then upper and lower bounds become tighter (MinMaxShift)
Prune unwanted clusters by upper and lower bounds For all unpruned cluster compute the ED values to
determine the cluster assignment of the object
Voronoidiagrambased Methods
Before each iteration, Voronoi diagram is constructed for all cluster representatives
For each cluster representative,Find out the objects which completely
enclosed in the cluster’s Voronoi cellApply bisector pruning to prune unrelated
clusters
Voronoi diagram with Bisector Pruning (VDBi)
c1
o1 c2
c3
Comparing c1 and c3, o1 fall into c1 side of the bisector(c1,c3), then c3 can be pruned.
Since bisector of c1 and c2 cut o1’s MBR, o1 may assigned to either c1 or c2.
Voronoi diagram with bisector pruning and partial expected distance computation (VDBiP)
• Cut the object’s MBR input two equal halves (a) and (b)
o1
(a) (b)
VDBiP
If o1(b)’s MBR is completely enclosed in Voronoi cell of c2
• Compute ED(o1(a) , c1) & ED(o1(a) , c2)
• Since ED(o1(b), c2) < ED(o1(b), c1)
• If ED(o1(a) , c2) < ED(o1(a) , c1) then
•ED(o1(a) , c2) + ED(o1(b) , c2) < ED(o1(a), c1) + ED(o1(b) , c1)
=> prune c1
c1
o1
c2
(a) (b)
ED(o1(a) , c1) ED(o1(a), c2)
Experiments
Experiments
Measures Efficiency (Expected distance computation required)
Comparison with Basic Minmax distance pruning (MinMax) Voronoi diagram with Bisector Pruning (VDBi) Voronoi diagram with Bisector Pruning and Partial
expected distance computation (VDBiP) MMbased with Cluster Shift (MinMaxShift) VDbased with Cluster Shift (VDBiShift ,VDBiPShift)
Experimental Settings
Data set randomly generated synthetic data set
Probability density function
random
Domain 100 x 100 2D space
Number of objects 10000
Number of clusters vary
Maximum length of an MBR’s side
10%, 1%, 0.1%
Number of sample points 20 * 20
Degree of uncertainty is large (MBR width = 10%)
0
0.1
0.2
0.3
0.4
0.5
0.6
5 6 7 8 9
#cluster
# Ex
pect
ed d
ista
nce
calc
uatio
n pe
r obj
ect p
erite
ratio
n
MINMAX
VDBi
VDBiP
MINMAXSHIFT
VDBiSHIFT
VDBiPSHIFT
0
0.2
0.4
0.6
0.8
1
1.2
1.4
10 20 30 40 50
#cluster
# E
xpec
ted
dist
ance
calc
uatio
n pe
r ob
ject
per
itera
tion
MINMAX
VDBi
VDBiP
MINMAXSHIFT
VDBiSHIFT
VDBiPSHIFT
1. VDBi perform slight better than basic MinMax only
2. Cluster shift method greatly improve basic MinMax and VDBi performance
Degree of uncertainty is small (MBR width = 1%)
0
0.02
0.04
0.06
0.08
0.1
5 6 7 8 9
#cluster
#ED
per i
tera
tion
per
obje
ct
MINMAX
VDBi
VDBiP
MINMAXSHIFT
VDBiSHIFT
VDBiPSHIFT
0
0.05
0.1
0.15
10 20 30 40 50
#cluster
#ED
per
iter
atio
n pe
rob
ject
MINMAX
VDBi
VDBiP
MINMAXSHIFT
VDBiSHIFT
VDBiPSHIFT
1. Cluster shift method cannot greatly improve the performance of MinMax
2. VDbased approach outperform MMbased approach
1. VDbased approach still better than MMbased approach, but VD perform slightly better if there are less clusters
Degree of uncertainty is very small (MBR width = 0.1%)
00.010.020.030.040.050.06
5 6 7 8 9#Cluster
#ED
MINMAX
VDBi
VDBiP
MINMAXSHIFT
VDBiSHIFT
VDBiPSHIFT
00.005
0.010.015
0.020.025
0.030.035
10 20 30 40 50
#cluster
#ED
per
iter
atio
n pe
rob
ject
MINMAX
VDBi
VDBiP
MINMAXSHIFT
VDBiSHIFT
VDBiPSHIFT
Performance analysis
Algorithms Description
MinMax the worst one
MinMaxShift Good when object is large
VDBi Good when object is small
VDBiShift Good at all cases, outperform MinMaxbased method
VDBiP better than VDBi, perform well when MBR width is small
VDBiPShift Further improvement to VDBiP
Performance Analysis Basic MinMax performance is bad, because of the
loose upper and lower bound estimation by maxdist and mindist.
When degree of uncertainty of an object are small, MinMax with cluster shift (improved distance bounds) method cannot greatly improve the tightness of distance bounds, since mindist and maxdist is accurate enough
MinMaxShift’s performance is similar to that of basic MinMax
Because of the smaller object’s size, lesser objects may intersect with multiple Voronoi cells, also we proved that VD is better than basic MinMax
VD is good for small objects, and a hybrid of cluster shift (PC) and VD perform well in all cases
Maxdist(o1 ,cj) is a very loose upper bounds, Cluster shift method can improve a lot
cj
o1
cj
o2
Maxdist(o2 ,cj) is not a loose upper bounds, Cluster shift method cannot improve a lot
Conclusion
Uncertain clustering Voronoidiagrambased approach and
MinMaxbased approachVDBi is strictly better than basic MinMaxVoronoidiagrambased approach beat
MinMaxbased approach when object’s uncertainty are small
Hybrid approach is good in all cases
Thank you
Questions?