University of Minnesota 1 Exploiting Page-Level Upper Bound (PLUB) for Multi-Type Nearest Neighbor...

38
1 University of Minnesota Exploiting Page-Level Upper Bound (PLUB) for Multi-Type Nearest Neighbor (MTNN) Queries Xiaobin Ma Advisor: Shashi Shekhar Dec, 2005

Transcript of University of Minnesota 1 Exploiting Page-Level Upper Bound (PLUB) for Multi-Type Nearest Neighbor...

1University of Minnesota

Exploiting Page-Level Upper Bound (PLUB) for Multi-Type Nearest Neighbor

(MTNN) Queries

Xiaobin Ma

Advisor: Shashi Shekhar

Dec, 2005

2 University of Minnesota

Outline

Motivation Problem statement Related work and our contributions Proposed algorithm and cost model Experiment design and results Conclusion and future work

3 University of Minnesota

Motivation

GIS applications Find shortest path Through one point from each of different feature types

4 University of Minnesota

A Running Example

Three feature types:

red(g), green(g), black(b)

q is query point

Route with solid red line is shortest route

Routes with dashed lines are other possible routes

q

5 University of Minnesota

Basic Concepts

<P1,P2,…,Pk> ordered point sequence and P1,P2,…,Pk are from k

different (feature) types of data sets R(q, P1,P2,…,Pk)

a route from q through points P1,P2,…, and Pk d(R(q, P1,P2,…,Pk))

distance of route R(q, P1,P2,…,Pk) Multi-Type Nearest Neighbor (MTNN)

ordered point sequence <P1’,P2

’,…,Pk’> such that

d(R(q,P1’,P2

’,…,Pk’)) is minimum among all possible

routes d(R(q, P1

’,P2’,…,Pk

’)) is MTNN distance MTNN query

A query finding MTNN

6 University of Minnesota

Problem Statement for MTNN Query

Given: A query point Distance metric k different (feature) types of spatial objects with data

points numbers N1, N2, N3, … ,Nk respectively R-tree for each data set

Find: Multi-type nearest neighbor (MTNN) Objective: Minimize length of route from query

point covering an instance of each feature Constraint:

Correctness: The tour should be the shortest path for the query point and the given collection of spatial query feature types

Completeness: Only the shortest path is returned as the query result

7 University of Minnesota

Related Work

Optimal sequence route (OSR) query [Kolahdozan et. al. Tech 05-840 USC]

Optimal algorithms (RLORD) Focus on optimal algorithms for specified

permutation of feature types Point-based algorithms

Trip plan query (TPQ) [Li et. al. SSTD 05] Heuristic algorithms

Give approximate results

8 University of Minnesota

RLORD Example

q is query point

Search order is <r, b, g>

R(q,r2,b2, g2) is greedy route

Radius of circle is d(R(q,r2,b2,g2))

q

b2b15

b12b1

g2

g10

g12g13

g1

g6

g8

g11

g3

g9g14

g1g5

b6 b13

b17

b10

b5

b8b9

b3

b14

b4

b11g16

g7g4

r2

r9

r10

r11

r14r13

r7

r4

r5

r6

r3

r12

r1

r8r15

9 University of Minnesota

RLORD Running Iterations

Use backward search strategy O=<g,b,r> First iteration - examine feature type g

<g2>, <g3>, <g4>, <g5>,<g7>,<g9>, <g10>, <g12>, <g13>, <g14>, <g15>, <g16> in a set R

Second iteration - examine next feature type in O For every point bi in black set,

iterate on every partial route <gj>in R: IF d(R(q, bi)) + d(R(bi,gj)) < d(R(q,r2,b2,g2)) THEN put <bi,gj> into a set R1

keep ordered sequence <bi,gj> in R1 such that d(R(bi,gj)) + d(R(gj)) is minimum

<b1,g13>, <b2,g2>, <b3,g3>, <b4,g3>, <b6,g14>, <b7,g14>, <b11,g3>, <b12,g13>, <b13,g14>, <b14,g3>, <b15,g13> in a set R2

R <- R2 Examine next feature type and repeat above procedure

until all types of data are examined

10 University of Minnesota

Our Contributions

Formalized a new nearest neighbor search problem – Multi-Type Nearest Neighbor (MTNN) query problem

Proposed a new algorithm, i.e., Page Level Upper Bound (PLUB) based algorithm

Evaluated the proposed algorithm via cost model and experiment

11 University of Minnesota

Key Ideas of PLUB

Prune search space at page level Create candidate leaf page sequences Search candidate MTNN in these candidate leaf

page sequences

12 University of Minnesota

Page Level Upper Bound (PLUB) Algorithm

Step 1: First upper bound search Use basic R-tree based nearest neighbor search

algorithm to find an initial upper bound as current upper bound, using greedy strategy

Step 2: R-Tree search Prune search space with current upper bound and form

a set of leaf node candidate sequences, using page level pruning approach

Step 3: Subset search Search candidate MTNN in leaf node candidate

sequences Go to step 2 until going thought all permutation of

feature types, using candidate MTNN distance as current upper bound

13 University of Minnesota

B1

G1

R2

R1

B2

B4

RLUB – An Example

q

b2b15

b12b1

g2

g10

g12g13

g1

g6

g8

g11

g3

g9g14

g1g5

b6 b13

b17

b10

b5

b8b9

b3

b14

b4

b11g16

g7g4

r2

r9

r10

r11r14

r8

r15

r13

r7

r4

r5

r6

r3

r12

r1

Inputs q: query point Euclidean distance R-tree for each feature

B3

G2

G3

G4

R3

R4

R(q,r2,b2,g2) is greedy route Radius of circle is d(R(q,r2,b2,g2)) = 3.37 Rectangles are leaf pages in R-trees

14 University of Minnesota

B1

G1

R2

R1

B2

B4

RLUB – An Example

q

b2b15

b12b1

g2

g10

g12g13

g1

g6

g8

g11

g3

g9g14

g1g5

b6 b13

b17

b10

b5

b8b9

b3

b14

b4

b11g16

g7g4

r2

r9

r10

r11r14

r8

r15

r13

r7

r4

r5

r6

r3

r12

r1

B3

G2

G3

G4

R3

R4

R(q,r2,b2,g2) is greedy route Radius of circle is d(R(q,r2,b2,g2)) = 3.37 Rectangles are leaf pages in R-trees

UB

E?

R1 B1 G1 2.04 N

R1 B1 G3 6.2 Y

R1 B1 G4 4.27 Y

R1 B3 G1 7.53 Y

R1 B3 G3 6.54 Y

R1 B3 G4 4.29 Y

R1 B4 G1 4.02 Y

R2 B1 3.7 Y

R2 B3 G4 3.43 Y

R2 B4 5.17 Y

R4 B1 4.08 Y

R4 B3 7.94 Y

R4 B4 7.56 Y

Leaf page upper bound calculation (current search bound 3.37)

Only leaf node sequence <R1,B1,G1> left

15 University of Minnesota

B1

G1

R2

R1

B2

B4

RLUB – An Example

q

b2b15

b12b1

g2

g10

g12g13

g1

g6

g8

g11

g3

g9g14

g1g5

b6 b13

b17

b10

b5

b8b9

b3

b14

b4

b11g16

g7g4

r2

r9

r10

r11r14

r8

r15

r13

r7

r4

r5

r6

r3

r12

r1

B3

G2

G3

G4

R3

R4

R(q,r2,b2,g2) is greedy route Radius of circle is d(R(q,r2,b2,g2)) = 3.37 Rectangles are leaf pages in R-trees

Search candidate MTNN in <R1,B1,G1>(time unit p-p)

1st iteration <g2><g10><g12>

<g13> Time 4

2nd iteration <b12,g13,><b1,g13>

<b2,g2><b15,g13> Time 4x4+4=20

3rd iteration <r10,b15,g13,><r9,b15,g1

3><r2,b2,g2> <r11,b1,g13>

Time 4x4+4=20 Output

Shortest distance route R(q,r11,b1,g13) and distance value 3.16

16 University of Minnesota

Running Results of RLORD

First iteration (time unit p-p) <g2>, <g3>, <g4>, <g5>,<g7>,<g9>, <g10>, <g12>, <g13>,

<g14>, <g15>, <g16> Time 11

Second iteration <b1,g13>, <b2,g2>, <b3,g3>, <b4,g3>, <b6,g14>, <b7,g14>,

<b11,g3>, <b12,g13>, <b13,g14>, <b14,g3>, <b15,g13> Time 11x12+12=144

Third iteration <r1,b11,g3>, <r2,b2,g2>, <r3,b11,g3>, <r8,b1,g13>, <r9,b15,g13>,

<r10,b15,g13>, <r11,b1,g13>, <r12,b11,g3>, <r13,b1,g13>, <r14,b1,g13>, <r15,b1,g13>

Time 12x11+11=143 R(q,r11,b1,g13) is shortest among all routes

Shortest distance value 3.16

17 University of Minnesota

Running Time Comparison Table

R-R: rectangle to rectangle distance P-P: point to point distance

R-R P-P

PLUB 17 44

RLORD 0 298

RLORD has no R-R distance calculation, but has much more P-P calculation

Cost of R-R < 2 x cost of P-P

18 University of Minnesota

Cost Model for PLUB (For One Permutation)

CR-T + CLF + CPN CR-T : cost of R-tree traversal to find all R-tree leaf

nodes intersected by the circle with radius of current upper bound, centered at query point q

CLF : cost of page level leaf node search for R-tree candidate leaf node sequences

CPN : cost of point level search for candidate MTNN in candidate leaf node sequences

19 University of Minnesota

CR-T Model of PLUB

CR-T : R-tree traversal cost CPR :cost of point to rectangle distance calculation N t,i : number of all the tree nodes visited in feature

type i tree traversal

CR-T = CPR x Σ N t,i (i= 1, …, k)

20 University of Minnesota

CLF Model of PLUB

CLF: search of R-tree candidate leaf node sequences

NR-R : Number of leaf nodes visited in candidate leaf node sequences search

CR-R : cost of rectangle to rectangle distance calculation

CLF = NR-R x CR-R

21 University of Minnesota

CPN Model of PLUB

CPN : search MTNN in candidate leaf node sequences FLS : leaf node candidate sequence filtering ability ratio nl : average point number in leaf node for all feature types

pi : page number of feature type i CP-P :cost of point to point distance calculation

Cls : cost of search MTNN in single leaf node sequence Cls = CP-P x (nl +(nl x nl) + nl + (nl x nl) + … + nl + (nl x nl)

(k-1 items) = (k-1) (nl x (nl +1)) x CPP

CPN = Cls x Π pi x (1- FLS) i = 1,…,k

22 University of Minnesota

Cost Model for R-Lord (For One Permutation)

CR-T‘+ CPS CR-T‘: cost of R-tree based coarse pruning, i.e. find all

data points inside initial upper bound CR-T‘ = CR-T + CP-P x nl x (p1+ p2 +p3 +…+ pk-1+ pk ) CPS : cost of candidate MTNN search in remaining

subsets CP-P :cost of point to point distance calculation CPS = CP-P x nl x (p1 + nl x p1xp2 + (p2+ nl x p2xp3 )+ …

+ (pk-1+ nl x pk-1 x pk )

23 University of Minnesota

Cost Model Summary of PLUB and RLORD( one permutation)

In random or approximate random datasets, FLS is not big enough, PLUB takes more time.

In clustered datasets, FLS tends to be very big. When 1-FLS <(nl x (p1 + nl x p1xp2 +(p2+ nl x p2xp3 )+…+ (pk-1+ nl x pk-1 x pk ))) /((k-1)

nl x (nl +1) x Π pi )

PLUB runs faster than RLORD For clustered datasets, it becomes true when clusters becomes

more compact Left side: remaining ratio (r-ratio) Right side: comparison ratio (c-ratio)

General Form Approximate Form

PLUB CR-T + CLF + CPN CP-P x (k-1) nl x (nl +1) x Π pi x (1- FLS)

RLORD CR-T‘+ CPS

CP-P x nl x (p1 + nl x p1xp2 + (p2+ nl x p2xp3 ) + … + (pk-1+ nl x pk-1 x pk )

24 University of Minnesota

Experiment Design

25 University of Minnesota

Synthetic Data Sets Generation

Randomly generate cluster center in rectangle with bottom-left (0,0) and top-right point (10000,10000)

Constraint: the minimum distance between two cluster centers is minCCDist

Around every cluster center, generate cluster member points

Maximum distance from member point to cluster center is ClusterSize

Simplified maximum cluster center distance is determined by:

maxCCDist = 10000.0/(int)(sqrt(CN)+1) Thus minimum cluster center distance when generating

cluster center is as follows: minCCDist = BCF x maxCCDist

Then the cluster size is: ClusterSize = ICF x minCCDist

26 University of Minnesota

Experiment Parameters

Feature Types:2-7 Between-cluster Compactness Factor (BCF):

0.1-1.0 In-cluster Compactness Factor (ICF):0.1-0.5 Cluster Number(CN):20,50,100,200

27 University of Minnesota

Synthetic Datasets Example

BCF=0.5,ICF=0.5,CN=20,Feature Type=2

BCF=0.5,ICF=0.3,CN=20,Feature Type=2

28 University of Minnesota

Experiment Setup & Data Sets

Setup C / Pentium-IV 3.2GHz / Linux / 1GB Memory / Synthetic

data

Synthetic data Scalability test in terms feature types Effect of data sets density Effect of Between-cluster compactness factor Effect of In-cluster compactness factor

29 University of Minnesota

Scalability Test

Parameters Fixed:

BCF=0.1, ICF = 0.1, CN=20

Variable: feature types (2-7)

Trend PLUB is much

faster when number of features is high

30 University of Minnesota

Effect of Data Sets Density

Parameters Fixed: FT = 7,

BCF=0.1, ICF=0.5

Variable: cluster number (20,50,100,200)

Trend PLUB is always

faster than RLORD for all densities of data sets

31 University of Minnesota

Effect of Between-cluster Compactness Factor

Parameters Fixed: FT = 7,

ICF=0.3,CN=50, Variable: BCF

(0.1-1.0)

32 University of Minnesota

Effect of Between-cluster Compactness Factor

Top: execution time v.s. BCF

Trend PLUB is faster

than RLORD when BCF is less than 0.7

PLUB is slower than RLORD when BCF is bigger than 0.7

33 University of Minnesota

Effect of Between-cluster Compactness Factor

Bottom: Remaining ratio (r-ratio) and comparison ratio (c-ratio) v.s. BCF

Trend Ratios increase as

BCF increase Remaining ratio is

less than comparison ratio when BCF is less than 0.8

34 University of Minnesota

Effect of Between-cluster Compactness Factor

Contradiction? Remaining ratio

increases, which means the pruning ratio decreases, the execution time decreases

when BCF increases, there are less leaf nodes intersected with current search bound. Thus the total possible candidate leaf node sequences decrease dramatically

35 University of Minnesota

Effect of Between-cluster Compactness Factor

Key information when remaining

ratio is less than comparison ratio, PLUB runs faster

when remaining ratio is greater than comparison ratio, PLUB takes more time than RLORD.

36 University of Minnesota

Effect of In-cluster Compactness Factor

Parameters Fixed: FT = 7,

BCF=0.1,CN=50,

Variable: ICF (0.1-0.5)

Trend PLUB is always

faster than RLORD for ICF from 0.1 to 0.5

37 University of Minnesota

Conclusion and Future Work

Formalized MTNN query problem Proposed PLUB based algorithm for MTNN

query Compared PLUB and RLORD

Design heuristic algorithms to tackle MTNN query problem in large number of feature types

38 University of Minnesota

References

[1] M. Kolahdouzan, M. Sharifzadeh and C. Shahabi. The Optimal Sequenced Route Query. IN USC, CS Dept, Tech. Report 05-840, 2005

[2] Feifei Li, Dihan Cheng, Marios Hadjieleftherious, George Kollios and Shang-Hua Teng. On Trip Planning Queries in Spatial Databases. SSTD 2005.