Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations

Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations

Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan, Xiao Yu, Jiawei Han

•University of Illinois at Urbana-ChampaignMicrosoft Research Asia

2

Motivation: trajectory query by locations

Huge volume of spatial trajectories Require to search trajectories by a set of point locations

Geo-tagged photos Taxi trajectories Check-ins

User

3

k-Nearest Neighboring trajectory query

The trajectories may not exactly pass those locationsQuery the top k trajectories with the minimum aggregated distance to the given locations

q1

q2

q3

4

k-NNT query

Task Definition: Given the trajectory dataset D, and a set of query points, Q, the k-NNT query retrieves k trajectories K from D, K = {R1, R2, …, Rk} that for ∀ Ri ∈ K, ∀ Rj ∈ D - K, dist(Ri,Q) ≤ dist(Rj,Q).

ChallengesHuge trajectory dataset: High I/O cost to scan all the trajectories Aggregated distance computationNon-uniform distribution:

the trajectories are sparse/dense in different regionsthe user-given query locations may be far from all the trajectories

5

R1

R2q1

q2 q3

p1,1p1,2

p1,3p1,4 p1,5

p2,1p2,2

p2,3p2,4

p2,5

p2,6

The aggregate distance in k-NNT query

1. Find out the closest point from a trajectory to each query point (i.e., shortest matching pairs)

3. Sum up the lengths of all matching pairs

• dist(R1, q1)= dist(p1,2, q1)= 20 m• dist(R1, q2)= dist(p1,3, q2)= 50 m• dist(R1, q3)= dist(p1,5, q3)= 15 m• dist(R1, Q)=∑ dist(R1, qi)= 85 m

• dist(R2, q1)= dist(p2,3, q1)= 30 m• dist(R2, q2)= dist(p2,4, q2)= 5 m• dist(R2, q3)= dist(p2,6, q3)= 40 m• dist(R2, Q)=∑ dist(R2, qi)= 75 m

6

Related Work: k-BCT query

k-Best Connected Trajectory (k-BCT) query [SIGMOD2010]the similarity function between a trajectory R and query locations Q is

Problem: This function changes over units (inconsistent)An example

If query Q has two points q1 and q2;

dist(R1, q1) = dist(R1, q2) = 2.4km = 1.48 miles,

dist(R2, q1) = 1.5 km =0.93 miles, dist(R2, q2) = 5km = 3.1 miles

Use unit “mile”, Sim(R1, Q) = 0.45 > Sim(R2, Q) = 0.43

Use unit “km”, Sim(R1, Q) = 0.18 < Sim(R2, Q) = 0.22

7

Advantages of k-NNT over k-BCT

The distance function of k-BCT changes over units (inconsistent)The distance function of k-BCT is sensitive to a query

q1

q2

q3

• k-BCT&k-NNT

• k-NNT

• k-BCT

8

Query framework: candidate-generation-and-verification

Candidate generationBest-first search based individual heapsCoordination by a global heap

Candidate verificationLower-bound estimationEfficient pruning with the global heap

Qualifier expectation-based method

R1 R2R3 R4

q1

q2

q3

R5

R6

dist(R1, Q)= 5+2+2=9 mdist(R2, Q)= 25+20+30=75mdist(R3, Q)= 80+25+30=135mdist(R4, Q)= 90+5+3=98 mdist(R5, Q)= 55+8+70=123mdist(R6, Q)= 120+80+40=240 m

Direct Computing

Candidate Generation

R1 R4

q1

q2

q3

R5 dist(R1, Q)= 5+2+2=9 mdist(R4, Q)= 90+5+3=98 mdist(R5, Q)= 55+8+70=123m

Candidate Verification

9

Candidate Generation

Given a query Q = {q1, q2, …, qm}, generate a trajectory candidate set including all the k-NNTs (i.e., complete set)

Step 1: searching k-NN points using best-first-based individual heap Step 2: generating the candidate trajectories by the global heap

R1 R2R3 R4

q1

q2

q3

R5

R6

<p2,3, q1><p5,2, q1><p1,6, q1><p2,9, q1>

…...

h1

<p6,2, q2><p5,3, q2><p7,4, q2><p4,8, q2>

…...

h2

<p2,2, q3><p3,5, q3><p7,3, q3><p8,6, q3>

…...

h3

10

Global heapA minimum heap sorting matching pairs by the distanceRetrieves new matching pair from individual heapsPops the matching pairs to the candidate set

Step 2: generating candidate trajectories

<p2,3, q1><p5,2, q1><p1,6, q1><p2,9, q1>

…...

h1

<p6,2, q2><p5,3, q2><p7,4, q2><p4,8, q2>

…...

<p2,2, q3><p3,5, q3><p7,3, q3><p8,6, q3>

…...

<p5,1, qm><p2,3, qm><p5,7, qm><p9,2, qm>

…...

…...

<p1,4, q1>, <p5,1, q3>, <p6,4, q4>, <p3,4, q2>, …...

Global Heap (Size=m)

R1: <p1,2, q1>, <p1,5, q2>, <p1,3, q3>, ……, <p1,9, qm>. R2: , <p2,2, q2>, <p2,4, q3>, ……, . R4: <p4,5, q1>, , <p4,3, q3>, ……, <p4,7, qm> ………... Candidate Set

h2 h3 hmIndividual Heaps

11

R1 R2R3 R4

R5

p1,2

p4,4

p4,5p1,4

p1,6

p5,5

Example: Search based on the global heap

Candidate Set

Global Heap

Individual Heaps

q1

q2

q3

h1 h2 h3

…… …… …… • <p1,2,

q1>• <p1,4,

q2>• <p1,6,

q3>

12

R1 R2R3 R4

R5

p1,2

p4,4

p4,5p1,4

p1,6

p5,5


Candidate Set

Global Heap

Individual Heaps

q1

q2

q3

h1 h2 h3

…… …… ……

• <p1,2, q1>

• <p1,4, q2>

• <p1,6, q3>

• R1: (Partial Match)

• <p5,5, q2>

13

R1 R2R3 R4

R5

p1,2

p4,4

p4,5p1,4

p1,6

p5,5


Candidate Set

Global Heap

Individual Heaps

q1

q2

q3

h1 h2 h3

…… …… ……

• <p1,2, q1>

• <p1,4, q2>

• <p1,6, q3>


• <p5,5, q2>

• <p4,5, q3>

14

R1 R2R3 R4

R5

p1,2

p4,4

p4,5p1,4

p1,6

p5,5


Candidate Set

Global Heap

Individual Heaps

q1

q2

q3

h1 h2 h3

…… …… ……

• <p1,2, q1>

• <p1,4, q2>

• <p1,6, q3>


• <p5,5, q2>

• <p4,5, q3>


• <p4,4, q2>

15


R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match)

R4: <p4,5, q3>. (Partial Match)


Candidate Set

Global Heap<p1,2, q1>, <p4,4, q2>, <p1,5, q3>

Individual Heaps

…… ……

h1 h2 h3

……

R1 R2R3 R4

R5

p1,2

p4,4

p4,5p1,4

p1,6

p5,5

q1

q2

q3

Stop critiria: when there is k full-matching candidates – Property 1: The candidate set is complete if G has popped out k full-matching candidates (In this example k=1)

Advantagesguarantee including all k-NNTs in candidate setgenerate compact candidate sets

16

Candidate verification

The full-matching candidate may not be the final k-NNT The system has to retrieve the partial-matching trajectories (R4 and R5) to compute their aggregate distance (I/O cost)

Question: can we compute a lower-bound for R4 and R5 without retrieving their details?If LB(R4/5) > dist(R1,Q), we can prune it directly

R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match)



Candidate Set

17

Candidate verification

The lower-bound of a partial-matching trajectory is

If the LB(R) is larger than the distance of full-matching candidate, R can be pruned directlyR1: <p1,2, q1> <p1,4, q2> <p1,6, q3> dist(R1) = 95

R4: <p4,5, q3>

R5: <p5,5, q2>

Candidate Set

Global Heap• <p1,5, q3>

• <p1,2, q1>

• <p4,4, q2>

• <p1,5, q3>

• <p1,2, q1>

• <p4,4, q2>

• <p1,5, q3>

• <p1,2, q1>

• <p4,4, q2>

LB(R4) =114 (pruned)

LB(R5) =90 (passed)

18

Problem of Outlier Query Location

A query location is an outlier if it is far from all the trajectories

Too many partial-matching candidates will be generated before finding a full-matching candidates

R1: <p1,1, q1>, <p1,4, q2>, . (Partial Matching) R2: <p2,1, q1>, <p2,5, q2>, . (Partial Matching)R4: , <p4,4, q2>, . (Partial Matching)

<p1,1, q1>, <p4,4, q2>, <p1,7, q3> Iteration 4

Global Heap

Candidate Set




……

<p1,7, q3> cannot be popped out

R1R2

R3 R4

q1

q2

q3

p1,2

p2,1

p2,2

p2,5

p1,7 p2,6

p4,4

p1,4

19

Qualifier expectation based method

The system can make up the missing pairs of a partial-matching trajectory by retrieving all its pointsTwo key issues:

Guarantee the completeness of candidate set Property 2: If there are k made-up candidates (qualifier) with distance smaller than the sum of the pairs in global heap, the candidate set is complete

Which candidate should be selected to make up? The qualifier expectation measure

20

R1R2

R3 R4

q1

q2

q3

p1,2

p2,1

p2,2

p2,5

p1,7 p2,6

p4,4

p1,4

Example of Qualifier Expectation

R1: <p1,1, q1>, <p1,4, q2>, .

R2: <p2,1, q1>, <p2,5, q2>, .

R4: ,<p4,4, q2>, .Candidate Set

Global Heap, total dist sum(G) = 200m<p2,1, q1>, <p4,4, q2>, <p1,7, q3>

R1: 40m. R2: 30m. R4: 15m.

Qualifier Expectation

• R1: <p1,1, q1>, <p1,4, q2>, <p1,7, q3>.

dist(R1) =160m < sum(G), R1 is a qualifier

21

Experiment Setup

Real Dataset: collected from the Microsoft GeoLife and T-Drive projects , with over 20,000 real trajectoriesSynthetic datasets with both uniform distribution and biased distributionRandom generated query Q The proposed methods are compared with Fagin’s Algorithm (FA) and Threshold Algorithm (TA) (used in k-BCT)

• GeoLife

22

Evaluations on synthetic dataset (biased distribution)

GH (global heap) is faster than baselines with less I/O costsQE( global heap+ qualifier expectation ) is an order of magnitude faster than others

1000

10000

100000

1000000

10000000

2 4 6 8 10100

1000

10000

100000

2 4 6 8 10

100

1000

10000

100000

3k 6k 9k 12k

GH QE TA FA

Time (unit: ms) Accessed Rtree Nodes

(a) Query Time vs. |Q| (b) I/O Cost vs. |Q|

23

Evaluations on real dataset

When |Q| is small, the probability of outlier location is low, GH achieves the best performanceWhen |Q| is larger, the probability of outlier location is high, QE is more efficient

1000

10000

100000

1000000

2 4 6 8 10

10

100

1000

10000

2 4 6 8 10

100

1000

10000

100000

3k 6k 9k 12k

GH QE TA FA

Time (unit: ms) Accessed Rtree Nodes

(a) Query Time vs. |Q| (b) I/O Cost vs. |Q|

24

Conclusion

k-Nearest Neighboring Trajectory (k-NNT) queryretrieve trajectories by a set of locations

Candidate-generation-and-verification frameworkGenerate candidate trajectories with global heapEfficient lower-bound computation

Outlier query location: qualifier expectation-based method

25

Thanks!

Yu Zheng

[email protected]

Released Datasets:T-Drive taxi trajectoriesGeoLife GPS trajectories

Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations

Documents

Transcript of Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations