Efficient Parallel kNN Joins for Large Data in...
Transcript of Efficient Parallel kNN Joins for Large Data in...
![Page 1: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/1.jpg)
Efficient Parallel kNN Joins for Large Data inMapReduce
Chi Zhang1 Feifei Li2 Jeffrey Jestes2
1Dept of Computer Science 2School of ComputingFlorida State University University of Utah
April 4, 2012
![Page 2: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/2.jpg)
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 3: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/3.jpg)
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 4: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/4.jpg)
k Nearest Neighbor Join
k nearest neighbor join (kNN join)
Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .
q
Point in R Point in S
Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 5: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/5.jpg)
k Nearest Neighbor Join
k nearest neighbor join (kNN join)
Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .
q
p3
(q, p1)
(q, p3)
(q, p4)
3-NN join for qp1
p4
Point in R Point in S
Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 6: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/6.jpg)
k Nearest Neighbor Join
k nearest neighbor join (kNN join)
Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .
Point in R Point in S
Find kNN in S for all points in R
Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 7: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/7.jpg)
k Nearest Neighbor Join
k nearest neighbor join (kNN join)
Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .
Point in R Point in S
Find kNN in S for all points in R
Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 8: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/8.jpg)
Data Growth
Source: IDC
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 9: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/9.jpg)
Rise of Distributed and Parallel Computing
Data sets are growing at an exponential rate.
A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 10: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/10.jpg)
Rise of Distributed and Parallel Computing
Data sets are growing at an exponential rate.
A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 11: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/11.jpg)
Rise of Distributed and Parallel Computing
Challenges:
Minimize communication and computation.Achieve good load balance.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 12: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/12.jpg)
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 13: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/13.jpg)
kNN Join
Exact kNN Join
knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.
Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .
p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).
aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.
r
p
Point in R Point in S
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 14: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/14.jpg)
kNN Join
Exact kNN Join
knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.
Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .
p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).
aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.
r
p
p′
Point in R Point in S
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 15: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/15.jpg)
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 16: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/16.jpg)
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method
1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 17: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/17.jpg)
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.
2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 18: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/18.jpg)
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks
3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 19: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/19.jpg)
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks
3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 20: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/20.jpg)
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks
3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 21: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/21.jpg)
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks
3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
BNLJ(R1, S1)
BNLJ(R1, S2)
BNLJ(R2, S1)
BNLJ(R2, S2)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 22: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/22.jpg)
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
BNLJ(R1, S1)
BNLJ(R1, S2)
BNLJ(R2, S1)
BNLJ(R2, S2)
BNLJ(R1, S)
BNLJ(R2, S)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 23: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/23.jpg)
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 1
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
4
4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 24: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/24.jpg)
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 1
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
4
4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 25: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/25.jpg)
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 2
(r1, s1, d1,1)...
File 1
File 2
(r3, s1, d3,1)
(r1, s7, d1,8)
(r3, s5, d3,5)
Mapper
partition by record ids
Mapper
(r1, s1, d1,1)
(r3, s1, d3,1)
(r1, s7, d1,8)
(r3, s5, d3,5)
...
...
...
...
...
...
...
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 26: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/26.jpg)
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 2
(r1, s1, d1,1)...
File 1
File 2
(r3, s1, d3,1)
(r1, s7, d1,8)
(r3, s5, d3,5)
Mapper
partition by record ids
Mapper
(r1, s1, d1,1)
(r3, s1, d3,1)
(r1, s7, d1,8)
(r3, s5, d3,5)
Shuffle
(r1, s1, d1,1)
(r1, s7, d1,7)
(r3, s1, d3,1)
(r3, s5, d3,5)
sort list(s, d(r, s))get top k(= 2) results for r
Reducer
Reducer
DFS
(r1, s1, d1,1)(r1, s7, d1,7)
(r3, s5, d3,5)(r3, s6, d3,6)
DFS...
...
...
...
...
...
...
...
...
...
...
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 27: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/27.jpg)
Exact kNN join: Block R-tree Join
Use spatial index (R-tree) to improve performance
Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 28: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/28.jpg)
Exact kNN join: Block R-tree Join
Use spatial index (R-tree) to improve performance
Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 29: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/29.jpg)
Exact kNN join: Block R-tree Join
Use spatial index (R-tree) to improve performance
Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
4
4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 30: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/30.jpg)
Exact kNN join: Block R-tree Join
Use spatial index (R-tree) to improve performance
Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJ
4
4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 31: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/31.jpg)
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 32: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/32.jpg)
Approximate kNN join
Problems with exact kNN join solution
Too much communication and computation (n2 buckets required)Find solution requiring O(n) buckets.
We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJ
4
4
[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 33: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/33.jpg)
Approximate kNN join
Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)
Find solution requiring O(n) buckets.
We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJn2 buckets required, too much cost.
4
4
[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 34: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/34.jpg)
Approximate kNN join
Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)
Find solution requiring O(n) buckets.
We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJn2 buckets required, too much cost.
4
4
[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 35: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/35.jpg)
Approximate kNN join
Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)
Find solution requiring O(n) buckets.We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJn2 buckets required, too much cost.
4
4
[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 36: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/36.jpg)
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 37: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/37.jpg)
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 38: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/38.jpg)
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
: points in Pi
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 39: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/39.jpg)
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
zi,5zi,1zi,3
zi,4
zi,2zi,6
: points in Pi
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 40: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/40.jpg)
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
zi,5zi,1zi,3
zi,4
zi,2zi,6
: points in Pi
ZPi
qi=q + vi
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 41: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/41.jpg)
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
zi,5zi,1zi,3
zi,4
zi,2zi,6
: points in Pi
ZPizqi
z−(zqi, k, Pi) z+(zqi, k, Pi)
qi=q + vi
B+-tree
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 42: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/42.jpg)
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
zi,5zi,1zi,3
zi,4
zi,2zi,6
: points in Pi
ZPizqi
z−(zqi, k, Pi) z+(zqi, k, Pi)
Ci(q)
qi=q + vi
B+-tree
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 43: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/43.jpg)
Approximate kNN join: Z-order kNN join
In our group’s previous work we derive the following guarantee forthe zkNN join:
Theorem
Given a query point q ∈ Rd , a data set P ⊂ Rd , and a small constantα ∈ Z+. We generate (α− 1) random vectors {v2, . . . , vα}, such that forany i , vi ∈ Rd , and shift P by these vectors to obtain {P1, . . . ,Pα}(P1 = P). Then, the zkNN join returns a constant approximation in anyfixed dimension for knn(q,P) in expectation.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 44: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/44.jpg)
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 45: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/45.jpg)
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 46: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/46.jpg)
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
ZRi
zi,1 zi,2
Ri,1 Ri,2 Ri,3
Si,2
zr
Si,1
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 47: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/47.jpg)
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
ZRi
zi,1 zi,2
Ri,1 Ri,2 Ri,3
Si,2
zr
? ?Si,1
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 48: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/48.jpg)
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
ZRi
zi,1 zi,2
Ri,1 Ri,2 Ri,3
Si,1 Si,2 Si,3
zr
zr
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 49: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/49.jpg)
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
ZRi
zi,1 zi,2
Ri,1 Ri,2 Ri,3
Si,1 Si,2 Si,3
zr
Ci(r)
zr
small neighborhoodsearch!!!
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 50: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/50.jpg)
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
zr
Si,1 Si,2 Si,4
Ri,1 Ri,2 Ri,4 ZRi
zi,1 zi,2
Si,3
zi,3
Ri,3
zr
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 51: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/51.jpg)
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
zr
Si,1 Si,2 Si,4
Ri,1 Ri,2 Ri,4 ZRi
zi,1 zi,2
Si,3
zi,3
Ri,3
zr
Ci(r)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 52: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/52.jpg)
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
zr
Si,1 Si,2 Si,4
Ri,1 Ri,2 Ri,4 ZRi
zi,1 zi,2
Si,3
zi,3
Ri,3
zr
Ci(r)
copy copy
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 53: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/53.jpg)
Approximate kNN join: H-zkNNJ
Choice of partitioning values.
Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.
Evenly partition Ri or Si .
Evenly partition Ri → O( |Ri |n
log |Si |)Evenly partition Si → O(|Ri |log |Si |)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 54: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/54.jpg)
Approximate kNN join: H-zkNNJ
Choice of partitioning values.
Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .
Evenly partition Ri → O( |Ri |n
log |Si |)Evenly partition Si → O(|Ri |log |Si |)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 55: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/55.jpg)
Approximate kNN join: H-zkNNJ
Choice of partitioning values.
Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .
Evenly partition Ri → O( |Ri |n
log |Si |)Evenly partition Si → O(|Ri |log |Si |)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 56: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/56.jpg)
Approximate kNN join: H-zkNNJ
Computation of partitioning values.
Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).
We propose sampling based method to estimate quantiles.
We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 57: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/57.jpg)
Approximate kNN join: H-zkNNJ
Computation of partitioning values.
Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).
We propose sampling based method to estimate quantiles.
We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 58: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/58.jpg)
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si
R
S
shift by vicompute z-value
Ri
Si
ith shift
ith shift
DFS
DFS
sample
sample of ith shiftRi
sample of ith shiftSi
Map
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 59: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/59.jpg)
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si
R
S
shift by vicompute z-value
Ri
Si
ith shift
ith shift
DFS
DFS
sample
sample of ith shiftRi
sample of ith shiftSi
Map
shuffle
sort&
Ri
Si
estimator 1
estimator 2
DFS
Reduce
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 60: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/60.jpg)
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.
Ri
Si
partition by Si’s ranges
partition by Ri’s ranges
Ri,1 Ri,2 Ri,n. . .
block 1block 2 block n
Si,1 Si,2 Si,n. . .
block 1block 2 block n
Map
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 61: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/61.jpg)
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.
Ri
Si
partition by Si’s ranges
partition by Ri’s ranges
Ri,1 Ri,2 Ri,n. . .
block 1block 2 block n
Si,1 Si,2 Si,n. . .
block 1block 2 block n
Map
shuffle&
sort
Ri,n
Si,n
Ri,2
Si,2
Ri,1
Si,1
. . .
B+-Tree
B+-Tree
B+-Tree
Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]
DFS
DFS
DFS
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 62: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/62.jpg)
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 3: determine knn(r ,C(r)) of any r ∈ R from the (r ,Ci (r)) emittedby round 2.
Ri
Si
partition by Si’s ranges
partition by Ri’s ranges
Ri,1 Ri,2 Ri,n. . .
block 1block 2 block n
Si,1 Si,2 Si,n. . .
block 1block 2 block n
Map
shuffle&
sort
Ri,n
Si,n
Ri,2
Si,2
Ri,1
Si,1
. . .
B+-Tree
B+-Tree
B+-Tree
Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]
DFS
DFS
DFS
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 63: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/63.jpg)
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 64: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/64.jpg)
Experiments: algorithms
We implement the following methods in Hadoop 0.20.2:Exact Methods:
The baseline solution is denoted H-BNLJ,The improvement to the baseline solution is denoted H-BRJ.
Approximate Methods:
Our three-round solution is denoted by H-zkNNJ, (meaning ”Hadoopz-value kNN Join”).
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 65: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/65.jpg)
Experiments: setup
Experiments are performed in a heterogeneous Hadoop cluster with17 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).
3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
All machines are directly connected to a 1000Mbps switch.
Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.
The chunk size of DFS is set to 128MB.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 66: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/66.jpg)
Experiments: setup
Experiments are performed in a heterogeneous Hadoop cluster with17 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).
3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
All machines are directly connected to a 1000Mbps switch.
Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.
The chunk size of DFS is set to 128MB.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 67: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/67.jpg)
Experiments: datasets
OpenStreet Map dataset:
the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.
Large synthetic Random-Cluster datasets:
data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 68: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/68.jpg)
Experiments: datasets
OpenStreet Map dataset:
the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.
Large synthetic Random-Cluster datasets:
data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 69: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/69.jpg)
Experiments: configurations and defaults
Data set configurations
(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).
Default values for OpenStreet dataset:
Symbol Definition Default(MXN) data set configuration (4x4)
k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16
Values for R-Cluster dataset:
(2x2) is set to be the default data set configuration.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 70: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/70.jpg)
Experiments: configurations and defaults
Data set configurations
(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).
Default values for OpenStreet dataset:
Symbol Definition Default(MXN) data set configuration (4x4)
k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16
Values for R-Cluster dataset:
(2x2) is set to be the default data set configuration.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 71: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/71.jpg)
Experiments: configurations and defaults
Data set configurations
(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).
Default values for OpenStreet dataset:
Symbol Definition Default(MXN) data set configuration (4x4)
k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16
Values for R-Cluster dataset:
(2x2) is set to be the default data set configuration.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 72: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/72.jpg)
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
1
1.2
1.4
1.6
10 20 40 60 80
Appro
xim
atio
n r
atio
k values
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 73: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/73.jpg)
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
0.6
0.7
0.8
0.9
1
10 20 40 60 80
Rec
all
(Pre
cisi
on)
k values
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 74: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/74.jpg)
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
1
1.2
1.4
1.6
5 10 15 20 25 30
Appro
xim
atio
n r
atio
Dimensionality
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 75: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/75.jpg)
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30
Rec
all
(Pre
cisi
on)
Dimensionality
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 76: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/76.jpg)
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
10
20
30
40
50
4x4 6x6 8x8 12x12 16x16
Tim
e (
seconds
×10
3)
|R|×|S|:107×10
7
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 77: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/77.jpg)
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
20
40
60
80
100
4x4 6x6 8x8 12x12 16x16
Data
shuff
led (
GB
)
|R|×|S|:107×10
7
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 78: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/78.jpg)
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
103
104
105
5 10 15 20 25 30
Tim
e (s
econds)
Dimensionality
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 79: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/79.jpg)
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
5
10
15
20
25
5 10 15 20 25 30
Dat
a sh
uff
led (
GB
)
Dimensionality
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 80: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/80.jpg)
Conclusions
We study efficient methods to perform kNN joins in MapReduce.
Exact (H-BRJ) and approximate (H-zkNNJ) algorithms are proposed.H-zkNNJ performs orders of magnitude better than other methodswith excellent approximation quality.
We plan to investigate kNN joins on very high dimensions in thefuture.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 81: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/81.jpg)
The End
Thank You
Q and A
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 82: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/82.jpg)
Approximate kNN join: Z-order kNN join
zkNN algorithm
Algorithm 1: zkNN(q, P, k , α)
generate {v2, . . . , vα}, v1 =−→0 , vi is a random vector in Rd ;1
Pi = P + vi (i ∈ [1, α]; ∀p ∈ P, insert p + vi in Pi );2
for i = 1, . . . , α do3
let qi = q + vi , Ci (q) = ∅, and zqi be qi ’s z-value;4
insert z−(zqi , k,Pi ) into Ci (q);5
insert z+(zqi , k,Pi ) into Ci (q);6
for any p ∈ Ci (q), update p = p − vi ;7
C (q) =⋃α
i=1 Ci (q) = C1(q) ∪ · · · ∪ Cα(q);8
return knn(q,C (q)).9
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 83: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/83.jpg)
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
1
1.2
1.4
1.6
1.8
4x4 6x6 8x8 12x12 16x16
Appro
xim
atio
n r
atio
|R|×|S|:107×10
7
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 84: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/84.jpg)
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
1
1.2
1.4
1.6
10 20 40 60 80
Appro
xim
atio
n r
atio
k values
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 85: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/85.jpg)
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
0.6
0.7
0.8
0.9
1
40x40 80x80 120x120 160x160
Rec
all
(Pre
cisi
on)
|R|×|S|:106×10
6
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 86: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/86.jpg)
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
0.6
0.7
0.8
0.9
1
10 20 40 60 80
Rec
all
(Pre
cisi
on)
k values
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 87: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/87.jpg)
Experiments: Effect of ε
102
103
104
105
0.6 1 3 10 100
Tim
e (
seconds)
ε (×10-3
)
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 88: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/88.jpg)
Experiments: Effect of ε
102
103
0.6 1 3 10 100
Sta
ndard
devia
tion
ε (×10-3
)
R blocks
S blocks
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 89: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/89.jpg)
Experiments: Evaluation of H-BNLJ
101
102
103
104
105
106
5 10 15 20 25
Tim
e (
seconds)
# Reducers
H-zkNNJ
H-BRJ
H-BNLJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 90: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/90.jpg)
Experiments: Evaluation of H-BNLJ
100
101
102
103
104
H-BNLJ H-BRJ H-zkNNJ
Tim
e (s
econds)
Algorithms
Phase1Phase2Phase3
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 91: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/91.jpg)
Experiments: Speedup
0
5
10
15
20
25
5 10 15 20 25
Tim
e (
seco
nd
s ×
10
3)
# Reducers
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 92: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/92.jpg)
Experiments: Speedup
0
2
4
6
8
5 10 15 20 25
Sp
eed
up
# Reducers
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 93: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/93.jpg)
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
1
2
3
4
5
4x4 6x6 8x8 12x12 16x16
Tim
e (
seconds
×10
3)
|R|×|S|:107×10
7
zPhase1zPhase2zPhase3
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 94: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/94.jpg)
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
10
20
30
40
50
4x4 6x6 8x8 12x12 16x16
Tim
e (
seconds
×10
3)
|R|×|S|:107×10
7
RPhase1RPhase2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 95: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/95.jpg)
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
10
20
30
40
50
4x4 6x6 8x8 12x12 16x16
Tim
e (
seconds
×10
3)
|R|×|S|:107×10
7
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 96: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/96.jpg)
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
20
40
60
80
100
4x4 6x6 8x8 12x12 16x16
Data
shuff
led (
GB
)
|R|×|S|:107×10
7
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 97: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/97.jpg)
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
103
104
105
5 10 15 20 25 30
Tim
e (s
econds)
Dimensionality
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 98: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/98.jpg)
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
5
10
15
20
25
5 10 15 20 25 30
Dat
a sh
uff
led (
GB
)
Dimensionality
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 99: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/99.jpg)
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
1
1.2
1.4
1.6
5 10 15 20 25 30
Appro
xim
atio
n r
atio
Dimensionality
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 100: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/100.jpg)
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30
Rec
all
(Pre
cisi
on)
Dimensionality
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 101: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/101.jpg)
Experiments: Effect of k
0
1
2
3
4
10 20 40 60 80
Tim
e (
seconds
×10
3)
k values
zPhase1zPhase2zPhase3
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 102: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/102.jpg)
Experiments: Effect of k
0
4
8
12
16
10 20 40 60 80
Tim
e (
seconds
×10
3)
k values
RPhase1RPhase2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 103: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/103.jpg)
Experiments: Effect of k
0
10
20
30
40
3 10 20 40 60 80
Tim
e (
seconds
×10
3)
k values
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 104: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/104.jpg)
Experiments: Effect of k
0
50
100
150
200
3 10 20 40 60 80
Data
shuff
led (
GB
)
k values
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 105: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/105.jpg)
Experiments: Effect of number of shifts α
103
104
105
2 3 4 5 6
Tim
e (
seconds)
α values
H-zkNNJ H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 106: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/106.jpg)
Experiments: Effect of number of shifts α
0
5
10
15
2 3 4 5 6
Data
shuff
led (
GB
)
α values
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 107: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/107.jpg)
Experiments: Effect of number of shifts α
1
1.2
1.4
1.6
2 3 4 5 6
Appro
xim
atio
n r
atio
α values
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
![Page 108: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec64a46cb2df43c8d695654/html5/thumbnails/108.jpg)
Experiments: Effect of number of shifts α
0
0.2
0.4
0.6
0.8
1
2 3 4 5 6
Recall
(P
recis
ion)
α values
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce