Three way join in one round on hadoop

10
Three-way join in one round on Hadoop COMP 6231 GROUP 7 IRAJ HEDAYATISOMARIN, ZAKARIA NASERELDINE, JINYANG DU

Transcript of Three way join in one round on hadoop

Page 1: Three way join in one round on hadoop

Three-way join in one round on HadoopCOMP 6231

GROUP 7

IRAJ HEDAYATISOMARIN, ZAKARIA NASERELDINE, J INYANG DU

Page 2: Three way join in one round on hadoop

Problem statement

In this section of second project we aimed to calculate three-way join in one round of Map-Reduce algorithm.

R join S join T

T

SR

Page 3: Three way join in one round on hadoop

Algorithm Overview

First relation: R

Second relation: S

Third relation: T

a, b

b, c

c, d

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Mapper

h(b)=xh(c)=y

R,(a,b)

S,(b,c)

T,(c,d)

x

y

In memory join<KEY, VALUE>=<(X,Y), (relation_name, tuple)>

Coordinate of a reducer in imagined matrix of reducers

Page 4: Three way join in one round on hadoop

Mapping and Hashing

<KEY, VALUE>=<(X,Y), (relation_name, tuple)>

Exactly same as inputFetch from file name

Input tuple

Third relation: T

First relation: R

Second relation: S (h(b),h(c))

(h(b),1)(h(b),2)

(h(b),11)

(1,h(c))(1,h(c))

(11,h(c))

𝑅𝑒𝑑𝑢𝑐𝑒𝑟 ¿=(𝑥−1)×√¿𝑜𝑓 𝑟𝑒𝑑𝑢𝑐𝑒𝑟𝑠+𝑦

h(b)=xh(c)=y

Page 5: Three way join in one round on hadoop

In-memory join algorithmNESTED LOOP JOIN

For each tuple in RFor each tuple in S

If R.b==S.b thenFor each tuple in T

If S.c==T.c thenPrint (R.a, S.b, S.c, T.d)

SORT-BASED JOIN ALGORITHM

1. divide input list in three sorted lists using Binary Search algorithm

2. Execute in-memory join algorithm

•UNTIL R and S are not empty DO• IF the first items in both list are equal THEN• make sure all the tuples with the same value have

been joined together and remove them from the list• ELSE• Choose the smallest one and remove items until

reach an item equal or greater than the front item in the another list

𝑂 (𝑛3)

1.Divide list: 2.In-memory join:𝑂 (𝑛 log𝑛)

Page 6: Three way join in one round on hadoop

Number of reducers We decide to use a square matrix. This choice would be a constraint on number of reducers. For example in this case, we had 128 reducers available but actually we just use 121 of them

On the other hand selecting different number of reducers in each dimension, we will have data replication and inefficiency.

Page 7: Three way join in one round on hadoop

Number of reducers (example 1, replication problem)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

2

3

4

# of reducers=128Assumption: R>>TBoth of them have uniform distributionT(R) = 1,000,000T(T) = 1,000

For square matrix:Replicated data=1,000,000*11+1,000*11=11,011,000

For above matrix:Replicated data=1,000,000*16+1,000*16=16,016,000

Page 8: Three way join in one round on hadoop

Number of reducers (example 1, inefficiency problem)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

2

3

4

# of reducers=128Assumption: T>>RT is not uniformly distributedT(R) = 1,000T(T) = 1,000,000

When the range is reduced, it’s more likely two value hash in to the same location.

IDLE IDLEFULL FULL

Page 9: Three way join in one round on hadoop

Experimental results 37 seconds

Page 10: Three way join in one round on hadoop

Any Question?