Three way join in one round on hadoop

Three-way join in one round on HadoopCOMP 6231

GROUP 7

IRAJ HEDAYATISOMARIN, ZAKARIA NASERELDINE, J INYANG DU

Problem statement

In this section of second project we aimed to calculate three-way join in one round of Map-Reduce algorithm.

R join S join T

T

SR

Algorithm Overview

First relation: R

Second relation: S

Third relation: T

a, b

b, c

c, d

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Mapper

h(b)=xh(c)=y

R,(a,b)

S,(b,c)

T,(c,d)

x

y

In memory join<KEY, VALUE>=<(X,Y), (relation_name, tuple)>

Coordinate of a reducer in imagined matrix of reducers

Mapping and Hashing

<KEY, VALUE>=<(X,Y), (relation_name, tuple)>

Exactly same as inputFetch from file name

Input tuple

Third relation: T

First relation: R

Second relation: S (h(b),h(c))

…

(h(b),1)(h(b),2)

(h(b),11)

…

(1,h(c))(1,h(c))

(11,h(c))

𝑅𝑒𝑑𝑢𝑐𝑒𝑟 ¿=(𝑥−1)×√¿𝑜𝑓 𝑟𝑒𝑑𝑢𝑐𝑒𝑟𝑠+𝑦

h(b)=xh(c)=y

In-memory join algorithmNESTED LOOP JOIN

For each tuple in RFor each tuple in S

If R.b==S.b thenFor each tuple in T

If S.c==T.c thenPrint (R.a, S.b, S.c, T.d)

SORT-BASED JOIN ALGORITHM

1. divide input list in three sorted lists using Binary Search algorithm

2. Execute in-memory join algorithm

•UNTIL R and S are not empty DO• IF the first items in both list are equal THEN• make sure all the tuples with the same value have

been joined together and remove them from the list• ELSE• Choose the smallest one and remove items until

reach an item equal or greater than the front item in the another list

𝑂 (𝑛3)

1.Divide list: 2.In-memory join:𝑂 (𝑛 log𝑛)

Number of reducers We decide to use a square matrix. This choice would be a constraint on number of reducers. For example in this case, we had 128 reducers available but actually we just use 121 of them

On the other hand selecting different number of reducers in each dimension, we will have data replication and inefficiency.

Number of reducers (example 1, replication problem)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

2

3

4

# of reducers=128Assumption: R>>TBoth of them have uniform distributionT(R) = 1,000,000T(T) = 1,000

For square matrix:Replicated data=1,000,000*11+1,000*11=11,011,000

For above matrix:Replicated data=1,000,000*16+1,000*16=16,016,000

Number of reducers (example 1, inefficiency problem)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

2

3

4

# of reducers=128Assumption: T>>RT is not uniformly distributedT(R) = 1,000T(T) = 1,000,000

When the range is reduced, it’s more likely two value hash in to the same location.

IDLE IDLEFULL FULL

Experimental results 37 seconds

Any Question?

Three way join in one round on hadoop

Data & Analytics

Transcript of Three way join in one round on hadoop