A BigBench Implementation in the Hadoop Ecosystem

17
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG

Transcript of A BigBench Implementation in the Hadoop Ecosystem

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

10/10/2013(c) Tilmann Rabl - msrg.org 2

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

10/10/2013(c) Tilmann Rabl - msrg.org 3

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

(c) Tilmann Rabl - msrg.org 47/19/2013

Unstructured Data

Semi-Structured Data

Structured Data

Sales

Customer

ItemMarketprice

Web Page

Web Log

Reviews

AdaptedTPC-DS

BigBenchSpecific

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

7/19/2013(c) Tilmann Rabl - msrg.org 5

Online Data Generation

Offline Preprocessing

Cat

egor

izat

ion

Real Reviews

Tok

eniz

atio

n

Gen

eral

izat

ion

Markov Chain Input

Generated Reviews

Pro

duct

Cus

tom

izat

ion

Text

Gen

erat

ion

Par

amet

er

Gen

erat

ion

(PD

GF

)

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

(c) Tilmann Rabl - msrg.org 67/19/2013

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

7/19/2013(c) Tilmann Rabl - msrg.org 7

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

7/19/2013(c) Tilmann Rabl - msrg.org 8

Data Sources Number of Queries Percentage

Structured 18 60%

Semi-structured 7 23%

Un-structured 5 17%

Analytic techniques Number of Queries Percentage

Statistics analysis 6 20%

Data mining 17 57%

Reporting 8 27%

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

7/19/2013(c) Tilmann Rabl - msrg.org 9

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

7/19/2013(c) Tilmann Rabl - msrg.org 10

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

10/10/2013(c) Tilmann Rabl - msrg.org 11

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

10/10/2013(c) Tilmann Rabl - msrg.org 12

Query Types Number of Queries Percentage

Pure Hive 14 47%

Mahout 5 17%

OpenNLP 4 13%

Custom MR 7 23%

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

10/10/2013(c) Tilmann Rabl - msrg.org 13

ADD FILE q1_mapper.py;ADD FILE q1_reducer.py;

--Find the most frequent onesSELECT pid1, pid2, COUNT (*) AS cntFROM (

--Make items basketFROM (

-- Joining two tablesFROM (

SELECT s.ss_ticket_number AS oid , s.ss_item_sk AS pidFROM store_sales sINNER JOIN item i ON s.ss_item_sk = i.i_item_skWHERE i.i_category_id in (1 ,4 ,6) and s.ss_store_sk in (10 , 20, 33, 40, 50)

) temp_joinMAP temp_join.oid, temp_join.pidUSING 'python q1_mapper.py'AS oid, pidCLUSTER BY oid

) map_outputREDUCE map_output.oid, map_output.pidUSING 'python q1_reducer.py'AS (pid1 BIGINT, pid2 BIGINT)

) temp_basketGROUP BY pid1, pid2HAVING COUNT (pid1) > 49ORDER BY pid1 ,cnt ,pid2;

Mapperimport sys

if __name__ == "__main__":

for line in sys.stdin:key, val = line.strip().split("\t")print "%s\t%s" % (key, val)

Reducerimport sys

def print_permutations(vals):l = len(vals)if l <= 1 or l*(l-1)/2 > 500:

returnvals.sort()for i in range(l-1):

for j in range(i+1,l):print "%s\t%s" % (vals[i], vals[j])

if __name__ == "__main__":current_key = ''vals = []for line in sys.stdin:

key, val = line.strip().split("\t")if current_key == '' :

current_key = keyvals.append(val)

elif current_key == key:vals.append(val)elif current_key != key:

print_permutations(vals)vals = []current_key = keyvals.append(val)

print_permutations(vals)

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

10/10/2013(c) Tilmann Rabl - msrg.org 14

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

10/10/2013(c) Tilmann Rabl - msrg.org 15

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

10/10/2013(c) Tilmann Rabl - msrg.org 16

MIDDLEWARE SYSTEMSRESEARCH GROUPMSRG.ORG

(c) Tilmann Rabl - msrg.org 177/19/2013