MapReduce and Hadoop

36
MapReduce and Hadoop Cadenelli Nicola Datenbanken Implementierungstechniken

description

And introdution to MR and Hadoop and an view on the opportunities to use MR with databases i.e., SQL-MapReduce by Teradata and In-database MR by Oracle. The presentation was used during a class of Datenbanken Implementierungstechniken in 2013.

Transcript of MapReduce and Hadoop

Page 1: MapReduce and Hadoop

MapReduce and Hadoop

Cadenelli Nicola

Datenbanken Implementierungstechniken

Page 2: MapReduce and Hadoop

Introduction● History● Motivations

MapReduce● What MapReduce is● Why it is usefull● Execution Details● Some Examples● Conclusions

Outline

Hadoop● Introduction● Hadoop Architecture● Hadoop Ecosystem● In real world

MapReduce&Databases● SQL-MapReduce● In-Database Map-Reduce● Conclusions

Introduction MapReduce Hadoop MR&Databases ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 3: MapReduce and Hadoop

GFS

MapReduce

BigTable

HDFS

MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 4: MapReduce and Hadoop

2004: Google publishes the papers

2006: Apache releases Hadoop.Is the first Open Source implementation of GFS and MapReduce.

Now:Amazon, AOL, eBay, Facebook, HP, IBM, Last.fm, LinkedIn, Microsoft, Spotify,Twitter and more are using Hadoop.

A Brief History

Introduction MapReduce Hadoop MR&Databases ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 5: MapReduce and Hadoop

● Data start to be really big: more than >10TB.E.g: Large Synoptic Survey Telescope (30TB / night)

● The best idea is to scale out (not scale up) the system, but . . . How do we scale to more than 1000+ machines? How do we handle machine failures? How can we facilitate communications between nodes? If we change system, do we lose all our optimisation work?

● Google needed to recreate the index of the web.

Motivations

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 6: MapReduce and Hadoop

“MapReduce is a programming model and an associated implementation for processing and generating large data sets.” – Google, Inc. MapReduce paper, 2004.

It is a really simple API that has just two serial functions, map() and reduce() and is language independent (Java, Python, Perl …).

What is MapReduce?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 7: MapReduce and Hadoop

MapReduce hides messy details in the runtime library:● Parallelization and Distribution● Load balancing● Network and disk transfer optimization● Handling of machine failures● Fault tolerance● Monitoring & status updates

All users obtain benefits from improvements on the core library.

Why is MapReduce useful?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 8: MapReduce and Hadoop

1. Read a lot of data2. Map: extract something we care about from each record3. Shuffle and Sort4. Reduce: aggregate, summarize, filter, or transform5. Write the results

From an outside view is the same (read, elaborate, write), map and reduce change to fit the problem.

Typical problem solved by MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 9: MapReduce and Hadoop

● Single master controls job execution on multiple slaves.

● Mappers preferentially placed on same node or same rack as their input block → minimizes network usage!!!

● Mappers save outputs to local disk before serving them to reducers.

● If a map or reduce crashes: Re-execute!

● Allows having more mappers and reducers than nodes.

Some Execution Details

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 10: MapReduce and Hadoop

Execution overview

Google, Inc. MapReduce paper, 2004.

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 11: MapReduce and Hadoop

Programmer has to write two primary methods:

map (k1,v1) → list(k2,v2)reduce (k2,list(v2)) → list(k2,v2)

● All v' with the same k' are reduced together, in order.● The input keys and values are drawn from a different domain than the output keys and values.

MapReduce Programming Model

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 12: MapReduce and Hadoop

map(String key, String value):// key: document name// value: document contentsfor each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:

result += ParseInt(v);Emit(AsString(result));

Example: Words Frequency

“documentx”, “To be or not to be”

“be”, 2“not”, 1“or”, 1“to”, 2

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 13: MapReduce and Hadoop

“document1”,“To be or not to be”

“be”, 2“not”, 1“or”, 1“to”, 2

...

“to”, 1“be”, 1“or”, 1“not”, 1“to”, 1“be”, 1

key = “be”values = “1”,”1”

key = “not”values = “1”

key = “or”values = “1”

key = “to”values = “1”,”1”

...“document2”,“text” ...

...“be”, 1“be”, 1

...“not”, 1

...“or”, 1

...“to”, 1“to”, 1

...

Sh

uff

le a

nd

So

rt:

agg

reg

ate

valu

es b

y ke

y

Map Reduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 14: MapReduce and Hadoop

● Inverted index- Find what documents contain a specific word.

- Map: parse document, emit <word, document-ID> pairs.- Reduce: for each word, sort the corresponding document Ids.

Emit <word, list(document-ID)>

• Reverse web-link graph- Find where page links come from.- Map: output <target, source> for each link to target in a page

source.- Reduce: concatenate the list of all source URLs associated

with a target.

Emit <target, list(source)>

Others examples

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 15: MapReduce and Hadoop

● Proven to be a useful abstraction

● Really simplifies large-scala computations

● Fun to use:- Focus on problem- Let the library deal with messy details

Conclusions on MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 16: MapReduce and Hadoop

GFS

MapReduce

HDFS

MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 17: MapReduce and Hadoop

● Is a framework for distributed processing

● It is Open Source (Apache v2 Licence)

● It is a top-level Apache Project

● Written in Java

● Batch processing centric

● Runs on commodity hardware

What is Hadoop?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 18: MapReduce and Hadoop

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Distributed File System

● For very large files: TBs, PBs.

● Each file is partitioned into chunks of 64MB.

● Each chunk is replicated several times (>=3), on different racks, for fault tolerance.

● Is an abstract FS, disks are formatted on ext3, ext4 or XFS.

Page 19: MapReduce and Hadoop

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture

● TaskTracker is the MapReduce server (processing part)

● DataNode is the HDFS server (data part)

TaskTracker

DataNode

Machine

Page 20: MapReduce and Hadoop

Hadoop Architecture - Master/Slave

TaskTracker

DataNode

JobTracker: ● Accepts users' jobs● Assigns tasks to workers● Keeps track of the jobs status

TaskTracker

DataNode

TaskTracker

DataNode

JobTracker

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 21: MapReduce and Hadoop

Hadoop Architecture - Master/Slave

TaskTracker

DataNode

NameNode: ● Keeps information on data location● Decides where a file has to be written

TaskTracker

DataNode

TaskTracker

DataNode

NameNode

Data never flows trough the NameNode!

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 22: MapReduce and Hadoop

Hadoop Architecture – Scalable

TaskTracker

DataNode

Machine

● Having multiple machine with Hadoop creates a cluster.

● What If we need more storage or compute power?

TaskTracker

DataNode

Machine

TaskTracker

DataNode

Machine

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 23: MapReduce and Hadoop

Hadoop Architecture - Overview

B C

Client JobTracker

NameNode

SecondaryNameNode A

File

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 24: MapReduce and Hadoop

Hadoop Ecosystem – Pig & Hive

MapReduce

HDFS

Pig Hive

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 25: MapReduce and Hadoop

Hadoop Ecosystem – HBase

MapReduce

HDFS

Pig Hive

HBase

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 26: MapReduce and Hadoop

@Google● Index construction for Google Search● Article clustering for Google News● Statistical machine translation

@Yahoo! (4100 nodes)● “Web map” powering Yahoo! Search● Spam detection for Yahoo! Mail

@Facebook (>100 PB of storage)● Data mining● Ad optimization● Spam detection

What is MapReduce/Hadoop used for?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○

Page 27: MapReduce and Hadoop

MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by features like B-trees and hash partitioning . . .

. . . most of the data in companies are stored on databases!

but . . .

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○

Page 28: MapReduce and Hadoop

● SQL-MapReduce by Teradata Aster

● In-Database Map-Reduce by Oracle

● Connectors to allow external Hadoop programs to access data from databases and to store Hadoop output in databases

Solutions

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○

Page 29: MapReduce and Hadoop

Is a framework to allow developers to write SQL-MapReduce functions in languages such as Java,

C#, Python and C++ and push them into the database for advanced in-database analytics.

SQL-MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○

Page 30: MapReduce and Hadoop

MR functions can be used like custom SQL operators and can implement any algorithm or transformation.

SQL-MapReduce - Syntax

http://www.asterdata.com/resources/mapreduce.php

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○

Page 31: MapReduce and Hadoop

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR

SELECT key AS word, value AS wordcountFROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key )ORDER BY wordcount DESCLIMIT 20;

Example: Words Frequency

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

Page 32: MapReduce and Hadoop

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR

SELECT key AS word, value AS wordcountFROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key )ORDER BY wordcount DESCLIMIT 20;

Demo #2: Why do Reduce when we have SQL?

SELECT word, count(*) AS wordcountFROM Tokenize ( ON blogs )GROUP BY wordORDER BY wordcount DESCLIMIT 20;

Example: Words Frequency

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

Page 33: MapReduce and Hadoop

● Uses Table Functions to implement Map-Reduce within the database.

● Parallelization is provided by the Oracle Parallel Execution framework.

Using this in combination with SQL, Oracle provides an simple mechanism for database developers to

develop Map-Reduce functionality using languages they know.

In-Database Map-Reduce by Oracle

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○

Page 34: MapReduce and Hadoop

SELECT * FROM table(oracle_map_reduce.reducer(

cursor(SELECT value(map_result).word word FROM table(oracle_map_reduce.mapper(

cursor(SELECT a FROM documents), ' '

))

map_result)

));

Example: Words Frequency

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○

Page 35: MapReduce and Hadoop

However this solutions are not source compatible with Hadoop.

Native Hadoop programs need to be rewritten before becoming usable in

databases.

Still not perfect!

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○

Page 36: MapReduce and Hadoop

Questions?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ●