MapReduce and Hadoop

MapReduce and Hadoop

Cadenelli Nicola

Datenbanken Implementierungstechniken

Introduction● History● Motivations

MapReduce● What MapReduce is● Why it is usefull● Execution Details● Some Examples● Conclusions

Outline

Hadoop● Introduction● Hadoop Architecture● Hadoop Ecosystem● In real world

MapReduce&Databases● SQL-MapReduce● In-Database Map-Reduce● Conclusions

Introduction MapReduce Hadoop MR&Databases ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS

MapReduce

BigTable

HDFS

MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

2004: Google publishes the papers

2006: Apache releases Hadoop.Is the first Open Source implementation of GFS and MapReduce.

Now:Amazon, AOL, eBay, Facebook, HP, IBM, Last.fm, LinkedIn, Microsoft, Spotify,Twitter and more are using Hadoop.

A Brief History

Introduction MapReduce Hadoop MR&Databases ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Data start to be really big: more than >10TB.E.g: Large Synoptic Survey Telescope (30TB / night)

● The best idea is to scale out (not scale up) the system, but . . . How do we scale to more than 1000+ machines? How do we handle machine failures? How can we facilitate communications between nodes? If we change system, do we lose all our optimisation work?

● Google needed to recreate the index of the web.

Motivations

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“MapReduce is a programming model and an associated implementation for processing and generating large data sets.” – Google, Inc. MapReduce paper, 2004.

It is a really simple API that has just two serial functions, map() and reduce() and is language independent (Java, Python, Perl …).

What is MapReduce?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce hides messy details in the runtime library:● Parallelization and Distribution● Load balancing● Network and disk transfer optimization● Handling of machine failures● Fault tolerance● Monitoring & status updates

All users obtain benefits from improvements on the core library.

Why is MapReduce useful?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

1. Read a lot of data2. Map: extract something we care about from each record3. Shuffle and Sort4. Reduce: aggregate, summarize, filter, or transform5. Write the results

From an outside view is the same (read, elaborate, write), map and reduce change to fit the problem.

Typical problem solved by MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Single master controls job execution on multiple slaves.

● Mappers preferentially placed on same node or same rack as their input block → minimizes network usage!!!

● Mappers save outputs to local disk before serving them to reducers.

● If a map or reduce crashes: Re-execute!

● Allows having more mappers and reducers than nodes.

Some Execution Details

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Execution overview

Google, Inc. MapReduce paper, 2004.

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Programmer has to write two primary methods:

map (k1,v1) → list(k2,v2)reduce (k2,list(v2)) → list(k2,v2)

● All v' with the same k' are reduced together, in order.● The input keys and values are drawn from a different domain than the output keys and values.

MapReduce Programming Model

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

map(String key, String value):// key: document name// value: document contentsfor each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:

result += ParseInt(v);Emit(AsString(result));

Example: Words Frequency

“documentx”, “To be or not to be”

“be”, 2“not”, 1“or”, 1“to”, 2

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“document1”,“To be or not to be”

“be”, 2“not”, 1“or”, 1“to”, 2

...

“to”, 1“be”, 1“or”, 1“not”, 1“to”, 1“be”, 1

key = “be”values = “1”,”1”

key = “not”values = “1”

key = “or”values = “1”

key = “to”values = “1”,”1”

...“document2”,“text” ...

...“be”, 1“be”, 1

...“not”, 1

...“or”, 1

...“to”, 1“to”, 1

...

Sh

uff

le a

nd

So

rt:

agg

reg

ate

valu

es b

y ke

y

Map Reduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Inverted index- Find what documents contain a specific word.

- Map: parse document, emit <word, document-ID> pairs.- Reduce: for each word, sort the corresponding document Ids.

Emit <word, list(document-ID)>

• Reverse web-link graph- Find where page links come from.- Map: output <target, source> for each link to target in a page

source.- Reduce: concatenate the list of all source URLs associated

with a target.

Emit <target, list(source)>

Others examples

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Proven to be a useful abstraction

● Really simplifies large-scala computations

● Fun to use:- Focus on problem- Let the library deal with messy details

Conclusions on MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS

MapReduce

HDFS

MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Is a framework for distributed processing

● It is Open Source (Apache v2 Licence)

● It is a top-level Apache Project

● Written in Java

● Batch processing centric

● Runs on commodity hardware

What is Hadoop?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Distributed File System

● For very large files: TBs, PBs.

● Each file is partitioned into chunks of 64MB.

● Each chunk is replicated several times (>=3), on different racks, for fault tolerance.

● Is an abstract FS, disks are formatted on ext3, ext4 or XFS.

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture

● TaskTracker is the MapReduce server (processing part)

● DataNode is the HDFS server (data part)

TaskTracker

DataNode

Machine

Hadoop Architecture - Master/Slave

TaskTracker

DataNode

JobTracker: ● Accepts users' jobs● Assigns tasks to workers● Keeps track of the jobs status

TaskTracker

DataNode

TaskTracker

DataNode

JobTracker

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Master/Slave

TaskTracker

DataNode

NameNode: ● Keeps information on data location● Decides where a file has to be written

TaskTracker

DataNode

TaskTracker

DataNode

NameNode

Data never flows trough the NameNode!

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture – Scalable

TaskTracker

DataNode

Machine

● Having multiple machine with Hadoop creates a cluster.

● What If we need more storage or compute power?

TaskTracker

DataNode

Machine

TaskTracker

DataNode

Machine

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Overview

B C

Client JobTracker

NameNode

SecondaryNameNode A

File

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – Pig & Hive

MapReduce

HDFS

Pig Hive

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – HBase

MapReduce

HDFS

Pig Hive

HBase

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

@Google● Index construction for Google Search● Article clustering for Google News● Statistical machine translation

@Yahoo! (4100 nodes)● “Web map” powering Yahoo! Search● Spam detection for Yahoo! Mail

@Facebook (>100 PB of storage)● Data mining● Ad optimization● Spam detection

What is MapReduce/Hadoop used for?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by features like B-trees and hash partitioning . . .

. . . most of the data in companies are stored on databases!

but . . .

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○

● SQL-MapReduce by Teradata Aster

● In-Database Map-Reduce by Oracle

● Connectors to allow external Hadoop programs to access data from databases and to store Hadoop output in databases

Solutions

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○

Is a framework to allow developers to write SQL-MapReduce functions in languages such as Java,

C#, Python and C++ and push them into the database for advanced in-database analytics.

SQL-MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○

MR functions can be used like custom SQL operators and can implement any algorithm or transformation.

SQL-MapReduce - Syntax

http://www.asterdata.com/resources/mapreduce.php

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR

SELECT key AS word, value AS wordcountFROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key )ORDER BY wordcount DESCLIMIT 20;


Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR

SELECT key AS word, value AS wordcountFROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key )ORDER BY wordcount DESCLIMIT 20;

Demo #2: Why do Reduce when we have SQL?

SELECT word, count(*) AS wordcountFROM Tokenize ( ON blogs )GROUP BY wordORDER BY wordcount DESCLIMIT 20;


Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

● Uses Table Functions to implement Map-Reduce within the database.

● Parallelization is provided by the Oracle Parallel Execution framework.

Using this in combination with SQL, Oracle provides an simple mechanism for database developers to

develop Map-Reduce functionality using languages they know.

In-Database Map-Reduce by Oracle

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○

SELECT * FROM table(oracle_map_reduce.reducer(

cursor(SELECT value(map_result).word word FROM table(oracle_map_reduce.mapper(

cursor(SELECT a FROM documents), ' '

))

map_result)

));


Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○

However this solutions are not source compatible with Hadoop.

Native Hadoop programs need to be rewritten before becoming usable in

databases.

Still not perfect!

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○

Questions?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ●

MapReduce and Hadoop

Technology

Transcript of MapReduce and Hadoop