MapReduce and Hadoop
-
Upload
nicola-cadenelli -
Category
Technology
-
view
642 -
download
1
Embed Size (px)
description
Transcript of MapReduce and Hadoop

MapReduce and Hadoop
Cadenelli Nicola
Datenbanken Implementierungstechniken

Introduction● History● Motivations
MapReduce● What MapReduce is● Why it is usefull● Execution Details● Some Examples● Conclusions
Outline
Hadoop● Introduction● Hadoop Architecture● Hadoop Ecosystem● In real world
MapReduce&Databases● SQL-MapReduce● In-Database Map-Reduce● Conclusions
Introduction MapReduce Hadoop MR&Databases ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS
MapReduce
BigTable
HDFS
MapReduce
Introduction MapReduce Hadoop MR&Databases ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

2004: Google publishes the papers
2006: Apache releases Hadoop.Is the first Open Source implementation of GFS and MapReduce.
Now:Amazon, AOL, eBay, Facebook, HP, IBM, Last.fm, LinkedIn, Microsoft, Spotify,Twitter and more are using Hadoop.
A Brief History
Introduction MapReduce Hadoop MR&Databases ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Data start to be really big: more than >10TB.E.g: Large Synoptic Survey Telescope (30TB / night)
● The best idea is to scale out (not scale up) the system, but . . . How do we scale to more than 1000+ machines? How do we handle machine failures? How can we facilitate communications between nodes? If we change system, do we lose all our optimisation work?
● Google needed to recreate the index of the web.
Motivations
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“MapReduce is a programming model and an associated implementation for processing and generating large data sets.” – Google, Inc. MapReduce paper, 2004.
It is a really simple API that has just two serial functions, map() and reduce() and is language independent (Java, Python, Perl …).
What is MapReduce?
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce hides messy details in the runtime library:● Parallelization and Distribution● Load balancing● Network and disk transfer optimization● Handling of machine failures● Fault tolerance● Monitoring & status updates
All users obtain benefits from improvements on the core library.
Why is MapReduce useful?
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

1. Read a lot of data2. Map: extract something we care about from each record3. Shuffle and Sort4. Reduce: aggregate, summarize, filter, or transform5. Write the results
From an outside view is the same (read, elaborate, write), map and reduce change to fit the problem.
Typical problem solved by MapReduce
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Single master controls job execution on multiple slaves.
● Mappers preferentially placed on same node or same rack as their input block → minimizes network usage!!!
● Mappers save outputs to local disk before serving them to reducers.
● If a map or reduce crashes: Re-execute!
● Allows having more mappers and reducers than nodes.
Some Execution Details
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Execution overview
Google, Inc. MapReduce paper, 2004.
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Programmer has to write two primary methods:
map (k1,v1) → list(k2,v2)reduce (k2,list(v2)) → list(k2,v2)
● All v' with the same k' are reduced together, in order.● The input keys and values are drawn from a different domain than the output keys and values.
MapReduce Programming Model
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

map(String key, String value):// key: document name// value: document contentsfor each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:
result += ParseInt(v);Emit(AsString(result));
Example: Words Frequency
“documentx”, “To be or not to be”
“be”, 2“not”, 1“or”, 1“to”, 2
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“document1”,“To be or not to be”
“be”, 2“not”, 1“or”, 1“to”, 2
...
“to”, 1“be”, 1“or”, 1“not”, 1“to”, 1“be”, 1
key = “be”values = “1”,”1”
key = “not”values = “1”
key = “or”values = “1”
key = “to”values = “1”,”1”
...“document2”,“text” ...
...“be”, 1“be”, 1
...“not”, 1
...“or”, 1
...“to”, 1“to”, 1
...
Sh
uff
le a
nd
So
rt:
agg
reg
ate
valu
es b
y ke
y
Map Reduce
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Inverted index- Find what documents contain a specific word.
- Map: parse document, emit <word, document-ID> pairs.- Reduce: for each word, sort the corresponding document Ids.
Emit <word, list(document-ID)>
• Reverse web-link graph- Find where page links come from.- Map: output <target, source> for each link to target in a page
source.- Reduce: concatenate the list of all source URLs associated
with a target.
Emit <target, list(source)>
Others examples
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Proven to be a useful abstraction
● Really simplifies large-scala computations
● Fun to use:- Focus on problem- Let the library deal with messy details
Conclusions on MapReduce
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS
MapReduce
HDFS
MapReduce
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Is a framework for distributed processing
● It is Open Source (Apache v2 Licence)
● It is a top-level Apache Project
● Written in Java
● Batch processing centric
● Runs on commodity hardware
What is Hadoop?
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Distributed File System
● For very large files: TBs, PBs.
● Each file is partitioned into chunks of 64MB.
● Each chunk is replicated several times (>=3), on different racks, for fault tolerance.
● Is an abstract FS, disks are formatted on ext3, ext4 or XFS.

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Architecture
● TaskTracker is the MapReduce server (processing part)
● DataNode is the HDFS server (data part)
TaskTracker
DataNode
Machine

Hadoop Architecture - Master/Slave
TaskTracker
DataNode
JobTracker: ● Accepts users' jobs● Assigns tasks to workers● Keeps track of the jobs status
TaskTracker
DataNode
TaskTracker
DataNode
JobTracker
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Master/Slave
TaskTracker
DataNode
NameNode: ● Keeps information on data location● Decides where a file has to be written
TaskTracker
DataNode
TaskTracker
DataNode
NameNode
Data never flows trough the NameNode!
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture – Scalable
TaskTracker
DataNode
Machine
● Having multiple machine with Hadoop creates a cluster.
● What If we need more storage or compute power?
TaskTracker
DataNode
Machine
TaskTracker
DataNode
Machine
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Overview
B C
Client JobTracker
NameNode
SecondaryNameNode A
File
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – Pig & Hive
MapReduce
HDFS
Pig Hive
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – HBase
MapReduce
HDFS
Pig Hive
HBase
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

@Google● Index construction for Google Search● Article clustering for Google News● Statistical machine translation
@Yahoo! (4100 nodes)● “Web map” powering Yahoo! Search● Spam detection for Yahoo! Mail
@Facebook (>100 PB of storage)● Data mining● Ad optimization● Spam detection
What is MapReduce/Hadoop used for?
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by features like B-trees and hash partitioning . . .
. . . most of the data in companies are stored on databases!
but . . .
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○

● SQL-MapReduce by Teradata Aster
● In-Database Map-Reduce by Oracle
● Connectors to allow external Hadoop programs to access data from databases and to store Hadoop output in databases
Solutions
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○

Is a framework to allow developers to write SQL-MapReduce functions in languages such as Java,
C#, Python and C++ and push them into the database for advanced in-database analytics.
SQL-MapReduce
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○

MR functions can be used like custom SQL operators and can implement any algorithm or transformation.
SQL-MapReduce - Syntax
http://www.asterdata.com/resources/mapreduce.php
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcountFROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key )ORDER BY wordcount DESCLIMIT 20;
Example: Words Frequency
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcountFROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key )ORDER BY wordcount DESCLIMIT 20;
Demo #2: Why do Reduce when we have SQL?
SELECT word, count(*) AS wordcountFROM Tokenize ( ON blogs )GROUP BY wordORDER BY wordcount DESCLIMIT 20;
Example: Words Frequency
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

● Uses Table Functions to implement Map-Reduce within the database.
● Parallelization is provided by the Oracle Parallel Execution framework.
Using this in combination with SQL, Oracle provides an simple mechanism for database developers to
develop Map-Reduce functionality using languages they know.
In-Database Map-Reduce by Oracle
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○

SELECT * FROM table(oracle_map_reduce.reducer(
cursor(SELECT value(map_result).word word FROM table(oracle_map_reduce.mapper(
cursor(SELECT a FROM documents), ' '
))
map_result)
));
Example: Words Frequency
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○

However this solutions are not source compatible with Hadoop.
Native Hadoop programs need to be rewritten before becoming usable in
databases.
Still not perfect!
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○

Questions?
Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ●