MapReduce and Hadoop

Cadenelli Nicola

Datenbanken Implementierungstechniken

Introduction● History● Motivations

MapReduce● What MapReduce is● Why it is usefull● Execution Details● Some Examples● Conclusions

Outline

Hadoop● Introduction● Hadoop Architecture● Hadoop Ecosystem● In real world

MapReduce&Databases● SQL-MapReduce● In-Database Map-Reduce● Conclusions

Introduction MapReduce Hadoop MR&Databases ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce

BigTable

MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

2004: Google publishes the papers

2006: Apache releases Hadoop.Is the first Open Source implementation of GFS and MapReduce.

Now:Amazon, AOL, eBay, Facebook, HP, IBM, Last.fm, LinkedIn, Microsoft, Spotify,Twitter and more are using Hadoop.

A Brief History

Introduction MapReduce Hadoop MR&Databases ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Data start to be really big: more than >10TB.E.g: Large Synoptic Survey Telescope (30TB / night)

● The best idea is to scale out (not scale up) the system, but . . . How do we scale to more than 1000+ machines? How do we handle machine failures? How can we facilitate communications between nodes? If we change system, do we lose all our optimisation work?

● Google needed to recreate the index of the web.

Motivations

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“MapReduce is a programming model and an associated implementation for processing and generating large data sets.” – Google, Inc. MapReduce paper, 2004.

It is a really simple API that has just two serial functions, map() and reduce() and is language independent (Java, Python, Perl …).

What is MapReduce?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce hides messy details in the runtime library:● Parallelization and Distribution● Load balancing● Network and disk transfer optimization● Handling of machine failures● Fault tolerance● Monitoring & status updates

All users obtain benefits from improvements on the core library.

Why is MapReduce useful?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

1. Read a lot of data2. Map: extract something we care about from each record3. Shuffle and Sort4. Reduce: aggregate, summarize, filter, or transform5. Write the results

From an outside view is the same (read, elaborate, write), map and reduce change to fit the problem.

Typical problem solved by MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Single master controls job execution on multiple slaves.

● Mappers preferentially placed on same node or same rack as their input block → minimizes network usage!!!

● Mappers save outputs to local disk before serving them to reducers.

● If a map or reduce crashes: Re-execute!

● Allows having more mappers and reducers than nodes.

Some Execution Details

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Execution overview

Google, Inc. MapReduce paper, 2004.

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Programmer has to write two primary methods:

map (k1,v1) → list(k2,v2)reduce (k2,list(v2)) → list(k2,v2)

● All v' with the same k' are reduced together, in order.● The input keys and values are drawn from a different domain than the output keys and values.

MapReduce Programming Model

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

map(String key, String value):// key: document name// value: document contentsfor each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:

result += ParseInt(v);Emit(AsString(result));

Example: Words Frequency

“documentx”, “To be or not to be”

“be”, 2“not”, 1“or”, 1“to”, 2

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“document1”,“To be or not to be”

“be”, 2“not”, 1“or”, 1“to”, 2

“to”, 1“be”, 1“or”, 1“not”, 1“to”, 1“be”, 1

key = “be”values = “1”,”1”

key = “not”values = “1”

key = “or”values = “1”

key = “to”values = “1”,”1”

...“document2”,“text” ...

...“be”, 1“be”, 1

...“not”, 1

...“or”, 1

...“to”, 1“to”, 1

Map Reduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Inverted index- Find what documents contain a specific word.

- Map: parse document, emit <word, document-ID> pairs.- Reduce: for each word, sort the corresponding document Ids.

Emit <word, list(document-ID)>

• Reverse web-link graph- Find where page links come from.- Map: output <target, source> for each link to target in a page

source.- Reduce: concatenate the list of all source URLs associated

with a target.

Emit <target, list(source)>

Others examples

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Proven to be a useful abstraction

● Really simplifies large-scala computations

● Fun to use:- Focus on problem- Let the library deal with messy details

Conclusions on MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Is a framework for distributed processing

● It is Open Source (Apache v2 Licence)

● It is a top-level Apache Project

● Written in Java

● Batch processing centric

● Runs on commodity hardware

What is Hadoop?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Distributed File System

● For very large files: TBs, PBs.

● Each file is partitioned into chunks of 64MB.

● Each chunk is replicated several times (>=3), on different racks, for fault tolerance.

● Is an abstract FS, disks are formatted on ext3, ext4 or XFS.

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture

● TaskTracker is the MapReduce server (processing part)

● DataNode is the HDFS server (data part)

TaskTracker

DataNode

Machine

Hadoop Architecture - Master/Slave

TaskTracker

DataNode

JobTracker: ● Accepts users' jobs● Assigns tasks to workers● Keeps track of the jobs status

TaskTracker

DataNode

TaskTracker

DataNode

JobTracker

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Master/Slave

TaskTracker

DataNode

NameNode: ● Keeps information on data location● Decides where a file has to be written

TaskTracker

DataNode

TaskTracker

DataNode

NameNode

Data never flows trough the NameNode!

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture – Scalable

TaskTracker

DataNode

Machine

● Having multiple machine with Hadoop creates a cluster.

● What If we need more storage or compute power?

TaskTracker

DataNode

Machine

TaskTracker

DataNode

Machine

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Overview

Client JobTracker

NameNode

SecondaryNameNode A

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – Pig & Hive

MapReduce

Pig Hive

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – HBase

MapReduce

Pig Hive

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

@Google● Index construction for Google Search● Article clustering for Google News● Statistical machine translation

@Yahoo! (4100 nodes)● “Web map” powering Yahoo! Search● Spam detection for Yahoo! Mail

@Facebook (>100 PB of storage)● Data mining● Ad optimization● Spam detection

What is MapReduce/Hadoop used for?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by features like B-trees and hash partitioning . . .

. . . most of the data in companies are stored on databases!

but . . .

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○

● SQL-MapReduce by Teradata Aster

● In-Database Map-Reduce by Oracle

● Connectors to allow external Hadoop programs to access data from databases and to store Hadoop output in databases

Solutions

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○

Is a framework to allow developers to write SQL-MapReduce functions in languages such as Java,

C#, Python and C++ and push them into the database for advanced in-database analytics.

SQL-MapReduce

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○

MR functions can be used like custom SQL operators and can implement any algorithm or transformation.

SQL-MapReduce - Syntax

http://www.asterdata.com/resources/mapreduce.php

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR

SELECT key AS word, value AS wordcountFROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key )ORDER BY wordcount DESCLIMIT 20;

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR

SELECT key AS word, value AS wordcountFROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key )ORDER BY wordcount DESCLIMIT 20;

Demo #2: Why do Reduce when we have SQL?

SELECT word, count(*) AS wordcountFROM Tokenize ( ON blogs )GROUP BY wordORDER BY wordcount DESCLIMIT 20;

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

● Uses Table Functions to implement Map-Reduce within the database.

● Parallelization is provided by the Oracle Parallel Execution framework.

Using this in combination with SQL, Oracle provides an simple mechanism for database developers to

develop Map-Reduce functionality using languages they know.

In-Database Map-Reduce by Oracle

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○

SELECT * FROM table(oracle_map_reduce.reducer(

cursor(SELECT value(map_result).word word FROM table(oracle_map_reduce.mapper(

cursor(SELECT a FROM documents), ' '

map_result)

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○

However this solutions are not source compatible with Hadoop.

Native Hadoop programs need to be rewritten before becoming usable in

databases.

Still not perfect!

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○

Questions?

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ●

MapReduce and Hadoop

Technology

Transcript of MapReduce and Hadoop

MapReduce)and)Hadoop - Indian Statistical Institute · MapReduce)and)Hadoop) Debapriyo Majumdar ... – MapReduce would fail, ... For example matrix A is represented by the relation

Data Intensive Computing: MapReduce and Hadoop · Data Intensive Computing: MapReduce and Hadoop ... • Example: square x = x * x ... A simplified view of MapReduce: example

Hadoop and MapReduce Big Data Analytics

Parallel video transcoding using Hadoop MapReduce · 06-01-2017 · 3.2 The distributed video transcoding using Hadoop MapReduce. Distributed video transcoding based on Hadoop MapReduce

MapReduce Online - USENIX · 2.2 Hadoop Architecture Hadoop is composed of Hadoop MapReduce, an imple-mentation of MapReduce designed for large clusters, and the Hadoop Distributed

Processing with What is MapReduce? Hadoop/MapReduce ...

MapReduce Programming with Apache Hadoop - DSTdst.lbl.gov/ACSDownloads/kjackson/downloads/Hadoop-HDFS8-12pm.… · MapReduce Programming with Apache Hadoop Viraj Bhat ... (hadoop,

Cloud Computing, Hadoop and MapReduce

Überblick Hadoop Einführung HDFS und MapReduce - doag.org · Inhalt Seite 3 1 Apache Hadoop 2 Hadoop Distributed File System (HDFS) 3 MapReduce Überblick Hadoop 4 MapReduce im

Big Data - Hadoop/MapReduce

Intro to BigData , Hadoop and Mapreduce

Hadoop hbase mapreduce

Python MapReduce Programming with Pydoop · MapReduce and Hadoop Hadoop Crash Course Pydoop: a Python MapReduce and HDFS API for Hadoop Python MapReduce Programming with Pydoop Simone

Beyond Hadoop and MapReduce

Mapreduce and Hadoop Introduce Mapreduce and Hadoop Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication.

Hadoop Mapreduce

Hadoop, HDFS, MapReduce and Pig

Introduction to Hadoop and MapReduce

Introduction to MapReduce and Hadoop

BigData and MapReduce with Hadoop - IJSe6.ijs.si/~arashkovska/docs/pub/2012_CLASS_MapReduce_Hadoop.pdf · BigData and MapReduce with Hadoop Ivan Tomaši ... simply solved by MapReduce