Hadoop and MapReduce

22
HADOOP Framework and Applications Prepared by: TEAM HADOOP slide1/22

Transcript of Hadoop and MapReduce

Page 1: Hadoop and MapReduce

HADOOPFramework and Applications

Prepared by: TEAM HADOOP slide1/22

Page 2: Hadoop and MapReduce

CONTENTS WHY HADOOP?

INTRODUCTION TO MapReduce

Prepared by: TEAM HADOOP slide 2/22

Page 3: Hadoop and MapReduce

WHAT?“... to create building blocks for programmers who just happen to have lots of data to store, lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.” -Tom White Source: Hadoop: The Definitive Guide

Prepared by: TEAM HADOOP slide 3/22

Page 4: Hadoop and MapReduce

WHAT? Hadoop contains many subprojects: Hadoop Common Chukwa HBase ZooKeeper Pig Zombie Hive MapReduce

We will focus on MapReduce

Prepared by: TEAM HADOOP slide 4/22

Page 5: Hadoop and MapReduce

WHO & WHEN? Pre-2004 : Cutting and Cafarella develop

open source projects for web-scale indexing, crawling and search.

Prepared by: TEAM HADOOP slide 5/22

Page 6: Hadoop and MapReduce

WHO & WHEN? 2004: Jeffrey Dean and Sanjay

Ghemawat introduce map reduce model used internally at Google.

Prepared by: TEAM HADOOP slide 6/22

Page 7: Hadoop and MapReduce

WHO & WHEN? 2006: Hadoop becomes official Apache

project, Cutting joins Yahoo!Yahoo adopts Hadoop.

Prepared by: TEAM HADOOP slide 7/22

Page 8: Hadoop and MapReduce

TRENDS

Prepared by: TEAM HADOOP slide 8/22

Page 9: Hadoop and MapReduce

WHO USES IT?

Prepared by: TEAM HADOOP slide 9/22

Page 10: Hadoop and MapReduce

Roughly how long to read 1TB from a commodity hard disk?

Prepared by: TEAM HADOOP slide 10/22

Page 11: Hadoop and MapReduce

Roughly how long to read 1TB from a commodity hard disk?

Around 4 hours

62 seconds…

WITH HADOOP..

Prepared by: TEAM HADOOP slide 11/22

Page 12: Hadoop and MapReduce

INTRODUCTION TO MapReduce

"Break large problem into smaller parts, solve in parallel, combine results."

Prepared by: TEAM HADOOP slide 12/22

Page 13: Hadoop and MapReduce

Typical scenario How many times is the word ‘IT’

present? You’ll probably count but in a 30k paged document, can you??

Prepared by: TEAM HADOOP slide 13/22

Page 14: Hadoop and MapReduce

Map Reduce Typical Illustration

Prepared by: TEAM HADOOP slide 14/22

Page 15: Hadoop and MapReduce

Map Reduce paradigm

Input

Map

Shuffle/SortReduce

Output

Prepared by: TEAM HADOOP slide 15/22

Page 16: Hadoop and MapReduce

Map Reduce paradigm Map: transforms input record to

intermediate (key, value) pair

Prepared by: TEAM HADOOP slide 16/22

Page 17: Hadoop and MapReduce

Map Reduce paradigm Reduce: transforms all records for given

key to final output.

Prepared by: TEAM HADOOP slide 17/22

Page 18: Hadoop and MapReduce

Map reduce principles

Move code to data (local

computation)

Allow programs to scale

transparently w.r.t size of input

Abstract away fault tolerance, synchronization,

etc.

Prepared by: TEAM HADOOP slide 18/22

Page 19: Hadoop and MapReduce

Implementation: Hardware

Prepared by: TEAM HADOOP sroy [email protected] slide 19/22

Page 20: Hadoop and MapReduce

Map Reduce: strengths

Batch, offline jobs

Write-once, read-many across full data set

Usually, though not always, simple computations

I/O bound by disk/network bandwidth

Prepared by: TEAM HADOOP slide 20/22

Page 21: Hadoop and MapReduce

What it’s not!

What it’s not:

High-performance parallel computing, e.g. MPI

Low-latency random access relational database

Always the right solution

Prepared by: TEAM HADOOP slide 21/22

Page 22: Hadoop and MapReduce

THANK YOU!

QUESTIONS?

Prepared by: TEAM HADOOP slide 22/22