Introduction to Hadoop and MapReduce

38
Overview of Hadoop and MapReduce Ganesh Neelakanta Iyer Research Scholar, National University of Singapore

description

Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, KadayiruppuKerala, India in December 2010

Transcript of Introduction to Hadoop and MapReduce

Page 1: Introduction to Hadoop and MapReduce

Overview of Hadoop and MapReduceGanesh Neelakanta Iyer

Research Scholar, National University of Singapore

Page 2: Introduction to Hadoop and MapReduce

About Me

I have 3 years of Industry work experience - Sasken Communication Technologies Ltd, Bangalore - NXP Semiconductors Pvt Ltd (Formerly Philips Semiconductors), Bangalore

I have finished my Masters in Electrical and Computer Engineering from NUS (National University of Singapore) in 2008.

Currently Research Scholar in NUS under the guidance of A/P. Bharadwaj Veeravalli.

Research Interests: Cloud computing, Game theory, Resource Allocation and PricingPersonal Interests: Kathakali, Teaching, Travelling, Photography

Page 3: Introduction to Hadoop and MapReduce

Agenda

• Introduction to Hadoop

• Introduction to HDFS

• MapReduce Paradigm

• Some practical MapReduce examples

• MapReduce in Hadoop

• Concluding remarks

Page 4: Introduction to Hadoop and MapReduce

Introduction to Hadoop

Page 5: Introduction to Hadoop and MapReduce

Data!

• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage

• The New York Stock Exchange generates about one terabyte of new trade data per day

• In last one week, I personally took 15 GB photos while I was travelling. So imagine the memory requirements for all photos taken in a day all over the world!

Page 6: Introduction to Hadoop and MapReduce

Hadoop

• Open source Cloud supported by Apache

• Reliable shared storage and analysis system

• Uses distributed file system (Called as HDFS) like GFS

• Can be used for a variety of applications

Page 7: Introduction to Hadoop and MapReduce

Typical Hadoop Cluster

Pro-Hadoop by Jason Venner

Page 8: Introduction to Hadoop and MapReduce

Typical Hadoop ClusterAggregation switch

Rack switch

40 nodes/rack, 1000-4000 nodes in cluster1 Gbps bandwidth within rack, 8 Gbps out of rackNode specs (Yahoo terasort):

8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?)

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf

Page 9: Introduction to Hadoop and MapReduce

mage from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

Page 10: Introduction to Hadoop and MapReduce

Introduction to HDFS

Page 11: Introduction to Hadoop and MapReduce

HDFS – Hadoop Distributed File System

http://www.gartner.com/it/page.jsp?id=1447613

Very Large Distributed File System– 10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware– Files are replicated to handle hardware failure– Detect failures and recover from them

Optimized for Batch Processing– Data locations exposed so that computations can move to where data resides– Provides very high aggregate bandwidth

User Space, runs on heterogeneous OS

Page 12: Introduction to Hadoop and MapReduce

Distributed File SystemData Coherency

– Write-once-read-many access model– Client can only append to existing files

Files are broken up into blocks– Typically 128 MB block size– Each block replicated on multiple DataNodes

Intelligent Client– Client can find location of blocks– Client accesses data directly from DataNode

Page 13: Introduction to Hadoop and MapReduce

MapReduce Paradigm

Page 14: Introduction to Hadoop and MapReduce

MapReduceSimple data-parallel programming model designed for scalability and

fault-tolerance

Framework for distributed processing of large data sets

Originally designed by Google

Pluggable user code runs in generic framework

Pioneered by Google - Processes 20 petabytes of data per day

Page 15: Introduction to Hadoop and MapReduce

What is MapReduce used for?At Google:

Index construction for Google SearchArticle clustering for Google NewsStatistical machine translation

At Yahoo!:“Web map” powering Yahoo! SearchSpam detection for Yahoo! Mail

At Facebook:Data miningAd optimizationSpam detection

Page 16: Introduction to Hadoop and MapReduce

What is MapReduce used for?

In research:Astronomical image analysis (Washington)Bioinformatics (Maryland)Analyzing Wikipedia conflicts (PARC)Natural language processing (CMU) Particle physics (Nebraska)Ocean climate simulation (Washington)<Your application here>

Page 17: Introduction to Hadoop and MapReduce

MapReduce Programming Model

Data type: key-value records

Map function:(Kin, Vin) list(Kinter, Vinter)

Reduce function:(Kinter, list(Vinter)) list(Kout, Vout)

Page 18: Introduction to Hadoop and MapReduce

Example: Word Countdef mapper(line):

foreach word in line.split():

output(word, 1)

def reducer(key, values):

output(key, sum(values))

Page 19: Introduction to Hadoop and MapReduce

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

Page 20: Introduction to Hadoop and MapReduce

MapReduce Execution Details

Single master controls job execution on multiple slaves

Mappers preferentially placed on same node or same rack as their input blockMinimizes network usage

Mappers save outputs to local disk before serving them to reducersAllows recovery if a reducer crashesAllows having more reducers than nodes

Page 21: Introduction to Hadoop and MapReduce

Fault Tolerance in MapReduce1. If a task crashes:

Retry on another nodeOK for a map because it has no dependenciesOK for reduce because map outputs are on disk

If the same task fails repeatedly, fail the job or ignore that input block (user-controlled)

Page 22: Introduction to Hadoop and MapReduce

Fault Tolerance in MapReduce

2. If a node crashes:Re-launch its current tasks on other nodesRe-run any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Page 23: Introduction to Hadoop and MapReduce

Fault Tolerance in MapReduce

3. If a task is going slowly (straggler):Launch second copy of task on another node (“speculative

execution”)Take the output of whichever copy finishes first, and kill the other

Surprisingly important in large clustersStragglers occur frequently due to failing hardware, software bugs,

misconfiguration, etcSingle straggler may noticeably slow down a job

Page 24: Introduction to Hadoop and MapReduce

Takeaways

By providing a data-parallel programming model, MapReduce can control job execution in useful ways:Automatic division of job into tasksAutomatic placement of computation near dataAutomatic load balancingRecovery from failures & stragglers

User focuses on application, not on complexities of distributed computing

Page 25: Introduction to Hadoop and MapReduce

Some practical MapReduce examples

Page 26: Introduction to Hadoop and MapReduce

1. Search

Input: (lineNumber, line) recordsOutput: lines matching a given pattern

Map: if(line matches pattern):

output(line)

Reduce: identify functionAlternative: no reducer (map-only job)

Page 27: Introduction to Hadoop and MapReduce

2. Sort

Input: (key, value) recordsOutput: same records, sorted by key

Map: identity functionReduce: identify function

Trick: Pick partitioningfunction h such thatk1<k2 => h(k1)<h(k2)

pigsheepyak

zebra

aardvarkantbeecow

elephant

Map

Map

Map

Reduce

Reduce

ant, bee

zebra

aardvark,elephant

cow

pig

sheep, yak

[A-M]

[N-Z]

Page 28: Introduction to Hadoop and MapReduce

3. Inverted IndexInput: (filename, text) recordsOutput: list of files containing each word

Map: foreach word in text.split():

output(word, filename)

Combine: uniquify filenames for each word

Reduce:def reduce(word, filenames):

output(word, sort(filenames))

Page 29: Introduction to Hadoop and MapReduce

Inverted Index Example

to be or not to be afraid, (12th.txt)

be, (12th.txt, hamlet.txt)greatness, (12th.txt)

not, (12th.txt, hamlet.txt)of, (12th.txt)

or, (hamlet.txt)to, (hamlet.txt)

hamlet.txt

be not afraid of greatness

12th.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt

be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt

Page 30: Introduction to Hadoop and MapReduce

4. Most Popular Words

Input: (filename, text) recordsOutput: top 100 words occurring in the most files

Two-stage solution:Job 1:

Create inverted index, giving (word, list(file)) recordsJob 2:

Map each (word, list(file)) to (count, word)Sort these records by count as in sort job

Page 31: Introduction to Hadoop and MapReduce

MapReduce in Hadoop

Page 32: Introduction to Hadoop and MapReduce

MapReduce in Hadoop

Three ways to write jobs in Hadoop:Java APIHadoop Streaming (for Python, Perl, etc)Pipes API (C++)

Page 33: Introduction to Hadoop and MapReduce

Word Count in Python with Hadoop Streaming

import sysfor line in sys.stdin:for word in line.split():

print(word.lower() + "\t" + 1)

import syscounts = {}for line in sys.stdin:word, count = line.split("\t”)dict[word] = dict.get(word, 0) + int(count)

for word, count in counts:print(word.lower() + "\t" + 1)

Mapper.py:

Reducer.py:

Page 34: Introduction to Hadoop and MapReduce

Concluding remarks

Page 35: Introduction to Hadoop and MapReduce

ConclusionsMapReduce programming model hides the complexity of work

distribution and fault tolerance

Principal design philosophies:Make it scalable, so you can throw hardware at problemsMake it cheap, lowering hardware, programming and admin costs

MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time

Cloud computing makes it straightforward to start using Hadoop (or other parallel software) at scale

Page 36: Introduction to Hadoop and MapReduce

What next?

MapReduce has limitations – Applications are limited

Some developments: • Pig started at Yahoo research• Hive developed at Facebook• Amazon Elastic MapReduce

Page 37: Introduction to Hadoop and MapReduce

ResourcesHadoop: http://hadoop.apache.org/core/ Pig: http://hadoop.apache.org/pigHive: http://hadoop.apache.org/hiveVideo tutorials: http://www.cloudera.com/hadoop-training

Amazon Web Services: http://aws.amazon.com/Amazon Elastic MapReduce guide:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

Slides of the talk delivered by Matei Zaharia, EECS, University of California, Berkeley

Page 38: Introduction to Hadoop and MapReduce

Thank [email protected]://ganeshniyer.com