Introduction to Hadoop and MapReduce

Overview of Hadoop and MapReduceGanesh Neelakanta Iyer

Research Scholar, National University of Singapore

About Me

I have 3 years of Industry work experience - Sasken Communication Technologies Ltd, Bangalore - NXP Semiconductors Pvt Ltd (Formerly Philips Semiconductors), Bangalore

I have finished my Masters in Electrical and Computer Engineering from NUS (National University of Singapore) in 2008.

Currently Research Scholar in NUS under the guidance of A/P. Bharadwaj Veeravalli.

Research Interests: Cloud computing, Game theory, Resource Allocation and PricingPersonal Interests: Kathakali, Teaching, Travelling, Photography

Agenda

• Introduction to Hadoop

• Introduction to HDFS

• MapReduce Paradigm

• Some practical MapReduce examples

• MapReduce in Hadoop

• Concluding remarks

Introduction to Hadoop

Data!

• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage

• The New York Stock Exchange generates about one terabyte of new trade data per day

• In last one week, I personally took 15 GB photos while I was travelling. So imagine the memory requirements for all photos taken in a day all over the world!

Hadoop

• Open source Cloud supported by Apache

• Reliable shared storage and analysis system

• Uses distributed file system (Called as HDFS) like GFS

• Can be used for a variety of applications

Typical Hadoop Cluster

Pro-Hadoop by Jason Venner

Typical Hadoop ClusterAggregation switch

Rack switch

40 nodes/rack, 1000-4000 nodes in cluster1 Gbps bandwidth within rack, 8 Gbps out of rackNode specs (Yahoo terasort):

8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?)

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf

mage from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

Introduction to HDFS

HDFS – Hadoop Distributed File System

http://www.gartner.com/it/page.jsp?id=1447613

Very Large Distributed File System– 10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware– Files are replicated to handle hardware failure– Detect failures and recover from them

Optimized for Batch Processing– Data locations exposed so that computations can move to where data resides– Provides very high aggregate bandwidth

User Space, runs on heterogeneous OS

Distributed File SystemData Coherency

– Write-once-read-many access model– Client can only append to existing files

Files are broken up into blocks– Typically 128 MB block size– Each block replicated on multiple DataNodes

Intelligent Client– Client can find location of blocks– Client accesses data directly from DataNode

MapReduce Paradigm

MapReduceSimple data-parallel programming model designed for scalability and

fault-tolerance

Framework for distributed processing of large data sets

Originally designed by Google

Pluggable user code runs in generic framework

Pioneered by Google - Processes 20 petabytes of data per day

What is MapReduce used for?At Google:

Index construction for Google SearchArticle clustering for Google NewsStatistical machine translation

At Yahoo!:“Web map” powering Yahoo! SearchSpam detection for Yahoo! Mail

At Facebook:Data miningAd optimizationSpam detection

What is MapReduce used for?

In research:Astronomical image analysis (Washington)Bioinformatics (Maryland)Analyzing Wikipedia conflicts (PARC)Natural language processing (CMU) Particle physics (Nebraska)Ocean climate simulation (Washington)<Your application here>

MapReduce Programming Model

Data type: key-value records

Map function:(Kin, Vin) list(Kinter, Vinter)

Reduce function:(Kinter, list(Vinter)) list(Kout, Vout)

Example: Word Countdef mapper(line):

foreach word in line.split():

output(word, 1)

def reducer(key, values):

output(key, sum(values))

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves

Mappers preferentially placed on same node or same rack as their input blockMinimizes network usage

Mappers save outputs to local disk before serving them to reducersAllows recovery if a reducer crashesAllows having more reducers than nodes

Fault Tolerance in MapReduce1. If a task crashes:

Retry on another nodeOK for a map because it has no dependenciesOK for reduce because map outputs are on disk

If the same task fails repeatedly, fail the job or ignore that input block (user-controlled)

Fault Tolerance in MapReduce

2. If a node crashes:Re-launch its current tasks on other nodesRe-run any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3. If a task is going slowly (straggler):Launch second copy of task on another node (“speculative

execution”)Take the output of whichever copy finishes first, and kill the other

Surprisingly important in large clustersStragglers occur frequently due to failing hardware, software bugs,

misconfiguration, etcSingle straggler may noticeably slow down a job

Takeaways

By providing a data-parallel programming model, MapReduce can control job execution in useful ways:Automatic division of job into tasksAutomatic placement of computation near dataAutomatic load balancingRecovery from failures & stragglers

User focuses on application, not on complexities of distributed computing

Some practical MapReduce examples

1. Search

Input: (lineNumber, line) recordsOutput: lines matching a given pattern

Map: if(line matches pattern):

output(line)

Reduce: identify functionAlternative: no reducer (map-only job)

2. Sort

Input: (key, value) recordsOutput: same records, sorted by key

Map: identity functionReduce: identify function

Trick: Pick partitioningfunction h such thatk1<k2 => h(k1)<h(k2)

pigsheepyak

zebra

aardvarkantbeecow

elephant

Map

Map

Map

Reduce

Reduce

ant, bee

zebra

aardvark,elephant

cow

pig

sheep, yak

[A-M]

[N-Z]

3. Inverted IndexInput: (filename, text) recordsOutput: list of files containing each word

Map: foreach word in text.split():

output(word, filename)

Combine: uniquify filenames for each word

Reduce:def reduce(word, filenames):

output(word, sort(filenames))

Inverted Index Example

to be or not to be afraid, (12th.txt)

be, (12th.txt, hamlet.txt)greatness, (12th.txt)

not, (12th.txt, hamlet.txt)of, (12th.txt)

or, (hamlet.txt)to, (hamlet.txt)

hamlet.txt

be not afraid of greatness

12th.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt

be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt

4. Most Popular Words

Input: (filename, text) recordsOutput: top 100 words occurring in the most files

Two-stage solution:Job 1:

Create inverted index, giving (word, list(file)) recordsJob 2:

Map each (word, list(file)) to (count, word)Sort these records by count as in sort job

MapReduce in Hadoop

MapReduce in Hadoop

Three ways to write jobs in Hadoop:Java APIHadoop Streaming (for Python, Perl, etc)Pipes API (C++)

Word Count in Python with Hadoop Streaming

import sysfor line in sys.stdin:for word in line.split():

print(word.lower() + "\t" + 1)

import syscounts = {}for line in sys.stdin:word, count = line.split("\t”)dict[word] = dict.get(word, 0) + int(count)

for word, count in counts:print(word.lower() + "\t" + 1)

Mapper.py:

Reducer.py:

Concluding remarks

ConclusionsMapReduce programming model hides the complexity of work

distribution and fault tolerance

Principal design philosophies:Make it scalable, so you can throw hardware at problemsMake it cheap, lowering hardware, programming and admin costs

MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time

Cloud computing makes it straightforward to start using Hadoop (or other parallel software) at scale

What next?

MapReduce has limitations – Applications are limited

Some developments: • Pig started at Yahoo research• Hive developed at Facebook• Amazon Elastic MapReduce

ResourcesHadoop: http://hadoop.apache.org/core/ Pig: http://hadoop.apache.org/pigHive: http://hadoop.apache.org/hiveVideo tutorials: http://www.cloudera.com/hadoop-training

Amazon Web Services: http://aws.amazon.com/Amazon Elastic MapReduce guide:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

Slides of the talk delivered by Matei Zaharia, EECS, University of California, Berkeley

Thank [email protected]://ganeshniyer.com

Introduction to Hadoop and MapReduce

Technology

Transcript of Introduction to Hadoop and MapReduce