Aws dc elastic-mapreduce

Elastic MapReduceOutsourcing BigData

Nathan McCourtney@beaknit

What is MapReduce?From Wikipedia: MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

http://en.wikipedia.org/wiki/Filesystem

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/Tree_(data_structure)

The Map

Mapping involves taking raw data and converting it into a series of symbols. For example, DNA sequencing:

ddATP -> AddGTP -> GddCTP -> CddTTP -> T

Results in representations like "GATTACA"

Practical Mapping

Inputs are generally flat-files containing lines of text.clever_critters.txt:

foxes are clever

cats are clever

Files are read in and fed to a mapper one line at a time via STDIN.

cat clever_critters.txt | mapper.rb

Practical Mapping Cont'd

The mapper processes the line and outputs a key/value pair to STDOUT for each symbol it maps

foxes 1

are 1

clever 1

cats 1

are 1

clever 1

Work Partitioning

These key/value pairs are passed to a "partition function" which organizes the output and assigns it to reducer nodes

foxes -> node 1are -> node 2clever -> node 3cat -> node 4

Practical Reduction

The Reducers each receive the sharded workload assigned to them by the partitioning. Typically the work is received as a stream of key/value pairs via STDIN: "foxes 1" -> node 1 "are 1|are 1" -> node 2 "clever 1|clever 1" -> node 3 "cats 1|cats 1" -> node 4

Practical Reduction Cont'd

The reduction is essentially whatever you want it to be. There are common patterns that are often pre-solved by the map-reduce framework. See Hadoop's Built-In Reducers eg, "Aggregate" - give me a total of all the key/values foxes - 1 are - 2 clever -2 cats - 1

http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Working+with+the+Hadoop+Aggregate+Package+(the+-reduce+aggregate+option

What is Hadoop?

From wikipedia:Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license.[1] It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.

Essentially, Hadoop is a practical implementation of all the pieces you'd need to accomplish everything we've discussed thus far. It takes in the data, organizes the tasks, passes the data through its entire path and finally outputs the reduction.

http://en.wikipedia.org/wiki/Software_framework

http://en.wikipedia.org/wiki/Distributed_computing

http://en.wikipedia.org/wiki/Free_software

http://en.wikipedia.org/wiki/Apache_Hadoop#cite_note-0

http://en.wikipedia.org/wiki/Petabytes

http://en.wikipedia.org/wiki/Google

http://en.wikipedia.org/wiki/MapReduce

http://en.wikipedia.org/wiki/Google_File_System

http://en.wikipedia.org/wiki/Google_File_System

Hadoop's Guts

source: http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html

Fun to build?

No

Solution?

Amazon's Elastic MapReduce

Look complex? It's not1. Sign up for the service2. Download the tools (requires ruby 1.8)3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli4. Create your credentials.json file

{

"access_id": "<key>",

"private_key": "<secret key>",

"keypair": "<name of keypair>",

"key-pair-file": "~/.ssh/<key>.pem",

"log_uri": "s3://<unique s3 bucket/",

"region": "us-east-1"

}

5. unzip ~/Downloads/elastic-mapreduce-ruby.zip

http://aws.amazon.com/developertools/2264

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/SignUp.html

Run it

ruby elastic-mapreduce --listruby elastic-mapreduce --create --aliveruby elastic-mapreduce --listruby elastic-mapreduce --terminate <JobFlowID> Note you can also view it in the Amazon EMR web interface Logs can be viewed by looking into the s3 bucket you specified in your credentials.json file. Just drill down via the s3 web interface and double-click the file.

Creating a minimal job1. Set up a dedicated s3 bucket 2. Create a folder called "input" in that bucket 3. Upload your inputs into s3://bucket/input

s3cmd put *log s3://bucket/input

Minimal Job Cont'd4. Write a mapper

eg:ARGF.each do |line|

# remove any newline line = line.chomp

if /ERROR/.match(line)

puts "ERROR\t1"

end

if /INFO/.match(line)

puts "INFO\t1"

end

if /DEBUG/.match(line)

puts "DEBUG\t1"

endend

See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for examples

http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/

Minimal Job Cont'd5. Upload your mapper to your s3 bucket

s3cmd put mapper.rb s3://bucket

6. Run itelastic-mapreduce --create --stream \

--mapper s3://bucket/mapper.rb \

--input s3://bucket/input \

--output s3://bucket/output \

--reducer aggregate

NOTE: This job uses the built-in aggregator. NOTE: The output directory must NOT exist at the time of the run Amazon will scale ec2 instances to consume the load dynamically.

7. Pick up your results in the output folder

AWS Demo App

AWS has a very cool publicly-available app to run: elastic-mapreduce --create --stream \

--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \

--input s3://elasticmapreduce/samples/wordcount/input \

--output s3://bucket/output \

--reducer aggregate

See Amazon Example Doc

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/CreateJobFlowStreaming.html

Possibilities

EMR is a fully-functional Hadoop implementation. Mappers and reducers can be written in python, ruby, PHP and Java Go crazy.

http://code.google.com/edu/parallel/mapreduce-tutorial.html

Further ReadingTom White's O'Reilly on Hadoop AWS EMR Getting Started Guide Hadoop Wiki

http://shop.oreilly.com/product/9780596521981.do

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/SignUp.html

http://wiki.apache.org/hadoop/

Aws dc elastic-mapreduce

Technology

Transcript of Aws dc elastic-mapreduce