Aws dc elastic-mapreduce

20
Elastic MapReduce Outsourcing BigData Nathan McCourtney @beaknit

Transcript of Aws dc elastic-mapreduce

Page 1: Aws dc elastic-mapreduce

Elastic MapReduceOutsourcing BigData

Nathan McCourtney@beaknit

Page 2: Aws dc elastic-mapreduce

What is MapReduce?From Wikipedia: MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Page 3: Aws dc elastic-mapreduce

The Map

Mapping involves taking raw data and converting it into a series of symbols. For example, DNA sequencing:

ddATP -> AddGTP -> GddCTP -> CddTTP -> T

Results in representations like "GATTACA"

Page 4: Aws dc elastic-mapreduce

Practical Mapping

Inputs are generally flat-files containing lines of text.clever_critters.txt:

foxes are clever

cats are clever

Files are read in and fed to a mapper one line at a time via STDIN.

cat clever_critters.txt | mapper.rb

Page 5: Aws dc elastic-mapreduce

Practical Mapping Cont'd

The mapper processes the line and outputs a key/value pair to STDOUT for each symbol it maps

foxes 1

are 1

clever 1

cats 1

are 1

clever 1

Page 6: Aws dc elastic-mapreduce

Work Partitioning

These key/value pairs are passed to a "partition function" which organizes the output and assigns it to reducer nodes

foxes -> node 1are -> node 2clever -> node 3cat -> node 4

Page 7: Aws dc elastic-mapreduce

Practical Reduction

The Reducers each receive the sharded workload assigned to them by the partitioning. Typically the work is received as a stream of key/value pairs via STDIN: "foxes 1" -> node 1 "are 1|are 1" -> node 2 "clever 1|clever 1" -> node 3 "cats 1|cats 1" -> node 4

Page 8: Aws dc elastic-mapreduce

Practical Reduction Cont'd

The reduction is essentially whatever you want it to be. There are common patterns that are often pre-solved by the map-reduce framework. See Hadoop's Built-In Reducers eg, "Aggregate" - give me a total of all the key/values foxes - 1 are - 2 clever -2 cats - 1

Page 9: Aws dc elastic-mapreduce

What is Hadoop?

From wikipedia:Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license.[1] It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.

Essentially, Hadoop is a practical implementation of all the pieces you'd need to accomplish everything we've discussed thus far. It takes in the data, organizes the tasks, passes the data through its entire path and finally outputs the reduction.

Page 10: Aws dc elastic-mapreduce

Hadoop's Guts

source: http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html

Page 11: Aws dc elastic-mapreduce

Fun to build?

No

Page 12: Aws dc elastic-mapreduce

Solution?

Amazon's Elastic MapReduce

Page 13: Aws dc elastic-mapreduce

Look complex? It's not1. Sign up for the service2. Download the tools (requires ruby 1.8)3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli4. Create your credentials.json file

{

"access_id": "<key>",

"private_key": "<secret key>",

"keypair": "<name of keypair>",

"key-pair-file": "~/.ssh/<key>.pem",

"log_uri": "s3://<unique s3 bucket/",

"region": "us-east-1"

}

5. unzip ~/Downloads/elastic-mapreduce-ruby.zip

Page 14: Aws dc elastic-mapreduce

Run it

ruby elastic-mapreduce --listruby elastic-mapreduce --create --aliveruby elastic-mapreduce --listruby elastic-mapreduce --terminate <JobFlowID> Note you can also view it in the Amazon EMR web interface Logs can be viewed by looking into the s3 bucket you specified in your credentials.json file. Just drill down via the s3 web interface and double-click the file.

Page 15: Aws dc elastic-mapreduce

Creating a minimal job1. Set up a dedicated s3 bucket 2. Create a folder called "input" in that bucket 3. Upload your inputs into s3://bucket/input

s3cmd put *log s3://bucket/input

Page 16: Aws dc elastic-mapreduce

Minimal Job Cont'd4. Write a mapper

eg:ARGF.each do |line|

# remove any newline line = line.chomp

if /ERROR/.match(line)

puts "ERROR\t1"

end

if /INFO/.match(line)

puts "INFO\t1"

end

if /DEBUG/.match(line)

puts "DEBUG\t1"

endend

See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for examples

Page 17: Aws dc elastic-mapreduce

Minimal Job Cont'd5. Upload your mapper to your s3 bucket

s3cmd put mapper.rb s3://bucket

6. Run itelastic-mapreduce --create --stream \

--mapper s3://bucket/mapper.rb \

--input s3://bucket/input \

--output s3://bucket/output \

--reducer aggregate

NOTE: This job uses the built-in aggregator. NOTE: The output directory must NOT exist at the time of the run Amazon will scale ec2 instances to consume the load dynamically.

7. Pick up your results in the output folder

Page 18: Aws dc elastic-mapreduce

AWS Demo App

AWS has a very cool publicly-available app to run: elastic-mapreduce --create --stream \

--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \

--input s3://elasticmapreduce/samples/wordcount/input \

--output s3://bucket/output \

--reducer aggregate

See Amazon Example Doc

Page 19: Aws dc elastic-mapreduce

Possibilities

EMR is a fully-functional Hadoop implementation. Mappers and reducers can be written in python, ruby, PHP and Java Go crazy.