Aws dc elastic-mapreduce

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Aws dc elastic-mapreduce

2. What is MapReduce?From Wikipedia:MapReduce is a framework for processing highly distributable problems across huge datasets using a large number ofcomputers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes usedifferent hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in adatabase (structured)."Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to workernodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes thesmaller problem, and passes the answer back to its master node."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some wayto form the output the answer to the problem it was originally trying to solve. 3. The MapMapping involves taking raw data and converting it into aseries of symbols.For example, DNA sequencing:ddATP -> AddGTP -> GddCTP -> CddTTP -> TResults in representations like "GATTACA" 4. Practical MappingInputs are generally flat-files containing lines of text. clever_critters.txt: foxes are clever cats are cleverFiles are read in and fed to a mapper one line at a time viaSTDIN. cat clever_critters.txt | mapper.rb 5. Practical Mapping ContdThe mapper processes the line and outputs a key/valuepair to STDOUT for each symbol it maps foxes 1 are 1 clever 1 cats 1 are 1 clever 1 6. Work PartitioningThese key/value pairs are passed to a "partition function"which organizes the output and assigns it to reducer nodes foxes -> node 1 are -> node 2 clever -> node 3 cat -> node 4 7. Practical ReductionThe Reducers each receive the shardedworkload assigned to them by the partitioning.Typically the work is received as a stream ofkey/value pairs via STDIN: "foxes 1" -> node 1 "are 1|are 1" -> node 2 "clever 1|clever 1" -> node 3 "cats 1|cats 1" -> node 4 8. Practical Reduction ContdThe reduction is essentially whatever you want it to be.There are common patterns that are often pre-solved bythe map-reduce framework.See Hadoops Built-In Reducerseg, "Aggregate" - give me a total of all the key/valuesfoxes - 1are - 2clever -2cats - 1 9. What is Hadoop?From wikipedia:Apache Hadoop is a software framework that supports data-intensive distributed applications under afree license.[1] It enables applications to work with thousands of computational independentcomputers and petabytes of data. Hadoop was derived from Googles MapReduce and Google FileSystem (GFS) papers.Essentially, Hadoop is a practical implementation of all the pieces youd need toaccomplish everything weve discussed thus far. It takes in the data, organizesthe tasks, passes the data through its entire path and finally outputs thereduction. 10. Hadoops Gutssource: 11. Fun to build?No 12. Solution?Amazons Elastic MapReduce 13. Look complex? Its not1. Sign up for the service2. Download the tools (requires ruby 1.8)3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli4. Create your credentials.json file{"access_id": "","private_key": "","keypair": "","key-pair-file": "~/.ssh/.pem","log_uri": "s3://