Map Reduce Examples

Map Reduce Examples

Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations

Distributed Grep The map function emits a line if it matches a supplied pattern The reduce function is an identity function that just copies the supplied intermediate data to the output

Count of URL Access Frequency The map function processes logs of web page requests and outputs (URL 1) The reduce function adds together all values for the same URL and emits a (URL total count) pair

ReverseWeb-Link Graph The map function outputs (target source) pairs for each link to a target URL found in a page named source The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair (target list(source))

Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word frequency) pairs Themap function emits a (hostname term vector) pair for each input document (where the hostname is extracted from the URL of the document) The reduce function is passed all per-document term vectors for a given host It adds these term vectors together throwing away infrequent terms and then emits a final (hostname term vector) pair

Same but copied with formatted on next page




Reverse Web-Link Graph The map function outputs (target source) pairs for each link to a target URL found in a page named source The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair (target list(source))

Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word frequency) pairs The map function emits a (hostname term vector) pair for each input document (where the hostname is extracted from the URL of the document) The reduce function is passed all per-document term vectors for a given host It adds these term vectors together throwing away infrequent terms and then emits a final (hostname term vector) pair

Inverted Index The map function parses each document and emits a sequence of (word document ID) pairs The reduce function accepts all pairs for a given word sorts the corresponding document IDs and emits a (word list(document ID)) pair The set of all output pairs forms a simple inverted index It is easy to augment this computation to keep track of word positions

Distributed Sort The map function extracts the key from each record and emits a (key record) pair The reduce function emits all pairs unchanged This computation depends on the partitioning facilities described in Section 41 and the ordering properties described in Section 42

Strategy to solve MapReduce Problem

After grouping all the intermediate data the values of all occurrences of the same key are sorted and grouped together As a result after grouping each key each key becomes unique in all intermediate data Therefore finding unique keys is the starting point to solving a typical MapReduce problem Then the intermediate (key value) pairs as the output of the Map function will be automatically found

The following examples explain how to define keys and values in such problems

Problem 1 Counting the number of occurrences of each word in a collection of documentsSolution unique key each word intermediate value number of occurrences

Problem 2 Counting the number of occurrences of words having the same size or the same number of letters in a collection of documentsSolution unique key each word intermediate value size of the word

Problem 3 Counting the number of occurrences of anagrams in a collection of documents (Anagrams are words with the same set of letters but in a different order (eg the words ldquolistenrdquo and ldquosilentrdquo)Solution unique key alphabetically sorted sequence of letters for each word (eg ldquoeilnstrdquo) intermediate value number of occurrences

6224 Strategy to Solve MapReduce Problems As mentioned earlier a1048862er grouping all the intermediate data the values of all occurrences of the same key are sorted and grouped together As a result a1048862er grouping each key becomes unique in all intermediate data Therefore finding unique keys is the starting point to solving a typical MapReduce problem Then the intermediate (key value) pairs as the output of the Map function will be automatically found The following three examples explain how to define keys and values in such problems Problem 1 Counting the number of occurrences of each word in a collection of documents Solution unique ldquokeyrdquo each word intermediate ldquovaluerdquo number of occurrencesProblem 2 Counting the number of occurrences of words having the same size or the same number of letters in a collection of documents Solution unique ldquokeyrdquo each word intermediate ldquovaluerdquo size of the word Problem 3 Counting the number of occurrences of anagrams in a collection of documents Anagrams are words with the same set of letters but in a different order (eg the words ldquolistenrdquo and ldquosilentrdquo) Solution unique ldquokeyrdquo alphabetically sorted sequence of letters for each word (eg ldquoeilnstrdquo) intermediate ldquovaluerdquo number of occurrences

Transparent Programming Modelbull Programs written for cloud implementation need to be automatically parallelized and executed on a large cluster of commodity machinesbull The run-time system should take care of the details of partitioning the input data scheduling the programs execution across a set of machines handling machine failures and managing the required inter-machine communicationbull The programming model should allow programmers without many experiences with parallel and distributed systems to easily utilize the resources of a large distributed system

Scalable Data Processing on Large Clustersbull A web programming model implemented for fast processing and generating large datasetsbull Applied mainly in web-scale search and cloud computing applicationsbull Users specify a map function to generate a set of intermediate keyvalue pairsbull Users use a reduce function to merge all intermediate values with the same intermediate key

Google MapReducebull Map written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function

bull The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation

Hadoop A software platform originally developed by Yahoo to enable users write and run applications over vast distributed data

Attractive Features in Hadoop 1048708 Scalable can easily scale to store and process petabytes of data in the Web space1048708 Economical An open-source MapReduce minimizes the overheads in task spawning and massive data communication1048708 Efficient Processing data with high-degree of parallelism across a large number of commodity nodes1048708 Reliable Automatically maintains multiple copies of data to facilitate redeployment of computing tasks on failures

Explain MapReduce with an example

The computation takes a set of input keyvalue pairs and produces a set of output keyvalue pairs The user of the MapReduce library expresses the computation as two functions Map and ReduceMap written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associatedwith the same intermediate key I and passes them to the Reduce function

The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation The intermediate values are supplied to the users reduce function via an iteratorThis allows us to handle lists of values that are too large to fit in memory

Consider the problem of counting the n umber of occurrences of each word in a large collection of documents The user would write code similar to the following pseudo-code

map(String key String value) key document name value document contents for each word w in valueEmitIntermediate(w 1)

reduce(String key Iterator values) key a word values a list of counts int result = 0for each v in valuesresult += ParseInt(v)

Emit(AsString(result))

The map function emits each word plus an associated count of occurrences (just lsquo1rsquo in this simple example) The reduce function sums together all counts emitted for a particular word

In addition the user writes code to fill in a mapreduce specification object with the names of the input and output files and optional tuning parameters The user then invokes the MapReduce function passing it the specification object The userrsquos code is linked together with the MapReduce library (implemented in C++) Appendix A contains the full program text for this example

Even though the previous pseudo-code is written in terms of string inputs and outputs conceptually the map and reduce functions supplied by the user have associated typesmap (k1v1) list(k2v2)reduce (k2list(v2)) list(v2)

Ie the input keys and values are drawn from a different domain than the output keys and values Furthermore the intermediate keys and values are from the same domainas the output keys and values

We have a large collection of text documents in a folderCount the frequency of distinct words in the documents

Map functionMap function operates on every keyvalue pair of input data and transforms the data based on the transformation logic provided in the map functionMap function always emits an intermediate keyvalue pair as output

Map( Key1 Value1) -gt List ( Key2 Value2 )For each file

Read each line from the input fileLocate each word

Emit the (word1) for every word foundThe emitted (word 1) will form the list that is output from the Map function

Reduce function takes the list of every key and transforms the data based on the (aggregation) logic provided in the reduce function It is similar to the Aggregate functions in Standard SQL

For the List(key value) output from the mapper Shuffle and Sort the data by keyGroup by Key and create the list of values for a key

Reduce functionReduce ( Key2 List(Value2) ) -gt List (Key3 Value3 )Read each key (word) and list of values (1 1 1) associated with it

For each key add the list of values to calculate sumEmit the word sum for every word found




Reverse Web-Link Graph The map function outputs (target source) pairs for each link to a target URL found in a page named source The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair (target list(source))

Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word frequency) pairs The map function emits a (hostname term vector) pair for each input document (where the hostname is extracted from the URL of the document) The reduce function is passed all per-document term vectors for a given host It adds these term vectors together throwing away infrequent terms and then emits a final (hostname term vector) pair

Inverted Index The map function parses each document and emits a sequence of (word document ID) pairs The reduce function accepts all pairs for a given word sorts the corresponding document IDs and emits a (word list(document ID)) pair The set of all output pairs forms a simple inverted index It is easy to augment this computation to keep track of word positions

Distributed Sort The map function extracts the key from each record and emits a (key record) pair The reduce function emits all pairs unchanged This computation depends on the partitioning facilities described in Section 41 and the ordering properties described in Section 42


































Map Reduce Examples

Documents

Transcript of Map Reduce Examples