Map Reduce Examples
-
Upload
singsg-singsg -
Category
Documents
-
view
219 -
download
1
description
Transcript of Map Reduce Examples
Map Reduce Examples
Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations
Distributed Grep The map function emits a line if it matches a supplied pattern The reduce function is an identity function that just copies the supplied intermediate data to the output
Count of URL Access Frequency The map function processes logs of web page requests and outputs (URL 1) The reduce function adds together all values for the same URL and emits a (URL total count) pair
ReverseWeb-Link Graph The map function outputs (target source) pairs for each link to a target URL found in a page named source The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair (target list(source))
Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word frequency) pairs Themap function emits a (hostname term vector) pair for each input document (where the hostname is extracted from the URL of the document) The reduce function is passed all per-document term vectors for a given host It adds these term vectors together throwing away infrequent terms and then emits a final (hostname term vector) pair
Same but copied with formatted on next page
Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations
Distributed Grep The map function emits a line if it matches a supplied pattern The reduce function is an identity function that just copies the supplied intermediate data to the output
Count of URL Access Frequency The map function processes logs of web page requests and outputs (URL 1) The reduce function adds together all values for the same URL and emits a (URL total count) pair
Reverse Web-Link Graph The map function outputs (target source) pairs for each link to a target URL found in a page named source The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair (target list(source))
Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word frequency) pairs The map function emits a (hostname term vector) pair for each input document (where the hostname is extracted from the URL of the document) The reduce function is passed all per-document term vectors for a given host It adds these term vectors together throwing away infrequent terms and then emits a final (hostname term vector) pair
Inverted Index The map function parses each document and emits a sequence of (word document ID) pairs The reduce function accepts all pairs for a given word sorts the corresponding document IDs and emits a (word list(document ID)) pair The set of all output pairs forms a simple inverted index It is easy to augment this computation to keep track of word positions
Distributed Sort The map function extracts the key from each record and emits a (key record) pair The reduce function emits all pairs unchanged This computation depends on the partitioning facilities described in Section 41 and the ordering properties described in Section 42
Strategy to solve MapReduce Problem
After grouping all the intermediate data the values of all occurrences of the same key are sorted and grouped together As a result after grouping each key each key becomes unique in all intermediate data Therefore finding unique keys is the starting point to solving a typical MapReduce problem Then the intermediate (key value) pairs as the output of the Map function will be automatically found
The following examples explain how to define keys and values in such problems
Problem 1 Counting the number of occurrences of each word in a collection of documentsSolution unique key each word intermediate value number of occurrences
Problem 2 Counting the number of occurrences of words having the same size or the same number of letters in a collection of documentsSolution unique key each word intermediate value size of the word
Problem 3 Counting the number of occurrences of anagrams in a collection of documents (Anagrams are words with the same set of letters but in a different order (eg the words ldquolistenrdquo and ldquosilentrdquo)Solution unique key alphabetically sorted sequence of letters for each word (eg ldquoeilnstrdquo) intermediate value number of occurrences
6224 Strategy to Solve MapReduce Problems As mentioned earlier a1048862er grouping all the intermediate data the values of all occurrences of the same key are sorted and grouped together As a result a1048862er grouping each key becomes unique in all intermediate data Therefore finding unique keys is the starting point to solving a typical MapReduce problem Then the intermediate (key value) pairs as the output of the Map function will be automatically found The following three examples explain how to define keys and values in such problems Problem 1 Counting the number of occurrences of each word in a collection of documents Solution unique ldquokeyrdquo each word intermediate ldquovaluerdquo number of occurrencesProblem 2 Counting the number of occurrences of words having the same size or the same number of letters in a collection of documents Solution unique ldquokeyrdquo each word intermediate ldquovaluerdquo size of the word Problem 3 Counting the number of occurrences of anagrams in a collection of documents Anagrams are words with the same set of letters but in a different order (eg the words ldquolistenrdquo and ldquosilentrdquo) Solution unique ldquokeyrdquo alphabetically sorted sequence of letters for each word (eg ldquoeilnstrdquo) intermediate ldquovaluerdquo number of occurrences
Transparent Programming Modelbull Programs written for cloud implementation need to be automatically parallelized and executed on a large cluster of commodity machinesbull The run-time system should take care of the details of partitioning the input data scheduling the programs execution across a set of machines handling machine failures and managing the required inter-machine communicationbull The programming model should allow programmers without many experiences with parallel and distributed systems to easily utilize the resources of a large distributed system
Scalable Data Processing on Large Clustersbull A web programming model implemented for fast processing and generating large datasetsbull Applied mainly in web-scale search and cloud computing applicationsbull Users specify a map function to generate a set of intermediate keyvalue pairsbull Users use a reduce function to merge all intermediate values with the same intermediate key
Google MapReducebull Map written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function
bull The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation
Hadoop A software platform originally developed by Yahoo to enable users write and run applications over vast distributed data
Attractive Features in Hadoop 1048708 Scalable can easily scale to store and process petabytes of data in the Web space1048708 Economical An open-source MapReduce minimizes the overheads in task spawning and massive data communication1048708 Efficient Processing data with high-degree of parallelism across a large number of commodity nodes1048708 Reliable Automatically maintains multiple copies of data to facilitate redeployment of computing tasks on failures
Explain MapReduce with an example
The computation takes a set of input keyvalue pairs and produces a set of output keyvalue pairs The user of the MapReduce library expresses the computation as two functions Map and ReduceMap written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associatedwith the same intermediate key I and passes them to the Reduce function
The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation The intermediate values are supplied to the users reduce function via an iteratorThis allows us to handle lists of values that are too large to fit in memory
Consider the problem of counting the n umber of occurrences of each word in a large collection of documents The user would write code similar to the following pseudo-code
map(String key String value) key document name value document contents for each word w in valueEmitIntermediate(w 1)
reduce(String key Iterator values) key a word values a list of counts int result = 0for each v in valuesresult += ParseInt(v)
Emit(AsString(result))
The map function emits each word plus an associated count of occurrences (just lsquo1rsquo in this simple example) The reduce function sums together all counts emitted for a particular word
In addition the user writes code to fill in a mapreduce specification object with the names of the input and out- put files and optional tuning parameters The user then invokes the MapReduce function passing it the specification object The userrsquos code is linked together with the MapReduce library (implemented in C++) Appendix A contains the full program text for this example
Even though the previous pseudo-code is written in terms of string inputs and outputs conceptually the map and reduce functions supplied by the user have associated typesmap (k1v1) list(k2v2)reduce (k2list(v2)) list(v2)
Ie the input keys and values are drawn from a different domain than the output keys and values Furthermore the intermediate keys and values are from the same domainas the output keys and values
We have a large collection of text documents in a folderCount the frequency of distinct words in the documents
Map functionMap function operates on every keyvalue pair of input data and transforms the data based on the transformation logic provided in the map functionMap function always emits an intermediate keyvalue pair as output
Map( Key1 Value1) -gt List ( Key2 Value2 )For each file
Read each line from the input fileLocate each word
Emit the (word1) for every word foundThe emitted (word 1) will form the list that is output from the Map function
Reduce function takes the list of every key and transforms the data based on the (aggregation) logic provided in the reduce function It is similar to the Aggregate functions in Standard SQL
For the List(key value) output from the mapper Shuffle and Sort the data by keyGroup by Key and create the list of values for a key
Reduce functionReduce ( Key2 List(Value2) ) -gt List (Key3 Value3 )Read each key (word) and list of values (1 1 1) associated with it
For each key add the list of values to calculate sumEmit the word sum for every word found
Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations
Distributed Grep The map function emits a line if it matches a supplied pattern The reduce function is an identity function that just copies the supplied intermediate data to the output
Count of URL Access Frequency The map function processes logs of web page requests and outputs (URL 1) The reduce function adds together all values for the same URL and emits a (URL total count) pair
Reverse Web-Link Graph The map function outputs (target source) pairs for each link to a target URL found in a page named source The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair (target list(source))
Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word frequency) pairs The map function emits a (hostname term vector) pair for each input document (where the hostname is extracted from the URL of the document) The reduce function is passed all per-document term vectors for a given host It adds these term vectors together throwing away infrequent terms and then emits a final (hostname term vector) pair
Inverted Index The map function parses each document and emits a sequence of (word document ID) pairs The reduce function accepts all pairs for a given word sorts the corresponding document IDs and emits a (word list(document ID)) pair The set of all output pairs forms a simple inverted index It is easy to augment this computation to keep track of word positions
Distributed Sort The map function extracts the key from each record and emits a (key record) pair The reduce function emits all pairs unchanged This computation depends on the partitioning facilities described in Section 41 and the ordering properties described in Section 42
Strategy to solve MapReduce Problem
After grouping all the intermediate data the values of all occurrences of the same key are sorted and grouped together As a result after grouping each key each key becomes unique in all intermediate data Therefore finding unique keys is the starting point to solving a typical MapReduce problem Then the intermediate (key value) pairs as the output of the Map function will be automatically found
The following examples explain how to define keys and values in such problems
Problem 1 Counting the number of occurrences of each word in a collection of documentsSolution unique key each word intermediate value number of occurrences
Problem 2 Counting the number of occurrences of words having the same size or the same number of letters in a collection of documentsSolution unique key each word intermediate value size of the word
Problem 3 Counting the number of occurrences of anagrams in a collection of documents (Anagrams are words with the same set of letters but in a different order (eg the words ldquolistenrdquo and ldquosilentrdquo)Solution unique key alphabetically sorted sequence of letters for each word (eg ldquoeilnstrdquo) intermediate value number of occurrences
6224 Strategy to Solve MapReduce Problems As mentioned earlier a1048862er grouping all the intermediate data the values of all occurrences of the same key are sorted and grouped together As a result a1048862er grouping each key becomes unique in all intermediate data Therefore finding unique keys is the starting point to solving a typical MapReduce problem Then the intermediate (key value) pairs as the output of the Map function will be automatically found The following three examples explain how to define keys and values in such problems Problem 1 Counting the number of occurrences of each word in a collection of documents Solution unique ldquokeyrdquo each word intermediate ldquovaluerdquo number of occurrencesProblem 2 Counting the number of occurrences of words having the same size or the same number of letters in a collection of documents Solution unique ldquokeyrdquo each word intermediate ldquovaluerdquo size of the word Problem 3 Counting the number of occurrences of anagrams in a collection of documents Anagrams are words with the same set of letters but in a different order (eg the words ldquolistenrdquo and ldquosilentrdquo) Solution unique ldquokeyrdquo alphabetically sorted sequence of letters for each word (eg ldquoeilnstrdquo) intermediate ldquovaluerdquo number of occurrences
Transparent Programming Modelbull Programs written for cloud implementation need to be automatically parallelized and executed on a large cluster of commodity machinesbull The run-time system should take care of the details of partitioning the input data scheduling the programs execution across a set of machines handling machine failures and managing the required inter-machine communicationbull The programming model should allow programmers without many experiences with parallel and distributed systems to easily utilize the resources of a large distributed system
Scalable Data Processing on Large Clustersbull A web programming model implemented for fast processing and generating large datasetsbull Applied mainly in web-scale search and cloud computing applicationsbull Users specify a map function to generate a set of intermediate keyvalue pairsbull Users use a reduce function to merge all intermediate values with the same intermediate key
Google MapReducebull Map written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function
bull The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation
Hadoop A software platform originally developed by Yahoo to enable users write and run applications over vast distributed data
Attractive Features in Hadoop 1048708 Scalable can easily scale to store and process petabytes of data in the Web space1048708 Economical An open-source MapReduce minimizes the overheads in task spawning and massive data communication1048708 Efficient Processing data with high-degree of parallelism across a large number of commodity nodes1048708 Reliable Automatically maintains multiple copies of data to facilitate redeployment of computing tasks on failures
Explain MapReduce with an example
The computation takes a set of input keyvalue pairs and produces a set of output keyvalue pairs The user of the MapReduce library expresses the computation as two functions Map and ReduceMap written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associatedwith the same intermediate key I and passes them to the Reduce function
The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation The intermediate values are supplied to the users reduce function via an iteratorThis allows us to handle lists of values that are too large to fit in memory
Consider the problem of counting the n umber of occurrences of each word in a large collection of documents The user would write code similar to the following pseudo-code
map(String key String value) key document name value document contents for each word w in valueEmitIntermediate(w 1)
reduce(String key Iterator values) key a word values a list of counts int result = 0for each v in valuesresult += ParseInt(v)
Emit(AsString(result))
The map function emits each word plus an associated count of occurrences (just lsquo1rsquo in this simple example) The reduce function sums together all counts emitted for a particular word
In addition the user writes code to fill in a mapreduce specification object with the names of the input and out- put files and optional tuning parameters The user then invokes the MapReduce function passing it the specification object The userrsquos code is linked together with the MapReduce library (implemented in C++) Appendix A contains the full program text for this example
Even though the previous pseudo-code is written in terms of string inputs and outputs conceptually the map and reduce functions supplied by the user have associated typesmap (k1v1) list(k2v2)reduce (k2list(v2)) list(v2)
Ie the input keys and values are drawn from a different domain than the output keys and values Furthermore the intermediate keys and values are from the same domainas the output keys and values
We have a large collection of text documents in a folderCount the frequency of distinct words in the documents
Map functionMap function operates on every keyvalue pair of input data and transforms the data based on the transformation logic provided in the map functionMap function always emits an intermediate keyvalue pair as output
Map( Key1 Value1) -gt List ( Key2 Value2 )For each file
Read each line from the input fileLocate each word
Emit the (word1) for every word foundThe emitted (word 1) will form the list that is output from the Map function
Reduce function takes the list of every key and transforms the data based on the (aggregation) logic provided in the reduce function It is similar to the Aggregate functions in Standard SQL
For the List(key value) output from the mapper Shuffle and Sort the data by keyGroup by Key and create the list of values for a key
Reduce functionReduce ( Key2 List(Value2) ) -gt List (Key3 Value3 )Read each key (word) and list of values (1 1 1) associated with it
For each key add the list of values to calculate sumEmit the word sum for every word found
Strategy to solve MapReduce Problem
After grouping all the intermediate data the values of all occurrences of the same key are sorted and grouped together As a result after grouping each key each key becomes unique in all intermediate data Therefore finding unique keys is the starting point to solving a typical MapReduce problem Then the intermediate (key value) pairs as the output of the Map function will be automatically found
The following examples explain how to define keys and values in such problems
Problem 1 Counting the number of occurrences of each word in a collection of documentsSolution unique key each word intermediate value number of occurrences
Problem 2 Counting the number of occurrences of words having the same size or the same number of letters in a collection of documentsSolution unique key each word intermediate value size of the word
Problem 3 Counting the number of occurrences of anagrams in a collection of documents (Anagrams are words with the same set of letters but in a different order (eg the words ldquolistenrdquo and ldquosilentrdquo)Solution unique key alphabetically sorted sequence of letters for each word (eg ldquoeilnstrdquo) intermediate value number of occurrences
6224 Strategy to Solve MapReduce Problems As mentioned earlier a1048862er grouping all the intermediate data the values of all occurrences of the same key are sorted and grouped together As a result a1048862er grouping each key becomes unique in all intermediate data Therefore finding unique keys is the starting point to solving a typical MapReduce problem Then the intermediate (key value) pairs as the output of the Map function will be automatically found The following three examples explain how to define keys and values in such problems Problem 1 Counting the number of occurrences of each word in a collection of documents Solution unique ldquokeyrdquo each word intermediate ldquovaluerdquo number of occurrencesProblem 2 Counting the number of occurrences of words having the same size or the same number of letters in a collection of documents Solution unique ldquokeyrdquo each word intermediate ldquovaluerdquo size of the word Problem 3 Counting the number of occurrences of anagrams in a collection of documents Anagrams are words with the same set of letters but in a different order (eg the words ldquolistenrdquo and ldquosilentrdquo) Solution unique ldquokeyrdquo alphabetically sorted sequence of letters for each word (eg ldquoeilnstrdquo) intermediate ldquovaluerdquo number of occurrences
Transparent Programming Modelbull Programs written for cloud implementation need to be automatically parallelized and executed on a large cluster of commodity machinesbull The run-time system should take care of the details of partitioning the input data scheduling the programs execution across a set of machines handling machine failures and managing the required inter-machine communicationbull The programming model should allow programmers without many experiences with parallel and distributed systems to easily utilize the resources of a large distributed system
Scalable Data Processing on Large Clustersbull A web programming model implemented for fast processing and generating large datasetsbull Applied mainly in web-scale search and cloud computing applicationsbull Users specify a map function to generate a set of intermediate keyvalue pairsbull Users use a reduce function to merge all intermediate values with the same intermediate key
Google MapReducebull Map written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function
bull The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation
Hadoop A software platform originally developed by Yahoo to enable users write and run applications over vast distributed data
Attractive Features in Hadoop 1048708 Scalable can easily scale to store and process petabytes of data in the Web space1048708 Economical An open-source MapReduce minimizes the overheads in task spawning and massive data communication1048708 Efficient Processing data with high-degree of parallelism across a large number of commodity nodes1048708 Reliable Automatically maintains multiple copies of data to facilitate redeployment of computing tasks on failures
Explain MapReduce with an example
The computation takes a set of input keyvalue pairs and produces a set of output keyvalue pairs The user of the MapReduce library expresses the computation as two functions Map and ReduceMap written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associatedwith the same intermediate key I and passes them to the Reduce function
The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation The intermediate values are supplied to the users reduce function via an iteratorThis allows us to handle lists of values that are too large to fit in memory
Consider the problem of counting the n umber of occurrences of each word in a large collection of documents The user would write code similar to the following pseudo-code
map(String key String value) key document name value document contents for each word w in valueEmitIntermediate(w 1)
reduce(String key Iterator values) key a word values a list of counts int result = 0for each v in valuesresult += ParseInt(v)
Emit(AsString(result))
The map function emits each word plus an associated count of occurrences (just lsquo1rsquo in this simple example) The reduce function sums together all counts emitted for a particular word
In addition the user writes code to fill in a mapreduce specification object with the names of the input and out- put files and optional tuning parameters The user then invokes the MapReduce function passing it the specification object The userrsquos code is linked together with the MapReduce library (implemented in C++) Appendix A contains the full program text for this example
Even though the previous pseudo-code is written in terms of string inputs and outputs conceptually the map and reduce functions supplied by the user have associated typesmap (k1v1) list(k2v2)reduce (k2list(v2)) list(v2)
Ie the input keys and values are drawn from a different domain than the output keys and values Furthermore the intermediate keys and values are from the same domainas the output keys and values
We have a large collection of text documents in a folderCount the frequency of distinct words in the documents
Map functionMap function operates on every keyvalue pair of input data and transforms the data based on the transformation logic provided in the map functionMap function always emits an intermediate keyvalue pair as output
Map( Key1 Value1) -gt List ( Key2 Value2 )For each file
Read each line from the input fileLocate each word
Emit the (word1) for every word foundThe emitted (word 1) will form the list that is output from the Map function
Reduce function takes the list of every key and transforms the data based on the (aggregation) logic provided in the reduce function It is similar to the Aggregate functions in Standard SQL
For the List(key value) output from the mapper Shuffle and Sort the data by keyGroup by Key and create the list of values for a key
Reduce functionReduce ( Key2 List(Value2) ) -gt List (Key3 Value3 )Read each key (word) and list of values (1 1 1) associated with it
For each key add the list of values to calculate sumEmit the word sum for every word found
Transparent Programming Modelbull Programs written for cloud implementation need to be automatically parallelized and executed on a large cluster of commodity machinesbull The run-time system should take care of the details of partitioning the input data scheduling the programs execution across a set of machines handling machine failures and managing the required inter-machine communicationbull The programming model should allow programmers without many experiences with parallel and distributed systems to easily utilize the resources of a large distributed system
Scalable Data Processing on Large Clustersbull A web programming model implemented for fast processing and generating large datasetsbull Applied mainly in web-scale search and cloud computing applicationsbull Users specify a map function to generate a set of intermediate keyvalue pairsbull Users use a reduce function to merge all intermediate values with the same intermediate key
Google MapReducebull Map written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function
bull The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation
Hadoop A software platform originally developed by Yahoo to enable users write and run applications over vast distributed data
Attractive Features in Hadoop 1048708 Scalable can easily scale to store and process petabytes of data in the Web space1048708 Economical An open-source MapReduce minimizes the overheads in task spawning and massive data communication1048708 Efficient Processing data with high-degree of parallelism across a large number of commodity nodes1048708 Reliable Automatically maintains multiple copies of data to facilitate redeployment of computing tasks on failures
Explain MapReduce with an example
The computation takes a set of input keyvalue pairs and produces a set of output keyvalue pairs The user of the MapReduce library expresses the computation as two functions Map and ReduceMap written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associatedwith the same intermediate key I and passes them to the Reduce function
The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation The intermediate values are supplied to the users reduce function via an iteratorThis allows us to handle lists of values that are too large to fit in memory
Consider the problem of counting the n umber of occurrences of each word in a large collection of documents The user would write code similar to the following pseudo-code
map(String key String value) key document name value document contents for each word w in valueEmitIntermediate(w 1)
reduce(String key Iterator values) key a word values a list of counts int result = 0for each v in valuesresult += ParseInt(v)
Emit(AsString(result))
The map function emits each word plus an associated count of occurrences (just lsquo1rsquo in this simple example) The reduce function sums together all counts emitted for a particular word
In addition the user writes code to fill in a mapreduce specification object with the names of the input and out- put files and optional tuning parameters The user then invokes the MapReduce function passing it the specification object The userrsquos code is linked together with the MapReduce library (implemented in C++) Appendix A contains the full program text for this example
Even though the previous pseudo-code is written in terms of string inputs and outputs conceptually the map and reduce functions supplied by the user have associated typesmap (k1v1) list(k2v2)reduce (k2list(v2)) list(v2)
Ie the input keys and values are drawn from a different domain than the output keys and values Furthermore the intermediate keys and values are from the same domainas the output keys and values
We have a large collection of text documents in a folderCount the frequency of distinct words in the documents
Map functionMap function operates on every keyvalue pair of input data and transforms the data based on the transformation logic provided in the map functionMap function always emits an intermediate keyvalue pair as output
Map( Key1 Value1) -gt List ( Key2 Value2 )For each file
Read each line from the input fileLocate each word
Emit the (word1) for every word foundThe emitted (word 1) will form the list that is output from the Map function
Reduce function takes the list of every key and transforms the data based on the (aggregation) logic provided in the reduce function It is similar to the Aggregate functions in Standard SQL
For the List(key value) output from the mapper Shuffle and Sort the data by keyGroup by Key and create the list of values for a key
Reduce functionReduce ( Key2 List(Value2) ) -gt List (Key3 Value3 )Read each key (word) and list of values (1 1 1) associated with it
For each key add the list of values to calculate sumEmit the word sum for every word found
Explain MapReduce with an example
The computation takes a set of input keyvalue pairs and produces a set of output keyvalue pairs The user of the MapReduce library expresses the computation as two functions Map and ReduceMap written by the user takes an input pair and produces a set of intermediate keyvalue pairs The MapReduce library groups together all intermediate values associatedwith the same intermediate key I and passes them to the Reduce function
The Reduce function also written by the user accepts an intermediate key I and a set of values for that key It merges together these values to form a possibly smaller set of values Typically just zero or one output value is produced per Reduce invocation The intermediate values are supplied to the users reduce function via an iteratorThis allows us to handle lists of values that are too large to fit in memory
Consider the problem of counting the n umber of occurrences of each word in a large collection of documents The user would write code similar to the following pseudo-code
map(String key String value) key document name value document contents for each word w in valueEmitIntermediate(w 1)
reduce(String key Iterator values) key a word values a list of counts int result = 0for each v in valuesresult += ParseInt(v)
Emit(AsString(result))
The map function emits each word plus an associated count of occurrences (just lsquo1rsquo in this simple example) The reduce function sums together all counts emitted for a particular word
In addition the user writes code to fill in a mapreduce specification object with the names of the input and out- put files and optional tuning parameters The user then invokes the MapReduce function passing it the specification object The userrsquos code is linked together with the MapReduce library (implemented in C++) Appendix A contains the full program text for this example
Even though the previous pseudo-code is written in terms of string inputs and outputs conceptually the map and reduce functions supplied by the user have associated typesmap (k1v1) list(k2v2)reduce (k2list(v2)) list(v2)
Ie the input keys and values are drawn from a different domain than the output keys and values Furthermore the intermediate keys and values are from the same domainas the output keys and values
We have a large collection of text documents in a folderCount the frequency of distinct words in the documents
Map functionMap function operates on every keyvalue pair of input data and transforms the data based on the transformation logic provided in the map functionMap function always emits an intermediate keyvalue pair as output
Map( Key1 Value1) -gt List ( Key2 Value2 )For each file
Read each line from the input fileLocate each word
Emit the (word1) for every word foundThe emitted (word 1) will form the list that is output from the Map function
Reduce function takes the list of every key and transforms the data based on the (aggregation) logic provided in the reduce function It is similar to the Aggregate functions in Standard SQL
For the List(key value) output from the mapper Shuffle and Sort the data by keyGroup by Key and create the list of values for a key
Reduce functionReduce ( Key2 List(Value2) ) -gt List (Key3 Value3 )Read each key (word) and list of values (1 1 1) associated with it
For each key add the list of values to calculate sumEmit the word sum for every word found
Ie the input keys and values are drawn from a different domain than the output keys and values Furthermore the intermediate keys and values are from the same domainas the output keys and values
We have a large collection of text documents in a folderCount the frequency of distinct words in the documents
Map functionMap function operates on every keyvalue pair of input data and transforms the data based on the transformation logic provided in the map functionMap function always emits an intermediate keyvalue pair as output
Map( Key1 Value1) -gt List ( Key2 Value2 )For each file
Read each line from the input fileLocate each word
Emit the (word1) for every word foundThe emitted (word 1) will form the list that is output from the Map function
Reduce function takes the list of every key and transforms the data based on the (aggregation) logic provided in the reduce function It is similar to the Aggregate functions in Standard SQL
For the List(key value) output from the mapper Shuffle and Sort the data by keyGroup by Key and create the list of values for a key
Reduce functionReduce ( Key2 List(Value2) ) -gt List (Key3 Value3 )Read each key (word) and list of values (1 1 1) associated with it
For each key add the list of values to calculate sumEmit the word sum for every word found