MapReduce: distributed computing on large commodity clusters
-
Upload
spiros-denaxas -
Category
Technology
-
view
2.085 -
download
0
description
Transcript of MapReduce: distributed computing on large commodity clusters
MapReduce
Distributed computing on large commodity clusters
Dr. Spiros Denaxas, Epidemiology & Public Health, UCL, 18 Feb 2010
Hello
� Introduction
� Who am I
� Structure of presentation
�Distributed computing
�MapReduce examples
�Amazon Web Services
�Live demo
Data and some more data
� Google processes > 20PB daily
� Facebook processes 15TB daily, more than 4TB new data per day
� Archive.org has 2PB of content
� Baidu, 3TB/week
� CERN LHC will generate 20GB/sec
Data driven applications
� Fraud detection
� Web indexing
� Risk management
� Service personalization
� Spam detection
� Document clustering
Fruits of some sort
� Consider a very simple example
� fruit_diary.log: a text file with fruit names
� “cat fruit_diary.txt | sort | uniq –c”
� Matches all lines from your eating diary
� Sorts all lines in memory
� uniq –c counts the unique occurrences
� What if fruit diary was 500GB? What if it was
500TB? 500PB ?
Big Data problem
� 1. Iterate over a large number of records
� 2. Extract something of interest
� 3. Shuffle and sort the intermediate results
� 4. Aggregate the intermediate results
� 5. Generate final output
Big Data problem
� Majority of “big data” players rely heavily on data
analysis, be it commercial or scientific.
� Ad hoc investigation, trends, patterns, reporting.
� Timely manner
� Questions must be answered in hours, not
weeks.
Scalability of single entities
� The single disk/memory model does not scale
on a single processing entity.
� Now what?
� Lets add many disk/memory/processing entities!
� Parallel vs. Distributed computing
� Parallel: Multiple CPU’s in a single computer
� Distributed: Multiple CPU’s across multiple computers
over the network
What is wrong with this?
� Worker 1:void work() {
var2++;
var1 = var2 + 5;}
� Worker 2:void work() {
var1++;
var2 = var1; }
Parallel computing pitfalls
� “Parallel computing is a black art”
� Very hard to program, expensive, complicated ; how does it scale?
� How do we know:
�when a worker has finished?
�when a worker has failed?
�How to synchronize?
Now what?
� Data needs to be processed on a massive scale
in a distributed fashion as it does not fit a single
node.
� Solution must be scalable
� Solution must be cheap
� Low cost hardware with redundancy
� Don’t worry about concurrency, focus on more
serious problems.
Distributed systems
� Fault tolerant
� Highly available
� Recoverable
� Consistent
� Scalable
Data storage
� Google FS (GFS) and Hadoop Distributed File
System (HDFS)
� Data must be available by all processing nodes
� Don’t move data to workers, move workers to
the data
� Store data on local disks of nodes
� Start workers on node that has data locally
� Minimize meta-data by using large blocks
MapReduce
� A framework for processing data using a cluster of computer nodes
� Created by Google in C++
� Two steps: map and reduce
� Automatic parallelization, distribution, failover, synchronization and and and …
� Clean abstraction layer for programmers
� Processes are isolated
mapmap map map
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
Shuffle and Sort: aggregate values by keysShuffle and Sort: aggregate values by keys
MapReduce map()
� map(in_key,in_value) => (out_key,
intermediate_value) list
� Data (lines from files, database rows etc) are
read, recorded and emitted as key / value pairs
� For example ”coconut,1”
� Map() produces one or more intermediate
values along with an output key from the input
data. Map() runs in parallel and independently.
MapReduce map()
MapReduce reduce()
� A reducer is given a key and all values for this specific key.
� Once the map phase is over, intermediate values of a given output key are collapsed into a list.
� Reduce() combines intermediate values into one or more final values for the same output key
� Bottleneck: reduce() stage cannot start until map() phase is done.
MapReduce reduce()
Big Data problem
� 1. Iterate over a large number of records
� 2. map()
� 3. Shuffle and sort the intermediate results
� 4. reduce()
� 5. Generate final output
Term Frequency (TF) calculation
� The TF of a given term is the number of times it appears within a document collection.
� ”The sea-reach of the Thames stretched before us like the beginning of an interminable waterway. In the offing the sea and the sky were welded together without a joint, and in the luminous space the tanned sails of the barges drifting up with the tide seemed to stand still in red clusters of canvas sharply peaked, with gleams of varnished sprits.”
Term Frequency (TF) calculation
� Stopword elimination
� sea reach Thames stretched before like beginning interminable waterway offing sea sky welded together joint luminous space tanned sails barges drifting tide seemed stand still red clusters canvas sharply peaked gleams varnished sprits
Term Frequency (TF) calculation
� A generic map() function
� Input: a single line
� Ouput: <word,frequency> pairs
� map(line) {
@words = split / / line
foreach word ( @words ) {
print word, 1 } }
Term Frequency (TF) calculation
� sea reach Thames stretched before like beginning
� Output would look like:� sea,1
� reach,1� Thames,1
� stretched,1� before,1
� Like,1� beginning,1
Term Frequency (TF) calculation
� A generic reducer() function
� Sums up the values which are the occurrence of each word.
� Input: series of <word,frequency> pairs
� Output: series of <word,sum> pairs
� Reducer ( word, frequency ) {datastructure[ word ] ++;}
foreach word ( datastructure ) { print word freq } }
Term Frequency (TF) calculation
� Output from reduce stage would look like:
� sky,2
� reach,1
� Thames,1
� stretched,1
� before,1
� like,1
� beginning,1 [...]
MapReduce indexing
� Map over all documents� Emit term as key, (docno, tf) as value
� Emit other information as necessary (e.g., term position)
� Sort/shuffle: group by term
� Reduce� Gather and sort (e.g., by docno or tf)
� Write to disk
� MapReduce does all the heavy lifting!
Inverted index (Boolean)one fish, two fish
Doc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
1
1
1
1
1
1
1 2 3
1
1
1
4
blue
cat
egg
fish
green
ham
hat
one
3
4
1
4
4
3
2
1
blue
cat
egg
fish
green
ham
hat
one
2
green eggs and hamDoc 4
1red
1two
2red
1two
Inverted index (ranked)one fish, two fish
Doc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
2
1
2
1
1
1
1 2 3
1
1
1
4
1
1
1
1
1
1
2
1
tf
df
blue
cat
egg
fish
green
ham
hat
one
3,1
4,1
1,2
4,1
4,1
3,1
2,1
1,1
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
2,2
green eggs and hamDoc 4
1 1red
1 1two
2,11red
1,11two
11one 11
11two 11
11fish 22
one fish, two fishDoc 1
22red 11
22blue 11
22fish 22
red fish, blue fishDoc 2
33cat 11
33hat 11
cat in the hatDoc 3
11fish 22 22 22
11one 1111two 11
22red 11
33cat 11
22blue 11
33hat 11
Shuffle and Sort: aggregate values by keysShuffle and Sort: aggregate values by keys
Map
Reduce
Inverted Index (positional)one fish, two fish
Doc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
3,1,[1]
4,1,[2]
1,2,[2,4]
4,1,[3]
4,1,[1]
3,1,[2]
2,1,[3]
1,1,[1]
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
green eggs and hamDoc 4
2,1,[1]1red
1,1,[3]1two
1,2,[2,4]
3,1
4,1
1,2
4,1
4,1
3,1
2,1
1,1
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
2,2
2,11red
1,11two
11one 11
11two 11
11fish 22
one fish, two fishDoc 1
22red 11
22blue 11
22fish 22
red fish, blue fishDoc 2
33cat 11
33hat 11
cat in the hatDoc 3
11fish 22 22 22
11one 1111two 11
22red 11
33cat 11
22blue 11
33hat 11
Shuffle and Sort: aggregate values by keysShuffle and Sort: aggregate values by keys
Map
Reduce
[2,4]
[2,4]
[1][1]
[3][3]
[2,4]
[2,4]
[1][1]
[3][3]
[1][1]
[2][2]
[1][1]
[1][1]
[3][3]
[2][2]
[3][3][2,4
]
[2,4]
[1][1]
[2,4]
[2,4]
PageRank
� Named after Larry Page at Google
� Essentially a link analysis algorithm
� Measures the relative importance of a web page
� Algorithmically assesses and quantifies that “importance”
PageRank
� How can we define how important page X is?
� One solution: quantify the incoming links from other pages to that page
� Surely, more incoming links would mean a more authoritative status?
PageRank
� Imagine your typical web surfer browsing page X
� Only two things can happen:
�A) Random link from X is clicked (probability
a)
�B) User teleports away (probability 1 –a )
PageRank defined
Given page x with in-bound links t1…tn, where
� C(t) is the out-degree of t
� α is probability of random jump
� N is the total number of nodes in the graph
∑=
−+
=
n
i i
i
tC
tPR
NxPR
1 )(
)()1(
1)( αα
PageRank defined
X
t1
t2
tn
…
Computing PageRank
� Properties of PageRank� Can be computed iteratively� Effects of each iteration are local
� Sketch of algorithm:� Start with seed PRi values
� Each page distributes PRi “credit” to all pages it links to
� Each target page adds up “credit” from multiple in-bound links to compute PRi+1
� Iterate until values converge (100 times?)
Map: distribute PageRank “credit” to link targets
Reduce: gather up PageRank “credit” from multiple sources
to compute new PageRank value
Iterate until
convergence
Graph algorithms in MapReduce
� General approach:� Store graphs as adjacency lists
� Each map task receives a node and its adjacency list
� Map task compute some function of the link structure, emits value with target as the key
� Reduce task collects keys (target nodes) and aggregates
� Perform multiple MapReduce iterations until some termination condition
Amazon Web Services (AWS)
� “A collection of remote computing services
offered over the Internet by Amazon”
� Accessed over HTTP using Representational
State Transfer (REST) or Simple Object Access
Protocol (SOAP)
� Cheap
� Scalable
� API implementations
Amazon Simple Storage Service
(S3)
� “An online persistent data storage service offered by Amazon Web Services”
� Charged on data stored and transferred
� Block-based filesystem
� Data is organized in buckets
� Buckets are accessed using HTTP REST
Amazon Simple Storage Service
(S3)
� http://<bucket>.s3.amazonaws.com/<key>
� Like HDFS, data is replicated across nodes, enabling the storage of very large files
� Several big players use S3 like Twitter and HP
Amazon Elastic Computer Cloud
(EC2)
� Scalable deployment of virtual servers for large scale data processing.
� Billed by hour of processing and magnitude of resources needed.
� No persistent storage (that’s what S3 is for!)
� Automatic scalability
Amazon Elastic Computer Cloud
(EC2)
� Amazon Machine Images (AMI)
�Sun, Oracle, IBM
�Windows, Linux
� Several sizes for all tastes
� From 1.5 to 65GB RAM
�From 1 core to 2 quad cores
Amazon Elastic MapReduce
� Hadoop-ready virtual servers on EC2
� HDFS-esque input from S3
� Amazon Web Services Management
�Hadoop in < 5 minutes!
Live Demo
� Word counting
�Project Gutenberg
� 1500 books from all languages
� Approx half a million lines of text
� 1500 files stored on S3
�8 EC2 Instances deployed
� 14 minutes from start to finish
� Less than 2 USD
Thank you