Post on 13-Sep-2014
description
Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 2 September 1, 2011
Matt Lease
School of Information
University of Texas at Austin
ml at ischool dot utexas dot edu
Jason Baldridge
Department of Linguistics
University of Texas at Austin
Jasonbaldridge at gmail dot com
Acknowledgments
Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park
Some figures courtesy of
• Chuck Lam’s Hadoop In Action (2011)
• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
g g g g g
f f f f f Map
Fold
Roots in Functional Programming
Divide and Conquer
“Work”
w1 w2 w3
r1 r2 r3
“Result”
“worker” “worker” “worker”
Partition
Combine
MapReduce
“Big Ideas”
Scale “out”, not “up”
Limits of SMP and large shared-memory machines
Move processing to the data
Cluster have limited bandwidth
Process data sequentially, avoid random access
Seeks are expensive, disk throughput is reasonable
Seamless scalability
From the mythical man-month to the tradable machine-hour
Typical Large-Data Problem
Iterate over a large number of records
Compute something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results
Generate final output
Key idea: provide a functional abstraction for
these two operations
(Dean and Ghemawat, OSDI 2004)
MapReduce Data Flow
Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52
MapReduce “Runtime”
Handles scheduling
Assigns workers to map and reduce tasks
Handles “data distribution”
Moves processes to data
Handles synchronization
Gathers, sorts, and shuffles intermediate data
Handles errors and faults
Detects worker failures and restarts
Built on a distributed file system
MapReduce
Programmers specify two functions
map ( K1, V1 ) → list ( K2, V2 )
reduce ( K2, list(V2) ) → list ( K3, V3)
Note correspondence of types map output → reduce input
Data Flow
Input → “input splits”: each a sequence of logical (K1,V1) “records”
Map
• Each split processed by same map node
• map invoked iteratively: once per record in the split
• For each record processed, map may emit 0-N (K2,V2) pairs
Reduce
• reduce invoked iteratively for each ( K2, list(V2) ) intermediate value
• For each processed, reduce may emit 0-N (K3,V3) pairs
Each reducer’s output written to a persistent file in HDFS
InputSplit
Source: redrawn from a slide by Cloduera, cc-licensed
InputSplit InputSplit
Input File Input File
InputSplit InputSplit
RecordReader RecordReader RecordReader RecordReader RecordReader
Mapper
Intermediates
Mapper
Intermediates
Mapper
Intermediates
Mapper
Intermediates
Mapper
Intermediates
Inp
utF
orm
at
Data Flow
Input → “input splits”: each a sequence of logical (K1,V1) “records”
For each split, for each record, do map(K1,V1) (multiple calls)
Each map call may emit any number of (K2,V2) pairs (0-N)
Run-time
Groups all values with the same key into ( K2, list(V2) )
Determines which reducer will process this
Copies data across network as needed for reducer
Ensures intra-node sort of keys processed by each reducer
• No guarantee by default of inter-node total sort across reducers
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30
“Hello World”: Word Count
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)
map map map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6
b a 1 2 c c 3 6 a c 5 2 b c 7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52
Partition
Given: map ( K1, V1 ) → list ( K2, V2 )
reduce ( K2, list(V2) ) → list ( K3, V3)
partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]
Each distinct key (with associated values) sent to a single reducer
• Same reduce node may process multiple keys in separate reduce() calls
Balances workload across reducers: equal number of keys to each
• Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)
Customizable
• Some keys require more computation than others
• e.g. value skew, or key-specific computation performed
• For skew, sampling can dynamically estimate distribution & set partition
• Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?
Secondary Sorting (Lin 57, White 241)
How to output sorted bigrams (1st word, then list of 2nds)?
What if we use word1 as the key, word 2 as the value?
What if we use <first>--<second> as the key?
Pattern
Create a composite key of (first, second)
Define a Key Comparator based on both words
• This will produce the sort order we want (aa ab ac ba bb bc ca cb…)
Define a partition function based only on first word
• All bigrams with the same first word go to same reducer
• How do you know when the first word changes across invocations?
Preserve state in the reducer across invocations
• Will be called separately for each bigram, but we want to remember
the current first word across bigrams seen
Hadoop also provides Group Comparator
Combine
Given: map ( K1, V1 ) → list ( K2, V2 )
reduce ( K2, list(V2) ) → list ( K3, V3)
combine ( K2, list(V2) ) → list ( K2, V2 )
Optional optimization
Local aggregation to reduce network traffic
No guarantee it will be used, how many times it will be called
Semantics of program cannot depend on its use
Signature: same input as reduce, same output as map
Combine may be run repeatedly on its own output
Lin: Associative & Commutative combiner = reducer
• See next slide
Functional Properties
Associative: f( a, f(b,c) ) = f( f(a,b), c )
Grouping of operations doesn’t matter
YES: Addition, multiplication, concatenation
NO: division, subtraction, NAND
NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )
Commutative: f(a,b) = f(b,a)
Ordering of arguments doesn’t matter
YES: addition, multiplication, NAND
NO: division, subtraction, concatenation
Concatenate(“a,”b”) != concatenate(“b”,a”)
Distributive
White (p. 32) and Lam (p. 84) mention with regard to combiners
But really, go with associative + commutative in Lin (pp. 20, 27)
combine combine combine combine
b a 1 2 c 9 a c 5 2 b c 7 8
partition partition partition partition
map map map map
k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6
b a 1 2 c c 3 6 a c 5 2 b c 7 8
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3
c 2 3 6 8
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
worker
Master
User
Program
output
file 0
output
file 1
(1) submit
(2) schedule map (2) schedule reduce
(3) read (4) local write
(5) remote read (6) write
Input
files
Map
phase
Intermediate files
(on local disk)
Reduce
phase
Output
files
Adapted from (Dean and Ghemawat, OSDI 2004)
Shuffle and 2 Sorts
As map emits values, local sorting runs in tandem (1st sort)
Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)
Partition determines which (logical) reducer Rj each key will go to
Node’s TaskTracker tells JobTracker it has keys for Rj
JobTracker determines node to run Rj based on data locality
When local map/combine/sort finishes, sends data to Rj’s node
Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
For each (K, list(V)) tuple in merged output, call reduce(…)
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178
Distributed File System
Don’t move data… move computation to the data!
Store data on the local disks of nodes in the cluster
Start up the workers on the node that has the data local
Why?
Not enough RAM to hold all the data in memory
Disk access is slow, but disk throughput is reasonable
A distributed file system is the answer
GFS (Google File System) for Google’s MapReduce
HDFS (Hadoop Distributed File System) for Hadoop
GFS: Assumptions
Commodity hardware over “exotic” hardware
Scale “out”, not “up”
High component failure rates
Inexpensive commodity components fail all the time
“Modest” number of huge files
Multi-gigabyte files are common, if not encouraged
Files are write-once, mostly appended to
Perhaps concurrently
Large streaming reads over random access
High sustained throughput over low latency
GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design Decisions
Files stored as chunks
Fixed size (64MB)
Reliability through replication
Each chunk replicated across 3+ chunkservers
Single master to coordinate access, keep metadata
Simple centralized management
No data caching
Little benefit due to large datasets, streaming reads
Simplify the API
Push some of the issues onto the client (e.g., data layout)
HDFS = GFS clone (same basic ideas)
Basic Cluster Components
1 “Manager” node (can be split onto 2 nodes)
Namenode (NN)
Jobtracker (JT)
1-N “Worker” nodes
Tasktracker (TT)
Datanode (DN)
Optional Secondary Namenode
Periodic backups of Namenode in case of failure
Hadoop Architecture
Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25
Namenode Responsibilities
Managing the file system namespace:
Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc.
Coordinating file operations:
Directs clients to datanodes for reads and writes
No data is moved through the namenode
Maintaining overall health:
Periodic communication with the datanodes
Block re-replication and rebalancing
Garbage collection
Putting everything together…
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
namenode
namenode daemon
job submission node
jobtracker
Anatomy of a Job
MapReduce program in Hadoop = Hadoop job
Jobs are divided into map and reduce tasks (+ more!)
An instance of running a task is called a task attempt
Multiple jobs can be composed into a workflow
Job submission process
Client (i.e., driver program) creates a job, configures it, and
submits it to job tracker
JobClient computes input splits (on client end)
Job data (jar, configuration XML) are sent to JobTracker
JobTracker puts job data in shared location, enqueues tasks
TaskTrackers poll for tasks
Off to the races…
Why have 1 API when you can have 2?
White pp. 25-27, Lam pp. 77-80
Hadoop 0.19 and earlier had “old API”
Hadoop 0.21 and forward has “new API”
Hadoop 0.20 has both!
Old API most stable, but deprecated
Current books use old API predominantly, but discuss changes
• Example code using new API available online from publisher
Some old API classes/methods not yet ported to new API
Cloud9 uses both, and you can too
Old API
Mapper (interface)
void map(K1 key, V1 value, OutputCollector<K2, V2> output,
Reporter reporter)
void configure(JobConf job)
void close() throws IOException
Reducer/Combiner
void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3,V3> output, Reporter reporter)
void configure(JobConf job)
void close() throws IOException
Partitioner
void getPartition(K2 key, V2 value, int numPartitions)
New API
org.apache.hadoop.mapred now deprecated; instead use
org.apache.hadoop.mapreduce &
org.apache.hadoop.mapreduce.lib
Mapper, Reducer now abstract classes, not interfaces
Use Context instead of OutputCollector and Reporter
Context.write(), not OutputCollector.collect()
Reduce takes value list as Iterable, not Iterator
Can use java’s foreach syntax for iterating
Can throw InterruptedException as well as IOException
JobConf & JobClient replaced by Configuration & Job