Map Reduce
ElecBon process
Map Reduce
Typical single node architecture
Storage
Memory
CPU
Applica'on
Map Reduce
Applica'on
Map Reduce
• CounBng • SorBng
– Merge sort, Quick sort • BIG Data
– Data Mining – Trend Analysis e.g. TwiPer – RecommendaBon Systems
• If bought = (A, B) => likely to buy C – Google Search
The Underlying Technologies
Distributed systems, storage, compuBng ….
• Web data sets can be very large – Tens to hundreds of terabytes …. soon petabyte(s)
• Cannot mine on a single server (why?) • Standard architecture emerging:
– Cluster of commodity Linux nodes – (very) High speed Ethernet interconnect
• How to organize computaBons on this architecture? – Storage is cheap but data management is not
(Nodes are bound to fail) • Mask issues such as hardware failure
Mask issues such as hardware failure
Goal: Stable Storage
• For: (Stable) ComputaBon • In other words
if any of the nodes, fails how do we ensure data availability, persistence …. ?
Goal: Stable Storage
• Answer: – Distribute it – have redundancy ‘Manage’ this
• Data operaBons and services – Store, Retrieve on a single logical resource that is distributed over a number of ‘locaBons’.
Filesystem!
DFS
• Distributed File System – Provides global file namespace – Google GFS; Hadoop HDFS; etc. – Typical usage paPern
• Huge files (100s of GB to TB) • Reads and appends are common
DFS • Chunk Servers
– File is split into conBguous chunks – Typically each chunk is 16-‐64MB – Each chunk replicated (usually 2x or 3x) – Try to keep replicas in different racks
• Master node (GFS) – a.k.a. Name Nodes in HDFS – Stores metadata – Might be replicated
• Client library for file access – Talks to master to find chunk servers – Connects directly to chunk servers to access data
Chubby
• A coarse-‐grained lock service – distributed systems can use this to synchronize access to shared resources
– Intended for use by loosely-‐coupled distributed systems
– In GFS: Elect a master – In BigTable: master elecBon, client discovery, table service locking
Interface
• Presents a simple distributed file system – Clients can open/close/read/write files – Reads and writes are whole-‐file
– Also supports advisory reader/writer locks – Clients can register for noBficaBon of file update
Topology
Master
Replica
Replica
Replica Replica
ALL Client
Traffic
One Chubby Cell
Master ElecBon
• All replicas try to acquire a write lock on designated file.
• The one who gets the lock is the master. – Master can then write its address to file – other replicas can read this file to discover the chosen master name.
• Chubby doubles as a name service
Consensus
• Chubby cell is usually 5 replicas – 3 must be alive for cell to be viable
• How do replicas in Chubby agree on their own master, official lock values? – PAXOS algorithm
PAXOS
Paxos is a family of algorithms (by Leslie Lamport) designed to provide distributed consensus in a network of several processors.
Processor AssumpBons
• Operate at arbitrary speed • Independent, random failures • Processors with stable storage may rejoin protocol aoer failure
• Do not lie, collude, or aPempt to maliciously subvert the protocol
Network AssumpBons
• All processors can communicate with (see) one another
• Messages are sent asynchronously and may take arbitrarily long to deliver
• Order of messages is not guaranteed: they may be lost, reordered, or duplicated
• Messages, if delivered, are not corrupted in the process
A Fault Tolerant Memory of Facts
• Paxos provides a memory for individual facts‖ in the network. – A fact is a binding from a variable to a value. – Paxos between 2F+1 processors is reliable and can make progress if up to F of them fail.
Roles
• Proposer – An agent that proposes a fact • Leader – the authoritaBve proposer • Acceptor – holds agreed-‐upon facts in its memory
• Learner – May retrieve a fact from the system
Safety Guarantees
• Nontriviality: Only proposed values can be learned
• Consistency: Only at most one value can be learned
• Liveness: If at least one value V has been proposed, eventually any learner L will get some value
Key Idea
• Acceptors do not act unilaterally. For a fact to be learned, a quorum of acceptors must agree upon the fact
• A quorum is any majority of acceptors • Given acceptors {A, B, C, D}, Q = {{A, B, C}, {A, B, D}, {B, C, D}, {A, C, D}}
Basic Paxos
• Determines the authoritaBve value for a single variable
• Several proposers offer a value Vn to set the variable to.
• The system converges on a single agreed-‐ upon V to be the fact.
Step 1: Prepare
Credit: spinnaker labs inc.
Step 2: Promise
PROMISE x– Acceptor will accept proposals only numbered x or higher Proposer 1 is ineligible because an Acceptor quorum has voted for a higher number than j
Credit: spinnaker labs inc.
Step 3: Accept
Credit: spinnaker labs inc.
Step 4: Accepted ack
Credit: spinnaker labs inc.
Learning
If a learner interrogates the system, a quorum will respond with fact V_k
Basic Paxos conBnued..
• Proposer 1 is free to try again with a proposal number > k; can take over leadership and write in a new authoritaBve value – Official fact will change atomically on all acceptors from the perspecBve of learners – If a leader dies mid-‐negoBaBon, value just drops, another leader tries with higher proposal
Paxos in Chubby
• Replicas in a cell iniBally use Paxos to establish the leader.
• Majority of replicas must agree • Replicas promise not to try to elect new master for at least a few seconds (master lease)
• Master lease is periodically renewed Read More: hPp://labs.google.com/papers/chubby.html
hPp://labs.google.com/papers/bigtable-‐osdi06.pdf
Big Table
• Google‘s Needs: – Data reliability – High speed retrieval – Storage of huge numbers of records (several TB of data)
– (MulBple) past versions of records should be available
HBase -‐ Big Table
• Features: – Simplified data retrieval mechanism – (row, col, Bmestamp)
• value lookup, only – No relaBonal operators – Arbitrary number of columns per row – Arbitrary data type for each column
• New constraint: data validaBon must be performed by applicaBon layer!
Logical Data RepresentaBon
• Rows & columns idenBfied by arbitrary strings • MulBple versions of a (row, col) cell can be accessed through Bmestamps – ApplicaBon controls version tracking policy – Columns grouped into column families
Data Model • Related columns stored in fixed number of families
– Family name is a prefix on column name – e.g., fileaPr:owning_group, fileaPr:owning_user
• A column name has the form "<family>:<label>" where <family> and <label> can be arbitrary byte arrays.
• Lookup is Hash based • Column families stored physically close on disk
– items in a given column family should have roughly the same read/write characterisBcs and contain similar data.
Conceptual View
Column family
Physical Storage View
Each stored in conBguous chunks over mulBple nodes as the data grows
Example GET DecimalFormat decimalFormat = new DecimalFormat("0000000");
HTable hTable = new HTable("rest_data"); String str =decimalFormat.format(4); Get g = new Get(Bytes.toBytes(str)); Result r = hTable.get(g);
NavigableMap<byte[], byte[]> map = r.getFamilyMap(Bytes.toBytes("feature"));
Example PUT DecimalFormat restIdFormat = new DecimalFormat("0000000");
HTable hTable = new HTable("restaurants");
String restId = restIdFormat.format(4); Put put = new Put(Bytes.toBytes("rest_ids"));
put.add(Bytes.toBytes("restaurant_id"), Bytes .toBytes(restId), Bytes.toBytes(restId)); hTable.put(put);
HBase -‐ BigTable
• Further Reading with many more details:
– hPp://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
– hPp://labs.google.com/papers/bigtable-‐osdi06.pdf
MapReduce
• ImplementaBons run on the backbone of DFS such as HDFS, GFS
• Using if needed, storage soluBons like HBase, BigTable
Word Count
• We have a large file of words, one word to a line
• Count the number of Bmes each disBnct word appears in the file
• Sample applicaBon: analyze web server logs to find popular URLs
Word Count
• Input: a set of key/value pairs • User supplies two funcBons:
– map(k,v) • Intermediate: list(k1,v1)
– reduce(k1, list(v1)) à v2
• (k1,v1) is an intermediate key/value pair • Output is the set of (k1,v2) pairs
Word Count using MapReduce map(key, value): // key: document name; value: text of document
for each word w in value: emit(w, 1)
reduce(key, values): // key: a word; value: an iterator over counts
result = 0 for each count v in values: result += v emit(result)
Overview
Data Flow
• Input, final output are stored on a distributed file system (GFS, HDFS) – Scheduler tries to schedule map tasks “close” to physical storage locaBon of input data
– Intermediate results are stored on local FS of map and reduce workers
• Output is ooen input to another map reduce task – E.g. data mining – apriori algorithm
CoordinaBon
• Master data structures – Task status: (idle, in-‐progress, completed) – Idle tasks get scheduled as workers become available – When a map task completes, it sends the master the locaBon and sizes of its R intermediate files, one for each reducer
– Master pushes this info to reducers
• Master pings workers periodically to detect failures
Failures
• Map worker failure – Map tasks completed or in-‐progress at worker are reset to idle
– Reduce workers are noBfied when task is rescheduled on another worker
• Reduce worker failure – Only in-‐progress tasks are reset to idle
• Master failure – MapReduce task is aborted and client is noBfied
Combiners
• Ooen a map task will produce many pairs of the form (k,v1), (k,v2), ... for the same key k – E.g., popular words in Word Count
• Can save network Bme by pre-‐aggregaBng at mapper – combine(k1, list(v1)) à v2 – Usually same as reduce funcBon
ParBBon FuncBon
• Inputs to map tasks are created by conBguous splits of input file
• For reduce, we need to ensure that records with the same intermediate key end up at the same worker
• System uses a default parBBon funcBon e.g., hash(key) mod R
• SomeBmes useful to override – E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file
More Reading hPp://labs.google.com/papers/mapreduce-‐osdi04-‐slides/index.html hPp://labs.google.com/papers/mapreduce-‐osdi04.pdf hPp://wiki.apache.org/hadoop/ hPp://code.google.com/edu/parallel/mapreduce-‐tutorial.html
Top Related