Download - CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% [email protected]%

CmpE 138 Spring 2011

Special Topics – L2

Shivanshu Singh [email protected]

Map Reduce

ElecBon process

Map Reduce

Typical single node architecture

Storage

Memory

CPU

Applica'on

Map Reduce

Applica'on

Map Reduce

•  CounBng •  SorBng

– Merge sort, Quick sort •  BIG Data

– Data Mining – Trend Analysis e.g. TwiPer – RecommendaBon Systems

•  If bought = (A, B) => likely to buy C – Google Search

The Underlying Technologies

Distributed systems, storage, compuBng ….

•  Web data sets can be very large –  Tens to hundreds of terabytes …. soon petabyte(s)

•  Cannot mine on a single server (why?) •  Standard architecture emerging:

–  Cluster of commodity Linux nodes –  (very) High speed Ethernet interconnect

•  How to organize computaBons on this architecture? –  Storage is cheap but data management is not

(Nodes are bound to fail) •  Mask issues such as hardware failure

Mask issues such as hardware failure

Goal: Stable Storage

•  For: (Stable) ComputaBon •  In other words

if any of the nodes, fails how do we ensure data availability, persistence …. ?

Goal: Stable Storage

•  Answer: – Distribute it – have redundancy ‘Manage’ this

•  Data operaBons and services –  Store, Retrieve on a single logical resource that is distributed over a number of ‘locaBons’.

Filesystem!

DFS

•  Distributed File System – Provides global file namespace – Google GFS; Hadoop HDFS; etc. – Typical usage paPern

•  Huge files (100s of GB to TB) •  Reads and appends are common

DFS •  Chunk Servers

–  File is split into conBguous chunks –  Typically each chunk is 16-‐64MB –  Each chunk replicated (usually 2x or 3x) –  Try to keep replicas in different racks

•  Master node (GFS) –  a.k.a. Name Nodes in HDFS –  Stores metadata –  Might be replicated

•  Client library for file access –  Talks to master to find chunk servers –  Connects directly to chunk servers to access data

Chubby

•  A coarse-‐grained lock service – distributed systems can use this to synchronize access to shared resources

–  Intended for use by loosely-‐coupled distributed systems

–  In GFS: Elect a master –  In BigTable: master elecBon, client discovery, table service locking

Interface

•  Presents a simple distributed file system – Clients can open/close/read/write files – Reads and writes are whole-‐file

– Also supports advisory reader/writer locks – Clients can register for noBficaBon of file update

Topology

Master

Replica

Replica

Replica Replica

ALL Client

Traffic

One Chubby Cell

Master ElecBon

•  All replicas try to acquire a write lock on designated file.

•  The one who gets the lock is the master. – Master can then write its address to file – other replicas can read this file to discover the chosen master name.

•  Chubby doubles as a name service

Consensus

•  Chubby cell is usually 5 replicas – 3 must be alive for cell to be viable

•  How do replicas in Chubby agree on their own master, official lock values? – PAXOS algorithm

PAXOS

Paxos is a family of algorithms (by Leslie Lamport) designed to provide distributed consensus in a network of several processors.

Processor AssumpBons

•  Operate at arbitrary speed •  Independent, random failures •  Processors with stable storage may rejoin protocol aoer failure

•  Do not lie, collude, or aPempt to maliciously subvert the protocol

Network AssumpBons

•  All processors can communicate with (see) one another

•  Messages are sent asynchronously and may take arbitrarily long to deliver

•  Order of messages is not guaranteed: they may be lost, reordered, or duplicated

•  Messages, if delivered, are not corrupted in the process

A Fault Tolerant Memory of Facts

•  Paxos provides a memory for individual facts‖ in the network. – A fact is a binding from a variable to a value. – Paxos between 2F+1 processors is reliable and can make progress if up to F of them fail.

Roles

•  Proposer – An agent that proposes a fact •  Leader – the authoritaBve proposer •  Acceptor – holds agreed-‐upon facts in its memory

•  Learner – May retrieve a fact from the system

Safety Guarantees

•  Nontriviality: Only proposed values can be learned

•  Consistency: Only at most one value can be learned

•  Liveness: If at least one value V has been proposed, eventually any learner L will get some value

Key Idea

•  Acceptors do not act unilaterally. For a fact to be learned, a quorum of acceptors must agree upon the fact

•  A quorum is any majority of acceptors •  Given acceptors {A, B, C, D}, Q = {{A, B, C}, {A, B, D}, {B, C, D}, {A, C, D}}

Basic Paxos

•  Determines the authoritaBve value for a single variable

•  Several proposers offer a value Vn to set the variable to.

•  The system converges on a single agreed-‐ upon V to be the fact.

Step 1: Prepare

Credit: spinnaker labs inc.

Step 2: Promise

PROMISE x– Acceptor will accept proposals only numbered x or higher Proposer 1 is ineligible because an Acceptor quorum has voted for a higher number than j


Step 3: Accept


Step 4: Accepted ack


Learning

If a learner interrogates the system, a quorum will respond with fact V_k

Basic Paxos conBnued..

•  Proposer 1 is free to try again with a proposal number > k; can take over leadership and write in a new authoritaBve value – Official fact will change atomically on all acceptors from the perspecBve of learners – If a leader dies mid-‐negoBaBon, value just drops, another leader tries with higher proposal

Paxos in Chubby

•  Replicas in a cell iniBally use Paxos to establish the leader.

•  Majority of replicas must agree •  Replicas promise not to try to elect new master for at least a few seconds (master lease)

•  Master lease is periodically renewed Read More: hPp://labs.google.com/papers/chubby.html

hPp://labs.google.com/papers/bigtable-‐osdi06.pdf

Big Table

•  Google‘s Needs: – Data reliability – High speed retrieval – Storage of huge numbers of records (several TB of data)

–  (MulBple) past versions of records should be available

HBase -‐ Big Table

•  Features: – Simplified data retrieval mechanism – (row, col, Bmestamp)

•  value lookup, only – No relaBonal operators – Arbitrary number of columns per row – Arbitrary data type for each column

•  New constraint: data validaBon must be performed by applicaBon layer!

Logical Data RepresentaBon

•  Rows & columns idenBfied by arbitrary strings •  MulBple versions of a (row, col) cell can be accessed through Bmestamps – ApplicaBon controls version tracking policy – Columns grouped into column families

Data Model •  Related columns stored in fixed number of families

–  Family name is a prefix on column name –  e.g., fileaPr:owning_group, fileaPr:owning_user

•  A column name has the form "<family>:<label>" where <family> and <label> can be arbitrary byte arrays.

•  Lookup is Hash based •  Column families stored physically close on disk

–  items in a given column family should have roughly the same read/write characterisBcs and contain similar data.

Conceptual View

Column family

Physical Storage View

Each stored in conBguous chunks over mulBple nodes as the data grows

Example GET DecimalFormat decimalFormat = new DecimalFormat("0000000");

HTable hTable = new HTable("rest_data"); String str =decimalFormat.format(4); Get g = new Get(Bytes.toBytes(str)); Result r = hTable.get(g);

NavigableMap<byte[], byte[]> map = r.getFamilyMap(Bytes.toBytes("feature"));

Example PUT DecimalFormat restIdFormat = new DecimalFormat("0000000");

HTable hTable = new HTable("restaurants");

String restId = restIdFormat.format(4); Put put = new Put(Bytes.toBytes("rest_ids"));

put.add(Bytes.toBytes("restaurant_id"), Bytes .toBytes(restId), Bytes.toBytes(restId)); hTable.put(put);

HBase -‐ BigTable

•  Further Reading with many more details:

– hPp://wiki.apache.org/hadoop/Hbase/HbaseArchitecture

– hPp://labs.google.com/papers/bigtable-‐osdi06.pdf

MapReduce

•  ImplementaBons run on the backbone of DFS such as HDFS, GFS

•  Using if needed, storage soluBons like HBase, BigTable

Word Count

•  We have a large file of words, one word to a line

•  Count the number of Bmes each disBnct word appears in the file

•  Sample applicaBon: analyze web server logs to find popular URLs

Word Count

•  Input: a set of key/value pairs •  User supplies two funcBons:

– map(k,v) •  Intermediate: list(k1,v1)

–  reduce(k1, list(v1)) à v2

•  (k1,v1) is an intermediate key/value pair •  Output is the set of (k1,v2) pairs

Word Count using MapReduce map(key, value): // key: document name; value: text of document

for each word w in value: emit(w, 1)

reduce(key, values): // key: a word; value: an iterator over counts

result = 0 for each count v in values: result += v emit(result)

Overview

Data Flow

•  Input, final output are stored on a distributed file system (GFS, HDFS) – Scheduler tries to schedule map tasks “close” to physical storage locaBon of input data

–  Intermediate results are stored on local FS of map and reduce workers

•  Output is ooen input to another map reduce task – E.g. data mining – apriori algorithm

CoordinaBon

•  Master data structures –  Task status: (idle, in-‐progress, completed) –  Idle tasks get scheduled as workers become available – When a map task completes, it sends the master the locaBon and sizes of its R intermediate files, one for each reducer

– Master pushes this info to reducers

•  Master pings workers periodically to detect failures

Failures

•  Map worker failure – Map tasks completed or in-‐progress at worker are reset to idle

– Reduce workers are noBfied when task is rescheduled on another worker

•  Reduce worker failure – Only in-‐progress tasks are reset to idle

•  Master failure – MapReduce task is aborted and client is noBfied

Combiners

•  Ooen a map task will produce many pairs of the form (k,v1), (k,v2), ... for the same key k – E.g., popular words in Word Count

•  Can save network Bme by pre-‐aggregaBng at mapper – combine(k1, list(v1)) à v2 – Usually same as reduce funcBon

ParBBon FuncBon

•  Inputs to map tasks are created by conBguous splits of input file

•  For reduce, we need to ensure that records with the same intermediate key end up at the same worker

•  System uses a default parBBon funcBon e.g., hash(key) mod R

•  SomeBmes useful to override –  E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file

More Reading hPp://labs.google.com/papers/mapreduce-‐osdi04-‐slides/index.html hPp://labs.google.com/papers/mapreduce-‐osdi04.pdf hPp://wiki.apache.org/hadoop/ hPp://code.google.com/edu/parallel/mapreduce-‐tutorial.html