Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative...
-
Upload
godfrey-hodge -
Category
Documents
-
view
218 -
download
2
Transcript of Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative...
![Page 1: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/1.jpg)
Reliability & Chubby
CSE 490H
This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.
![Page 2: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/2.jpg)
Overview
Writable / WritableComparable Reliability review Chubby + PAXOS
![Page 3: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/3.jpg)
Datatypes in Hadoop
Hadoop provides support for primitive datatypesString Text Integer IntWritableLong LongWritable FloatWritable, DoubleWritable, ByteWritable,
ArrayWritable…
![Page 4: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/4.jpg)
The Writable Interface
interface Writable {
public void readFields(DataInput in);
public void write(DataOutput out);
}
![Page 5: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/5.jpg)
Example: LongWritablepublic class LongWritable implements WritableComparable { private long value;
public void readFields(DataInput in) throws IOException { value = in.readLong(); }
public void write(DataOutput out) throws IOException { out.writeLong(value); }}
![Page 6: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/6.jpg)
WritableComparable
Extends Writable so the data can be used as a key, not just a value
int compareTo(Object what)int hashCode()
this.compareTo(x) == 0 => x.hashCode() == this.hashCode()
![Page 7: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/7.jpg)
A Composite Writable class IntPairWritable implements Writable { private int fst; private int snd;
public void readFields(DataInput in) throws IOException { fst = in.readInt(); snd = in.readInt();}
public void write(DataOutput out) throws IOException { out.writeInt(fst); out.writeInt(snd); }}
![Page 8: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/8.jpg)
A Composite Writable (2)class IntPairWritable implements Writable { private IntWritable fst; private IntWritable snd;
public void readFields(DataInput in) throws IOException { fst.readFields(in); snd.readFields(in); }
public void write(DataOutput out) throws IOException { fst.write(out); snd.write(out); }}
![Page 9: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/9.jpg)
Marshalling Order Constraint
readFields() and write() must operate in the same order
serialized A serialized B
class Foo { T1 A; T2 B;}
write()
class Foo { T1 A; T2 B;}
readFields()
![Page 10: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/10.jpg)
Subclassing is problematic
class AaronsData implements Writable { }class TypeA extends AaronsData { int fieldA;}
class TypeB extends AaronsData { float fieldB;}
Cannot do this with Hadoop!
![Page 11: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/11.jpg)
Attempt 2…
class AaronsData implements Writable {
int fieldA;
float fieldB;
}
But we only want to populate one field at a time; how do we determine which is the “real” field?
![Page 12: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/12.jpg)
Looking at the Bytes
tag (0) fieldA data
tag (1) fieldB data
![Page 13: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/13.jpg)
Tag-Discriminated Union
class AaronsData implements Writable { static final int TYPE_A = 0, TYPE_B = 1; int TAG; int fieldA; float fieldB;
void readFields(DataInput in) { TAG = in.readInt(); if (TAG == TYPE_A) { fieldA = in.readInt(); } else { fieldB = in.readFloat(); } }}
![Page 14: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/14.jpg)
Reliability
![Page 15: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/15.jpg)
Reliability Demands
Support partial failureTotal system must support graceful decline in
application performance rather than a full halt
![Page 16: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/16.jpg)
Reliability Demands
Data Recoverability If components fail, their workload must be
picked up by still-functioning units
![Page 17: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/17.jpg)
Reliability Demands
Individual RecoverabilityNodes that fail and restart must be able to
rejoin the group activity without a full group restart
![Page 18: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/18.jpg)
Reliability Demands
ConsistencyConcurrent operations or partial internal
failures should not cause externally visible nondeterminism
![Page 19: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/19.jpg)
Reliability Demands
ScalabilityAdding increased load to a system should not
cause outright failure, but a graceful decline Increasing resources should support a
proportional increase in load capacity
![Page 20: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/20.jpg)
Reliability Demands
SecurityThe entire system should be impervious to
unauthorized accessRequires considering many more attack
vectors than single-machine systems
![Page 21: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/21.jpg)
Ken Arnold, CORBA designer:
“Failure is the defining difference between distributed and local programming”
![Page 22: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/22.jpg)
Component Failure
Individual nodes simply stop
![Page 23: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/23.jpg)
Data Failure
Packets omitted by overtaxed router Or dropped by full receive-buffer in kernel Corrupt data retrieved from disk or net
![Page 24: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/24.jpg)
Network Failure
External & internal links can dieSome can be routed around in ring or mesh
topologyStar topology may cause individual nodes to
appear to haltTree topology may cause “split”Messages may be sent multiple times or not
at all or in corrupted form…
![Page 25: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/25.jpg)
Timing Failure
Temporal properties may be violatedLack of “heartbeat” message may be
interpreted as component haltClock skew between nodes may confuse
version-aware data readers
![Page 26: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/26.jpg)
Byzantine Failure
Difficult-to-reason-about circumstances ariseCommands sent to foreign node are not
confirmed: What can we reason about the state of the system?
![Page 27: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/27.jpg)
Malicious Failure
Malicious (or maybe naïve) operator injects invalid or harmful commands into system
![Page 28: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/28.jpg)
Preparing for Failure
Distributed systems must be robust to these failure conditions
But there are lots of pitfalls…
![Page 29: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/29.jpg)
The Eight Design Fallacies
The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous.
-- Peter Deutsch and James Gosling, Sun Microsystems
![Page 30: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/30.jpg)
Dealing With Component Failure
Use heartbeats to monitor component availability
“Buddy” or “Parent” node is aware of desired computation and can restart it elsewhere if needed
Individual storage nodes should not be the sole owner of dataPitfall: How do you keep replicas consistent?
![Page 31: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/31.jpg)
Dealing With Data Failure
Data should be check-summed and verified at several pointsNever trust another machine to do your data
validation! Sequence identifiers can be used to
ensure commands, packets are not lost
![Page 32: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/32.jpg)
Dealing With Network Failure
Have well-defined split policyNetworks should routinely self-discover
topologyWell-defined arbitration/leader election
protocols determine authoritative components Inactive components should gracefully clean up
and wait for network rejoin
![Page 33: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/33.jpg)
Dealing With Other Failures
Individual application-specific problems can be difficult to envision
Make as few assumptions about foreign machines as possible
Design for security at each step
![Page 34: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/34.jpg)
Chubby
![Page 35: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/35.jpg)
What is it?
A coarse-grained lock serviceOther distributed systems can use this to
synchronize access to shared resources Intended for use by “loosely-coupled
distributed systems”
![Page 36: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/36.jpg)
Design Goals
High availability Reliability
Anti-goals:High performanceThroughput Storage capacity
![Page 37: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/37.jpg)
Intended Use Cases
GFS: Elect a master BigTable: master election, client discovery,
table service locking Well-known location to bootstrap larger
systems Partition workloads Locks should be coarse: held for hours or
days – build your own fast locks on top
![Page 38: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/38.jpg)
External Interface
Presents a simple distributed file system Clients can open/close/read/write files
Reads and writes are whole-fileAlso supports advisory reader/writer locks Clients can register for notification of file
update
![Page 39: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/39.jpg)
Files == Locks?
“Files” are just handles to information These handles can have several attributes
The contents of the file is one (primary) attribute
As is the owner of the file, permissions, date modified, etc
Can also have an attribute indicating whether the file is locked or not.
![Page 40: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/40.jpg)
Topology
replica replica
replica replica
Master replica
One Chubby “Cell”
All client traffic
![Page 41: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/41.jpg)
Master election
Master election is simple: all replicas try to acquire a write lock on designated file. The one who gets the lock is the master.Master can then write its address to file; other
replicas can read this file to discover the chosen master name.
Chubby doubles as a name service
![Page 42: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/42.jpg)
Distributed Consensus
Chubby cell is usually 5 replicas3 must be alive for cell to be viable
How do replicas in Chubby agree on their own master, official lock values?PAXOS algorithm
![Page 43: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/43.jpg)
PAXOS
Paxos is a family of algorithms (by Leslie Lamport) designed to provide distributed consensus in a network of several processors.
![Page 44: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/44.jpg)
Processor Assumptions
Operate at arbitrary speed Independent, random failures Procs with stable storage may rejoin
protocol after failure Do not lie, collude, or attempt to
maliciously subvert the protocol
![Page 45: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/45.jpg)
Network Assumptions
All processors can communicate with (“see”) one another
Messages are sent asynchronously and may take arbitrarily long to deliver
Order of messages is not guaranteed: they may be lost, reordered, or duplicated
Messages, if delivered, are not corrupted in the process
![Page 46: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/46.jpg)
A Fault Tolerant Memory of Facts
Paxos provides a memory for individual “facts” in the network.
A fact is a binding from a variable to a value.
Paxos between 2F+1 processors is reliable and can make progress if up to F of them fail.
![Page 47: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/47.jpg)
Roles
Proposer – An agent that proposes a fact Leader – the authoritative proposer Acceptor – holds agreed-upon facts in its
memory Learner – May retrieve a fact from the
system
![Page 48: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/48.jpg)
Safety Guarantees
Nontriviality: Only proposed values can be learned
Consistency: Only at most one value can be learned
Liveness: If at least one value V has been proposed, eventually any learner L will get some value
![Page 49: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/49.jpg)
Key Idea
Acceptors do not act unilaterally. For a fact to be learned, a quorum of acceptors must agree upon the fact
A quorum is any majority of acceptors Given acceptors {A, B, C, D}, Q = {{A, B,
C}, {A, B, D}, {B, C, D}, {A, C, D}}
![Page 50: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/50.jpg)
Basic Paxos
Determines the authoritative value for a single variable
Several proposers offer a value Vn to set the variable to.
The system converges on a single agreed-upon V to be the fact.
![Page 51: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/51.jpg)
Step 1: Prepare
Acceptor Acceptor Acceptor
Proposer 1
Proposer 2
PREPARE jPREPARE k
k > j
![Page 52: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/52.jpg)
Step 2: Promise
PROMISE x – Acceptor will accept proposals only numbered x or higher
Proposer 1 is ineligible because a quorum has voted for a higher number than j
Acceptor Acceptor Acceptor
Proposer 1
Proposer 2
k > j
PROMISE j
PROMISE kPROMISE k
![Page 53: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/53.jpg)
Step 3: Accept!
Acceptor Acceptor Acceptor
Proposer 1
Proposer 2
ACCEPT! (v_k, k)
Proposer 1 is disqualified; Proposer 2 offers a value
![Page 54: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/54.jpg)
Step 4: Accepted
Acceptor Acceptor Acceptor
Proposer 1
Proposer 2
A quorum has accepted value v_k; it is now a fact
Accepted k
![Page 55: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/55.jpg)
Learning values
Acceptor Acceptor Acceptor
Proposer 1
Proposer 2
Learner
v?
Acceptor Acceptor Acceptor
Proposer 1
Proposer 2
Learner
V_k
If a learner interrogates the system, a quorum will respond with fact V_k
![Page 56: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/56.jpg)
Basic Paxos…
Proposer 1 is free to try again with a proposal number > k; can take over leadership and write in a new authoritative valueOfficial fact will change “atomically” on all
acceptors from perspective of learners If a leader dies mid-negotiation, value just drops,
another leader tries with higher proposal
![Page 57: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/57.jpg)
More Paxos Algorithms
Not whole story MultiPaxos: steps 1—2 done once, 3—4
repeated multiple times by same leader Also: cheap Paxos, fast Paxos,
generalized Paxos, Byzantine Paxos…
![Page 58: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/58.jpg)
Paxos in Chubby
Replicas in a cell initially use Paxos to establish the leader.
Majority of replicas must agree Replicas promise not to try to elect new
master for at least a few seconds (“master lease”)
Master lease is periodically renewed
![Page 59: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/59.jpg)
Client Updates
All client updates go through master Master updates official database; sends
copy of update to replicasMajority of replicas must acknowledge receipt
of update before master writes its own value Clients find master through DNS
Contacting replica causes redirect to master
![Page 60: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/60.jpg)
Chubby File System
Looks like simple UNIX FS: /ls/foo/wombatAll filenames start with ‘/ls’ (“lockservice”)Second component is cell (“foo”)Rest of the path is anything you want
No inter-directory move operation Permissions use ACLs, non-inherited No symlinks/hardlinks
![Page 61: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/61.jpg)
Files
Files have version numbers attached Opening a file receives handle to file
Clients cache all file data including file-not-found
Locks are advisory – not required to open file
![Page 62: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/62.jpg)
Why Not Mandatory Locks?
Locks represent client-controlled resources; how can Chubby enforce this?
Mandatory locks imply shutting down client apps entirely to do debuggingShutting down distributed applications much
trickier than in single-machine case
![Page 63: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/63.jpg)
Callbacks
Master notifies clients if files modified, created, deleted, lock status changes
Push-style notifications decrease bandwidth from constant polling
![Page 64: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/64.jpg)
Cache Consistency
Clients cache all file content Must send respond to Keep-Alive message
from server at frequent interval KA messages include invalidation requests
Responding to KA implies acknowledgement of cache invalidation
Modification only continues after all caches invalidated or KA time out
![Page 65: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/65.jpg)
Client Sessions
Sessions maintained between client and serverKeep-alive messages required to maintain
session every few seconds If session is lost, server releases any client-
held handles. What if master is late with next keep-alive?
Client has its own (longer) timeout to detect server failure
![Page 66: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/66.jpg)
Master Failure
If client does not hear back about keep-alive in local lease timeout, session is in jeopardyClear local cacheWait for “grace period” (about 45 seconds)Continue attempt to contact master
Successful attempt => ok; jeopardy over Failed attempt => session assumed lost
![Page 67: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/67.jpg)
Master Failure (2)
If replicas lose contact with master, they wait for grace period (shorter: 4—6 secs)
On timeout, hold new election
![Page 68: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/68.jpg)
Reliability
Started out using replicated Berkeley DB Now uses custom write-thru logging DB Entire database periodically sent to GFS
In a different data center Chubby replicas span multiple racks
![Page 69: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/69.jpg)
Scalability
90K+ clients communicate with a single Chubby master (2 CPUs)
System increases lease times from 12 sec up to 60 secs under heavy load
Clients cache virtually everything Data is small – all held in RAM (as well as
disk)
![Page 70: Reliability & Chubby CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.](https://reader033.fdocuments.in/reader033/viewer/2022052603/56649f225503460f94c3a4b7/html5/thumbnails/70.jpg)
Conclusion
Simple protocols win again Piggybacking data on Keep-alive is a
simple, reliable coherency protocol