Consistency
-
Upload
dale-chapman -
Category
Documents
-
view
24 -
download
0
description
Transcript of Consistency
2
Where are we?
• System Complexity• Modularity & Naming• Enforced Modularity• Network• Fault Tolerance• Transaction– All-or-nothing– Before-or-after
• Consistency <-• Security
Two-phase commit
• Phase-1: preparation / voting– Lower-layer transactions either aborts or
tentatively committed– Higher-layer transaction evaluate lower situation
• Phase-2: commitment– If top-layer, then COMMIT or ABORT– If nested itself, then become tentatively
committed
6
8
Fault-tolerance
• Fault-tolerance– Goal: building reliable systems from unreliable
components– So far: transactions for crash-recovery on a single server
• Important to recover from failures– How to continue despite failures?– General plan: multiple servers, replication– Already seen some cases: DNS, RAID, ..
• How to handle harder cases?– E.g., replicated file storage, replicated master for 2PC, ..
9
Fault-tolerance
• Example: file storage– Simple design: single file server• e.g., home directory in AFS• What if AFS server crashes? Can't access your files.
– Alternative design: keep copies of your files• on your desktop, laptop, etc.• Storage is now replicated: can access your files despite
failures.
Constrains & invariable
• One common use for transaction– To maintain constrains
• A constraint is an application-defined requirement that every update to data preserve some invariant– Table management– Double-linked list– Disk storage management– Display management– Replica management– Banking– Process control
10
Interface consistency
• Internal operation: inconsistency– Update action requires several steps– Inconsistency may exist during the steps• Aka. constraint violation
• Interface: consistency– Another thread/client asks to read the data
• Two consistency models– Strict consistency– Eventual consistency
11
Strict consistency• Hides the constraint violation behind modular boundaries
– actions outside the transaction performing the update will never see data that is inconsistent with the invariant
• Depends on actions honoring abstractions– E.g. by using only the intended reading and writing operations.
• Cache specification– “The result of a READ of a named object is always the value that was
provided by the most recent WRITE to that object”– Does not demand that the replica in the cache always be identical to
the replica in the backing store– Requires only that the cache deliver data at its interface that meets
the specification
13
Strict consistency
• Examples– Sequential consistency – External time consistency
• Using transactions– All-or-nothing• Maintain interface consistency despite failures
– Before-or-after• Maintain interface consistency despite concurrent
reading or updating of the same data
14
Eventual consistency
• Scenario– Performance or availability is a high priority– Temporary inconsistency is tolerable– E.g. web browser display rendering– E.g. New book today but catalog updated next day– E.g. Loose coupling replica
15
Eventual consistency
• Inconsistency window– After a data update the constraint may not hold
until some unspecified time in the future– An observer may, using the standard interfaces,
discover that the invariant is violated– Different observers may even see different results– Once updates stop occurring, it will make a best
effort drive toward the invariant
16
Cache system
• Cache = performance oriented replica system– Rather than reliability
• Invariant– Data in primary = replica in secondary memory– How long is the inconsistency window?– Strict consistency VS. eventual consistency
• Interface– The result of a read of a named object is always the value
of the most recent write to that object
17
Cache consistency
• Consistency: either strict or eventual– Strict: write-through cache• Performance is affected
– Eventual: no-write-through cache• Still hold the invariant for the same thread• What if there’re more than one cache?• What if another thread has its own cache?
– Even write-through cache fails to keep consistency
• Three methods– Timeout, marking, snoopy
18
Eventual consistency with timer expiration
• Example: DNS cache– Client ask for IP for “ginger.pedantic.edu”– Then network manager change IP of “ginger.pedantic.edu”
on “ns.pedantic.edu”
• TTL (Time To Live)– One hour as default– Keep old IP available during the TTL– Low cost
19
Strict consistency with a fluorescent marking pen
• A few variable are shared and writable– Server marks a page as “don’t cache me”– Browser will not cache the page
• The “volatile” variable– Ask the compiler to ensure read/write consistency• Write register back to memory• Flush cache• Block instruction reordering
20
Strict consistency with the snoopy cache
• Invalidate cache entries when inconsistent– When several processors share the same
secondary memory– Primary cache usually be private– Write-through doesn’t change cache in other
processor– Naïve solution: invalidate everything if write– Better idea: specify the cache line• Each private cache monitor the memory bus• Even grab the value of write and update
21
22
1. Processor A writes to memory2. Write-through the cache, to memory by bus3. Cache of B & C snoop on the bus and update the replica
Durable storage and the durability mantra
• Mirroring– On physical unit basis– E.g. RAID-1– Protect against internal failures
of individual disks• Issues– What if the OS damages the data before writing?– Placing matters: geographically separated
23
The durability mantra
• Multiple copies, widely separated and independently administered…
• Multiple copies, widely separated and independently administered…
• Multiple copies, widely separated and independently administered…
• Multiple copies, widely separated and independently administered…
24
Durable storage and the durability mantra
• Separate replicas geographically– High latency– Unreliable communication– Hard for synchronization
• When made asynchronously– Primary copy VS. backup copies– Master VS. slaves
• Constraint: replicas should be identical
25
Durable storage and the durability mantra
• Logic copies: file-by-file– Understandable to the application– Similar with logic locking– Lower performance
• Physical copies: sector-by-sector• More complex– Crash during updates– Performance enhancement
26
27
Challenge in replication: Consistency
• Optimistic Replication– Tolerate inconsistency, and fix things up later– Works well when out-of-sync replicas are
acceptable• Pessimistic Replication– Ensure strong consistency between replicas– Needed when out-of-sync replicas can cause
serious problems
29
Consistency
• Resolving Inconsistencies– Suppose we have two computers: laptop and
desktop– File could have been modified on either system
• How to figure out which one was updated?– One approach: use timestamps to figure out which
was updated recently– Many file synchronization tools use this approach
30
Use of time in computer systems
• Time is used by many distributed systems– E.g., cache expiration (DNS, HTTP), file
synchronizers, Kerberos, ..– Time intervals: how long did some operation take?– Calendar time: what time/date did some event
happen at?– Ordering of events: in what order did some events
happen?
31
Time Measuring
• Measuring time intervals– Computer has a reasonably-fixed-frequency
oscillator (e.g., quartz crystal)– Represent time interval as a count of oscillator's
cycles• time period = count / frequency• e.g., with a 1MHz oscillator, 1000 cycles means 1msec
32
Time Measuring
• Keeping track of calendar time.– Typically, calendar time is represented using a
counter from some fixed epoch.– For example, Unix time is #seconds since midnight
UTC at start of Jan 1, 1970.– Can convert this counter value into human-
readable date/time, and vice-versa.• Conversion requires two more inputs: time zone, data
on leap seconds.
33
Time Measuring
• What happens when turn off the computer?– "Real-Time Clock" (RTC) chip remains powered,
with battery / capacitor– Stores current calendar time, has an oscillator that
increments periodically• Maintaining accurate time– Accuracy: for calendar time, need to set the clock
correctly at some point– Precision: need to know oscillator frequency (drift
due to age, temp, etc)
34
Clock Synchronizing
• Synchronizing a clock over the internet: NTP– Query server's time, adjust local time accordingly
35
Clock Synchronizing
• Need to take into account network latency– Simple estimate: RTT/2– When does this fail to work well?• Asymmetric routes, with different latency in each
direction• Queuing delay, unlikely to be symmetric even for
symmetric routes• Busy server might take a long time to process client's
request
– Can use repeated queries to average out (or estimate variance) for second two
36
Estimating Network Latency
sync(server):t_begin = local_timetsrv = getTime(server)t_end = local_timedelay = (t_end-t_begin) / 2offset = (t_end-delay) - tsrvlocal_time = local_time - offset
37
Clock Synchronizing
• What if a computer's clock is too fast– e.g., 5 seconds ahead– Naive plan: reset it to the correct time
• Can break time intervals being measured (e.g., negative interval)• Can break ordering (e.g., older files were created in the future)
– "make" is particularly prone to these errors
• Principle: time never goes backwards– Idea: temporarily slow down or speed up the clock– Typically cannot adjust oscillator (fixed hardware)– Adjust oscillator frequency estimate, so counter advances
faster / slower
38
Slew timesync(server):
t_begin = local_timetsrv = getTime(server)t_end = local_timedelay = (t_end-t_begin) / 2offset = (t_end-delay) - tsrvfreq = base + ε * sign(offset)sleep(freq * abs(offset) / ε)freq = base
timer_intr(): # on every oscillator ticklocal_time = local_time + 1/freq
temporarily speed up / slow down local clock
39
Improving Time Precision
• If only adjust our time once– An inaccurate clock will lose accuracy– Need to also improve precision, so we don't need to slew
as often• Assumption: poor precision caused by poor
estimate of oscillator frequency– Can measure difference between local and remote clock
"speeds" over time– Adjust local frequency estimate based on that information– In practice, may want more stable feedback loop (PLL):
look at control theory
40
File Reconciliation with Timestamps
• Key Problem– Determine which machine has the newer version of file
• Strawman– Use the file with the highest mtime timestamp– Works when only one side updates the file per
reconciliation
41
File Reconciliation with Timestamps• Better plan
– Track last reconcile time on each machine– Send file if changed since then, and update last reconcile time– When receiving, check if local file also changed since last reconcile
• New Outcome– Timestamps on two versions of a file could be concurrent– Key issue with optimistic concurrency control: optimism was
unwarranted– Generally, try various heuristics to merge changes (text diff/merge,
etc)– Worst case, ask user (e.g., if edited same line of code in C file)
• Problem: reconciliation across multiple machines
42
File Reconciliation with Timestamps
• Goal: No Lost Updates– V2 should overwrite V1 if V2 contains all updates
that V1 contained– Simple timestamps can't help us determine this
43
Vector Timestamps
• Idea: vector timestamps– Store a vector of timestamps from each machine– Entry in vector keeps track of the last mtime– V1 is newer than V2 if all of V1's timestamps are >= V2’s– V1 is older than V2 if all of V1's timestamps are <= V2’s– Otherwise, V1 and V2 were modified concurrently, so
conflict– If two vectors are concurrent, one computer modified file
• without seeing the latest version from another computer
– If vectors are ordered, everything is OK as before
44
Vector Timestamps
• Cool property of version vectors:– A node's timestamps are only compared to other
timestamps from the same node– Time synchronization not necessary for
reconciliation w/ vector timestamps– Can use a monotonic counter on each machine
• Does calendar time still matter?– More compact than vector timestamps– Can help synchronize two systems that don't share
vector timestamps
49
Synchronizing multiple files
• Strawman– As soon as file is modified, send updates to every
other computer• What consistency guarantees does this file
system provide to an application?– Relatively few guarantees, aside from no lost updates
for each file– In particular, can see changes to b without seeing
preceding changes to a– Counter-intuitive: updates to diff files might arrive in
diff order
52
Pessimistic Replication
• Some applications may prefer not to tolerate inconsistency– E.g., a replicated lock server, or replicated coordinator
for 2PC– E.g., Better not give out the same lock twice– E.g., Better have a consistent decision about whether
transaction commits• Trade-off: stronger consistency with pessimistic
replication means:– Lower availability than what you might get with
optimistic replication
53
Single-copy consistency
• Problem of optimistic way: replicas get out of sync– One replica writes data, another doesn't see the changes– This behavior was impossible with a single server
• Ideal goal: single-copy consistency– Property of the externally-visible behavior of a replicated
system.– Operations appear to execute as if there's only a single
copy of the data.• Internally, there may be failures or disagreement, which we have
to mask.
– Similar to how we defined serializability goal ("as if executed serially").
54
Replicating a Server
• Strawman– Clients send requests to both servers– Tolerating faults: if one server is down, clients
send to the other• Tricky case: what if there's a network
partition?– Each client thinks the other server is dead, keeps
using its server– Bad situation: not single-copy consistency!
55
Handling network partitions
• Issue– Clients may disagree about what servers are up– Hard to solve with 2 servers, but possible with 3 servers
• Idea: require a majority servers to perform operation– In case of 3 servers, 2 form a majority– If client can contact 2 servers, it can perform operation
(otherwise, wait)– Thus, can handle any 1 server failure
56
Handling network partitions
• Why does the majority rule work?– Any two majority sets of servers overlap– Suppose two clients issue operations to a majority
of servers– Must have overlapped in at least one server, will
help ensure single-copy
57
Handling network partitions
• Problem: replicas can become inconsistent– Issue: clients' requests to different servers can
arrive in different order– How do we ensure the servers remain consistent?
58
RSM: Replicated state machines
• A general approach to making consistent replicas of a server:– Start with the same initial state on each server– Provide each replica with the same input
operations, in the same order– Ensure all operations are deterministic• E.g., no randomness, no reading of current time, etc.
• These rules ensure each server will end up in the same final state
60
Simple Implementation: replicated logs
• Replicated Logs– Log client operations, including both R/W, and numbered
• Key issue: agreeing on the order of operations.– Coordinator handles one client operation at a time– Coordinator chooses an order for all operations (assigns
log sequence number)– Coordinator issues the operation to each replica– When is it OK to reply to client?
• Must wait for majority of replicas to reply• Otherwise, if a minority crashes, remaining servers may continue
without op
61
Replicating the coordinator
• Replicating the coordinator– Tricky: can we get multiple coordinators due to
network partition?– Tricky: what happens if coordinator crashes
midway through an operation?
What is Paxos protocol?
• Paxos is a simple protocol that a group of machines in a distributed system can use to agree on a value proposed by a member of the group.
• Assumptions– Asynchronous
• Processes operat at arbitary speed
– Non-Byzantine model• Processes operate at arbitrary speed• Fail by stopping
– Processes may fail and the restart; this requires that information can be remembered
Roles
• Proposer: offer proposals of the form [value, number].
• Acceptor: accept or reject offered proposals so as to reach consensus on the chosen proposal/value.
• Learner: become aware of the chosen proposal/value.
• A process can take on all roles
Approach 1
• Designate a single process X as acceptor (e.g. one with smallest identifier)– Each proposer sends its value to X– X decides on one of the values– X announces its decision to all learners
• Problem?– Failure of the single acceptor halts decision
• Need multiple acceptors!
Approach 2
• Each proposer propose to all acceptors• Each acceptor accepts the first proposal it
receives and rejects the rest• If the proposer receives positive replies
from a majority of acceptors, it chooses its own value– There is at most 1 majority, hence only a single
value is chosen• Proposer sends chosen value to all learners
Approach 2
• Problem:– What if multiple leaders propose
simultaneously so there is no majority accepting?
– What if the process fails?
Paxos solution
• Each acceptor must be able to accept multiple proposals
• Order proposals by proposal number - If a proposal with value v is chosen, all higher
proposals have value v
Paxos Operation: Process State
• Each node maintains:– na, va: highest proposal number accepted and its
corresponding accepted value – nh: highest proposal number seen– myn: node’s proposal number in the current Paxos
Paxos Operations• Choosing a proposal number:
– Use last known proposal number + 1, append process’s identifier
Paxos Operation
• Phase 1 (Prepare)- A node decides to propose - Proposer choose myn > nh - Proposer sends <prepare, myn> to all nodes- A node receiving <prepare, n> has this logic
If n < nh
reply <prepare-reject>Else nh = n
reply <prepare-ok, na,va>
This node will not accept any proposal lower than n
Paxos Operation• Phase 2 (Accept):
- If a proposer gets prepare-ok from a majority• V = non-empty value corresponding to the highest na received• If V= null, then proposer can pick any V• Send <accept, myn, V> to all nodes
- If proposer fails to get majority prepare-ok• Delay and restart Paxos
- Upon receiving <accept, n, V>If n < nh
reply with <accept-reject>else na = n; va = V; nh = n
reply with <accept-ok>
Paxos Operation
• Phase 3 (Decide)- If proposer gets accept-ok from a majority • Send <decide, va> to all nodes
- If leader fails to get accept-ok from a majority• Delay and restart Paxos
Paxos: Timeouts
• All processes wait a maximum period (timeout) for messages they expect
• Upon timeout, a process starts again
Paxos with One Leader, No Failures:Phase 1
0 1 2 3 4
na
va
nh
done
-1
nil
-1
F
-1
nil
-1
F
-1
nil
-1
F
-1
nil
-1
F
-1
nil
-1
F
“prepare(1,1)”myn = 1
Paxos with One Leader, No Failures:Phase 1
0 1 2 3 4
na
va
nh
done
-1
nil
1
F
-1
nil
1
F
-1
nil
1
F
-1
nil
1
F
-1
nil
1
F
“prepare-accept(-1, nil)”
78
Paxos with One Leader, No Failures:Phase 2
0 1 2 3 4
na
va
nh
done
-1
nil
1
F
-1
nil
1
F
-1
nil
1
F
-1
nil
1
F
-1
nil
1
F
prepare-accept frommajority! all v’s nil
Paxos with One Leader, No Failures:Phase 2
0 1 2 3 4
na
va
nh
done
-1
nil
1
F
-1
1
F
-1
nil
1
F
-1
nil
1
F
-1
nil
1
F
“accept(1,1,1)”
nil
Paxos with One Leader, No Failures:Phase 2
0 1 2 3 4
accept from majority
na
va
nh
done
1
1
F
1
1
F
1
1
F
1
1
F
1
1
F
1 1 1 1 1
Paxos with One Leader, No Failures:Phase 3
0 1 2 3 4
Send (decide,1)
na
va
nh
done
1
1
F
1
1
F
1
1
F
1
1
F
1
1
F
1 1 1 1 1
Understanding Paxos
• What if we get two nodes that send a prepare message?
• What if a proposer fails while sending accept?
• What if a proposer fails after sending prepare-ok?
More Than Proposer
• Can occur after timeout during Paxos algorithm, partition, lost packets
• Two proposers must use different n in their prepare messages.
• Suppose two proposers have proposals 1, 2
83
More Than One Proposer
• Proposal 1 gets to all nodes which is then followed by proposal 2
• In both cases a prepare-ok message is sent• Both proposes will send a accept message • However, for proposal 1 an accept-reject
message is sent
Proposer Fails Before Sending Accept
• Some process will time out and become a propose
• Old proposer didn’t send any decide, so no risk of non-agreement
Risks: Leader Failures
• Suppose proposers fails after sending minority of accept– Same as two proposers!
• Suppose proposer fails after sending majority of accept– Same as two leaders!
Process Fails
• Process fails after receiving accept and after sending accept-ok
• Process should remember va and na on disk• If process doesn’t restart, possible timeout in
Phase 3, new leader
Shortcuts to meet more modest requirements
• Single state machine– Carry out all updates at one replica site– Generate a new version at that site– Bring the other replicas into line
• Brute force copying: copy new version of data to each of the other replica sites, replace previous copies
– Conditions• Occasionally update• Small database• No urgency to make updates available, so batch is OK• Temporary inconsistency can be tolerated
88
Single state machine
• SSM is subject to data decay, but:– Undetected decay of the master– Frequent update leads unnecessary reconciliation
• A dummy update is enough
• Main defect of SSM– Data update is not fault tolerant
• Only data access is fault tolerant
– What if fails in the middle of updating?– Doesn’t work well for some applications
• E.g. large database
89
Variant of single state machine
• Master only distributes deltas– Pros: may produce a performance gain– Cons: has disadvantages of both SSM & RSM
• Reduce inconsistency window– E.g. shadow copy (just as two-phase commit)
• Partition large database– Each of which can be updated independently
• Assign a different master to each partition– Distribute updating work, increase availability of update
90
Variant of single state machine
• Add fault tolerance when master fails– Using a consensus algorithm to choose a new
master site• If data is insensitive to update order, then
consensus algorithm is not needed– E.g. email, users may see different order
• Master can distribute just its update log– Replica sites can run REDO on log– Replica sites can only maintain a complete log
91
Maintaining data integrity
• Threats of data integrity in updating a replica– Data can be damaged or lost– Transmission can introduce errors– Operators can make blunders
• Solutions– Periodically compare replicas bit-by-bit
• To check spontaneous data decay
– Calculate a witness of the contents to compare• E.g. by choosing a good hash algorithm• Just as checksum in end-to-end layer, link layer
92
Replica reading and majorities
• Simplest plan– Read and write from master, slaves are only backups– Master is responsible to order for consistency
• Enhancement for read availability– Allow reading replica– Also enhance performance– But consistency may be violated
• Should ensures before-or-after between reads and updates
• More reliable but expensive way– Obtain data from other replicas to verify integrity– Use majority for reading
93
Quorum
• Define separate read &write quorums: Qr & Qu
– Qr + Qu > Nreplicas (Why?)• Confirm a write after writing to at least Qu of replicas
• Read at least Qr agree on the data or witness value
• Example– In favor of reading: Nreplicas = 5, Qw = 4, Qr = 2
– In favor or updating: Nreplicas = 5, Qw = 2, Qr = 4
– Enhance availability by Qw = Nreplicas & Qr = 1
94
Quorum
• Provide no before-or-after or all-or-nothing– If reading & writing requests come from a site• Easy…
– If reading from multiple sites, writing from one site• Maintain a version number at that site
– If writing from multiple sites• Protocol providing a distributed sequencer
• Another complicating consideration– Performance maximization
95
Backup
• Time consuming– Incremental backup– Partial backup
• Don’t copy files that can be reconstructed from other files
• When to backup?– In the middle of updating may violate consistency
• Replicated failures by the same programming– Independence of failures should be enforced
• Fork wisdom– The more elaborate the backup system, the less likely that
it actually works
96