Consistency

92
Consistency 1 Computer System Engineering (2013 Spring) Ordering & Replication

description

Consistency. Ordering & Replication. Computer System Engineering (2013 Spring). Where are we?. System Complexity Modularity & Naming Enforced Modularity Network Fault Tolerance Transaction All-or-nothing Before-or-after Consistency

Transcript of Consistency

1

Consistency

Computer System Engineering (2013 Spring)

Ordering & Replication

2

Where are we?

• System Complexity• Modularity & Naming• Enforced Modularity• Network• Fault Tolerance• Transaction– All-or-nothing– Before-or-after

• Consistency <-• Security

3

Simple locking

Time

Lock

Lock Point

4

Two-phase locking

Time

Lock

Lock Point

5

Using Read-write Lock

Time

Lock

Lock Point Release Read Lock ASAP

Release Write Lock

Two-phase commit

• Phase-1: preparation / voting– Lower-layer transactions either aborts or

tentatively committed– Higher-layer transaction evaluate lower situation

• Phase-2: commitment– If top-layer, then COMMIT or ABORT– If nested itself, then become tentatively

committed

6

7

3N messages

8

Fault-tolerance

• Fault-tolerance– Goal: building reliable systems from unreliable

components– So far: transactions for crash-recovery on a single server

• Important to recover from failures– How to continue despite failures?– General plan: multiple servers, replication– Already seen some cases: DNS, RAID, ..

• How to handle harder cases?– E.g., replicated file storage, replicated master for 2PC, ..

9

Fault-tolerance

• Example: file storage– Simple design: single file server• e.g., home directory in AFS• What if AFS server crashes? Can't access your files.

– Alternative design: keep copies of your files• on your desktop, laptop, etc.• Storage is now replicated: can access your files despite

failures.

Constrains & invariable

• One common use for transaction– To maintain constrains

• A constraint is an application-defined requirement that every update to data preserve some invariant– Table management– Double-linked list– Disk storage management– Display management– Replica management– Banking– Process control

10

Interface consistency

• Internal operation: inconsistency– Update action requires several steps– Inconsistency may exist during the steps• Aka. constraint violation

• Interface: consistency– Another thread/client asks to read the data

• Two consistency models– Strict consistency– Eventual consistency

11

12

CONSISTENCY: STRICT VS. EVENTUAL

Strict consistency• Hides the constraint violation behind modular boundaries

– actions outside the transaction performing the update will never see data that is inconsistent with the invariant

• Depends on actions honoring abstractions– E.g. by using only the intended reading and writing operations.

• Cache specification– “The result of a READ of a named object is always the value that was

provided by the most recent WRITE to that object”– Does not demand that the replica in the cache always be identical to

the replica in the backing store– Requires only that the cache deliver data at its interface that meets

the specification

13

Strict consistency

• Examples– Sequential consistency – External time consistency

• Using transactions– All-or-nothing• Maintain interface consistency despite failures

– Before-or-after• Maintain interface consistency despite concurrent

reading or updating of the same data

14

Eventual consistency

• Scenario– Performance or availability is a high priority– Temporary inconsistency is tolerable– E.g. web browser display rendering– E.g. New book today but catalog updated next day– E.g. Loose coupling replica

15

Eventual consistency

• Inconsistency window– After a data update the constraint may not hold

until some unspecified time in the future– An observer may, using the standard interfaces,

discover that the invariant is violated– Different observers may even see different results– Once updates stop occurring, it will make a best

effort drive toward the invariant

16

Cache system

• Cache = performance oriented replica system– Rather than reliability

• Invariant– Data in primary = replica in secondary memory– How long is the inconsistency window?– Strict consistency VS. eventual consistency

• Interface– The result of a read of a named object is always the value

of the most recent write to that object

17

Cache consistency

• Consistency: either strict or eventual– Strict: write-through cache• Performance is affected

– Eventual: no-write-through cache• Still hold the invariant for the same thread• What if there’re more than one cache?• What if another thread has its own cache?

– Even write-through cache fails to keep consistency

• Three methods– Timeout, marking, snoopy

18

Eventual consistency with timer expiration

• Example: DNS cache– Client ask for IP for “ginger.pedantic.edu”– Then network manager change IP of “ginger.pedantic.edu”

on “ns.pedantic.edu”

• TTL (Time To Live)– One hour as default– Keep old IP available during the TTL– Low cost

19

Strict consistency with a fluorescent marking pen

• A few variable are shared and writable– Server marks a page as “don’t cache me”– Browser will not cache the page

• The “volatile” variable– Ask the compiler to ensure read/write consistency• Write register back to memory• Flush cache• Block instruction reordering

20

Strict consistency with the snoopy cache

• Invalidate cache entries when inconsistent– When several processors share the same

secondary memory– Primary cache usually be private– Write-through doesn’t change cache in other

processor– Naïve solution: invalidate everything if write– Better idea: specify the cache line• Each private cache monitor the memory bus• Even grab the value of write and update

21

22

1. Processor A writes to memory2. Write-through the cache, to memory by bus3. Cache of B & C snoop on the bus and update the replica

Durable storage and the durability mantra

• Mirroring– On physical unit basis– E.g. RAID-1– Protect against internal failures

of individual disks• Issues– What if the OS damages the data before writing?– Placing matters: geographically separated

23

The durability mantra

• Multiple copies, widely separated and independently administered…

• Multiple copies, widely separated and independently administered…

• Multiple copies, widely separated and independently administered…

• Multiple copies, widely separated and independently administered…

24

Durable storage and the durability mantra

• Separate replicas geographically– High latency– Unreliable communication– Hard for synchronization

• When made asynchronously– Primary copy VS. backup copies– Master VS. slaves

• Constraint: replicas should be identical

25

Durable storage and the durability mantra

• Logic copies: file-by-file– Understandable to the application– Similar with logic locking– Lower performance

• Physical copies: sector-by-sector• More complex– Crash during updates– Performance enhancement

26

27

Challenge in replication: Consistency

• Optimistic Replication– Tolerate inconsistency, and fix things up later– Works well when out-of-sync replicas are

acceptable• Pessimistic Replication– Ensure strong consistency between replicas– Needed when out-of-sync replicas can cause

serious problems

28

OPTIMISTIC REPLICATION

29

Consistency

• Resolving Inconsistencies– Suppose we have two computers: laptop and

desktop– File could have been modified on either system

• How to figure out which one was updated?– One approach: use timestamps to figure out which

was updated recently– Many file synchronization tools use this approach

30

Use of time in computer systems

• Time is used by many distributed systems– E.g., cache expiration (DNS, HTTP), file

synchronizers, Kerberos, ..– Time intervals: how long did some operation take?– Calendar time: what time/date did some event

happen at?– Ordering of events: in what order did some events

happen?

31

Time Measuring

• Measuring time intervals– Computer has a reasonably-fixed-frequency

oscillator (e.g., quartz crystal)– Represent time interval as a count of oscillator's

cycles• time period = count / frequency• e.g., with a 1MHz oscillator, 1000 cycles means 1msec

32

Time Measuring

• Keeping track of calendar time.– Typically, calendar time is represented using a

counter from some fixed epoch.– For example, Unix time is #seconds since midnight

UTC at start of Jan 1, 1970.– Can convert this counter value into human-

readable date/time, and vice-versa.• Conversion requires two more inputs: time zone, data

on leap seconds.

33

Time Measuring

• What happens when turn off the computer?– "Real-Time Clock" (RTC) chip remains powered,

with battery / capacitor– Stores current calendar time, has an oscillator that

increments periodically• Maintaining accurate time– Accuracy: for calendar time, need to set the clock

correctly at some point– Precision: need to know oscillator frequency (drift

due to age, temp, etc)

34

Clock Synchronizing

• Synchronizing a clock over the internet: NTP– Query server's time, adjust local time accordingly

35

Clock Synchronizing

• Need to take into account network latency– Simple estimate: RTT/2– When does this fail to work well?• Asymmetric routes, with different latency in each

direction• Queuing delay, unlikely to be symmetric even for

symmetric routes• Busy server might take a long time to process client's

request

– Can use repeated queries to average out (or estimate variance) for second two

36

Estimating Network Latency

sync(server):t_begin = local_timetsrv = getTime(server)t_end = local_timedelay = (t_end-t_begin) / 2offset = (t_end-delay) - tsrvlocal_time = local_time - offset

37

Clock Synchronizing

• What if a computer's clock is too fast– e.g., 5 seconds ahead– Naive plan: reset it to the correct time

• Can break time intervals being measured (e.g., negative interval)• Can break ordering (e.g., older files were created in the future)

– "make" is particularly prone to these errors

• Principle: time never goes backwards– Idea: temporarily slow down or speed up the clock– Typically cannot adjust oscillator (fixed hardware)– Adjust oscillator frequency estimate, so counter advances

faster / slower

38

Slew timesync(server):

t_begin = local_timetsrv = getTime(server)t_end = local_timedelay = (t_end-t_begin) / 2offset = (t_end-delay) - tsrvfreq = base + ε * sign(offset)sleep(freq * abs(offset) / ε)freq = base

timer_intr(): # on every oscillator ticklocal_time = local_time + 1/freq

temporarily speed up / slow down local clock

39

Improving Time Precision

• If only adjust our time once– An inaccurate clock will lose accuracy– Need to also improve precision, so we don't need to slew

as often• Assumption: poor precision caused by poor

estimate of oscillator frequency– Can measure difference between local and remote clock

"speeds" over time– Adjust local frequency estimate based on that information– In practice, may want more stable feedback loop (PLL):

look at control theory

40

File Reconciliation with Timestamps

• Key Problem– Determine which machine has the newer version of file

• Strawman– Use the file with the highest mtime timestamp– Works when only one side updates the file per

reconciliation

41

File Reconciliation with Timestamps• Better plan

– Track last reconcile time on each machine– Send file if changed since then, and update last reconcile time– When receiving, check if local file also changed since last reconcile

• New Outcome– Timestamps on two versions of a file could be concurrent– Key issue with optimistic concurrency control: optimism was

unwarranted– Generally, try various heuristics to merge changes (text diff/merge,

etc)– Worst case, ask user (e.g., if edited same line of code in C file)

• Problem: reconciliation across multiple machines

42

File Reconciliation with Timestamps

• Goal: No Lost Updates– V2 should overwrite V1 if V2 contains all updates

that V1 contained– Simple timestamps can't help us determine this

43

Vector Timestamps

• Idea: vector timestamps– Store a vector of timestamps from each machine– Entry in vector keeps track of the last mtime– V1 is newer than V2 if all of V1's timestamps are >= V2’s– V1 is older than V2 if all of V1's timestamps are <= V2’s– Otherwise, V1 and V2 were modified concurrently, so

conflict– If two vectors are concurrent, one computer modified file

• without seeing the latest version from another computer

– If vectors are ordered, everything is OK as before

44

Vector Timestamps

• Cool property of version vectors:– A node's timestamps are only compared to other

timestamps from the same node– Time synchronization not necessary for

reconciliation w/ vector timestamps– Can use a monotonic counter on each machine

• Does calendar time still matter?– More compact than vector timestamps– Can help synchronize two systems that don't share

vector timestamps

49

Synchronizing multiple files

• Strawman– As soon as file is modified, send updates to every

other computer• What consistency guarantees does this file

system provide to an application?– Relatively few guarantees, aside from no lost updates

for each file– In particular, can see changes to b without seeing

preceding changes to a– Counter-intuitive: updates to diff files might arrive in

diff order

51

PESSIMISTIC REPLICATIONRSM & Paxos

52

Pessimistic Replication

• Some applications may prefer not to tolerate inconsistency– E.g., a replicated lock server, or replicated coordinator

for 2PC– E.g., Better not give out the same lock twice– E.g., Better have a consistent decision about whether

transaction commits• Trade-off: stronger consistency with pessimistic

replication means:– Lower availability than what you might get with

optimistic replication

53

Single-copy consistency

• Problem of optimistic way: replicas get out of sync– One replica writes data, another doesn't see the changes– This behavior was impossible with a single server

• Ideal goal: single-copy consistency– Property of the externally-visible behavior of a replicated

system.– Operations appear to execute as if there's only a single

copy of the data.• Internally, there may be failures or disagreement, which we have

to mask.

– Similar to how we defined serializability goal ("as if executed serially").

54

Replicating a Server

• Strawman– Clients send requests to both servers– Tolerating faults: if one server is down, clients

send to the other• Tricky case: what if there's a network

partition?– Each client thinks the other server is dead, keeps

using its server– Bad situation: not single-copy consistency!

55

Handling network partitions

• Issue– Clients may disagree about what servers are up– Hard to solve with 2 servers, but possible with 3 servers

• Idea: require a majority servers to perform operation– In case of 3 servers, 2 form a majority– If client can contact 2 servers, it can perform operation

(otherwise, wait)– Thus, can handle any 1 server failure

56

Handling network partitions

• Why does the majority rule work?– Any two majority sets of servers overlap– Suppose two clients issue operations to a majority

of servers– Must have overlapped in at least one server, will

help ensure single-copy

57

Handling network partitions

• Problem: replicas can become inconsistent– Issue: clients' requests to different servers can

arrive in different order– How do we ensure the servers remain consistent?

58

RSM: Replicated state machines

• A general approach to making consistent replicas of a server:– Start with the same initial state on each server– Provide each replica with the same input

operations, in the same order– Ensure all operations are deterministic• E.g., no randomness, no reading of current time, etc.

• These rules ensure each server will end up in the same final state

59

60

Simple Implementation: replicated logs

• Replicated Logs– Log client operations, including both R/W, and numbered

• Key issue: agreeing on the order of operations.– Coordinator handles one client operation at a time– Coordinator chooses an order for all operations (assigns

log sequence number)– Coordinator issues the operation to each replica– When is it OK to reply to client?

• Must wait for majority of replicas to reply• Otherwise, if a minority crashes, remaining servers may continue

without op

61

Replicating the coordinator

• Replicating the coordinator– Tricky: can we get multiple coordinators due to

network partition?– Tricky: what happens if coordinator crashes

midway through an operation?

62

63

Leslie Lamport

What is Paxos protocol?

• Paxos is a simple protocol that a group of machines in a distributed system can use to agree on a value proposed by a member of the group.

• Assumptions– Asynchronous

• Processes operat at arbitary speed

– Non-Byzantine model• Processes operate at arbitrary speed• Fail by stopping

– Processes may fail and the restart; this requires that information can be remembered

Roles

• Proposer: offer proposals of the form [value, number].

• Acceptor: accept or reject offered proposals so as to reach consensus on the chosen proposal/value.

• Learner: become aware of the chosen proposal/value.

• A process can take on all roles

Approach 1

• Designate a single process X as acceptor (e.g. one with smallest identifier)– Each proposer sends its value to X– X decides on one of the values– X announces its decision to all learners

• Problem?– Failure of the single acceptor halts decision

• Need multiple acceptors!

Approach 2

• Each proposer propose to all acceptors• Each acceptor accepts the first proposal it

receives and rejects the rest• If the proposer receives positive replies

from a majority of acceptors, it chooses its own value– There is at most 1 majority, hence only a single

value is chosen• Proposer sends chosen value to all learners

Approach 2

• Problem:– What if multiple leaders propose

simultaneously so there is no majority accepting?

– What if the process fails?

Paxos solution

• Each acceptor must be able to accept multiple proposals

• Order proposals by proposal number - If a proposal with value v is chosen, all higher

proposals have value v

Paxos Operation: Process State

• Each node maintains:– na, va: highest proposal number accepted and its

corresponding accepted value – nh: highest proposal number seen– myn: node’s proposal number in the current Paxos

Paxos Operations• Choosing a proposal number:

– Use last known proposal number + 1, append process’s identifier

Paxos Operation

• Phase 1 (Prepare)- A node decides to propose - Proposer choose myn > nh - Proposer sends <prepare, myn> to all nodes- A node receiving <prepare, n> has this logic

If n < nh

reply <prepare-reject>Else nh = n

reply <prepare-ok, na,va>

This node will not accept any proposal lower than n

Paxos Operation• Phase 2 (Accept):

- If a proposer gets prepare-ok from a majority• V = non-empty value corresponding to the highest na received• If V= null, then proposer can pick any V• Send <accept, myn, V> to all nodes

- If proposer fails to get majority prepare-ok• Delay and restart Paxos

- Upon receiving <accept, n, V>If n < nh

reply with <accept-reject>else na = n; va = V; nh = n

reply with <accept-ok>

Paxos Operation

• Phase 3 (Decide)- If proposer gets accept-ok from a majority • Send <decide, va> to all nodes

- If leader fails to get accept-ok from a majority• Delay and restart Paxos

Paxos: Timeouts

• All processes wait a maximum period (timeout) for messages they expect

• Upon timeout, a process starts again

Paxos with One Leader, No Failures:Phase 1

0 1 2 3 4

na

va

nh

done

-1

nil

-1

F

-1

nil

-1

F

-1

nil

-1

F

-1

nil

-1

F

-1

nil

-1

F

“prepare(1,1)”myn = 1

Paxos with One Leader, No Failures:Phase 1

0 1 2 3 4

na

va

nh

done

-1

nil

1

F

-1

nil

1

F

-1

nil

1

F

-1

nil

1

F

-1

nil

1

F

“prepare-accept(-1, nil)”

78

Paxos with One Leader, No Failures:Phase 2

0 1 2 3 4

na

va

nh

done

-1

nil

1

F

-1

nil

1

F

-1

nil

1

F

-1

nil

1

F

-1

nil

1

F

prepare-accept frommajority! all v’s nil

Paxos with One Leader, No Failures:Phase 2

0 1 2 3 4

na

va

nh

done

-1

nil

1

F

-1

1

F

-1

nil

1

F

-1

nil

1

F

-1

nil

1

F

“accept(1,1,1)”

nil

Paxos with One Leader, No Failures:Phase 2

0 1 2 3 4

accept from majority

na

va

nh

done

1

1

F

1

1

F

1

1

F

1

1

F

1

1

F

1 1 1 1 1

Paxos with One Leader, No Failures:Phase 3

0 1 2 3 4

Send (decide,1)

na

va

nh

done

1

1

F

1

1

F

1

1

F

1

1

F

1

1

F

1 1 1 1 1

Understanding Paxos

• What if we get two nodes that send a prepare message?

• What if a proposer fails while sending accept?

• What if a proposer fails after sending prepare-ok?

More Than Proposer

• Can occur after timeout during Paxos algorithm, partition, lost packets

• Two proposers must use different n in their prepare messages.

• Suppose two proposers have proposals 1, 2

83

More Than One Proposer

• Proposal 1 gets to all nodes which is then followed by proposal 2

• In both cases a prepare-ok message is sent• Both proposes will send a accept message • However, for proposal 1 an accept-reject

message is sent

Proposer Fails Before Sending Accept

• Some process will time out and become a propose

• Old proposer didn’t send any decide, so no risk of non-agreement

Risks: Leader Failures

• Suppose proposers fails after sending minority of accept– Same as two proposers!

• Suppose proposer fails after sending majority of accept– Same as two leaders!

Process Fails

• Process fails after receiving accept and after sending accept-ok

• Process should remember va and na on disk• If process doesn’t restart, possible timeout in

Phase 3, new leader

Shortcuts to meet more modest requirements

• Single state machine– Carry out all updates at one replica site– Generate a new version at that site– Bring the other replicas into line

• Brute force copying: copy new version of data to each of the other replica sites, replace previous copies

– Conditions• Occasionally update• Small database• No urgency to make updates available, so batch is OK• Temporary inconsistency can be tolerated

88

Single state machine

• SSM is subject to data decay, but:– Undetected decay of the master– Frequent update leads unnecessary reconciliation

• A dummy update is enough

• Main defect of SSM– Data update is not fault tolerant

• Only data access is fault tolerant

– What if fails in the middle of updating?– Doesn’t work well for some applications

• E.g. large database

89

Variant of single state machine

• Master only distributes deltas– Pros: may produce a performance gain– Cons: has disadvantages of both SSM & RSM

• Reduce inconsistency window– E.g. shadow copy (just as two-phase commit)

• Partition large database– Each of which can be updated independently

• Assign a different master to each partition– Distribute updating work, increase availability of update

90

Variant of single state machine

• Add fault tolerance when master fails– Using a consensus algorithm to choose a new

master site• If data is insensitive to update order, then

consensus algorithm is not needed– E.g. email, users may see different order

• Master can distribute just its update log– Replica sites can run REDO on log– Replica sites can only maintain a complete log

91

Maintaining data integrity

• Threats of data integrity in updating a replica– Data can be damaged or lost– Transmission can introduce errors– Operators can make blunders

• Solutions– Periodically compare replicas bit-by-bit

• To check spontaneous data decay

– Calculate a witness of the contents to compare• E.g. by choosing a good hash algorithm• Just as checksum in end-to-end layer, link layer

92

Replica reading and majorities

• Simplest plan– Read and write from master, slaves are only backups– Master is responsible to order for consistency

• Enhancement for read availability– Allow reading replica– Also enhance performance– But consistency may be violated

• Should ensures before-or-after between reads and updates

• More reliable but expensive way– Obtain data from other replicas to verify integrity– Use majority for reading

93

Quorum

• Define separate read &write quorums: Qr & Qu

– Qr + Qu > Nreplicas (Why?)• Confirm a write after writing to at least Qu of replicas

• Read at least Qr agree on the data or witness value

• Example– In favor of reading: Nreplicas = 5, Qw = 4, Qr = 2

– In favor or updating: Nreplicas = 5, Qw = 2, Qr = 4

– Enhance availability by Qw = Nreplicas & Qr = 1

94

Quorum

• Provide no before-or-after or all-or-nothing– If reading & writing requests come from a site• Easy…

– If reading from multiple sites, writing from one site• Maintain a version number at that site

– If writing from multiple sites• Protocol providing a distributed sequencer

• Another complicating consideration– Performance maximization

95

Backup

• Time consuming– Incremental backup– Partial backup

• Don’t copy files that can be reconstructed from other files

• When to backup?– In the middle of updating may violate consistency

• Replicated failures by the same programming– Independence of failures should be enforced

• Fork wisdom– The more elaborate the backup system, the less likely that

it actually works

96

Partitioning data

• Partitioning data to tolerate failure– Place each part on a different physical device– Easy to explain to users, inexpensive– Improve performance

• Combine replication with partition– Each storage device can contain replicas of several

of the different partitions

97