Hector Garcia-Molina

CS 347 Notes06 1

CS 347: Parallel and Distributed

Data ManagementNotes06: Reliable

DistributedDatabase Management

Hector Garcia-Molina

CS 347 Notes06 2

Reliable distributed database management

• Reliability• Failure models• Scenarios

CS 347 Notes06 3

Reliability

• Correctness– Serializability– Atomicity– Persistence

• Availability

CS 347 Notes06 4

Types of failures

• Processor failures– Halt, delay, restart, bezerk, ...

• Storage failures– Volatile, non-volatile, atomic write,

transient errors, spontaneous failures

• Network failures– Lost message, out-of-order messages,

partitions, bounded delay

CS 347 Notes06 5

• Malevolent failures• Multiple failures• Detectable failures

More Types of failures

CS 347 Notes06 6

Failure models

• Cannot protect against everything• Unlikely failures (e.g., flooding in the

Sahara)

• Expensive to protect failures (e.g., earthquake)

• Failures we know how to protect against (e.g., message sequence numbers; stable storage)

CS 347 Notes06 7

Failure model:

DesiredEvents

Expected

Undesired

Unexpected

CS 347 Notes06 8

Node models

(1) Fail-stop nodestime

perfect halted recoveryperfect

Volatilememory lost

Stablestorage ok

CS 347 Notes06 9

(2) Byzantine nodesA Perfect

Perfect Arbitrary failure Recovery

B

C

At any given time, at most some fraction f of

nodesfailed (typically f < 1/2 or f < 1/3)

Node models

CS 347 Notes06 10

Network models

(1) Reliable network- in order messages- no spontaneous messages- timeout TD

I.e., no lost messages, except for node failures

If no ack in TD sec.Destination down

(not paused)

CS 347 Notes06 11

Variation of reliable net:

• Persistent messages– If destination down, net will

eventually deliver message– Simplifies node recovery, but leads to

inefficiencies (hides too much)– Not considered here

CS 347 Notes06 12

Network models

(2) Partitionable network- In order messages- No spontaneous messages

- no timeout; nodes can have different view of failures

CS 347 Notes06 13

Scenarios

• Reliable network– Fail-stop nodes

– No data replication (1)– Data replication (2)

• Partitionable network– Fail-stop nodes (3)

CS 347 Notes06 14

No Data Replication

• Reliable network, fail-stop nodes

• Basic idea: node P controls X

P Item Xnet

CS 347 Notes06 15

No Data Replication

• Reliable network, fail-stop nodes

• Basic idea: node P controls X

P Item Xnet

- Single control point simplifies concurrency control, recovery

- Not an availability hit: if P down, X unavailable too!

CS 347 Notes06 16

“P controls X” means- P does concurrency control for

X- P does recovery for X

CS 347 Notes06 17

Say transaction T wants to access X:

PT is process that represents T at this node

PT

req

Local DMBS

Lockmgr

LOG X

CS 347 Notes06 18

Process models

(A) Cohorts Spawn process

CommunicationData Access

USER

T1

LocalDMBS

T2

LocalDMBS

T3

LocalDMBS

CS 347 Notes06 19

Process models

(B) Transaction servers (manager)USER

LocalDMBS

TransMGR

LocalDMBS

TransMGR

LocalDMBS

TransMGR

CS 347 Notes06 20

• Cohorts: application code responsible for remote access

• Transaction manager: “system” handles distribution, remote access

CS 347 Notes06 21

.

Distributed commit problem

Action:a1,a2

Action:a3

Action:a4,a5

Transaction T

CS 347 Notes06 22

Centralized two-phase commit

Coordinator Participant

I

W

C

A

I

W

C

A

go exec*

nok abort*

ok* commit *

commit -

exec ok

exec nok

abort -

CS 347 Notes06 23

• Notation: Incoming message Outgoing message

( * = everyone)• When participant enters “W” state:

– it must have acquired all resources– it can only abort or commit if so

instructedby a coordinator

• Coordinator only enters “C” state if all participants are in “W”, i.e., it is certain that all will eventually commit

CS 347 Notes06 24

Handling node failures

• Coordinator and participant logs are used to reconstruct state before failure

CS 347 Notes06 25

-> Example: after participant fails: Log:

T1X

undo/redoinfo

T1Y

info... ...

T1“W”state

CS 347 Notes06 26

At recovery:

• T1 is in “W” state• Obtain X,Y write locks (no read locks!)

• Wait for message from coordinator(or ask coordinator for outcome)

CS 347 Notes06 27

Other examples:

• No “W” record on log abort T1

• See “C” record on log finish T1

CS 347 Notes06 28

• Add timeouts to cope with messages lost during crashes

• Add finish (“F”) state for coordinator – all done, can forget outcome

Next

CS 347 Notes06 29

Coordinator

I

W

C

A

F

_go_exec*

c-ok*-

nok*-

nokabort*

ok*commit*

t=timeout

cping=coord. ping

CS 347 Notes06 30

Coordinator

I

W

C

A

F

_go_exec*

c-ok*-

nok*-

ping- _t_

cping

pingabort

pingcommit

_t_cping

_t_abort*

nokabort*

ok*commit*

t=timeout

cping=coord. ping

CS 347 Notes06 31

Participant

I

W

C

A

execok

execnok

commitc-ok

abortnok

CS 347 Notes06 32

Participant

I

W

C

A

execok

execnok

commitc-ok

abortnok

cping , _t - ping

cpingdone

cpingdone

“done” message counts as eitherc-ok or n-ok for coordinator

CS 347 Notes06 33

Participant

I

W

C

A

execok

equivalent to finish state

execnok

commitc-ok

abortnok

cping , _t - ping

cpingdone

cpingdone

“done” message counts as eitherc-ok or n-ok for coordinator

CS 347 Notes06 34

Presumed abort protocol

• “F” and “A” states combined in coordinator

• Saves persistent space (forget quicker)

• Presumed commit is analogous

CS 347 Notes06 35

Presumed abort-coordinator (participant unchanged)

I

W

C

A/F

_go_exec*

ping-

c-ok*-

pingabort

pingcommit _t_

cping

nok, tabort*

ok*commit*

CS 347 Notes06 36

Remember: all state transitions must be loggedExample: tracking who has sent “OK” msgsLog at coord:

• After failure, we know still waiting for OK from node b

• Alternative: do not log receipts of “OK”s abort T1

T1start

part={a,b}

T1OK

from a RCV

... ......

CS 347 Notes06 37

Example: logging receipt of C-OK messages

• If logged, can recover state• If not logged:

– resend commit *– participants reply “done” if duplicate

CS 347 Notes06 38

2PC is blocking

Sample scenario:Coord P2

W P1 P3

WP4

W

CS 347 Notes06 39

Case I:P1 “W”; coordinator sent commits

P1 “C”Case II:

P1 NOK; P1 A

P2, P3, P4 (surviving participants)

cannot safely abort or commit transaction

coord

P1

P2

P3

P4w

w

w

CS 347 Notes06 40

Variants of 2PC

• Linear

Coord

• Hierarchical

ok ok ok

commit commit commit

CS 347 Notes06 41

• Distributed

– Nodes broadcast all messages– Every node knows when to commit

Variants of 2PC

CS 347 Notes06 42

3PC = non-blocking commit

• Assume: failed node is down forever

• Key idea: before committing, coordinator tells participants everyone is ok

CS 347 Notes06 43

Coordinator Participant

I

W

P

A

_go_exec*

ack**commit*

nokabort*

ok*pre *

3PC

C

I

W

P

A

_exec_ok

commit-

execnok

preack

C

abort-

** means allnon-failed nodes

CS 347 Notes06 44

3PC recovery rules: termination protocol

• Survivors try to complete transaction, based on their current states

• Goal:– If dead nodes committed or aborted,

then survivors should not contradict!– Else, survivors can do as they please...

surv

ivors

CS 347 Notes06 45

• Let {S1,S2,…Sn} be survivor sites• If one or more Si = COMMIT COMMIT T• If one or more Si = ABORT ABORT T• If one or more Si = PREPARE

T could not have aborted COMMIT T• If no Si = PREPARE (or COMMIT)

T could not have committed ABORT T

surv

ivors

CS 347 Notes06 46

Example:

P

W

W

?

?

CS 347 Notes06 47

Example:

I

W

W

?

?

CS 347 Notes06 48

Example:

P

P

C

?

?

CS 347 Notes06 49

Example:

P

W

A

?

?

CS 347 Notes06 50

Once survivors make decision, they must select new coordinator to continue 3PC

P P C C

W P C C W P P C

Decide tocommit

Time1

Time2

Time3

Time4

CS 347 Notes06 51

Note: when survivors continue 3PC, failed nodes do not

countE.g., “ack**” when ack’s received

from all non-failed nodesP

C

ack**commit *

CS 347 Notes06 52

Note: 3PC unsafe with partitions!

W

W

W

P

P

abort commit

CS 347 Notes06 53

Node recovery:

• After node N recovers from failure:– do not participate in termination protocol

(why?)

W

W

W

?

P

A

CS 347 Notes06 54

Node recovery:


(why?)

W

W

W

?

P

A

later on...

CS 347 Notes06 55

Node recovery:


(why?)– wait until it hears commit or abort

decision from operational node

?ping

ping

ping“C”(or “A”)

CS 347 Notes06 56

• Waiting for commit/abort decision from other node is ok, unless all fail:

? ? ? ? ?

CS 347 Notes06 57

Two options for all-failed problem:(A) Wait for all to recover(B) Majority commit

CS 347 Notes06 58

Option A

• Recovering node waits for either:(1) commit/abort outcome for T from other node(2) all nodes that participated in T are up and recovering:

then 3PC can continue(no danger that a failed node could haveaborted or committed)

Option B

• Want a “gang” of failed but recovered nodes to be able to terminate a transaction even when rest are failed...

CS 347 Notes06 59

P1

P3

P2

P5

P4

CS 347 Notes06 60

Option B

• Nodes are assigned votes, total is VMajority is V+1 e.g., V=5

2 Maj=3 V=6 Maj=4

• To make state transitions, coordinator requires messages from nodes with a majority of votes

CS 347 Notes06 61

Example(1): Coord P2 W P1 P3 W

p4 W

• Nodes P2, P3, P4 enter “W” state and fail• When they recover, coord. and P1 are

down• Each node has 1 vote, V=5, Maj=3

?

?

?

CS 347 Notes06 62

Example(1): Coord P2 W P1 P3 W

p4 W

• Nodes P2, P3, P4 enter “W” state and fail• When they recover, coord. and P1 are

down• Each node has 1 vote, V=5, Maj=3

?

?

?

• Since P2, P3, P4 have majority, they know coord. could not have gone to “P” without at least one of their votes

• Therefore, T can be aborted!

CS 347 Notes06 63

Example(2): Coord P3 ”P” P1 P4 ”W” P2

• Each node has 1 vote; V=5, Maj=3• Nodes fail after entering states shown;

P3, P4 recover

?

?

CS 347 Notes06 64

Example(2): Coord P3 ”P” P1 P4 ”W” P2

• Each node has 1 vote; V=5, Maj=3• Nodes fail after entering states shown;

P3, P4 recover

?

?

• Termination rule says we can try to commit, but P3, P4 do not have enough votes, so they do nothing!

• P3, P4 doing nothing is good because later on, coord. P1, P2 could abort T

CS 347 Notes06 65

1

Summary: Majority rule ensures that any

decision (e.g., Preparing, committing) will be

knownto any future group making a decision

1 1

22

decision # 2

decision # 1

CS 347 Notes06 66

Important Detail for Majority 3PC

• Example:

W

W

W

?

P

A

CS 347 Notes06 67

Important Detail for Majority 3PC

• Example:

W

W

W

?

P

A

P C

CS 347 Notes06 68

Need “Prepare To Abort” State

I

W

PC PA

_go_exec*

ackC**commit*

nokpreA*ok*

preC *

C

I

W

PC

A

_exec_ok

commit-

execnok

preCackC

C

preAackA

coordinator participant

A

ackA**abort*

PA

abort-

** means participantswith majority votes preA

ackA

CS 347 Notes06 69

Example Revisited

W

W

W

?

PC

PA

CS 347 Notes06 70

Example Revisited

W

W

W

?

PC

PA

PC C

OK to commit sincetransaction could not have aborted

CS 347 Notes06 71

Example Revisited -II

W

W

W

?

PC

PA

PA

CS 347 Notes06 72

Example Revisited -II

W

W

W

?

PC

PA

PA

No decision:Transaction could have aborted orcould have committed... Block!

PA

CS 347 Notes06 73

3PC with Majority Voting

• If survivors have majority andall states W try to abort

• If survivors have majority andstates in {W, PC, C} try to commit

• If survivors have majority andstates in {W, PA, A} try to abort

• Otherwise block

CS 347 Notes06 74

Comparison

Option A: only nodes that have not failed participate in 3PC

• Any size group can terminate(even one node)

• If all nodes fail, must wait for all to recover

CS 347 Notes06 75

Option B: Majority voting• A group of failed+recovering

nodes can terminate transaction (with majority of votes)

• Need majority for every commit blocking protocol!

Comparison

CS 347 Notes06 76

Reminder

• When node recovers, it uses its log in a normal fashion to determine status of transactions:– if commit found in log redo if

necessary– if abort found (or no “W” record)

rollback if necessary

CS 347 Notes06 77

– if in “W” state (or “P” state): • reclaim locks held by T before crash• try to terminate T (with other nodes)

– after locks claimed for “in doubt” transactions, start normal processing

Reminder - Continued

CS 347 Notes06 78

Final note

• If nodes use 2P locking, global deadlocks possible

Local WFG: Local WFG: no cycles no cycles!

T1

T2

T1

T2

CS 347 Notes06 79

• Need to “combine” WFGs to discover global deadlock

T1

T2

T1

T2

T1

T2e.g., central

detection node

CS 347 Notes06 80

Problem: False deadlocks

T1

T2

T1

T2

T1

T2

T1

T2

Time1

Time2

Time3

infosent

infosent

T1

T2

at centralsite:

CS 347 Notes06 81

Problem: False deadlocks

T1

T2

T1

T2

T1

T2

T1

T2

Time1

Time2

Time3

infosent

infosent

T1

T2

at centralsite:

Note that T2 is not 2PL;it releases lock,then asks for another lock

Exercise

• Assume all waits are due to transaction lock requests

• Assume transactions well formed and 2PL; scheduler legal

• Show that false deadlocks are not possible

CS 347 Notes06 82

CS 347 Notes06 83

• Many deadlock solutions– Distributed vs. centralized– Detection vs. prevention

• timeouts• wait-die• wound-wait

• Covered in CS245

Hector Garcia-Molina

Documents

Transcript of Hector Garcia-Molina