Hector Garcia-Molina

83
CS 347 Notes06 1 CS 347: Parallel and Distributed Data Management Notes06: Reliable Distributed Database Management Hector Garcia-Molina

description

CS 347: Parallel and Distributed Data Management Notes06: Reliable Distributed Database Management. Hector Garcia-Molina. Reliable distributed database management. Reliability Failure models Scenarios. Reliability. Correctness Serializability Atomicity Persistence Availability. - PowerPoint PPT Presentation

Transcript of Hector Garcia-Molina

Page 1: Hector Garcia-Molina

CS 347 Notes06 1

CS 347: Parallel and Distributed

Data ManagementNotes06: Reliable

DistributedDatabase Management

Hector Garcia-Molina

Page 2: Hector Garcia-Molina

CS 347 Notes06 2

Reliable distributed database management

• Reliability• Failure models• Scenarios

Page 3: Hector Garcia-Molina

CS 347 Notes06 3

Reliability

• Correctness– Serializability– Atomicity– Persistence

• Availability

Page 4: Hector Garcia-Molina

CS 347 Notes06 4

Types of failures

• Processor failures– Halt, delay, restart, bezerk, ...

• Storage failures– Volatile, non-volatile, atomic write,

transient errors, spontaneous failures

• Network failures– Lost message, out-of-order messages,

partitions, bounded delay

Page 5: Hector Garcia-Molina

CS 347 Notes06 5

• Malevolent failures• Multiple failures• Detectable failures

More Types of failures

Page 6: Hector Garcia-Molina

CS 347 Notes06 6

Failure models

• Cannot protect against everything• Unlikely failures (e.g., flooding in the

Sahara)

• Expensive to protect failures (e.g., earthquake)

• Failures we know how to protect against (e.g., message sequence numbers; stable storage)

Page 7: Hector Garcia-Molina

CS 347 Notes06 7

Failure model:

DesiredEvents

Expected

Undesired

Unexpected

Page 8: Hector Garcia-Molina

CS 347 Notes06 8

Node models

(1) Fail-stop nodestime

perfect halted recoveryperfect

Volatilememory lost

Stablestorage ok

Page 9: Hector Garcia-Molina

CS 347 Notes06 9

(2) Byzantine nodesA Perfect

Perfect Arbitrary failure Recovery

B

C

At any given time, at most some fraction f of

nodesfailed (typically f < 1/2 or f < 1/3)

Node models

Page 10: Hector Garcia-Molina

CS 347 Notes06 10

Network models

(1) Reliable network- in order messages- no spontaneous messages- timeout TD

I.e., no lost messages, except for node failures

If no ack in TD sec.Destination down

(not paused)

Page 11: Hector Garcia-Molina

CS 347 Notes06 11

Variation of reliable net:

• Persistent messages– If destination down, net will

eventually deliver message– Simplifies node recovery, but leads to

inefficiencies (hides too much)– Not considered here

Page 12: Hector Garcia-Molina

CS 347 Notes06 12

Network models

(2) Partitionable network- In order messages- No spontaneous messages

- no timeout; nodes can have different view of failures

Page 13: Hector Garcia-Molina

CS 347 Notes06 13

Scenarios

• Reliable network– Fail-stop nodes

– No data replication (1)– Data replication (2)

• Partitionable network– Fail-stop nodes (3)

Page 14: Hector Garcia-Molina

CS 347 Notes06 14

No Data Replication

• Reliable network, fail-stop nodes

• Basic idea: node P controls X

P Item Xnet

Page 15: Hector Garcia-Molina

CS 347 Notes06 15

No Data Replication

• Reliable network, fail-stop nodes

• Basic idea: node P controls X

P Item Xnet

- Single control point simplifies concurrency control, recovery

- Not an availability hit: if P down, X unavailable too!

Page 16: Hector Garcia-Molina

CS 347 Notes06 16

“P controls X” means- P does concurrency control for

X- P does recovery for X

Page 17: Hector Garcia-Molina

CS 347 Notes06 17

Say transaction T wants to access X:

PT is process that represents T at this node

PT

req

Local DMBS

Lockmgr

LOG X

Page 18: Hector Garcia-Molina

CS 347 Notes06 18

Process models

(A) Cohorts Spawn process

CommunicationData Access

USER

T1

LocalDMBS

T2

LocalDMBS

T3

LocalDMBS

Page 19: Hector Garcia-Molina

CS 347 Notes06 19

Process models

(B) Transaction servers (manager)USER

LocalDMBS

TransMGR

LocalDMBS

TransMGR

LocalDMBS

TransMGR

Page 20: Hector Garcia-Molina

CS 347 Notes06 20

• Cohorts: application code responsible for remote access

• Transaction manager: “system” handles distribution, remote access

Page 21: Hector Garcia-Molina

CS 347 Notes06 21

.

Distributed commit problem

Action:a1,a2

Action:a3

Action:a4,a5

Transaction T

Page 22: Hector Garcia-Molina

CS 347 Notes06 22

Centralized two-phase commit

Coordinator Participant

I

W

C

A

I

W

C

A

go exec*

nok abort*

ok* commit *

commit -

exec ok

exec nok

abort -

Page 23: Hector Garcia-Molina

CS 347 Notes06 23

• Notation: Incoming message Outgoing message

( * = everyone)• When participant enters “W” state:

– it must have acquired all resources– it can only abort or commit if so

instructedby a coordinator

• Coordinator only enters “C” state if all participants are in “W”, i.e., it is certain that all will eventually commit

Page 24: Hector Garcia-Molina

CS 347 Notes06 24

Handling node failures

• Coordinator and participant logs are used to reconstruct state before failure

Page 25: Hector Garcia-Molina

CS 347 Notes06 25

-> Example: after participant fails: Log:

T1X

undo/redoinfo

T1Y

info... ...

T1“W”state

Page 26: Hector Garcia-Molina

CS 347 Notes06 26

At recovery:

• T1 is in “W” state• Obtain X,Y write locks (no read locks!)

• Wait for message from coordinator(or ask coordinator for outcome)

Page 27: Hector Garcia-Molina

CS 347 Notes06 27

Other examples:

• No “W” record on log abort T1

• See “C” record on log finish T1

Page 28: Hector Garcia-Molina

CS 347 Notes06 28

• Add timeouts to cope with messages lost during crashes

• Add finish (“F”) state for coordinator – all done, can forget outcome

Next

Page 29: Hector Garcia-Molina

CS 347 Notes06 29

Coordinator

I

W

C

A

F

_go_exec*

c-ok*-

nok*-

nokabort*

ok*commit*

t=timeout

cping=coord. ping

Page 30: Hector Garcia-Molina

CS 347 Notes06 30

Coordinator

I

W

C

A

F

_go_exec*

c-ok*-

nok*-

ping- _t_

cping

pingabort

pingcommit

_t_cping

_t_abort*

nokabort*

ok*commit*

t=timeout

cping=coord. ping

Page 31: Hector Garcia-Molina

CS 347 Notes06 31

Participant

I

W

C

A

execok

execnok

commitc-ok

abortnok

Page 32: Hector Garcia-Molina

CS 347 Notes06 32

Participant

I

W

C

A

execok

execnok

commitc-ok

abortnok

cping , _t - ping

cpingdone

cpingdone

“done” message counts as eitherc-ok or n-ok for coordinator

Page 33: Hector Garcia-Molina

CS 347 Notes06 33

Participant

I

W

C

A

execok

equivalent to finish state

execnok

commitc-ok

abortnok

cping , _t - ping

cpingdone

cpingdone

“done” message counts as eitherc-ok or n-ok for coordinator

Page 34: Hector Garcia-Molina

CS 347 Notes06 34

Presumed abort protocol

• “F” and “A” states combined in coordinator

• Saves persistent space (forget quicker)

• Presumed commit is analogous

Page 35: Hector Garcia-Molina

CS 347 Notes06 35

Presumed abort-coordinator (participant unchanged)

I

W

C

A/F

_go_exec*

ping-

c-ok*-

pingabort

pingcommit _t_

cping

nok, tabort*

ok*commit*

Page 36: Hector Garcia-Molina

CS 347 Notes06 36

Remember: all state transitions must be loggedExample: tracking who has sent “OK” msgsLog at coord:

• After failure, we know still waiting for OK from node b

• Alternative: do not log receipts of “OK”s abort T1

T1start

part={a,b}

T1OK

from a RCV

... ......

Page 37: Hector Garcia-Molina

CS 347 Notes06 37

Example: logging receipt of C-OK messages

• If logged, can recover state• If not logged:

– resend commit *– participants reply “done” if duplicate

Page 38: Hector Garcia-Molina

CS 347 Notes06 38

2PC is blocking

Sample scenario:Coord P2

W P1 P3

WP4

W

Page 39: Hector Garcia-Molina

CS 347 Notes06 39

Case I:P1 “W”; coordinator sent commits

P1 “C”Case II:

P1 NOK; P1 A

P2, P3, P4 (surviving participants)

cannot safely abort or commit transaction

coord

P1

P2

P3

P4w

w

w

Page 40: Hector Garcia-Molina

CS 347 Notes06 40

Variants of 2PC

• Linear

Coord

• Hierarchical

ok ok ok

commit commit commit

Page 41: Hector Garcia-Molina

CS 347 Notes06 41

• Distributed

– Nodes broadcast all messages– Every node knows when to commit

Variants of 2PC

Page 42: Hector Garcia-Molina

CS 347 Notes06 42

3PC = non-blocking commit

• Assume: failed node is down forever

• Key idea: before committing, coordinator tells participants everyone is ok

Page 43: Hector Garcia-Molina

CS 347 Notes06 43

Coordinator Participant

I

W

P

A

_go_exec*

ack**commit*

nokabort*

ok*pre *

3PC

C

I

W

P

A

_exec_ok

commit-

execnok

preack

C

abort-

** means allnon-failed nodes

Page 44: Hector Garcia-Molina

CS 347 Notes06 44

3PC recovery rules: termination protocol

• Survivors try to complete transaction, based on their current states

• Goal:– If dead nodes committed or aborted,

then survivors should not contradict!– Else, survivors can do as they please...

surv

ivors

Page 45: Hector Garcia-Molina

CS 347 Notes06 45

• Let {S1,S2,…Sn} be survivor sites• If one or more Si = COMMIT COMMIT T• If one or more Si = ABORT ABORT T• If one or more Si = PREPARE

T could not have aborted COMMIT T• If no Si = PREPARE (or COMMIT)

T could not have committed ABORT T

surv

ivors

Page 46: Hector Garcia-Molina

CS 347 Notes06 46

Example:

P

W

W

?

?

Page 47: Hector Garcia-Molina

CS 347 Notes06 47

Example:

I

W

W

?

?

Page 48: Hector Garcia-Molina

CS 347 Notes06 48

Example:

P

P

C

?

?

Page 49: Hector Garcia-Molina

CS 347 Notes06 49

Example:

P

W

A

?

?

Page 50: Hector Garcia-Molina

CS 347 Notes06 50

Once survivors make decision, they must select new coordinator to continue 3PC

P P C C

W P C C W P P C

Decide tocommit

Time1

Time2

Time3

Time4

Page 51: Hector Garcia-Molina

CS 347 Notes06 51

Note: when survivors continue 3PC, failed nodes do not

countE.g., “ack**” when ack’s received

from all non-failed nodesP

C

ack**commit *

Page 52: Hector Garcia-Molina

CS 347 Notes06 52

Note: 3PC unsafe with partitions!

W

W

W

P

P

abort commit

Page 53: Hector Garcia-Molina

CS 347 Notes06 53

Node recovery:

• After node N recovers from failure:– do not participate in termination protocol

(why?)

W

W

W

?

P

A

Page 54: Hector Garcia-Molina

CS 347 Notes06 54

Node recovery:

• After node N recovers from failure:– do not participate in termination protocol

(why?)

W

W

W

?

P

A

later on...

Page 55: Hector Garcia-Molina

CS 347 Notes06 55

Node recovery:

• After node N recovers from failure:– do not participate in termination protocol

(why?)– wait until it hears commit or abort

decision from operational node

?ping

ping

ping“C”(or “A”)

Page 56: Hector Garcia-Molina

CS 347 Notes06 56

• Waiting for commit/abort decision from other node is ok, unless all fail:

? ? ? ? ?

Page 57: Hector Garcia-Molina

CS 347 Notes06 57

Two options for all-failed problem:(A) Wait for all to recover(B) Majority commit

Page 58: Hector Garcia-Molina

CS 347 Notes06 58

Option A

• Recovering node waits for either:(1) commit/abort outcome for T from other node(2) all nodes that participated in T are up and recovering:

then 3PC can continue(no danger that a failed node could haveaborted or committed)

Page 59: Hector Garcia-Molina

Option B

• Want a “gang” of failed but recovered nodes to be able to terminate a transaction even when rest are failed...

CS 347 Notes06 59

P1

P3

P2

P5

P4

Page 60: Hector Garcia-Molina

CS 347 Notes06 60

Option B

• Nodes are assigned votes, total is VMajority is V+1 e.g., V=5

2 Maj=3 V=6 Maj=4

• To make state transitions, coordinator requires messages from nodes with a majority of votes

Page 61: Hector Garcia-Molina

CS 347 Notes06 61

Example(1): Coord P2 W P1 P3 W

p4 W

• Nodes P2, P3, P4 enter “W” state and fail• When they recover, coord. and P1 are

down• Each node has 1 vote, V=5, Maj=3

?

?

?

Page 62: Hector Garcia-Molina

CS 347 Notes06 62

Example(1): Coord P2 W P1 P3 W

p4 W

• Nodes P2, P3, P4 enter “W” state and fail• When they recover, coord. and P1 are

down• Each node has 1 vote, V=5, Maj=3

?

?

?

• Since P2, P3, P4 have majority, they know coord. could not have gone to “P” without at least one of their votes

• Therefore, T can be aborted!

Page 63: Hector Garcia-Molina

CS 347 Notes06 63

Example(2): Coord P3 ”P” P1 P4 ”W” P2

• Each node has 1 vote; V=5, Maj=3• Nodes fail after entering states shown;

P3, P4 recover

?

?

Page 64: Hector Garcia-Molina

CS 347 Notes06 64

Example(2): Coord P3 ”P” P1 P4 ”W” P2

• Each node has 1 vote; V=5, Maj=3• Nodes fail after entering states shown;

P3, P4 recover

?

?

• Termination rule says we can try to commit, but P3, P4 do not have enough votes, so they do nothing!

• P3, P4 doing nothing is good because later on, coord. P1, P2 could abort T

Page 65: Hector Garcia-Molina

CS 347 Notes06 65

1

Summary: Majority rule ensures that any

decision (e.g., Preparing, committing) will be

knownto any future group making a decision

1 1

22

decision # 2

decision # 1

Page 66: Hector Garcia-Molina

CS 347 Notes06 66

Important Detail for Majority 3PC

• Example:

W

W

W

?

P

A

Page 67: Hector Garcia-Molina

CS 347 Notes06 67

Important Detail for Majority 3PC

• Example:

W

W

W

?

P

A

P C

Page 68: Hector Garcia-Molina

CS 347 Notes06 68

Need “Prepare To Abort” State

I

W

PC PA

_go_exec*

ackC**commit*

nokpreA*ok*

preC *

C

I

W

PC

A

_exec_ok

commit-

execnok

preCackC

C

preAackA

coordinator participant

A

ackA**abort*

PA

abort-

** means participantswith majority votes preA

ackA

Page 69: Hector Garcia-Molina

CS 347 Notes06 69

Example Revisited

W

W

W

?

PC

PA

Page 70: Hector Garcia-Molina

CS 347 Notes06 70

Example Revisited

W

W

W

?

PC

PA

PC C

OK to commit sincetransaction could not have aborted

Page 71: Hector Garcia-Molina

CS 347 Notes06 71

Example Revisited -II

W

W

W

?

PC

PA

PA

Page 72: Hector Garcia-Molina

CS 347 Notes06 72

Example Revisited -II

W

W

W

?

PC

PA

PA

No decision:Transaction could have aborted orcould have committed... Block!

PA

Page 73: Hector Garcia-Molina

CS 347 Notes06 73

3PC with Majority Voting

• If survivors have majority andall states W try to abort

• If survivors have majority andstates in {W, PC, C} try to commit

• If survivors have majority andstates in {W, PA, A} try to abort

• Otherwise block

Page 74: Hector Garcia-Molina

CS 347 Notes06 74

Comparison

Option A: only nodes that have not failed participate in 3PC

• Any size group can terminate(even one node)

• If all nodes fail, must wait for all to recover

Page 75: Hector Garcia-Molina

CS 347 Notes06 75

Option B: Majority voting• A group of failed+recovering

nodes can terminate transaction (with majority of votes)

• Need majority for every commit blocking protocol!

Comparison

Page 76: Hector Garcia-Molina

CS 347 Notes06 76

Reminder

• When node recovers, it uses its log in a normal fashion to determine status of transactions:– if commit found in log redo if

necessary– if abort found (or no “W” record)

rollback if necessary

Page 77: Hector Garcia-Molina

CS 347 Notes06 77

– if in “W” state (or “P” state): • reclaim locks held by T before crash• try to terminate T (with other nodes)

– after locks claimed for “in doubt” transactions, start normal processing

Reminder - Continued

Page 78: Hector Garcia-Molina

CS 347 Notes06 78

Final note

• If nodes use 2P locking, global deadlocks possible

Local WFG: Local WFG: no cycles no cycles!

T1

T2

T1

T2

Page 79: Hector Garcia-Molina

CS 347 Notes06 79

• Need to “combine” WFGs to discover global deadlock

T1

T2

T1

T2

T1

T2e.g., central

detection node

Page 80: Hector Garcia-Molina

CS 347 Notes06 80

Problem: False deadlocks

T1

T2

T1

T2

T1

T2

T1

T2

Time1

Time2

Time3

infosent

infosent

T1

T2

at centralsite:

Page 81: Hector Garcia-Molina

CS 347 Notes06 81

Problem: False deadlocks

T1

T2

T1

T2

T1

T2

T1

T2

Time1

Time2

Time3

infosent

infosent

T1

T2

at centralsite:

Note that T2 is not 2PL;it releases lock,then asks for another lock

Page 82: Hector Garcia-Molina

Exercise

• Assume all waits are due to transaction lock requests

• Assume transactions well formed and 2PL; scheduler legal

• Show that false deadlocks are not possible

CS 347 Notes06 82

Page 83: Hector Garcia-Molina

CS 347 Notes06 83

• Many deadlock solutions– Distributed vs. centralized– Detection vs. prevention

• timeouts• wait-die• wound-wait

• Covered in CS245