PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin,...

209
PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Transcript of PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin,...

Page 1: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

PBSPeter Bailis @pbailis Shivaram Venkataraman,Mike Franklin,Joe Hellerstein,Ion Stoica

@ Twitter6.22.12

UC Berkeley

Page 2: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

ProbabilisticallyBoundedStaleness

PBS

Page 3: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

1. Fast2. Scalable3. Available

Page 4: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

solution: replicate for 1. request capacity 2. reliability

Page 5: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

solution: replicate for 1. request capacity 2. reliability

Page 6: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

solution: replicate for 1. request capacity 2. reliability

Page 7: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley
Page 8: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

Page 9: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

Page 10: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

Page 11: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

Page 12: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

Page 13: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

Page 14: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

Page 15: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

slow

Page 16: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

slowalternative: sync later

Page 17: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

slowalternative: sync later

Page 18: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

slowalternative: sync later

Page 19: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

slowalternative: sync later

inconsistent

Page 20: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

slowalternative: sync later

inconsistent

Page 21: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

slowalternative: sync later

inconsistent

Page 22: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

keep replicas in sync

slowalternative: sync later

inconsistent

Page 23: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

⇧consistency, ⇧latencycontact more replicas,

read more recent data

consistency, ⇧ ⇧

latencycontact fewer replicas,

read less recent data

Page 24: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

⇧consistency, ⇧latencycontact more replicas,

read more recent data

consistency, ⇧ ⇧

latencycontact fewer replicas,

read less recent data

Page 25: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

eventual consistency“if no new updates are

made to the object, eventually all accesses

will return the last updated value”

W. Vogels, CACM 2008

Page 26: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

HowHow long do I have to wait?

eventual?

Page 27: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

consistent?How

What happens if I don’t wait?

Page 28: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

solution:

problem:

technique:

no guarantees with eventual consistency

consistency prediction

measure latencies use WARS model

PBS

Page 29: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Dynamo:Amazon’s Highly Available Key-value Store

SOSP 2007

Apache, DataStax

Project Voldemort

Page 30: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Adobe

Cisco

Digg

Gowalla

IBM

Morningstar

NetflixPalantir

Rackspace

Reddit

Rhapsody

Shazam

Spotify

Soundcloud

Twitter

Mozilla

Ask.comYammerAol

GitHubJoyentCloud

Best Buy

LinkedIn

Boeing

Comcast

Cassandra

RiakVoldemortGilt Groupe

Page 31: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

client

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

Page 32: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

client

read(“key”)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

Page 33: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

read(“key”)

client

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

Page 34: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

client

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

Page 35: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

client

(“key”, 1)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

Page 36: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

CoordinatorreadR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 37: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

read(“key”)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 38: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

read(“key”)

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 39: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 40: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 41: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 42: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

(“key”, 1)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 43: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

CoordinatorreadR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 44: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

read(“key”)readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 45: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

read(“key”)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 46: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 47: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)

(“key”, 1)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 48: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)

(“key”, 1)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 49: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

(“key”, 1)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

Page 50: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N replicas/keyread: wait for R replieswrite: wait for W acks

Page 51: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator W=1

R1(“key”, 1) R2 (“key”, 1) R3 (“key”, 1)

Page 52: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator

write(“key”, 2)

W=1

R1(“key”, 1) R2 (“key”, 1) R3 (“key”, 1)

Page 53: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator

write(“key”, 2)

W=1

R1(“key”, 1) R2 (“key”, 1) R3 (“key”, 1)

Page 54: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

Page 55: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator

ack(“key”, 2)

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

Page 56: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

read(“key”)ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

R=1

Page 57: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

read(“key”)

R=1

Page 58: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”, 1)

R=1

Page 59: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”,1)

R=1

Page 60: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”,1)

R=1

Page 61: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”,1)

R=1

Page 62: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”,1)

R=1

Page 63: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3 (“key”, 1)(“key”, 2)

(“key”,1)

ack(“key”, 2)

R=1

Page 64: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

ack(“key”, 2) ack(“key”, 2)

(“key”, 2)

R=1

Page 65: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

(“key”, 2)

R=1

Page 66: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

(“key”, 2)

(“key”, 2)

R=1

Page 67: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

(“key”, 2)

(“key”, 2) (“key”, 2)

R=1

Page 68: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

(“key”, 2)

R=1

Page 69: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

“strong”consistency

else:

R+W > Nif:

eventualconsistency

then:

Page 70: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R+W

Page 71: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R+W

strong consistency

Page 72: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R+W

strong consistencylower latency

Page 73: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R+W

strong consistencylower latency

Page 74: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R+W

strong consistencylower latency

Page 75: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Cassandra:R=W=1, N=3

by default

(1+1 ≯ 3)

Page 76: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

"In the general case, we typically use [Cassandra’s] consistency level of [R=W=1], which provides

maximum performance. Nice!" --D. Williams, “HBase vs Cassandra: why we moved” February 2010

Page 79: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Consistency or Bust: Breaking a Riak Cluster

NoSQL Primer

Sunday, July 31, 11

23

Page 80: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Consistency or Bust: Breaking a Riak Cluster

Low Value Data

n = 2, r = 1, w = 1

Sunday, July 31, 11

http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster

Page 81: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Consistency or Bust: Breaking a Riak Cluster

Low Value Data

n = 2, r = 1, w = 1

Sunday, July 31, 11

http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster

Page 82: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Consistency or Bust: Breaking a Riak Cluster

Mission Critical Data

n = 5, r = 1, w = 5, dw = 5

Sunday, July 31, 11

http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster

Page 83: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Consistency or Bust: Breaking a Riak Cluster

Mission Critical Data

n = 5, r = 1, w = 5, dw = 5

Sunday, July 31, 11

http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster

Page 84: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Voldemort @ LinkedIn

N=3 not required, “some consistency”:

R=W=1, N=2Alex Feinberg, personal communication

“very low latency and high availability”:

R=W=1, N=3

Page 85: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Anecdotally, EC“worthwhile” for

many kinds of data

Page 86: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Anecdotally, EC“worthwhile” for

many kinds of data

How eventual?How consistent?

Page 87: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Anecdotally, EC“worthwhile” for

many kinds of data

How eventual?How consistent?

“eventual and consistent enough”

Page 88: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Can we do better?

Page 89: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Probabilistically Bounded Staleness

can’t make promisescan give expectations

Can we do better?

Page 90: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

PBS is:a way to quantify latency-consistency trade-offs

what’s the latency cost of consistency?what’s the consistency cost of latency?

Page 91: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

PBS is:a way to quantify latency-consistency trade-offs

what’s the latency cost of consistency?what’s the consistency cost of latency?

an “SLA” for consistency

Page 92: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

t-visibility: probability p of consistent reads after after t seconds

(e.g., 99.9% of reads will be consistent after 10ms)

How eventual?

Page 93: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy)

Page 94: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy)

anti-entropy:only decreases stalenesscomes in many flavorshard to guarantee rate

Focus on message delays

Page 95: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

focus on

with failures:

steady state

unavailableor sloppy

Page 96: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicaonce per replica Time

Page 97: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

once per replica Time

Page 98: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ack

once per replica Time

Page 99: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ackwait for W responses

once per replica Time

Page 100: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ackwait for W responses

t seconds elapse

once per replica Time

Page 101: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ack

read

wait for W responses

t seconds elapse

once per replica Time

Page 102: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

once per replica Time

Page 103: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

once per replica Time

Page 104: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica Time

Page 105: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica Time

Page 106: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica Time

Page 107: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica Time

Page 108: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N=2

Time

Page 109: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write write

N=2

Time

Page 110: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

write

ack

N=2

Time

Page 111: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

write

ackW=1

N=2

Time

Page 112: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

write

ackW=1

N=2

Time

Page 113: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

write

ackW=1

N=2

read

Time

Page 114: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

write

ackW=1

N=2

read

response

Time

Page 115: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

write

ackW=1

R=1

N=2

read

response

Time

Page 116: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

write

ackW=1

R=1

N=2

read

response

good

Time

Page 117: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N=2

Time

Page 118: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

writewrite

N=2

Time

Page 119: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

write

ackN=2

Time

Page 120: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

write

ack

W=1

N=2

Time

Page 121: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

write

ack

W=1

N=2

Time

Page 122: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

write

ack

W=1

N=2

read

Time

Page 123: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

write

ack

W=1

N=2

read

response

Time

Page 124: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

write

ack

W=1

R=1

N=2

read

response

Time

Page 125: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

write

ack

W=1

R=1

N=2

read

response

bad

Time

Page 126: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

write

ack

W=1

R=1

N=2

read

response

bad

Time

Page 127: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

write

ack

W=1

R=1

N=2

read

response

bad

Time

Page 128: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replicaCoordinator Replica Time

Page 129: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2(“key”, 1) R3 (“key”, 1)(“key”, 2)

R=1

Page 130: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2(“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”, 1)

(“key”,1)

R=1

Page 131: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Coordinator

write(“key”, 2)

ack(“key”, 2)

W=1

R1 R2(“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”, 1)

(“key”,1)

R=1

R3 replied beforelast write arrived!

Page 132: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replicaCoordinator Replica Time

Page 133: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(W)write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replicaCoordinator Replica Time

Page 134: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(W)write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

(A)

Coordinator Replica Time

Page 135: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(R)

(W)write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

(A)

Coordinator Replica Time

Page 136: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

(R)

(W)write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

(A)

(S)

Coordinator Replica Time

Page 137: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Solving WARS: hardMonte Carlo methods: easier

Page 138: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

To use WARS:

W53.244.5101.1

...

A10.38.211.3...

R15.322.419.8...

S9.614.26.7...

run simulationMonte Carlo, sampling

gather latency data

Page 139: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

How eventual?

key: WARS modelneed: latencies

t-visibility: consistent reads with probability p after after t seconds

Page 140: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

consistent?What happens if I don’t wait?

How

Page 141: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Probability of reading later older than k versions is exponentially reduced by k

Pr(reading latest write) = 99%Pr(reading one of last two writes) = 99.9%

Pr(reading one of last three writes) = 99.99%

Page 143: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Cassandra cluster, injected latencies:

t-staleness RMSE: 0.28%latency N-RMSE: 0.48%

WARS Simulation accuracy

Page 144: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Yammer100K+ companies

uses Riak

LinkedIn 150M+ users

built and uses Voldemort

production latenciesfit gaussian mixtures

Page 145: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N=3

Page 146: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

10 ms

N=3

Page 147: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

99.9% consistent reads:R=2, W=1

t = 13.6 msLatency: 12.53 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 15.01 ms

LNKD-DISK

N=3

Page 148: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

99.9% consistent reads:R=2, W=1

t = 13.6 msLatency: 12.53 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 15.01 ms

LNKD-DISK

N=3

16.5% faster

Page 149: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

99.9% consistent reads:R=2, W=1

t = 13.6 msLatency: 12.53 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 15.01 ms

LNKD-DISK

N=3

16.5% fasterworthwhile?

Page 150: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N=3

Page 151: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N=3

Page 152: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N=3

Page 153: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

99.9% consistent reads:R=1, W=1

t = 1.85 msLatency: 1.32 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 4.20 ms

LNKD-SSD

N=3

Page 154: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

99.9% consistent reads:R=1, W=1

t = 1.85 msLatency: 1.32 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 4.20 ms

LNKD-SSD

N=3

59.5% faster

Page 155: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replica

write

ack(A)

(W)

response(S)

(R)

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

critical factor in staleness

read

Page 156: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

N=3

Page 157: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

N=3

Page 158: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

N=3

Page 159: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Coordinator Replica

write

ack(A)

(W)

response(S)

(R)

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

SSDs reducevariance

compared todisks!

read

Page 160: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N=3

Page 161: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N=3

Page 162: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

N=3

Page 163: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

99.9% consistent reads:R=1, W=1

t = 202.0 msLatency: 43.3 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 230.06 ms

YMMR

N=3

Page 164: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

99.9% consistent reads:R=1, W=1

t = 202.0 msLatency: 43.3 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 230.06 ms

YMMR

N=3

81.1% faster

Page 165: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley
Page 166: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

1. Tracing2. Simulation3. Tune N, R, W4. Profit

Workflow

Page 168: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley
Page 169: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

solution:

problem:

technique:

no guarantees with eventual consistency

consistency prediction

measure latencies use WARS model

PBS

Page 170: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

consistency

to measure is a metric

to predict

Page 171: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R+W

strong consistencylower latency

Page 172: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R+W

Page 173: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

latency vs. consistency trade-offs

simple modeling with WARS

model staleness in time, versionsPBS

Page 174: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

latency vs. consistency trade-offs

simple modeling with WARS

model staleness in time, versionsPBSeventual consistency

often fastoften consistent

PBS helps explain when and why

Page 175: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

latency vs. consistency trade-offs

simple modeling with WARS

model staleness in time, versions

pbs.cs.berkeley.edu/#demo

PBSeventual consistency

often fastoften consistent

PBS helps explain when and why

@pbailis

Page 177: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Extra Slides

Page 178: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Related Work

Page 179: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Quorum System Theorye.g., Probabilistic Quorums

k-quorums

Deterministic Stalenesse.g., TACT/conits

FRACS

Page 180: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Consistency Verificatione.g., Golab et al.

(PODC ’11),Bermbach and Tai(M4WSOC ’11)

Page 181: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

PBS and apps

Page 182: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

staleness requires either:

staleness-tolerant data structurestimelines, logs

cf. commutative data structures logical monotonicity

asynchronous compensation codedetect violations after data is returned; see paper

cf. “Building on Quicksand” memories, guesses, apologies

write code to fix any errors

Page 183: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

minimize:(compensation cost)×(# of expected anomalies)

asynchronouscompensation

Page 184: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Read only newer data?

client’s read rateglobal write rate

(monotonic reads session guarantee)

# versions tolerablestaleness

=

(for a given key)

Page 185: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Failure?

Page 186: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

latency spikes

Treat failures as

Page 187: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

How l o n gdo partitions last?

Page 188: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr

8.76 consecutive hours down⇒ bad 8-hour rolling average

Page 189: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr

8.76 consecutive hours down⇒ bad 8-hour rolling average

hide in tail of distribution ORcontinuously evaluate SLA, adjust

Page 190: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WAN

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

LNKD-SSD LNKD-DISK YMMR WAN

N=3

Page 191: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

10�2 10�1 100 101 102 103

0.20.40.60.81.0

R=3

LNKD-SSD LNKD-DISK YMMR WAN

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

LNKD-SSD LNKD-DISK YMMR WAN

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

(LNKD-SSD and LNKD-DISK identical for reads)N=3

Page 192: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Probabilistic quorum systems

N-WR(pinconsistent NR(

))

=

Page 193: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

81

k-staleness: probability p of reading one of last k versions

How consistent?

Page 194: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

82

How consistent?N-W

R(NR(

)))( K1-

Page 195: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

82

How consistent?

closed-form solutionstatic quorum choice

N-WR(NR(

)))( K1-

Page 196: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

<k,t>-staleness:versions and time

Page 197: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

<k,t>-staleness:versions and time

approximation: exponentiate

t-staleness by k

Page 198: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

reads return the last written value or newer(defined w.r.t. real time,when the read started)

consistency___“strong”

Page 199: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R1

N = 3 replicas

R2 R3

Write to W, read from R replicas

Page 200: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R1

N = 3 replicas

R2 R3

R=W=3 replicas{ }}{ R1 R2 R3

R=W=2 replicas{ }R1{ R2 } R2{ R3 } R1{ R3 }

Write to W, read from R replicas

quorum system:guaranteedintersection

Page 201: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

R1

N = 3 replicas

R2 R3

R=W=3 replicas

R=W=1 replicas

{ }}{ R1 R2 R3

{ }R1 }{ R2 }{ R3 }{

R=W=2 replicas{ }R1{ R2 } R2{ R3 } R1{ R3 }

Write to W, read from R replicas

quorum system:guaranteedintersection

partial quorum system:

may not intersect

Page 202: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Synthetic,Exponential Distributions

N=3, W=1, R=1

Page 203: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Synthetic,Exponential Distributions

W 1/4x ARS

N=3, W=1, R=1

Page 204: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

Synthetic,Exponential Distributions

W 1/4x ARS

W 10x ARS

N=3, W=1, R=1

Page 205: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley

concurrent writes:deterministically choose

Coordinator R=2

(“key”, 1) (“key”, 2)

Page 206: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley
Page 207: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley
Page 208: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley
Page 209: PBS @ Twitter · 2012. 6. 23. · PBS Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica @ Twitter 6.22.12 UC Berkeley