PCAP Project: Characterizing and Adapting the Consistency...

PCAP Project: Characterizing and Adapting the Consistency-latency

Tradeoff in Distributed Key-value Store

Muntasir Raihan Rahman

Joint work with Lewis Tseng, Son Nguyen, Indranil Gupta, Nitin Vaidya

Distributed Protocols Research Group (DPRG)http://dprg.cs.uiuc.edu 1

http://dprg.cs.uiuc.edu

Key-value/NoSQL Storage Systems

• Key-value/NoSQL stores: $3.4B sector by 2018• Distributed storage in the cloud

– Netflix: video position (Cassandra) – Amazon: shopping cart (DynamoDB)– And many others

• NoSQL = “Not Only SQL”

2

Key-value/NoSQL Storage Systems (2)

• Necessary API operations: get(key) and put(key, value)– And some extended operations, e.g., “CQL” in

Cassandra key-value store• Lots of open-source systems (startups)

– Cassandra (Facebook)– Riak (Basho)– Voldemort (LinkedIn)

• Closed-source systems with papers– Dynamo

3

Key-value/NoSQL Storage: Fast and Fresh

• Cloud clients expect both – Availability: Low latency for all operations

(reads/writes)• 500ms latency increase at Google.com costs 20% drop in

revenue • each extra ms $4 M revenue loss

– Consistency: read returns value of one of latest writes• Freshness of data means accurate tracking and higher user

satisfaction• Most KV stores only offer weak consistency (Eventual

consistency)• Eventual consistency = if writes stop, all replicas converge,

eventually• Why eventual? Why so weak?

4

CAP Theorem NoSQL Revolution

• Conjectured: [Brewer 00] • Proved: [Gilbert Lynch 02]• When network partitioned,

system must choose either strong consistency or availability.

• Kicked off NoSQL revolution• Abadi PACELC

– If P, choose A or C– Else, choose L (latency) or C

Consistency

Partition-tolerance Availability/Latency

RDBMSs

Cassandra, RIAK, Dynamo, Voldemort

HBase, HyperTable,BigTable, Spanner

5

Hard vs. Soft Partitions

• CAP Theorem looks at hard partitions

• However, soft partitions may happen inside a data-center– Periods of elevated

message delays – Periods of elevated

loss rates

Data-center 1(America)

Data-center 2(Europe)

Hard partition

ToR ToR

CoreSw Congestion at switches=> Soft partition

6

Our work: From Impossibility to Possibility

• C Probabilistic C (Consistency)• A Probabilistic A (Latency)• P Probabilistic P (Partition Model)

• Probabilistic Variant of CAP Theorem• PCAP System to support SLAs (service level

agreements) for single data-center• GeoPCAP: extension to multiple geo-distributed

data-centers

7

time

W(1) W(2) R(1)

tc

A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read

pic is likelihood a read is NOT tc-fresh

Probabilistic Consistency (pic ,tc)

pua is likelihood a read DOES NOT return an answer within ta time units

Probabilistic Latency (pua ,ta)

α is likelihood that a random host pair has message delay exceeding tptime units

Probabilistic Partition (α, tp )

PCAP Theorem: Impossible to achieve both Probabilistic Consistency and

Latency under Probabilistic Partitions if:

tc + ta < tp and pua + pic < α

Bad network -> High (α, tp )

To get better consistency -> lower (pic ,tc)

To get better latency -> lower (pua ,ta)

Probabilistic CAP

8

Presenter

Presentation Notes

PCAP theorem: i.e. there exists an execution such that the three properties cannot be satisfied simultaneously, but this Define start and end times (issue time and return time) Say ic stands for inconsistency, ua stands for unavailability High tp and high alpha imply bad network conditions First condition: When network is bad (high tp ), cannot achieve both low staleness (low tc), and low latency (low ta) at the same time Second condition implies that if we allow a fraction (pic) of reads to be tc-stale, and a fraction pua of reads to be later than ta, then when the natwork is bad (high alpha), we cannot reduce pic and pua arbitrarily close to zero Original CAP theorem is a special case of our PCAP theorem The formulation and proof of the PCAP theorem was obtained independently by my collaborators

9

Towards Probabilistic SLAs

• Consistency SLA: Goal is to – Meet a desired freshness probability (given freshness

interval) – Maximize probability that client receives operation’s

result within the timeout– Example: Google search application/Twitter search

• Wants users to receive “recent” data as search• Only 10% results can be more than 5 min stale

– SLA: (pic , tc)=(0.1, 5 min)• Minimize response time (fast response to query)

– Minimize: pua (Given: ta)

Presenter

Presentation Notes

We define SLA using the prob models of C and A, this is independent of the PCAP theorem

10

Towards Probabilistic SLAs (2)

• Latency SLA: Goal is to– Meet a desired probability that client receives

operation’s result within the timeout– Maximize freshness probability within given

freshness interval– Example: Amazon shopping cart

• Doesn’t want to lose customers due to high latency• Only 10% operations can take longer than 300ms

– SLA: (pua, ta) = (0.1, 300ms)• Minimize staleness (don’t want customers to lose items)

– Minimize: pic (Given: tc)

Presenter

Presentation Notes

We define SLA using the prob models of C and A, this is independent of the PCAP theorem

Increased Knob Latency Consistency

Read Delay Degrades Improves

Read Repair Rate Unaffected Improves

Consistency Level Degrades Improves

Continuously adapt control knobs to always satisfy PCAP SLA

KV-store (Cassandra,

Riak)

CONTROL KNOBS

PCAPSystem

Satisfies PCAP SLAADAPTIVE CONTROL

System assumptions:• Client sends query to coordinator server

which then forwards to replicas (answers reverse path)• There exist background mechanisms to bring

stale replicas up to date

Meeting these SLAs: PCAP Systems

11

Single DC

Presenter

Presentation Notes

this is the existing structure and protocol in Cassandra, Riak, Dynamo In most experiments we use read delay knob, two way control, fine grained, and not intrusive of CL expectations, non-blocking Assumptions: A coordinator node that forwards requests to multiple replicas; background synchronization

Original k-v store

PCAP systemknobs

PCAP Coordinator

(1) Inject test operations (read and write)

(2) Get estimate of current consistency or latencyBy analyzing operation log

(3) Update control knobs to reach target SLA Passive Approach: • Sample ongoing client operations• Non-intrusive to client workloads

Active approach

PCAP System Control Loop Architecture

12

Multiplicative Step Size Control

13

Initial: Pic > Pic(SLA)

Exponentially increase step size

Cross SLA

Reset step size = 1, direction change

Exponentially decrease step size until next cross

Converge around SLA (within ±1ms)

Faster convergence compared to unit step size change

Red: Read delay (SLA)Blue: Read delay

Presenter

Presentation Notes

Initial read delay 0, at each iteration exp add, until we Converges faster than unit step size, but not proven mathematically

Lognormal delay variation

Consistency always below target SLA

Optimal envelopes under different Network conditions

Meeting Consistency SLA for PCAP Cassandra (pic=0.135)

PCAP system SatisfiesSLA and close to optimal

Setup• 9 server Emulab cluster: each server has 4 Xeon + 12 GB RAM• 100 Mbps Ethernet• YCSB workload (144 client threads)• Network delay: Log-normal distribution

Mean latency = 3 ms | 4 ms | 5 ms

14

Presenter

Presentation Notes

Log normal distribution “change” means the mean and sd of delay distribution is changed

GeoPCAP

• Geo-distributed extension from single data-center to multiple geo-distributed data-centers

• Key techniques:– Probabilistic composition rules– New control algorithm

• PCAP multiplicative controller oscillation

15

Composition Rules

• Assume each data-center is running single data-center PCAP system

• Each PCAP system has probabilistic consistency and latency models

• Composition rules allow us to characterize the global behavior across all data-centers– Enables us to check whether the composed system

meets SLA• Our rules generalize Apache Cassandra multi-

data-center deployment rules

16

System Model

17

ClientRead, SLA

Prob C1

Local DC

Composed modelProb CC

Compare SLA

compliance

SLA

Given client C or L SLA:• QUICKEST: at-least one DC satisfies SLA• ALL: each DC satisfies SLA

Prob C2Prob C3

Presenter

Presentation Notes

4 combinations

Geo-delay Knob

1818

ClientRead, SLA

Prob C1

Local DC

Composed modelProb CC

Compare SLA

compliance

SLA

Prob C2Prob C3

ΔΔ

Δ

𝜟𝜟:𝐝𝐝𝐝𝐝𝐝𝐝𝐝𝐝𝐝𝐝 𝐢𝐢𝐢𝐢𝐢𝐢𝐝𝐝𝐢𝐢𝐢𝐢𝐝𝐝𝐝𝐝 𝐝𝐝𝐢𝐢 𝐝𝐝𝐥𝐥𝐥𝐥𝐝𝐝𝐝𝐝 𝐃𝐃𝐃𝐃 𝐛𝐛𝐝𝐝𝐛𝐛𝐥𝐥𝐢𝐢𝐝𝐝 𝐛𝐛𝐥𝐥𝐢𝐢𝐟𝐟𝐝𝐝𝐢𝐢𝐝𝐝𝐢𝐢𝐢𝐢𝐟𝐟 𝐢𝐢𝐝𝐝𝐫𝐫𝐫𝐫𝐝𝐝𝐢𝐢𝐢𝐢 𝐢𝐢𝐥𝐥 𝐢𝐢𝐝𝐝𝐫𝐫𝐥𝐥𝐢𝐢𝐝𝐝 𝐃𝐃𝐃𝐃𝜟𝜟 increase, wait longer at local DC, time for remote DC to sync, better consistency, worse latency𝜟𝜟 decrease, vice versa

Limitation of Multiplicative Controller

19

Monte-carlo simulation• For each DC, use LinkedIN latency distribution to model (Prob C, Prob A)• For WAN, use N(20,sqrt(2))

At steady state, oscillate with unit stepsize change

New Controller for GeoPCAP

• Based on PID control theory• Error, E = SLA – current metric

• Knob change = 𝑘𝑘𝑝𝑝 � 𝐸𝐸 + 𝑘𝑘𝑑𝑑 �𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑

+ 𝑘𝑘𝑖𝑖 � ∫𝐸𝐸𝐸𝐸𝐸𝐸

• (-) Overhead of tuning kp, kd, ki to meet stability

• (+) But can reduce oscillations arbitrarily

20

GeoPCAP L SLA (ALL)

21

N(20,sqrt(2)) | N(22,sqrt(2.2) N(20,sqrt(2)) | N(22,sqrt(2.2)

Latency SLA meet before and after jump

Consistency degrades after delay jump

Fast convergence initially, and after delay jump

Reduced oscillation, compared to mult controller

Summary

• CAP Theorem motivated NoSQL Revolution• But apps need freshness + fast responses

– Under soft partition• We proposed

– Probabilistic models for C, A, P– Probabilistic variant of CAP theorem– PCAP system satisfies Latency/Consistency SLAs for

single data-center– Integrated into Apache Cassandra and Riak KV stores– Extended to multiple geo-distributed data-centers

22

Meeting Consistency SLA for PCAP Cassandra (pic=0.2) [Passive]

23

Jump

Pic and Pua estimated from latest 100 operations from 5 servers

Initial conditions

Consistency always below or nearTarget SLA

Further from optimal, compared to active approach

Increased convergence time, compared to active approach

delay changed

Presenter

Presentation Notes

Lognormal 1, 0.1 to 4, .5, at 325 sec

PCAP (our system) Pileus (SOSP 2013)

SLA ProbabilisticConsistency/Latency SLA for single data-center

Family of Consistency/Latency SLA with utility values for geo-distributed settings

System Architecture Writes can go to any server (P2P)

Writes only go to master replica (Master-slave replication)

PCAP (our system) PBS (VLDB 2012)

Consistency model (Freshness interval)

(W start time, R start time) models R/W overlap

(W finish time, R start timeCannot model R/W overlap

PBS (Probabilistically Bounded Staleness) metrics can be used in our PCAP system

Related Works

Presenter

Presentation Notes

We only mention two related works, there is more in the paper

PCAP Project: Characterizing and Adapting the Consistency...

Documents

Transcript of PCAP Project: Characterizing and Adapting the Consistency...