Dr Lalith Senaweera Director General/CEO Sri Lanka Standards Institution
C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with...
-
Upload
juliet-mccormick -
Category
Documents
-
view
212 -
download
0
Transcript of C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with...
![Page 1: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/1.jpg)
C3: Cutting Tail Latency inCloud Data Stores via
Adaptive Replica Selection
Marco Canini (UCL)
with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)
![Page 2: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/2.jpg)
2
Latency matters
Latency $$$$
+500ms lead to a revenue
decrease of 1.2%[Eric Schurman, Bing]
Every 100ms in latency cost them
1% in sales [Greg Linden,
Amazon]
![Page 3: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/3.jpg)
3
![Page 4: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/4.jpg)
4
Oneuser request
Tens to Hundredsof servers involved
![Page 5: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/5.jpg)
5
Variability of response time at system components inflateend-to-end latenciesand lead to high tail latencies
LatencyE
CD
F
Size and system complexity
At Bing, 30% of services have95th percentile > 3 x median latency[Jalaparti et al. SIGCOMM’13]
![Page 6: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/6.jpg)
6
Performance fluctuations are the norm
Queueingdelays
Skewed access patterns
CD
F
Resource contention
Background activities
![Page 7: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/7.jpg)
7
How can we enable datacenter applications*to achieve morepredictable performance?
* in this work: low-latency data stores
![Page 8: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/8.jpg)
8
Replica selection
{request}
? ? ?
![Page 9: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/9.jpg)
9
Replica Selection Challenges
• Service-time variations
• Herd behavior
![Page 10: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/10.jpg)
10
Request
4 ms
5 ms
30 ms
Service-time variations
![Page 11: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/11.jpg)
11
Request
Herd Behavior
? ? ? Request
Request
![Page 12: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/12.jpg)
12
Load Pathology in Practice
• Cassandra on EC2– 15-node cluster, skewed access pattern,
workload by 120 YCSB generators
• Observe load on most heavily utilized node
![Page 13: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/13.jpg)
13
Load Pathology in Practice
99.9th percentile ~ 10x median latency
![Page 14: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/14.jpg)
14
Load Conditioning in Our Approach
![Page 15: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/15.jpg)
15
C3Adaptive replica selection mechanism that is robust to service time heterogeinity
![Page 16: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/16.jpg)
16
C3 • Replica Ranking
• Distributed Rate Control
![Page 17: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/17.jpg)
17
C3 • Replica Ranking
• Distributed Rate Control
![Page 18: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/18.jpg)
18
ClientServer
Client
Client
Server
µ-1 = 2 ms
µ-1 = 6 ms
![Page 19: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/19.jpg)
19
ClientServer
Client
Client
Server
Balance product of queue-size and service time{ }
µ-1 = 2 ms
µ-1 = 6 ms
![Page 20: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/20.jpg)
20
Server-side Feedback
Servers piggyback {} and {} in every response
Client Server
{ , }
Clients maintain EWMAs of these metrics
![Page 21: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/21.jpg)
21
Server-side Feedback
• Concurrency compensation
Servers piggyback {} and {} in every response
![Page 22: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/22.jpg)
22
Server-side Feedback
• Concurrency compensation
Servers piggyback {} and {} in every response
outstanding requests
![Page 23: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/23.jpg)
23
Select server with min ?
![Page 24: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/24.jpg)
24
Select server with min ?
• Potentially long queue sizes
• Hampers reactivity
![Page 25: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/25.jpg)
25
Penalizing Long Queues
![Page 26: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/26.jpg)
26
Scoring Function
Clients rank server according to
Ψ 𝑠=𝑅𝑠−𝜇−1+( �̂�𝑠 )3 ∙𝜇−1
![Page 27: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/27.jpg)
27
C3 • Replica Ranking
• Distributed Rate Control
![Page 28: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/28.jpg)
28
Need for distributed rate control
Replica ranking insufficient
• Avoid saturating individual servers?
• Non-internal sources of performance fluctuations?
![Page 29: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/29.jpg)
29
Cubic Rate Control
• Clients adjust sending rates according to cubic function
• If receive rate isn’t increasing further, multiplicatively decrease
![Page 30: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/30.jpg)
30
Putting everything together
On receiving a request:• Client sorts replica servers by ranking function• Select first replica server which is within rate limit• If all replicas have exceeded rate limit, retain in
backlog queue until a replica server is available
![Page 31: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/31.jpg)
31
C3 implementation in
Cassandra
![Page 32: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/32.jpg)
32
![Page 33: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/33.jpg)
33
A
R3
R2
R1
Read ( “key” )
![Page 34: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/34.jpg)
34
A
R3
R2
R1
???
Read ( “key” )
![Page 35: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/35.jpg)
35
A
R3
R2
R1
@Snitch,Who is the best replica?
Read ( “key” )
Dynamicsnitching
![Page 36: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/36.jpg)
36
C3 Implementation
• Replacement for Dynamic Snitching
• Scheduler and backlog queue per replica group
• ~400 lines of code
![Page 37: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/37.jpg)
37
Evaluation
Amazon EC2
Controlled Testbed
Simulations
![Page 38: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/38.jpg)
38
Evaluation
Amazon EC2
• 15-node Cassandra cluster
• Workloads generated using YCSB
• Read-heavy, update-heavy, read-only
• 500M 1KB records (larger than memory)
• Compare against Cassandra’s Dynamic
Snitching
![Page 39: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/39.jpg)
39
2x – 3x improved99.9th percentilelatencies
![Page 40: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/40.jpg)
40
Up to 43%improvedthroughput
![Page 41: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/41.jpg)
41
Improvedload conditioning
![Page 42: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/42.jpg)
42
Improvedload conditioning
![Page 43: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/43.jpg)
43
How does C3 react todynamic workload changes?
• Begin with 80 read-heavy workload generators
• 40 update-heavy generators join the system after 640s
• Observe latency profile with and without C3
![Page 44: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/44.jpg)
44
Latency profile degrades gracefully with C3
![Page 45: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/45.jpg)
45
Summary of other results
Higher system load
Skewed record sizes
SSDs instead of HDDs
> 3x better 99th percentile latency
> 50% higher throughput
![Page 46: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/46.jpg)
46
Ongoing work
• Stability analysis of C3
• Deployment in production settings at Spotify and SoundCloud
• Study of networking effects
![Page 47: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/47.jpg)
47
Summary
C3 combines careful replica ranking anddistributed rate control to:
• Reduce tail, mean, and median latencies
• Improve load conditioning and throughput
![Page 48: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)](https://reader035.fdocuments.in/reader035/viewer/2022070400/56649ec15503460f94bcc664/html5/thumbnails/48.jpg)
48
Thank you!
Client
Server
{ qs , 1/µs }