Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy...

24
RobinHood: Tail Latency-Aware Caching Dynamically Reallocating from Cache-Rich to Cache-Poor Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18. Siddhartha Sen Mor Harchol-Balter Microsoft Research Carnegie Mellon University

Transcript of Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy...

Page 1: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

RobinHood: Tail Latency-Aware Caching Dynamically Reallocating from Cache-Rich to Cache-Poor

Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University

USENIX OSDI, 10/8/18.

Siddhartha Sen Mor Harchol-BalterMicrosoft Research Carnegie Mellon University

Page 2: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

Typical Web Architecture

1

Aggregationserver

Recom

.

Products

Ads

User request

...Backend queries

Request latency = max of query latencies

Qu

erylaten

cy

Page 3: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

2

Typical Web Architecture

Goal: minimize 99-th percentile

(P99) request latency

Aggregationserver

Recom

.

Products

Ads

User request

...

Request latency = max of query latencies

Qu

erylaten

cy

Backend queries

Page 4: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

What Causes High P99 Request Latency?

3

Aggregationserver

Recom

.

Products

Ads

User request

...Backend queries

Observations at xbox.com (3/2018):

Better load balancing? Elastically scale backends?

Partially implemented

Qu

ery latency

Request latency

Page 5: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

What Else Can We Do?

4

Aggregationserver

Recom

.

Products

Ads

User request

...Backend queries

Cache

Can we use the aggregation cache to reduce the P99 request latency?

Observations at xbox.com (3/2018):

Aggregation Cache: Currently shared among queries to all backends

Qu

ery latency

Request latency

Page 6: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

Can We Use Caching to Reduce the P99?

5

State-of-the-art caching systems focus on hit ratio, fairness — not the P99

“Caching layers do not directly address tail latency, aside from configurations where the entire working set can reside in a cache.”

1ms 90%

100ms 10%

Belief: No Cache

B

P99=100ms

95%

5%P99=

100ms

Page 7: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

Can We Use Caching to Reduce the P99?

6

But: latency isnot a constant

50ms 500ms85% 15%

Caching can reduce P99 request latency!

Effectiveness in web architecture?

Belief: No 1ms 90%

100ms 10%

Cache

B

P99=500ms

95%

5%P99=50ms

Page 8: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

RobinHood: Key Idea

7

During load spike:Observations for xbox.com (3/2018):

Aggregationserver

Recom

.

Product

s

Ads

User request

...Backend queries

Cache

RobinHood: more cache ⇒ less load ⇒ much lower P99

Page 9: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

RobinHood: Key Idea

8

During load spike:Observations for xbox.com (3/2018):

Aggregationserver

Recom

.

Product

s

Ads

User request

...Backend queries

Cache

Products Recom. Ads

Dynamic Cache Partitions

Page 10: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

9

During load spike:Observations for xbox.com (3/2018):

Aggregationserver

Recom

.

Product

s

Ads

User request

...Backend queries

Cache

Recom. AdsProducts

RobinHood: Key Idea

Dynamic Cache Partitions

Page 11: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

10

During load spike:Observations for xbox.com (3/2018):

Aggregationserver

Recom

.

Product

s

Ads

User request

...Backend queries

Cache

Recom. AdsProducts

RobinHood: Key Idea

Dynamic Cache Partitions

Page 12: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

11

During load spike:Observations for xbox.com (3/2018):

Aggregationserver

Recom

.

Product

s

Ads

User request

...Backend queries

Cache

Recom. AdsProducts

RobinHood: Key Idea

Dynamic Cache Partitions

Page 13: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

12

During load spike:Observations for xbox.com (3/2018):

Aggregationserver

Recom

.

Product

s

Ads

User request

...Backend queries

Cache

Recom. AdsProducts

RobinHood: Key Idea

Dynamic Cache Partitions

Page 14: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

13

During load spike:Observations for xbox.com (3/2018):

Aggregationserver

Recom

.

Product

s

Ads

User request

...Backend queries

Cache

Recom. AdsProducts

RobinHood: Key Idea

Dynamic Cache Partitions

Page 15: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

RobinHood Cache

14

The RobinHood Caching System

Scalable in # backends,

# aggregation servers

Dynamically partition

the aggregation cache

First caching system to

minimize request P99

Deployable on off-the-

shelf software stack

Page 16: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

15

Highlatency

User requests99.5% 0.5%

Cache

How to Repartition the Cache?

How to redistribute the tax?

Every 5 seconds: RobinHood taxes everyone 1%

First idea: give cache to high-latency backends

Recall: not all requests are the same

Small effect on request P99

Page 17: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

RobinHood: find the cause of high request P99

16

P0 P100P99

Who “blocked” this request?

How to Repartition the Cache?

How to redistribute the tax?

Every 5 seconds: RobinHood taxes everyone 1%

Page 18: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

17

P0 P100P99

Who “blocked” these requests?

How to Repartition the Cache?

⇒ Track “request blocking count” (RBC) for each backend

RobinHood: find the cause of high request P99

How to redistribute the tax?

Every 5 seconds: RobinHood taxes everyone 1%

Page 19: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

RobinHood Architecture

18

Aggregation server

... ...

Cache

RobinHood Controller

- ingests RBC

- calculates / enforces cache

allocation

- not on request path

RH-control

Backends

Page 20: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

RobinHood Architecture

19

... ...

Cache Cache Cache

⇒ RH-control / Ag. server

In practice many Ag. servers

... ...

- Local decisions

- Local measurements- Pooled measurements

Challenge: insufficient# tail data points

RH-control RH-control RH-control Ag. servers

Backends

Distributed RobinHood:

Page 21: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

Experimental Setup

20

Request generator

MySQL(I/O Bound)

Matrix Multiply(CPU Bound)

K-V Store(CPU Bound)

Replay production trace

For 4 hours, 200k queries/second

32 GB cache size

⇒ Emulate query latency spikes

ABCD

20x

Ag. servers

... ...

Cache Cache Cache... ...

Backends

RH-control RH-control RH-control16x

Page 22: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

Evaluation Results: P99 Request Latency

21

RobinHood[our proposal]

Balance Query Latencies[Hyberbolic, ATC’17]

Original MS System[OneRF]

Maximize Overall Hit Ratio[Cliffhanger, NSDI’16]

Req

uest

P99

Lat

ency

[ms]

>

Page 23: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

What Makes RobinHood so Effective?

22

RobinHood[our proposal]

Original MS System[OneRF]

Req

uest

P99

Lat

ency

[ms]

>

The RobinHood tradeoff:→ up to 2.5x higher latency→ typically 4x lower latency

- Sacrifice performance of some backends- Reduce latency of bottleneck backends

⇒ Reduced request latency

Page 24: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18.

Conclusions

Yes! Huge reduction in P99 spikes and SLO violations.⇒ Use cache as load balancers: “RBC load metric”.

Yes! Built using off-the-shelf software stack. Works orthogonally to existing load balancing and data/quality tradeoffs.

23

Feasibility in production systems?

Is it possible to use caches to improve the request P99?

No! There’s a lot to do. Need to consider the effect of other request structures.

Is this the optimal solution? End of this project?

Poster #31