Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for...

33
Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov 1 Waleed Reda 1,2 Gerald Q. Maguire Jr. 1 Dejan Kostic 1 Marco Canini 3 1 KTH Royal Institute of Technology 2 Université Catholique de Louvain 3 KAUST

Transcript of Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for...

Page 1: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Fast and Accurate Load Balancing

for Geo-Distributed Storage Systems

Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire Jr.1

Dejan Kostic1 Marco Canini3

1KTH Royal Institute of Technology2Université Catholique de Louvain3KAUST

Page 2: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Geo-Distributed Services

2

Datacenter

ClientsService Level Objective (SLO):Request completion time at the target

percentile (e.g., 30 ms at 95th percentile)

Page 3: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Geo-Distributed Services

3

Datacenter

ClientsWeb-based services demonstrate temporal and spatial variability in load

Problem: it is difficult to meet strict SLOs, while maintaining high resource

utilization and low cost

Page 4: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Approach 1 - Datacenter Elasticity

4

Arr

iva

l ra

te

[100

0x r

eq

/s]

Page 5: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Approach 1 - Datacenter Elasticity

5

Arr

iva

l ra

te

[100

0x r

eq

/s]

Page 6: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Approach 1 - Datacenter Elasticity

6

Arr

iva

l ra

te

[100

0x r

eq

/s]

Page 7: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Approach 1 - Datacenter Elasticity

7

Lead to overprovisioning!

Provisioning delay (minutes) due to time needed to spawn and

warm up a VM

Hard to predict workload far into the future

Load spikes can be short lived

Provisioning delays

SLOviolations

Unused capacity

Arr

iva

l ra

te

[100

0x r

eq

/s]

Page 8: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Approach 2 - Geo-Distributed Load Balancing

8

Excessive or insufficient redirection

Redirection delays

Inaccurate response time

estimation

Redirection delay

SLOviolations

Excessive redirection

Arr

iva

l ra

te

[100

0x r

eq

/s]

Arr

iva

l ra

te

[100

0x r

eq

/s]

How much to redirect?

Page 9: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Our Approach: Kurma

9

Tames SLO violation at the

target level

Reacts to changes in load within

seconds

Accurately estimates remote

rate of SLO violations

Avoids unnecessary scaling out

Arr

iva

l ra

te

[100

0x r

eq

/s]

Arr

iva

l ra

te

[100

0x r

eq

/s]

Page 10: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Request Completion Time

10

Wide Area Network

Base Propagation: Stable component associated with packet propagation along a

network path

Delay Variance: Variable component associated with competing traffic

and queuing

Server 1 Server 2

Datacenter Ireland Datacenter Frankfurt

Service Time: Variable component

associated with load on the server

Kurma solves global optimization model while considering:Base Propagation + Delay Variance + Service Time

at all datacenters

Page 11: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Understanding Service Time

11

5 Server Cassandra cluster

Datacenter Frankfurt

Page 12: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Understanding Service Time

12

5 Server Cassandra cluster

Datacenter Frankfurt

Challenge: How to accurately estimate remote fraction of SLO violations at runtime

under variable network conditions?

Page 13: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Understanding Service Time

13

5 Server Cassandra cluster

Datacenter Frankfurt

Page 14: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Wide Area Network

Understanding Service Time

14

7000

5 Server Cassandra cluster

Datacenter Frankfurt

5 Server Cassandra cluster

Datacenter Ireland

Insight: the farther away a remote datacenter is, the less loaded it should be to serve remote

requests within a given SLO target

Page 15: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Understanding WAN Latency

15

Base propagation delay

Service time distribution recorded locally at a

specific load

Monte Carlo Simulations

Page 16: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Understanding WAN Latency

16

Base propagation delay

Service time distribution recorded locally at a

specific load

Monte Carlo Simulations

Page 17: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Understanding WAN Latency

17

Base propagation delay

Service time distribution recorded locally at a

specific load

Gives SLO violation rate given a specific load and WAN conditions

Monte Carlo Simulations

Page 18: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Understanding WAN Latency

18

Base propagation delay

Estimation Error

Page 19: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Incorporating WAN and Load

19

Page 20: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Incorporating WAN and Load

20

Page 21: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Incorporating WAN and Load

Page 22: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Incorporating WAN and Load

22

Page 23: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Optimisation Model

Runtime load in each datacenter{λ1, λ2, λ3}

+

Optimisation Problem✓ Minimize global SLO violations (KurmaPerf)✓ Minimize the cost of running a service (KurmaCost)

23

Page 24: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Implementation

24

Global View: latencies +

loads

Each Epoch 2.5 sec → 0.4Hz

Perform run-time WAN latency measurements

Aggregate load information (rates of requests)

Exchange metrics to obtain global view

Solve decentralized performance model

DatacenterLondon

DatacenterFrankfurt

DatacenterStockholm

Page 25: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Implementation

25

Each Epoch 2.5 sec → 0.4Hz

Perform run-time WAN latency measurements

Aggregate load information (rates of requests)

Exchange metrics to obtain global view

Solve decentralized performance model

DatacenterLondon

DatacenterFrankfurt

DatacenterStockholm

Enforce computed rates of requests redirection

Page 26: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Evaluation Setup

Geo-distributed Cassandra cluster • 3 Amazon EC2 datacenter (Ireland, Frankfurt, London)

• 5 x r5.large VMs per datacenter

• SLO: 30 ms at the 95th percentile

• Modified YCSB to replay workload traces

(World Cup http://ita.ee.lbl.gov/html/contrib/WorldCup.html)

Experiments:• Minimizing SLO violations for reads

• Maintaining Target SLO (accuracy)

• Cost Savings for 1 min billing intervals (simulations)

• Reads and writes, scalability, etc. link here.

26

Page 27: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Workload Trace

27

Load threshold for 5% SLO violations

No elastic scaling

Page 28: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

28

The numbers shown above the bars indicate the amount of inter-datacentre traffic transferred, whiskers → 75th percentile

Kurma’s SLO violations are at 2.4%

Cumulative Normalized SLO Violations

Page 29: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

29

The numbers shown above the bars indicate the amount of inter-datacentre traffic transferred, whiskers → 75th percentile

Kurma’s SLO violations are at 2.4%

Cumulative Normalized SLO Violations

Page 30: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Average Provisioning Cost Over 30

Consecutive Days

30

All Shared

WAN latency = 0ms

Bandwidth cost = 0$

All local

- Reactive threshold based elastic controller- Minimum billing period of 1 minute- Results obtained using simulations

Tota

l C

ost [U

S$]

Per

Day

Page 31: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Average Provisioning Cost Over 30

Consecutive Days

31

All Shared

WAN latency = 0ms

Bandwidth cost = 0$

Keeps SLO violations under 5%(minimize redirections while

avoiding scaling out)

KurmaCost KurmaPerf All local

- Reactive threshold based elastic controller- Minimum billing period of 1 minute- Results obtained using simulations

Minimize SLO violations(no consideration for traffic usage)

Tota

l C

ost [U

S$]

Per

Day

Page 32: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Taming SLO Violations Under Elastic

Threshold

32

No elastic scaling

Page 33: Fast and Accurate Load Balancing for Geo-Distributed ... · Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire

Conclusion

Kurma – fast and accurate load balancer for geo-distributed systems that takes advantage of spatial variability in load

Decouples end-to-end response time into components of base propagation latency, network congestion, and service time distribution

By operating at the granularity of a few seconds, Kurmareduces SLO violations or lowers the costs of running services by avoiding excessive global service overprovisioning

33Contact: [email protected]