Increasing the Performance of Geo-Distributed...

17
Increasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić, Network Systems Lab, KTH WWW.KTH.SE

Transcript of Increasing the Performance of Geo-Distributed...

Page 1: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Increasing the Performance of

Geo-Distributed Storage Systems

Prof. Dejan Kostić, Network Systems Lab, KTH

WWW.KTH.SE

Page 2: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Geo-Distributed Storage Systems

2

The quality of

replica selection

is key for good

performance

Amazon reports 1% loss in revenue for 100ms latency increase

Page 3: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

What is long tail latency and why do we care?

3

99th percentile

Local/Global

Resource Sharing

Queueing

Background Tasks

HW/SW Failures

1% out of millions?

Composite Requests!(bound by slowest sub-request)

What causes it?

Page 4: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Our Approach To Reducing Tail Latencies

4

Measure and analyze

network characteristics

Systematic testing of replica

selection algorithms

Improve latency and load

estimation

Page 5: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Amazon EC2 view from Ireland

5

UDP application level ping. After low pass filter 1 min

Choose

2/3Oregon < Virginia!

Frankfurt

Virginia

Oregon

Page 6: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Why is replica selection hard?

6

Debugging

Testing

Errors(subopt. choices)

Many possible topologies, bandwidth, latency and

loss rates results in a huge testing space

Do not lead to critical failures

Hard to determine the correct behavior

Many potential sources of problems

(sampling, smoothing, selection logic)

Page 7: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

GeoPerf [SoCC ‘15]

GeoPerf - a tool for systematic testing of

replica selection algorithms

Uses Symbolic Execution and Lightweight

Modeling/Simulation

Generates a specific set of inputs (latencies)

to expose performance anomalies in a

replica selection algorithm7

7

Page 8: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Symbolic Execution Engine

GeoPerf High-Level

8

Production System

(Cassandra, MongoDB)

Replica Selection

Algorithm

Discrete Event Simulation-A

Algorithm-A

Discrete Event Simulation-B

Algorithm-B

Reference

Replica Selection

Algorithm

Assert

Request Time @A > Request Time @B

Ground Truth

Page 9: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Symbolic Execution Background

9

Declare Symbolic Variable

Evaluate Satisfiability of

Obtained Constraint:

X > 5

Append Constraints:

(X > 5) && (X+Y<20)

(X > 5) && (X+Y<20)&&(X<=10)

Test Case: X = 7

int main(){

int X = symbolic();

int Y = 4;

if (X>5)

if (X+Y<20)

assert(X>10);

return 0;

}

Page 10: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Cassandra bug found by GeoPerf

10

2 close

replicas

1 far

replicawarmed up 20

with samples

if ((first-next)/first>BADNESS_THRESHOLD)

We reported this bug; It has been fixed and

a new test case has been added

Page 11: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Impact of the Bug found (Trace Replay)

● 14 days of latency traces used as an input

● 504 deployments of datacenter triplets

● Cassandra buggy version vs. fixed version

11

Median wasted time for

10% of all deployments

is above 50ms

Page 12: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Latency Estimation Issues

12

Oblivious to the underlying networking architecture(Treat latency as a common stream of samples)

Use a single metric to represent latency(Median, 99th percentile, EWMA etc.)

No insight into the cause of a change(Cannot differentiate between routing and congestion, bad predictors)

Page 13: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Motivation

13

Increasing sensitivity of the median will not solve the problem!

Singapore < VirginiaOscillating behaviorLonger tail

Page 14: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

EdgeVar: understand network latency!

The number of network paths between a pair of

geo-distributed datacenters of Amazon EC2 is finite

Packets traveling a previously observed network

path incur the same network delay as previously

observed for this network path

We decompose network latency into routing delay

and residual latency due to competing traffic

We improve the quality of latency estimation, as

well as the stability of network path selection

14

Page 15: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Popular Latency Classes observed

between Ireland and Oregon

15

Each

network path

occurs only

in one class

Page 16: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

EdgeVar Latency Decomposition

16

Can easily avoid

congested paths

Use EdgeVar-reported

Baseline latency and

Residual latency to

score network paths

Page 17: Increasing the Performance of Geo-Distributed …wasp-sweden.org/.../10/2016-WASP-summerschool-Kostic.pdfIncreasing the Performance of Geo-Distributed Storage Systems Prof. Dejan Kostić,

Conclusion

1. GeoPerf: replica selection debugging tool

• Generates latency inputs that can demonstrate

weaknesses in replica selection algorithms

• Found a bug in each of the two popular systems

2. EdgeVar:

• Provides clear understanding of e2e latency :

– level shifts due to routing vs. congestion

17

Thanks: Kirill Bogdanov (KTH), Miguel Peon-Quiros (UCM), Gerald Q.

Maguire Jr. (KTH)

Plan to release source code & latency traces:

https://www.kth.se/blogs/prophet