Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... ·...

22
Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou Cornell University HotCloud– July 9 th 2018 Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging

Transcript of Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... ·...

Page 1: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou

Cornell University

HotCloud– July9th 2018

Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging

Page 2: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

2

¨ Microservices puts more pressure on performance predictability ¤ Microservices dependencies à propagate & amplify QoS violations¤ Finding the culprit of a QoS violation is difficult¤ Post-QoS violation, returning to nominal operation is hard

¨ Anticipating QoS violations & identifying culprits

¨ Seer: Data-driven Performance Debugging for Microservices¤ Combines lightweight RPC-level distributed tracing with hardware

monitoring¤ Leverages scalable deep learning to signal QoS violations with

enough slack to apply corrective action

Executive Summary

Page 3: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

3

From Monoliths to Microservices

Page 4: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

4

¨ Advantages of microservices: ¤ Ease & speed of code development & deployment¤ Security, error isolation¤ PL/framework heterogeneity

¨ Challenges of microservices: ¤ Change server design assumptions ¤ Complicate resource management à dependencies¤ Amplify tail-at-scale effects¤ More sensitive to performance unpredictability¤ No representative end-to-end apps with microservices

Motivation

Page 5: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

5

¨ 4 end-to-end applications using popular open-source microservices à ~30-40 microservices per app¤ Social Network¤ Movie Reviewing/Renting/Streaming¤ E-commerce¤ Drone control service

¨ Programming languages and frameworks: ¤ node.js, Python, C/C++, Java/Javascript, Scala, PHP, and Go¤ Nginx, memcached, MongoDB, CockroachDB, Mahout, Xapian¤ Apache Thrift RPC, RESTful APIs¤ Docker containers¤ Lightweight RPC-level distributed tracing

An End-to-End Suite for Cloud & IoT Microservices

Page 6: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

6

Resource Management Implications

¨ Challenges of microservices: ¤ Dependencies complicate resource management¤ Dependencies change over time à difficult for users to express¤ Amplify tail@scale effects

Netflix Twitter Amazon Movie Streaming

Page 7: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

7

¨ Detecting QoS violations after they occur: ¤ Unpredictable performance propagates through system¤ Long time until return to nominal operation¤ Does not scale

The Need for Proactive Performance Debugging

Page 8: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

8

Performance ImplicationsCPU Mem Net DiskQueue

Page 9: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

9

Performance ImplicationsCPU Mem Net DiskQueue

Page 10: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

10

¨ Leverage the massive amount of traces collected over time

1. Apply online, practical data mining techniques that identify the culprit of an upcoming QoS violation

2. Use per-server hardware monitoring to determine the cause of the QoS violation

3. Take corrective action to prevent the QoS violation from occurring

¨ Need to predict 100s of msec – a few sec in the future

Seer: Data-Driven Performance Debugging

Page 11: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

11

¨ RPC level tracing¨ Based on Apache Thrift

¨ Timestamp start-end for each microservice

¨ Store in centralized DB (Cassandra)

¨ Record all requests àNo sampling

¨ Overhead: <0.1% in throughput and <0.2% in tail latency

TracingCollector

WebUI

Client

http

Cassandra

QueryEngine

[…]

mic

rose

rvic

es

latency

Gantt charts

zTracer

TCP

TCP

Proc

uService KRPC timeTX

zTracer

TCP

TCP

Proc

uService K+1

RPC timeRX

TCP procTX

TCP procRX

App proc

[…]

Tracing Framework

Page 12: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

12

¨ Why? ¤ Architecture-agnostic¤ Adjusts to changes in

dependencies over time

¤ High accuracy, good scalability

¤ Inference within the required window

Deep Learning to the Rescue

Page 13: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

13

¨ Container utilization

¨ Latency

¨ Queue depth

DNN Configuration

Output signal

Which microservicewill cause a

QoS violation in the near

future?

Input signal

Page 14: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

14

¨ Container utilization

¨ Latency

¨ Queue depth

DNN Configuration

Output signal

Which microservicewill cause a

QoS violation in the near

future?

Input signal

Page 15: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

15

¨ Training once: slow (hours - days)¤ Across load levels, load distributions, request types¤ Distributed queue traces, annotated with QoS violations¤ Weight/bias inference with SGD¤ Retraining in the background

¨ Inference continuously: streaming trace data

DNN Configuration

93% accuracy in signaling upcoming QoS violations

91% accuracy in attributing QoSviolation to correct microservice

Page 16: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

16

¨ Challenges: ¤ In large clusters inference too slow to prevent QoS violations¤ Offload on TPUs, 10-100x improvement; 10ms for 90th %ile

inference¤ Fast enough for most corrective actions to take effect (net bw

partitioning, RAPL, cache partitioning, scale-up/out, etc.)

DNN Configuration

Accuracy stable or increasing with cluster size

Page 17: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

17

¨ 40 dedicated servers¨ ~1000 single-concerned

containers¨ Machine utilization 80-85%

¨ Inject interference to cause QoS violation¤ Using microbenchmarks

(CPU, cache, memory, network, disk I/O)

Experimental Setup

Page 18: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

18

¨ Identify cause of QoS violation¤ Private cluster: performance counters & utilization monitors¤ Public cluster: contentious microbenchmarks

¨ Adjust resource allocation¤ RAPL (fine-grain DVFS) & scale-up for CPU contention¤ Cache partitioning (CAT) for cache contention¤ Memory capacity partitioning for memory contention¤ Network bandwidth partitioning (HTB) for net contention ¤ Storage bandwidth partitioning for I/O contention

Restoring QoS

Page 19: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

19

¨ Post-detection, baseline system à dropped requests

¨ Post-detection, Seer à maintain nominal performance

Restoring QoS

Page 20: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

20

Demo CPU Mem Net DiskQueue

Page 21: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

21

Page 22: Seer: Leveraging Big Data to Navigate The Increasing ...delimitrou/slides/2018.hotcloud.seer... · ¨ Based on Apache Thrift ¨ Timestamp start-end for each microservice ¨ Store

22

¨ Security implications of data-driven approaches

¨ Fall-back mechanisms when ML goes wrong

¨ Not a single-layer solution à Predictability needs vertical approaches

Challenges Ahead

Thank you!

Serverless microservices IoT swarms