Bolt: I Know What You Did Last - Cornell Universitydelimitrou/slides/2017...Christina Delimitrou1...

Christina Delimitrou1 and Christos Kozyrakis2

1Cornell University, 2Stanford University

ASPLOS – April 12th 2017

Bolt: I Know What You Did Last

Summer… In the Cloud

Problem: cloud resource sharing hides security vulnerabilities

Interference from co-scheduled apps leaks app characteristics

Enables severe performance attacks

Bolt: adversarial runtime in public clouds

Transparent app detection (5-10sec)

Leverages practical machine learning techniques

DoS 140x increase in latency

User study: 88% correctly identified applications

Resource partitioning is helpful but insufficient

Executive Summary

Motivation

App1 App2

Motivation

App1 App2

containers

Motivation

App1 App2

containers

memory

capacity

Motivation

App1 App2

containers

memory

capacity

storage

capacity/bw

Motivation

App1 App2

containers

memory

capacity

storage

capacity/bwnetwork bw

Motivation

App1 App2

containers

memory

capacity

storage

LL cache

Motivation

App1 App2

containers

memory

capacity

storage

LL cache

Motivation

App1 App2

containers

memory

capacity

storage

LL cache

Not all isolation techniques available

Not all used/configured correctly

Not all scale well

Mem bw/core resources not isolated

Key idea: Leverage lack of isolation in public clouds to

infer application characteristics

Programming framework, algorithm, load characteristics

Exploit: enable practical, effective, and hard-to-detect

performance attacks

DoS, RFA, VM pinpointing

Use app characteristics (sensitive resource) against it

Avoid CPU saturation hard to detect

Threat Model

Impartial, neutral cloud provider

Active adversary but no control over VM placement

Adversary VictimCloud

provider

Adversary Victim

Contention

injection

inference3

Interference

Impact

measurement

Adversary Victim

Contention

injection

Interference

Impact

measurement

inference3

Custom

contention

kernel

Performance

attack5

1. Contention Measurement

Adversary Victim

Contention injection1

Interference

impact

measurement

Set of contentious kernels (iBench)

Compute

L1/L2/L3

Memory bw

Storage bw

Network bw

(Memory/Storage capacity)

Sample 2-3 kernels, run in

adversarial VM

Measure impact on performance of

kernels vs. isolation

2. Practical App Inference

Adversary Victim

Infer resource pressure in non-

profiled resources

Sparse dense information

SGD (Collaborative filtering)

Classify unknown victim based

on previously-seen

applications

Label & determine resource

sensitivity

Content-based recommendation

Practical app inference

Hybrid recommender

Big Data to the Rescue

1. Infer pressure in non-profiled resources

Reconstruct sparse information

Stochastic Gradient Descent (SGD), O(mpk)

Contention

injectionuBench

uBench

AppApp

SVD+SGD

AppAppInterference

profile

r1 r2 r3 … rN

a11 0 0 … a1N

0 a22 0 … 0

… … … … …

aM1 0 aM3 … 0

r1 r2 r3 … rN

a11 a12 a13 … a1N

a21 a22 a23 … a2N

… … … … …

aM1 aM2 aM3 … aMN

Big Data to the Rescue

2. Classify and label victims

Weighted Pearson Correlation Coefficients

Output: distribution of similarity scores to app classes

AppApp

Pearson Corr Coeff

AppApp

App label &

characteristics

r1 r2 r3 … rN

a11 a12 a13 … a1N

a21 a22 a23 … a2N

… … … … …

aM1 aM2 aM3 … aMN

Hadoop SVM: 65%

Spark ALS: 21%

memcached: 11%

Inference Accuracy

40 machine cluster (420 cores)

Training apps: 120 jobs (analytics, databases, webservers, in-

memory caching, scientific, js) high coverage of resource space

Testing apps: 108 latency-critical webapps, analytics

No overlap in algorithms/datasets between training and testing sets

Application class Detection accuracy (%)

In-memory caching (memcached) 80%

Persistent databases (Cassandra, MongoDB) 89%

Hadoop jobs 92%

Spark jobs 86%

Webservers 91%

Aggregate 89%

3. Practical Performance Attacks

1. Determine the resource

bottleneck of the victim

2. Create custom contentious

kernel that targets critical

resource(s)

3. Inject kernel in Bolt

Several performance attacks

(DoS, RFAs, VM pinpointing)

Target specific, critical resource

low CPU pressure

Adversary Victim

Custom kernel

injection4

3. Practical DoS Attacks

Launched against same 108 applications as before

On average 2.2x higher execution time and up to 9.8x

For interactive services, on average 42x increase in tail latency

and up to 140x

Bolt does not saturate CPU

Naïve attacker gets migrated

User Study

20 independent users from Stanford and Cornell

Cluster

200 EC2 servers, c3.8xlarge (32vCPUs, 60GB memory)

Rules:

4vCPUs per machine for Bolt

All users have equal priority

Users use thread pinning

Users can select specific instances

Training set: 120 apps incl. analytics, webapps, scientific, etc.

Accuracy of App Labeling

53 app classes

(analytics, webapps,

FS/OS, HLS/sim,

other…)

Accuracy of App Characterization

Performance

attack results

in the paper

The Value of Isolation

Need more scalable, fine-grain, and complete isolation

techniques

Bolt: highlight the security vulnerabilities from lack of isolation

Fast detection using online data mining techniques

Practical, hard-to-detect performance attacks

Current isolation helpful but insufficient

In the paper:

Sensitivity to Bolt parameters

Sensitivity to applications and platform parameters

User study details

More performance attacks (resource freeing, VM pinpointing)

Conclusions

Bolt: highlight the security vulnerabilities from lack of isolation

Fast detection using online data mining techniques

Practical, hard-to-detect performance attacks

Current isolation helpful but insufficient

In the paper:

Sensitivity to Bolt parameters

Sensitivity to applications and platform parameters

User study details

More performance attacks (resource freeing, VM pinpointing)

Questions?

Evolving Applications

Cloud applications change behavior

Users use the same cloud resources for several apps over time

Bolt periodically wakes up, checks if app profile has changed; if

so, reprofile & reclassify

Inference Within a Framework

Within a framework, dataset and choice of algorithm affect resource requirements

Bolt matches a new unknown application to apps in a framework by distinguishing their resource needs

Bolt: I Know What You Did Last - Cornell Universitydelimitrou/slides/2017...Christina Delimitrou1...

Documents

Transcript of Bolt: I Know What You Did Last - Cornell Universitydelimitrou/slides/2017...Christina Delimitrou1...

Dilip N Simha, Maohua Lu, Tzi-cher chiueh Park Chanhyun ASPLOS’12 March 3-7, 2012 London, England, UK.

ConSeq: Detecting Concurrency Bugs through Sequential Errorspages.cs.wisc.edu/~aliang/asplos-2011-conseq.pdf · fore software is released are sorely desired. Reliability research

Synopsis of the ASPLOS ’16 Wild and Crazy Ideas (WACI ...

ASPLOS 2015 Debate Onur Mutlu onur@cmu.edu omutlu/ March 17, 2014 Istanbul.

Structure and messaging techniques for online peer ... · 1University of California, San Diego, 2Stanford University {ykotturi, srk}@ucsd.edu, {chinmay, msb}@cs.stanford.edu ABSTRACT

Machine Learning for Creativity and Design - Towards a Principled … a... · 2019-12-16 · Machine-Generated Art Lia Coleman1, Panos Achlioptas2, Mohamed Elhoseiny12 1 KAUST 2Stanford

Deposited low temperature silicon GHz modulator low temperature silicon GHz modulator Yoon Ho Daniel Lee,1 Michael O. Thompson,2 and Michal Lipson1,3,* 1Cornell Nanophotonics Group,

Generative Adversarial PerturbationsGenerative Adversarial Perturbations Omid Poursaeed 1;2 Isay Katsman Bicheng Gao3 Serge Belongie1;2 1Cornell University 2Cornell Tech 3Shanghai

Positional Normalization - NeurIPS...Positional Normalization Boyi Li 1;2, Felix Wu , Kilian Q. Weinberger , Serge Belongie1;2 1Cornell University 2Cornell Tech {bl728, fw245, kilian,

AlphaServer GS320 Architecture Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research High-Performance Servers Published in 2000 (ASPLOS-IX)

NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri.

Emerging NVM Memory Technologies Yuan Xieweb.engr.oregonstate.edu/~sllu/xie.pdfEmerging NVM Memory Technologies Yuan Xie ... (Magnetic RAM) Memristor (Resistive RAM) ... ASPLOS-panel-Xie.ppt

Pin ASPLOS Tutorial Kim Hazelwood Vijay Janapa Reddi.

Architecture Support for Disciplined Approximate Programmingasampson/media/truffle-asplos-slides.pdf · Architecture Support for Disciplined Approximate Programming University of

PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Servicesdelimitrou/papers/2019.asplos... · 2019-01-28 · PARTIES ASPLOS ’19, April 13–17, 2019, Providence,

SoftSig: Software-Exposed Hardware Signatures for Code Analysis and Optimizations UIUC – ASPLOS 2008 by Evangelos Vlachos.

Unikernels: Library Operating Systems for the Cloudanil.recoil.org/papers/2013-asplos-mirage.pdf · Unikernels: Library Operating Systems for the Cloud ... whole-system optimisation

Accurate & Efficient Regression Modeling for ...people.duke.edu/~bcl15/documents/lee2006-asplos-slides.pdf · Benjamin C. Lee, David M. Brooks 18 :: ASPLOS-XII. Motivation & Background

Mark A. Constas · 2 Mark A. Constas1, Marco d’Errico2 3and Rebecca Pietrelli 1Cornell University 2,3 Food and Agriculture Organization of the United Nations Authors and Correspondence:

Timestamp Snooping - Facultypeople.ee.duke.edu/~sorin/.../asplos2000_timestamp... · Timestamp Snooping - MIlo M. K. Martin ASPLOS-IX, November 2000 Slide 40 Timestamp Snooping Protocol