More Sensitive Applications in Public Clouds Security in Public...

Isolation in Public Clouds: Challenges, Threats and Defenses

Venkatanathan Varadarajan [email protected]

1Ph.D. Defense — November 30th 2015

Committee: Prof. Michael Swift (advisor), Prof. Thomas Ristenpart (advisor), Prof. Aditya Akella, Prof. David Wood, and Prof. Nigel Boston

1

More Sensitive Applications in Public Clouds

2

More are migrated to the public cloud!

2

Cloud API

Internet

Security in Public Clouds — An Analogy

VMVM

VM

VM

VM

VM

VM

VM

Web Interface

3

Internet

Multi-tenancy + Public

*MIT CSAIL - RF Capture, “Wifi X-ray Vision

3

A Threat in Public Clouds

4

VM

VM

VM

VM

VM

VM

VM

VM

Cloud API

Internet

VM

Target VMs

Co-location

4

A Threat in Public Clouds

5

VM

VM

VM

VM

VM

VMVM

VM

Cloud API

Internet

VM

Co-location

5

An Example Cross-VM Attack: Side-channels

6

VM

VM Time

Core: A V

cache sets

cache ways

Prime

A

Probe

Attacker Timing Profile

Extract secret information

Ref: Zhang, Juels, Reiter, Ristenpart, “Cross-‐VM Side-‐channels …”, CCS’12

I-‐cache

Steal secret key and read secure email communications within ~6 hours!

6

Problem: Cross-VM AttacksTwo steps:

1. Place VM on the same host as victim, and

2. Exploit sharing and lack of isolation to,

• Steal secrets, — e.g., side-channel attacks — L1 D/I cache, TLB, Branch Predictors,

LLC, memory … • Affect performance,

— e.g., performance degradation, DoS attacks.

7

Isolation: “two user tasks are isolated if one cannot know about the

execution of the other due to resource sharing.”

7

Cluster Scheduler

Placement & Isolation in Public Clouds

8

Cloud API

Placement Policy

VMVM

VM

VM

VM

VM

VMVM

Hypervisor

Internet

(1) Controls co-location

(2) Provides isolation

• Cross-VM side- & covert-channels [Ristenpart’09, Xu’11, Wu’12, Zhang’12, Yarom’14, Zhang’14, Liu’15, Inci’15]

• Resource-stealing, DoS [Zhou’11, Varadarajan’12]

• Co-location Attack [Ristenpart’09, Zhang’14, Varadarajan’15, Xu’15]

8

VMVM

VM

VM

VM

VM

VMVM

In my dissertation

1. Is co-location practical in modern clouds?

2. Are there unique opportunities for malicious users to exploit the lack of isolation for performance gains?

3. Can isolation be improved without compromising the efficiency?

9

Cloud API

Cluster Scheduler

Placement Policy

Internet

Practice of multi-tenancy demands stronger isolation between VMs in public clouds.

9

Cluster Scheduler

Cloud API

Placement Policy

VMVM

VM

VM

VM

VM

VMVM

Internet

1. Is Co-location Practical in Modern Clouds?

10

Let’s assume 50,000 servers and a randomized placement policy,With 30 target VMs.

How many VMs do you need to launch to achieve a successful co-location 50% of the time?

Think about its cost ($$ and time)

My work: can achieve co-location with > 90% chance for a minimum of 14 cents with 10-30 VMs!

~ 1000 to 3000 VMs

10

Perfo

rman

ce D

egra

datio

n (%

)

0

150

300

450

600

CPU Net Disk Cache

2. Malicious use of Performance Interference

11

Local Xen Testbed

Machine Intel Xeon E5430, 2.66 Ghz

CPU 2 packages each with 2 cores

Cache Size 6MB per package

VM

VM3x-‐6x Performance loss

My work: A greedy customer could improve his performance by interfering with neighbors

11

3. Improving Isolation w/o Compromising Efficiency

12

VM

VM Time

Core: A V

cache sets

cache ways

Prime

A

Probe



*Zhang, Juels, Reiter, Ristenpart, “Cross-‐VM Side-‐channels …”, CCS’12

I-‐cache

My work: Design principle soft-isolation to defend against cross-VM side-channels

Shared resource

Hard-isolation No sharing: ⬆ isolation ⬇ efficiency

12

Dissertation Summary

13

Cluster Scheduler

Cloud API

Placement Policy

VMVM

VM

VM

VM

VM

VMVM

Is co-location attacks possible, cheap?

[Security’15]

Stealing performance from neighbors [CCS’12]

Better isolation in CPU schedulers to prevent

side-channels [Security’14]

Practice of multi-tenancy demands stronger isolation between VMs in public clouds.

1

2

3

13

Prior Work on Co-location: 2009 Study

• 6 years old

• New countermeasures (e.g, virtual private clouds)

• Increased scale of clouds

• Only on Amazon EC2

14

VM

VM

VM

VM

VM

VM

VM

VM

Cloud API

VM

Target VMs

Co-location*Hey you get off my cloud, Ristenpart et al., CCS 2009

Cluster Scheduler

Placement Policy

Internet

Is co-location still viable in security-hardened modern clouds?

14

Our Work: Exploring Co-location Attacks in Modern Clouds

Steps in achieving co-location:

1. Finding Launch Strategy

• launch parameters to increase chances of co-location

2. Detecting Co-location

• with any target victim

15

VM

VM

VM

VM

VM

VM

VM

VM

Cloud API

VM

Target VMs

Detect co-location

Cluster Scheduler

Placement Policy

Internet

15

Co-residency Detection

16

1. Read shared state on two VMs

e.g., private IP addresses, shared TSC counters.

2. Correlate performance of shared resources

e.g., network round-trip times, cache-based covert-channels.

A memory-based covert-channel* can cause 3x-4x degradation

* Wu et al. “Whispers in the Hyper-space: High-speed Covert Channel Attacks in the Cloud.”, Usenix Security’12

VMVM

n/w pings or covert-channels

VMVM102.2.1.3102.2.1.1

16

Background: Memory Covert-Channel

17

LLC

Core (Private Caches)

Core (Private Caches)

LLC

Memory

VMVM

Shared Memory Bus

Sender:Signal() { repeat atomic_op(ua_addr) done }

Receiver:Observe() { s = start_time repeat N mem_access() done e = end_time bw = N/(e-s) }

* Wu et al. “Whispers in the Hyper-space: High-speed Covert Channel Attacks in the Cloud.”, Usenix Security’12

“Victim” VM must cooperate with attack VMO.K. for measurement studies

But not useful for real attacks in the wild

17

A

Co-residency Detection on Uncooperative Victim

18

Data Store

VM

VMpart of a cloud app,

e.g., web server, key-value storesTarget Victims

…

Load Balancer

Send large # requests & measure avg. latency

• Realistic victim setting: modern multi-tier cloud app. • A way to detect co-location when VMs are behind a

load-balancer in presence of background traffic

background load

Sender()Receiver()

VM

18

1

10

100

1000

idle 100 250 500 750 1000

Aver

age

Req

uest

Lat

ency

(ms)

(log

scal

e)

Background Load (# concurrent users)

Non-coresident Co-resident

Co-residency Detection on Uncooperative Victim

19

45x

Social networking application (Olio):

HAProxy LB + 3 Web servers + 1 mysql server

19

Our Work: Exploring Co-location Attacks in Modern Clouds

Steps in achieving co-location:

1. Finding Launch Strategy

• launch parameters to increase chances of co-location

2. Detecting Co-location

• with any target victim

20

VM

VM

VM

VM

VM

VM

VM

VM

Cloud API

VM

Target VMs

Detect co-location

Cluster Scheduler

Placement Policy

Internet

VM

20

Placement Policy

Big Picture: Placement Vulnerability Study

Study spanning 3 months, exploring 6 placement variables, spending more than $200 per cloud

21

Placement Variables

Co-location?

e.g., # VMs, when you launch, datacenter, VM type, etc.

Fix Placement Variables

Observe Placement Behavior

21

Study Setup

• Two distinct accounts: proxy for victim and attacker • 6 placement variables

• # victim & attacker VMs, delay b/w launches, time of day, day of week, datacenter, cloud providers

• Small instance type (EC2: t2.small, GCE: g1.small, Azure: Standard-A1)

• Values for these variables form a launch strategy

• Execute a launch strategy from a workstation • detect and log co-location

• 9 samples per strategy with 3 runs per time of day & 2 days of week (weekday/weekend)

22

22

Placement Policy

How hard should it be to achieve co-location?

• Random placement policy • N = 50,000 machines [re:Invent’14] • v - victims and a - attacker VMs • Probability of Collision:

Pc = 1 - (1 - v/N)a

23

Cluster Scheduler

Cloud API

Placement Policy

VMVM

VM

VM

VM

VM

VMVM v a = ln(1− Pc)/ln(1−v/N); Pc= 0.5

10 3466

20 1733

30 1155

For a modest 50% success rate with 10-30 victims we need to launch ~3000 VMs

23

Results: Varying Number of VMs

24

EC2

Succ

ess

Rate

0

0.25

0.5

0.75

1

10v 10a10v 20a10v 30a10v 30a20v 30a30v 30a

GCE

0

0.25

0.5

0.75

1

10v 10a10v 20a10v 30a10v 30a20v 30a30v 30a

Azure

0

0.25

0.5

0.75

1

10v 10a10v 20a10v 30a10v 30a20v 30a30v 30a

Co-location is possible with as low as 10 VMs and always achieve co-location with 30 VMs

24

Results: Varying Delay between Launches

25

Azure

0

0.25

0.5

0.75

1

10v 10a

20v 20a

30v 30a

GCE

0

0.25

0.5

0.75

1

10v 10a

20v 20a

30v 30a

Zero One hour

EC2

Succ

ess

Rate

0

0.25

0.5

0.75

1

10v 10a

20v 20a

30v 30a

Placement policy for each cloud significantly varies

25

Cost of a Launch Strategy

26

Cos

t

$0.00

$1.50

$3.00

$4.50

$6.00

10v 10a

20v 20a

30v 30a

Tim

e to

ach

ieve

& d

etec

t co-

loca

tion

(in m

inut

es)

0

45

90

135

180

10v 10a

20v 20a

30v 30a

EC2 GCE AZURE

The cheapest launch strategy costs as low as 14 cents

26

• We always achieved co-location in smaller datacenter regions,

• GCE: europe-west1-b and EC2: us-west-1 (CA)

• In EC2, launching attacker VMs early morning (2 to 10am PST) has a higher success rate.

• In Azure we could co-locate 16 VMs on a single host

• Brief experiments with platform-as-a-service,

• … and many more in the paper

Other Interesting Results

27

27

Placement Vulnerability [Security’15]

Outline


2. Are there unique opportunities for malicious users to exploit the lack of isolation for performance gains?


28

Practice of multi-tenancy in public clouds demands stronger isolation between VMs in the presence of malicious users.

Resource-Freeing Attacks [CCS’12]

CPU Scheduler-based defenses [Security’14]

28

Perfo

rman

ce D

egra

datio

n (%

)

0

150

300

450

600

CPU Net Disk Cache

Resource Contention in Public Clouds

29

Local Xen Testbed

Machine Intel Xeon E5430, 2.66 Ghz

CPU 2 packages each with 2 cores

Cache Size 6MB per package

VM

VM

Non-work-conserving CPU scheduling

Work-conserving scheduling

3x-‐6x Performance loss à Higher cost

29

This work: Greedy customer can recover performance by interfering with other tenants

What can a tenant do?

30

Pack up VM and move(see our SOCC 2012 paper)… but, not all workloads cheap to move

VM

VM

Ask provider for better isolation… requires overhaul of the cloud

Resource-‐Freeing Attack

30

The Setting

Victim: – One or more VMs – Public interface (eg, http)

Beneficiary: – VM whose performance we want to improve

Helper: – Mounts the attack

Beneficiary and victim fighting over a target resource

Helper

31

VM

VM

Victim

Beneficiary

31

Beneficiary & Victim • Apache webservers hosting static

and dynamic (CGI) web pages • Target Resources: Network

Bandwidth • Work-conserving scheduler

— half the bandwidth

Example: Network Contention

32

Net

Clients

VictimBeneficiary

Local Xen

Test bed

What can you do?

32

Recipe for a Successful RFA

Shi. resource away from the target resource towards the bo0leneck resource

33

Shift resource usage via public interface

Proportion of Network usage

CPU intensive dynamic pages

Static pagesProp

ortio

n of CPU

usage

Push to

wards CPU

bottle

neck

Reduce target resource usage

Limits

33

An RFA in Our Example

In our testbed:

Increases beneficiary’s share of bandwidth

No RFA: 1800 page requests/sec W/ RFA: 3026 page requests/sec

34

Net

Helper

CGI Req

uest

CPU Utilization

Clients

50% à85% share of bandwidth

34

Summary: Resource Freeing Attacks

1) Send targeted requests to victim 2) Shift resources use from target to a bottleneck

Similar RFA on other resources exists, e.g. CPU Cache Bandwidth

35

On EC2, we reduced contention by 66.5% = performance improvement of 3-13%

“Resource-Freeing Attacks: Improve Your Cloud Performance (at Your Neighbor’s Expense)”, ACM CCS 2012

35

1. Work-conserving vs. Non-work-conserving

2. Effects of simple pay-per-hour pricing model

• $ per unit time != $ per useful work-done

What did we learn from RFAs?

36

Isolation

Efficiency

Work-conserving

Non work-conserving

Hybrid

36


Outline


2. Are there unique opportunities for malicious users to exploit the lack of isolation for monetary gains?


37




37

Problem: Cache-based Side-channels*

38

VM

VMTime

Core: A V

cache sets

cache ways

Prime

A

Probe



CRYPTO_FUNCTION(s): s ß secret bit

if (s = 0) { OPERATION_A } if (s = 1) { OPERATION_B }


I-‐cache

Secret

38

Requirements for Successful Side-channel

39

VM

VMTime

Core: A V

cache sets

cache ways

Prime

A

Probe






I-‐cache

Shared resource

High-precision timers

Quick preemptions

Secret

39

Defenses against Side-channels1. Sharing

• Resource Partitioning [NoHype’10] • Specialized Hardware [RPcache’07] • Software-based partitioning [StealthMem’12]

2. Access to high-‐resoluFon Fmers • Reduce resolution [TimeWarp’12] • Removing timing channel [StopWatch’13]

3. Quick cross-‐VM preempFons • No prior work!

40

No countermeasures deployed by providers!

40

Our Solution: Soft Isolation

Allow sharing but limit frequency of dangerous cross-VM interactions

Goals: 1. Secure: Controlled information leakage 2. Commodity: Easy to adopt 3. Efficient: Allow sharing, low overhead

41

… with simple changes to Hypervisor’s CPU scheduler

Core

(per core state)

Private Caches

Hypervisor

VM

VM

41

Requirement for Quick Preemptions

42

VM

VM

V

Time

Core: A

cache sets

Prime

A

Probe

cache ways

Preemption Interval

V V

Rate of preemption > Rate of event to measure



Next subsequent code/taskexecution … (or noise)

42

State-‐of-‐art CPU schedulers

Batch VMs Interactive VMs

…

Why do schedulers allow quick preemptions?

43

Prime-‐probe attacker: Abuses BOOST priority, using interrupts.

Throughput-‐oriented: Benefits from longer scheduler timeslices

Latency-‐oriented: Benefits from quick wakeups , BOOST priority

V

Time

Core: A V A

< 10µs

Malicious VM

43

V

Time

AA

Min. runFme (scheduler parameter)

Interrupt(boosted)

V V

Soft-Isolation: Ratelimit Preemptions

44

VM

VM

Core:

Available in Xen (and KVM) • ratelimit_us (and sched_min_granularity_ns)

• Reduces VM-switches à Boosts batch-workload’s performance

Minimum RunTime (MRT) guarantee à soft-isolation

44

MRT Guarantee and Open Questions

45

TimeCore: AV V

1. Can MRT defend against Cross-VM Side-channels?(security evaluation)

delay

2. Trade-off between securityand performance?(performance overhead)

MRT value

45

Security Evaluation : Prime-Probe Timing Profile

46

Idle Victim VM Simple Victim VM Under Zero-‐MRT

Alternating usage patterni-‐cache access timing

46

Security Evaluation : Prime-Probe Timing Profile

47

Simple Victim VM Under Zero-‐MRT

Simple Victim VM Under 1ms MRT

Side-‐channel not discernible Alternating usage pattern

47

More details in the paper …

• Detailed Performance and Security Analysis • 20+ graphs in the paper

• Per-core State-Cleansing • Interactive VMs may still leak information • MRT + State-cleansing incur low overhead

48

It is cheap and easy to deploy!

“Scheduler-‐based Defenses against Cross-‐VM Side-‐channels”, Usenix Security’14 https://bitbucket.org/vvaradarajan/robsched

48

1. Is co-location practical in modern clouds? Yes, and surprisingly cheap!

2. Exploiting the lack of isolation for performance gains? Yes, reduce contention by gaming schedulers.

3. Improving isolation w/o compromising efficiency? Soft-Isolation


Conclusion

49




49

Thanks to …

50

50

References

(1) Varadarajan, V., Kooburat, T., Farley, B., Ristenpart, T., & Swift, M. M. Resource-freeing attacks: improve your cloud performance (at your neighbor's expense). ACM CCS 2012

(2) Varadarajan, V., Ristenpart, T., & Swift, M. Scheduler-based defenses against cross-vm side-channels. In Usenix Security 2014.

(3) Varadarajan, V., Zhang, Y., Ristenpart, T., & Swift, M. A placement vulnerability study in multi-tenant public clouds. In Usenix Security 2015.

(4) Farley, B., Juels, A., Varadarajan, V., Ristenpart, T., Bowers, K. D., & Swift, M. M. More for your money: exploiting performance heterogeneity in public clouds. ACM SoCC 2012.

(5) Volos, H., Nalli, S., Panneerselvam, S., Varadarajan, V., Saxena, P., & Swift, M. M.. Aerie: Flexible file-system interfaces to storage-class memory. ACM Eurosys 2014.

51

51

Backup Slides

52

52

Results: Co-residency Detection in Public Clouds

53

Amazon EC2 (EC2)

0

5

10

15

20

Performance Degradation

0.8 1.4 2 2.6 3.2 3.8 4.4 5 5.6

167

Non-coresident Co-residentGoogle Compute Engine (GCE)

Num

ber o

f VM

pai

rs

0

5

10

15

20

Performance Degradation0.8 1.4 2 2.6 3.2 3.8 4.4 5 5.6

53

Co-residency Detection in Public Clouds (contd.)

• Heterogeneity at play on Azure AMD machines

• Requiring two receivers to detect co-residency

• Memory-probing or memory-locking: > 1.5x degradation

54

Num

ber o

f VM

pai

rs

0

4.5

9

13.5

18

Performance Degradation

0 0.8 1 1.2 1.4 1.5 1.6 1.8 2

mprobe-NC mlock-NC mprobe-C mlock-C

54

Limitations

1. Although an exhaustive study

— results limited to small instance type, three clouds, 9 runs per configuration, two user accounts etc.

2. Positive co-residency signal != exploitable

— may share only a small set of resources (e.g., memory, network)

3. A result of an unlucky placement policy settings?

55

“A Placement Vulnerability Study in Multi-tenant Public Clouds”, Usenix Security 2015

55

Challenge 1: Placement Policy

• An unknown placement policy places VM on a host

• Variables may influence VM placement

— Parameters: type of VM, time of launch, # VMs launched, datacenter region, etc.

— Environment variables: time of day, day of week, # machines, etc.

56

VMVM

VM

VM

VMVM

VM

Cloud API

Internet

Placement Policyfirst fit?

best fit?…

56

Challenge 2: Co-residency Detection

• No explicit co-residency information from cloud provider

• Multiple layers of strong isolation

— Virtual machine — e.g. hardware virtualization

— Network — e.g. Amazon VPC

— Modern hardware, etc.

57

VMVM

VM

VM

VMVM

VM

Cloud API

Internet

Placement PolicyDetecting

co-residency

Placement Policy

57

Some Strategies Work Better than Others

58

Launch Strategy v x a Cost in Cloud

Cost underRandom Placement*

Success rate norm. w/ random*

Launch 10 VMs in less popular datacenter 10x10 $0.26 $113.87 1/0.1 (=10)

Launch 30 VMs 1 hour after victim VM launches 30x30 $1.56 $32.75 1/0.6 (=1.67)

Launch more than 20 VMs 4 hours after victim VM

launches20x20 $0.52 $53.76 1/0.33 (=3.03)

Example strategies on EC2

*Random Placement of VMs on N hosts, v x a launch strategy has a probability of collision: 1 - (1 - v/N)a

58

Resource Freeing Attacks

1) Send targeted requests to victim 2) Shift resources use from target to a bottleneck

59

Can we mount RFAs when target resource is CPU cache?

Shared CPU Cache: • Ubiquitous: Almost all workloads need cache • Hardware controlled: Not easily isolated via software • Performance SensiKve: High performance cost!

59

Case Study: Cache vs. Network

• VicFm : Apache webserver hosting static and dynamic (CGI) web pages

• Beneficiary: Synthetic cache bound workload (LLCProbe)

• Target Resource: Cache

• No cache isolation: ~3x slower when sharing cache with webserver

60

Net

Cache

$$$ Clients

Local Xen

Test bed

VictimBeneficiary

CoreCore

60

Victim webserver frequently interrupts, pollutes the cache – Reason: Xen gives higher

priority to VM consuming less CPU time

Cache vs. Network

CacheNet

61

Clients$$$

Cache state time line

Beneficiary starts to run

Core Core

decreased cache efficiency

Webserver receives a request

Heavily loaded web server

cache state

61

Cache vs. Network w/ RFA

Cache

RFA helps in two ways: 1. Webserver loses its priority. 2. Reducing the capacity of

webserver. Net

62

Clients$$$

Cache state time line

Core Core

HelperHeavily loaded webserver requests under RFA

CGI Req

uestBeneficiary starts to run

Webserver receives a request

Heavily loaded web server

cache state

62

RFA: Performance Improvement

63

RFA intensities – time in ms per second

196% slowdown

86% slowdown

60% Performance Improvement

63

Instance type m1.small

# of co-‐resident pairs 9 (23 total instances)

Machine type Intel Xeon E5507 with 4MB LLC

Experiments on Amazon EC2

VM

VM

VM

VM

64

VM

Multiple Accounts

Co-‐resident VMs from our accounts: Stand-‐ins for victim and beneficiary

Separate instances for helper and web clients

No direct interact with any other customers Indirect interaction akin to normal usage cases

VM

64

mcf from SPEC-CPU

65

10% slowdown

6% slowdown

3% performance improvement = 35% reduction in performance loss

On average RFA improved performance across all SPEC workloads!

65

Xen Version 4.2.1

Scheduler Credit Scheduler 1Configuration

(Non-‐work conserving)40% cap on DomU VCPUs

with equal weight# VMs 6

# VCPUs per VM 2

Experimental Methodology

Two VMs: 1. Attacker 2. Victim

66

Machine Intel Xeon E5645, 2.4GHz, 6 cores, single package

Memory Hierarchy

Private 32KB L1 (I-‐ and D-‐Cache), 256KB unified L2, 12MB shared L3 & 16GB DDR3 RAM.

Machine ConfigurationXen Configuration

Core

VM

VM

HypervisorSetting similar to public clouds (e.g. EC2)

66

MRT Guarantee and Open Questions

67

TimeCore: AV V

1. Can MRT defend against Cross-VM Side-channels?(security evaluation)

delay

2. Trade-off between securityand performance?(performance overhead)

MRT value

67

Measuring Performance Overhead

68

CoreCore

Core Core

VCPU

VCPU

VCPUVCPU

VCPU VCPU

Hypervisor

workload-‐mixMeasured workload: 1. Interactive à memcached,

cassandra, etc. and 2. Batch à graph500, specJBB, etc.

Competing workloads: microbenchmarks à highly cache-‐thrashing + (interactive or batch)

68

Measuring Performance Overhead

69

Normalize

d to Zero-‐MRT

0.8

1.1

1.4

1.7

2

ping memcachedcassandra mcf specjbb graph500

Avg. 95th Percentile Latency (interactive workloads)

Avg. Runtime (batch workloads)

At 5ms MRT< 7% overhead

69

More Sensitive Applications in Public Clouds Security in Public...

Documents

Transcript of More Sensitive Applications in Public Clouds Security in Public...