OpenStack managed Cloud Foundry service Market · Openstack community contributor, experienced in...

1

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION

ACCELA ZHAO, LAYNE PENG

2

Accela Zhao, Technologist at EMC OCTO, active Openstack community contributor, experienced in cloud scheduling and container technologies.

WHO ARE THOSE GUYS …

Layne Peng, Principal Technologist at EMC OCTO, experienced cloud architect, one of the earliest contributors to Cloud Foundry in China, 9 patents owner and a book author.

Mail: [email protected]

Mail: [email protected] Twitter: @layne_peng

mailto:[email protected]

mailto:[email protected]

3

WHAT IS RESOURCE UTILIZATION?

This is what we buy

This is what we use

A gap of $$$ wasted

4

ENERGY AND RESOURCE UTILIZATION

Energy-related costs 42% of total (including buy new machines)

An idle server consumes even 70% as much energy as running in full-speed

Low resource utilization is energy inefficient Waste energy, waste money

Real world resource utilization is usually low: around 20% or less

http://san.ee.imperial.ac.uk/publications/EfficientCloud.pdf

http://128.250.22.134/papers/GreenCloud2010.pdf

http://web.stanford.edu/~cdel/2014.asplos.quasar.pdf

5

A CLOSER LOOK TO CLOUD

The key advantage - cloud consolidation

Less machines, more apps. Energy-efficient and saves money.

Improved resource utilization

6

• Scheduling - choose the best resource placement when app starts – Examples: Green Cloud, Paragon. And the schedulers in

Openstack, Kubernetes, Mesos, …

• Migration - continuously optimize the resource placement when app is running – Examples: Openstack Watcher, VMware DRS

• Soft Container - elastic, and dynamically adjust resource constraints in response to co-located apps – Related: Google Heracles

RESOURCE UTILIZATION ON CLOUD

Soft Container

http://www.cloudbus.org/papers/Energy-Aware-CloudResourceAllocation-FGCS2012.pdf

http://web.stanford.edu/~cdel/2013.asplos.paragon.pdf

https://github.com/openstack/nova/tree/master/nova/scheduler

https://github.com/kubernetes/kubernetes/tree/master/plugin/pkg/scheduler

http://mesos.apache.org/documentation/latest/app-framework-development-guide/

https://wiki.openstack.org/wiki/Watcher

https://www.vmware.com/products/vsphere/features/drs-dpm

http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf

7


Scheduler

Migration

Apps

Soft Container

Manages resource utilization at app kick-off

Manages resource utilization cross hosts while app running

Manages resource utilization at fine granularity inside host

8


A battle of putting more apps in each host

vs. guaranteed app SLA

The key problem: resource interference

9

• What is resource interference? – Apps co-located in one host share resources like CPU,

cache, memory, …

– They interfere with each other, result in poor performance compared to running standalone

– Resource interference make SLA unenforceable

• Related readings – Google Heracles: an analysis of resource interference

– Paragon: resource interference-aware scheduling

– Bubble-up: to measure resource interference

THE KEY PROBLEM: RESOURCE INTERFERENCE

http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf

http://web.stanford.edu/~cdel/2013.asplos.paragon.pdf

http://www.cs.virginia.edu/~skadron/Papers/mars_micro2011.pdf



10

RESOURCE INTERFERENCE: HOW IT LOOKS?

MySQL standalone running vs co-located with a CPU & disk hungry task

11

• Bubble-up – The setup

• Run app co-located with resource benchmarks, each benchmark stresses one type of resource

– App tolerated resource interference • Slowly increase resource benchmark stress until app fails its SLA.

• The critical point shows how much resource interference the app can tolerate.

– App caused resource interference • Run app at what its SLA requires.

• The stress it causes on each type of resource is the app’s caused resource interference.

• Where to use it? – Better resource utilization management

– Scheduling, Migration, Soft Container, …

RESOURCE INTERFERENCE: HOW TO MEASURE?




12

RESOURCE INTERFERENCE: HOW TO MEASURE?

MySQL standalone running, vs co-located with CPU stress, vs disk stress. In my case, MySQL is much more sensitive to CPU interference.

13

• Motivations – Increase resource utilization by co-locating more apps

• E.g. Business services is critical but may not use all resources on the host. Add the low priority hadoop batching tasks to fill what is left.

– Respond to the dynamic nature of time-varying workload • E.g. Business service may become more idle at lunch time, hadoop

tasks can then expand its resource bubble and utilize the leftover.

– Guarantee the SLA of critical apps • E.g. When the business service suddenly requires more resource for

processing, hadoop tasks will shrink instantly to give out resources.

• Challenges – Resource control and isolation of interference

– Respond to dynamic workload change

INTRODUCING TO SOFT CONTAINER

14

• What does “Soft” mean? – Varying container resources needs based upon neighbors

and SLAs. (The container becomes elastic)

– “Expanding” (bubble up) resources when idle resources exist

– Shrinking resources on a specific container, when another critical app demands more resources

INTRODUCING TO SOFT CONTAINER

Container resource bubble

Time

Resource

15

THE FEEDBACK CONTROL LOOP

Controller

Watcher Limiter

Containers

Soft Container

16

RESOURCES TO LIMIT

• CPU – Core

– Time Quota

– …

• Memory – Size

– Bandwidth

– …

• Disk I/O – IOPS

– Throughput

– …

17

RESOURCES TO LIMIT - MISSING

• CPU – Core

– Time Quota

– …

• Cache – LLC

– …

• Memory – Size

– Bandwidth*

– …

• GPU – …

• Device* – …

• Network – Ulimit

– Bandwidth

– …

…

• Disk I/O – IOPS

– Throughput

– …

Kernel 3.6, most supports can be found in the community…

18

ISOLATION THE RESOURCES - NAMESPACE

/proc/<pid>/ns: • lrwxrwxrwx 1 root root 0 Jun 21 18:38 ipc -> ipc:[4026532509] • lrwxrwxrwx 1 root root 0 Jun 21 18:38 mnt -> mnt:[4026532507] • lrwxrwxrwx 1 root root 0 Jun 16 18:24 net -> net:[4026532512] • lrwxrwxrwx 1 root root 0 Jun 21 18:38 pid -> pid:[4026532510] • lrwxrwxrwx 1 root root 0 Jun 21 18:38 user -> user:[4026531837] • lrwxrwxrwx 1 root root 0 Jun 21 18:38 uts -> uts:[4026532508]

• clone(): create a new process and attached to a new namespace • unshare(): create a new namespace and attaches to a existed process • setns(): Set a a process to a existing namespace

• security namespace • security keys namespace • device namespace • time namespace

We are still waiting…

19

LIMIT THE RESOURCE - CGROUP

Task, Control Group & Hierarchy Subsystem – Control options

• blkio • cpu • cpuacct • cpuset • devices

• freezer • memory • net_cls • net_prio • ns

Create a cgroup subsystem Change the limitation…

Usage

# echo 524288000 > /sys/fs/cgroup/memory/foo/memory.limit_in_bytes

20

MISSING - NETWORK

Isolation, does not means resource controlling

10

Suppose two containers in a machine, totally 100Gbps b/w

80

100Gbps

21

MISSING - NETWORK

Isolation, does not means resource controlling

10

Suppose two containers in a machine, totally 100Gbps b/w

80

100Gbps

95

100Gbps

If the GREEN container consumes the majority of b/w, which may have a negative impact on the BLUE one… How we can avoid this case from happening?

22

MISSING - NETWORK

Community attempts: Base on Traffic Control (tc)

Nightmare of the PaaS providers…

23

MISSING - NETWORK

Community attempts: Base on Traffic Control (tc)

Nightmare of the PaaS providers…

24

MISSING - GPU

Nvidia’s efforts:

a. GPU exposed as separated normal devices in /dev

Ref: https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation

b. devices cgroup: • Allow/Deny/List • Access

i. R ii. W iii. M

https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation







25

MISSING - GPU

Nvidia’s efforts:

a. GPU exposed as separated normal devices in /dev

Ref: https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation

b. devices cgroup: • Allow/Deny/List • Access

i. R ii. W iii. M

Usable, but insufficient… 1. Launch multiple jobs in parallel, each one us a subset of avaiable GPUs; 2. How about share GPU between Jobs with proper isolation? Can we share

a GPU like we can a CPU?








26

MISSING - CACHE

Intel’s efforts:

Cache Monitor Technology (CMT) • For an OS or VMM to indicate a software-

defined ID for each of applications or VMs that are scheduled to run on a core. This ID is called the Resource Monitoring ID (RMID).

• To Monitor cache occupancy on a per RMID basis

• For an OS or VMM to read LLC occupancy for a given RMID at any time.

Cache Allocation Technology (CAT) • The ability to enumerate the CAT capability and

the associated LLC allocation support via CPUID.

• Interfaces for the OS/hypervisor to group applications into classes of service (CLOS) and indicate the amount of last-level cache available to each CLOS. These interfaces are based on MSRs (Model-Specific Registers).

Code and Data Prioritization (CDP) • Extension to CAT • a new CPUID feature flag is added within the

CAT sub-leaves at CPUID.0x10.[ResID=1]:ECx[bit 2] to indicate support

27

MISSING – MEMORY BANDWIDTH

Monitor

Memory Bandwidth Monitoring (MBM) • Mechanisms in hardware to monitor cache

occupancy and bandwidth statistics as applicable to a given product generation on a per software-id basis.

• Mechanisms for the OS or hypervisor to read back the collected metrics such as L3 occupancy or Memory Bandwidth for a given software ID at any point during runtime.

Control

Ref Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platform: http://pertsserver.cs.uiuc.edu/~mcaccamo/papers/private/IEEE_TC_journal_submitted_C.pdf Code: https://github.com/heechul/memguard

http://pertsserver.cs.uiuc.edu/~mcaccamo/papers/private/IEEE_TC_journal_submitted_C.pdf


https://github.com/heechul/memguard


28

MISSING – MEMORY BANDWIDTH

Monitor

Memory Bandwidth Monitoring (MBM) • Mechanisms in hardware to monitor cache

occupancy and bandwidth statistics as applicable to a given product generation on a per software-id basis.

• Mechanisms for the OS or hypervisor to read back the collected metrics such as L3 occupancy or Memory Bandwidth for a given software ID at any point during runtime.

Control

Ref Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platform: http://pertsserver.cs.uiuc.edu/~mcaccamo/papers/private/IEEE_TC_journal_submitted_C.pdf Code: https://github.com/heechul/memguard





29

• Latencies – App request latency

– Disk IO await

– Network response time

• Queue length – CPU load average

– Disk request queue size

– Network queue length

• Utilization – CPU util rate

– Disk util rate

– Network util rate

WATCH THE WORKLOAD CHANGE

• Bandwidth – DRAM bandwidth

– CPU bandwidth

– Disk bandwidth

• Request count – App request count

– Disk IOPS / req/s

– Network IOPS / req/s

• Granularity – Global level

– Per container level

30


Controller

Watcher Limiter

Containers

Soft Container

31


Controller

Watcher Limiter

Containers

Soft Container

Immediate response

32


Controller

Watcher Limiter

Containers

Soft Container

Immediate response

How to immediately resize the containers?

33

HOW WE LOOK AT RESIZE?

a. Create a new container; b. Live migrate the contents to new container:

1. Transfer existed data to new container; 2. Transfer the instant data to new container.

c. Stop the old container d. Start the new container e. Route the traffic to new container

34

9527 /usr/sbin/httpd

Control Groups (cgroup): • CPU time: 20 • System memory: 1G • Disk bandwidth: 2000 • Network bandwidth: 100Mbs

Control Groups (cgroup): • CPU time: 70 • System memory: 5G • Disk bandwidth: 8000 • Network bandwidth: 1Gbs

a. Mount to new cgroup or change the value of the cgroup

b. Done!

IN CONTAINER’S WORLD…

35

9527 /usr/sbin/httpd

Control Groups (cgroup): • CPU time: 20 • System memory: 1G • Disk bandwidth: 2000 • Network bandwidth: 100Mbs

Control Groups (cgroup): • CPU time: 70 • System memory: 5G • Disk bandwidth: 8000 • Network bandwidth: 1Gbs

a. Mount to new cgroup or change the value of the cgroup

b. Done!

IN CONTAINER’S WORLD…

We need to take a fresh look at the resources management from

Container’s perspective.

36

SOFT CONTAINER: IMPLEMENTATION

Controller Algorithm ”expand”

Algorithm ”pin_idle”

Algorithm plugin N

Watcher CPU plugin

Disk plugin

Watcher plugin N

Limiter RunC plugin

Docker plugin

Limiter plugin N

Metrics Store

CPU statistics

Disk …

More …

Container Repo

RunC plugin

Docker plugin

Container type N

Containers

Auto discovery

37

• Early version

• Support RunC and Docker containers

• A few controller algorithms which are effective

• Able to expand with more plugins

SOFT CONTAINER: CURRENT STATUS

Completely runnable!

38

Demo Time :-)

https://youtu.be/rD50_RGUHXY

39

BENCHMARK RESULTS: BEFORE

If uncontrolled, MySQL workload is severely interfered by co-located low priority task

40

BENCHMARK RESULTS: BEFORE

The CPU utilization is far from saturation while workload varies by time (Although in my case, disk IO is highly utilized)

41

BENCHMARK RESULTS: SOFT CONTAINER

With Soft Container (green line), latency impact is controlled. (We can improve the algorithm to cope better with peak workload)

42


Soft Container helps improve CPU utilization by co-locating new tasks with MySQL

43


CPU utilization looks close to saturation, after adding in iowait time

44

• Soft Container monitors app resource needs and overall resource utilization in realtime

• Soft Container issues resource controls in realtime, to guard app SLA and balance resource utilization

HOW DOES SOFT CONTAINER DID THIS?

45


How the resource bubble floats under the control of Soft Container. (The vibration threshold are made very sensitive to workload change)

OpenStack managed Cloud Foundry service Market · Openstack community contributor, experienced in...

Documents

Transcript of OpenStack managed Cloud Foundry service Market · Openstack community contributor, experienced in...