Opportunities and Challenges for Self-Aware Virtualized...

SAN FRANCISCO, CA, USA

Opportunities and Challenges

for Self-Aware Virtualized

Infrastructure Management

Xiaoyun Zhu

VMware

June 3, 2012

Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments

2

Agenda

• Background

• Virtualized infrastructure management (VIM)

Key problems

• Resource management

• Application performance management

• Management software optimization

Existing solutions

Challenges

Opportunities

• Conclusion


Virtualization transforms IT

Virtualization is No. 1 in Gartner’s top 10 technology-related trends

that will impact enterprise infrastructure in 2012

Key virtualization benefits

• Higher hardware utilization

• Easier deployment

• Elastic capacity

• Better agility via live migration

• Higher availability

• Fault tolerance

• Lower energy cost

3


Top trends in virtualization

4

• The majority of server deployments in 2010 was virtual.

• Virtual machine installed base would grow from 11M in 2009 to 58M in

2012

• By 2012, 50% of world’s applications running on x86 architecture

servers will run on virtual machines

• Spending on x86 server and desktop virtualization software would grow

at 23.7% CAGR through 2014.

• Hosted virtual desktops will reach 74M

users by 2014

• Someone turns on 1 VM every 6 seconds

• Average 5.5 vMotions every second

• data source:


A foundation for cloud computing

Private cloud

• Virtualized enterprise data

centers

• Internal IT run as a business

• Initial setup & ongoing

maintenance cost

• High consolidation ratio via

statistical multiplexing

Public cloud

• IaaS cloud providers such as

Amazon EC2, Windows Azure

• IT resources offered as utilities

• No capital cost, lower barrier to

entry – appealing to SMBs

• Lower consolidation ratio (1-1

virtual-to-physical mapping)

5

New challenges - virtualized infrastructure management (VIM)

Overload protection


VIM – Key problems

6

Resource management

Resource controls and hierarchy

Resource scheduling

VM placement & dynamic load balancing

Cluster-level power management

Cloud scale management

Application performance management (APM)

Anomaly detection

Automated performance modeling

Automated service level assurance

Management software optimization

Concurrency control

Automated log analysis

Self-awareness is key to all these areas!


Resource controls (CPU, memory, disk I/O*, network I/O*) [Gulati’12]

Reservation – minimum guaranteed amount of resources

Limit – upper bound on resource consumption (non-work-conserving)

Shares – relative importance during resource contention

Resource pool hierarchy

Represents resource delegation

RP-level resource controls (cpu/memory)

Challenges

Self-learning: How do these settings

impact application performance?

Auto-configure: How to find proper

<R,L,S> settings for each VM/group?


7


Efficient and fair scheduling

NUMA-aware scheduling

SMP vCPU co-scheduling

Latency sensitive applications

Challenge: Accurate VM demand estimation

• CPU ready time

• Memory access pattern

• Memory shared, compressed, swapped, etc.

• How to determine VM happiness?

Host power management (HPM) [Ali’11]

Dynamic tuning of processor P-states and C-states

Challenge: Minimize impact on application

performance

Resource scheduling

8


VMware Distributed Resource Scheduler (DRS) [Gulati’12]

Initial VM placement & admission control

Compute VM resource entitlement Ei

Monitor cluster imbalance:

where

Dynamic load balancing via live migration

Find moves via greedy hill-climbing algorithm

Respect technical & logical constraints (affinity & anti-affinity)

Challenges

Avoid VM ping-ponging

Cost-benefit analysis

Avoid NIC saturation?

Network topology aware?

VM placement and load balancing

9

hi ih CEN /)(

)( hC NI


Initial Configuration

3diskLUN 9diskLUN 6diskLUN

VMware Storage DRS (SDRS) [Gulati’10]

Improve storage performance via optimized placement of virtual disks

Auto-balance normalized load using workload models and device models

Device models learned via load injection during idle periods

Final Configuration

3diskLUN 9diskLUN 6diskLUN

via Storage vMotion

Dynamic I/O load balancing

Challenges

How to handle correlated workloads - Need time series based performance

stats (expensive) + efficient algorithm to incorporate correlation information

How to identify correlated LUNs to avoid wasted moves?



11

VMware Distributed Power Management (DPM) [Holler’09]

Consolidate VMs onto fewer hosts during low utilization periods

Power off idle hosts or power on standby hosts on demand

Work together seamlessly with host power management (HPM)

Challenges How to predict load spikes before they occur – from reactive to proactive?

How to incorporate temperature information?

How to support power capping?

0

500

1000

1500

2000

2500

3000

0:10 0:20 0:30 0:40 0:50 1:00 1:10 1:20 1:30

Clu

ster

Po

wer

Co

nsu

mp

tio

n (

W)

Elapsed Time (hh:mm)

HPM only DPM only DPM+HPM combined



12

How to scale to larger clusters? [Gulati’11]

Hierarchical – across cluster LB, different time scales?

Distributed – how to get a consistent view of the cluster?

Statistical – power of two choices, how to handle resource pools?

Additional challenges

Cloud-level vs. VM-level resource controls?

Aggregated cluster-level metrics for cloud-level placement?

VM migration between resource islands?

Cross-cluster or cross-datacenter load balancing?

Take environmental factors into consideration?


VIM – Key problems

13

Resource management


Resource scheduling

VM placement & dynamic load balancing



Application performance management (APM)

Anomaly detection



Management software optimization

Concurrency control

Automated log analysis

Self-awareness is key to all these areas!


APM in the cloud is hard!

Can I detect this earlier? – anomaly

detection

Whose fault is it? - problem localization

Why? – root cause analysis

What should I do now? - remediation

Bob: Cloud tenant

running a web application

SLO violation!

Alice: Cloud

infrastructure

provider


APM at the cloud-scale is harder!

Alice: Cloud

infrastructure

provider

Need performance analysis

and remediation solutions for

the cloud scale!

100s ~ 1000s

Bob’s


Traditional approaches

• Ask an expert

Pros:

• Deep technical knowledge and rich experience (mental models)

Cons:

• Highly-variable resolution times (minutes to weeks)

• Hard to scale up (can only look at a finite number of problems)

• Hard to scale out (hard to codify or diffuse mental models to others)

• Use a “cookbook” (best practices for non-experts)

Pros:

• Standard diagnosis steps and flow charts

(CPU -> memory -> storage -> network)

Cons:

• Only gives guidelines for problems seen before

• Likely to see new interactions or emergent behavior in a cloud


Main challenges in APM

17

Challenges IT organizations face regarding APM

Time spent on troubleshooting performance issues (63%)

Identifying performance issues before they affect end users (61%)

Management costs (48%)

Usability of application performance data (42%)

Visibility into the quality of the end-user experience (39%)

Inability to monitor each transaction across IT (38%)

Challenges regarding usability of performance data

Time spent correlating performance data (63%)

Amount of performance data that is not relevant (61%)

Issues they are not able to see (false negatives) (42%)

Getting alerts which are not valid (false positives) (32%)

* data source: TRAC Research


Alternative: Data-driven approach!

• Leverage rich telemetry for applications and systems

• Leverage a rich class of data mining and statistical learning

techniques

• Apply to anomaly detection, problem localization, root

cause analysis

Pros:

• Generic: Makes no a priori assumptions

• Automation: Easier to do partially or fully

• Scalable: Codify analysis and resolution in algorithms

Cons:

• Need to analyze large volumes of data offline or in real-time

• Can lead to false positives (signal vs. noise) or false-negatives

• Correlation vs. causation: Only hints at possible root causes

• Application-level performance stats not always available


Self-learned normal behavior

19

What can we do with just system-level stats data?

No advanced labeling of good vs. bad performance

Performance problem may manifest in abnormal metric values

Self-learned dynamic thresholds that define normal behavior

* VMware vCenter Operations (VCops) [Colbert’11]


Early anomaly detection

20

Assume typical problems cause multiple metrics to change values

Trend average “noise” level for an object and only raises alert when

number of abnormalities exceeds noise level

Early warning: Alerts may be raised before users are impacted

Challenges

Faster detection while reducing false alarms

Fingerprint: Set of metric anomalies matches previously seen problem

How to identify the root causes of a true anomaly?

Scale: Beyond the current 1000 hosts/10000 VMs limit

# o

f a

bn

orm

ali

ties



21

Challenges

Application performance depends on many factors

• Multiple resources (distributed, multi-tiered, resource types)

• Potential interference among co-hosted applications

What metrics to use for predicting application performance?

• >1800 performance stats for a single ESX host

Offline modeling may be insufficient

• Time-varying workloads or system configurations

Online learning and modeling can help

• Accuracy (precision & recall) when used for performance diagnosis

• Need automatic online change point detection

Ongoing work (vPerfGuard)

Auto-identify performance-critical (VM & hypervisor) metrics

Auto-learn models (online) that predict application performance

Auto-adapt models (online) when system or workload changes



22

Requested

allocation

Two-layered control architecture [Padala’09]

Challenges: Controller tolerance of modeling inaccuracy

Utility-driven service differentiation

Auto-discover control knobs – resource types, resource settings, etc.

App1 Controller

App2 Controller

App3 Controller

App4 Controller

DSK CPU DSK CPU DSK CPU

Real

allocation

Sensors Performance targets

Node I Controller

Node II Controller

Node III Controller


Conclusion

23

• Many challenges and open problems remain in virtualized

infrastructure management

• Self-awareness and adaptivity are key to solving many of

these problems

• Leveraging existing techniques in statistical learning,

control, and optimization can be very powerful

• Encourage more effective collaborations between

theoreticians and system builders


References

24

X. Liu et al., “Online response time optimization of Apache Web server.” IWQoS 2003.

A. Holler, “Distributed Power Management concepts and use.” 2009.

P. Padala et al. “Automated control of multiple virtualized resources.” Eurosys 2009.

A. Gulati et al. “PARDA: Proportional allocation of resources for distributed storage

access.” FAST 2009.

A. Gulati, et al. “BASIL: Automated IO load balancing across storage devices.” FAST

2010.

V. Soundararajan et al. , “The impact of management operations on the virtualized

datacenter.” ISCA 2010.

X. Zhu et al. “Adaptive polling in VMware vCloud Director with feedback control.”

FeBID 2011.

A. Gulati et al. “Cloud scale resource management: Challenges and techniques.”

HotCloud 2011.

K. Colbert, “vCenter Operations technical deep dive.” VMworld 2011.

Q. Ali, “Host power management in VMware vSphere 5.” August 2011.

A. Gulati et al. “VMware Distributed Resource Management: Design, Implementation,

and Lessons Learned.” VMware Technical Journal, Vol. 1(1), April 2012.


Questions?

25

Feedback Computing Workshop 2012 San Jose, CA, September 17, 2012

http://feedbackcomputing.org/

Co-located with International Conference on Autonomic Computing (ICAC 2012)

Submissions are due on June 15, 2012



Opportunities and Challenges for Self-Aware Virtualized...

Documents

Transcript of Opportunities and Challenges for Self-Aware Virtualized...