Opportunities and Challenges for Self-Aware Virtualized...
Transcript of Opportunities and Challenges for Self-Aware Virtualized...
SAN FRANCISCO, CA, USA
Opportunities and Challenges
for Self-Aware Virtualized
Infrastructure Management
Xiaoyun Zhu
VMware
June 3, 2012
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
2
Agenda
• Background
• Virtualized infrastructure management (VIM)
Key problems
• Resource management
• Application performance management
• Management software optimization
Existing solutions
Challenges
Opportunities
• Conclusion
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Virtualization transforms IT
Virtualization is No. 1 in Gartner’s top 10 technology-related trends
that will impact enterprise infrastructure in 2012
Key virtualization benefits
• Higher hardware utilization
• Easier deployment
• Elastic capacity
• Better agility via live migration
• Higher availability
• Fault tolerance
• Lower energy cost
3
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Top trends in virtualization
4
• The majority of server deployments in 2010 was virtual.
• Virtual machine installed base would grow from 11M in 2009 to 58M in
2012
• By 2012, 50% of world’s applications running on x86 architecture
servers will run on virtual machines
• Spending on x86 server and desktop virtualization software would grow
at 23.7% CAGR through 2014.
• Hosted virtual desktops will reach 74M
users by 2014
• Someone turns on 1 VM every 6 seconds
• Average 5.5 vMotions every second
• data source:
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
A foundation for cloud computing
Private cloud
• Virtualized enterprise data
centers
• Internal IT run as a business
• Initial setup & ongoing
maintenance cost
• High consolidation ratio via
statistical multiplexing
Public cloud
• IaaS cloud providers such as
Amazon EC2, Windows Azure
• IT resources offered as utilities
• No capital cost, lower barrier to
entry – appealing to SMBs
• Lower consolidation ratio (1-1
virtual-to-physical mapping)
5
New challenges - virtualized infrastructure management (VIM)
Overload protection
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
VIM – Key problems
6
Resource management
Resource controls and hierarchy
Resource scheduling
VM placement & dynamic load balancing
Cluster-level power management
Cloud scale management
Application performance management (APM)
Anomaly detection
Automated performance modeling
Automated service level assurance
Management software optimization
Concurrency control
Automated log analysis
Self-awareness is key to all these areas!
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Resource controls (CPU, memory, disk I/O*, network I/O*) [Gulati’12]
Reservation – minimum guaranteed amount of resources
Limit – upper bound on resource consumption (non-work-conserving)
Shares – relative importance during resource contention
Resource pool hierarchy
Represents resource delegation
RP-level resource controls (cpu/memory)
Challenges
Self-learning: How do these settings
impact application performance?
Auto-configure: How to find proper
<R,L,S> settings for each VM/group?
Resource controls and hierarchy
7
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Efficient and fair scheduling
NUMA-aware scheduling
SMP vCPU co-scheduling
Latency sensitive applications
Challenge: Accurate VM demand estimation
• CPU ready time
• Memory access pattern
• Memory shared, compressed, swapped, etc.
• How to determine VM happiness?
Host power management (HPM) [Ali’11]
Dynamic tuning of processor P-states and C-states
Challenge: Minimize impact on application
performance
Resource scheduling
8
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
VMware Distributed Resource Scheduler (DRS) [Gulati’12]
Initial VM placement & admission control
Compute VM resource entitlement Ei
Monitor cluster imbalance:
where
Dynamic load balancing via live migration
Find moves via greedy hill-climbing algorithm
Respect technical & logical constraints (affinity & anti-affinity)
Challenges
Avoid VM ping-ponging
Cost-benefit analysis
Avoid NIC saturation?
Network topology aware?
VM placement and load balancing
9
hi ih CEN /)(
)( hC NI
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Initial Configuration
3diskLUN 9diskLUN 6diskLUN
VMware Storage DRS (SDRS) [Gulati’10]
Improve storage performance via optimized placement of virtual disks
Auto-balance normalized load using workload models and device models
Device models learned via load injection during idle periods
Final Configuration
3diskLUN 9diskLUN 6diskLUN
via Storage vMotion
Dynamic I/O load balancing
Challenges
How to handle correlated workloads - Need time series based performance
stats (expensive) + efficient algorithm to incorporate correlation information
How to identify correlated LUNs to avoid wasted moves?
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Cluster-level power management
11
VMware Distributed Power Management (DPM) [Holler’09]
Consolidate VMs onto fewer hosts during low utilization periods
Power off idle hosts or power on standby hosts on demand
Work together seamlessly with host power management (HPM)
Challenges How to predict load spikes before they occur – from reactive to proactive?
How to incorporate temperature information?
How to support power capping?
0
500
1000
1500
2000
2500
3000
0:10 0:20 0:30 0:40 0:50 1:00 1:10 1:20 1:30
Clu
ster
Po
wer
Co
nsu
mp
tio
n (
W)
Elapsed Time (hh:mm)
HPM only DPM only DPM+HPM combined
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Cloud scale management
12
How to scale to larger clusters? [Gulati’11]
Hierarchical – across cluster LB, different time scales?
Distributed – how to get a consistent view of the cluster?
Statistical – power of two choices, how to handle resource pools?
Additional challenges
Cloud-level vs. VM-level resource controls?
Aggregated cluster-level metrics for cloud-level placement?
VM migration between resource islands?
Cross-cluster or cross-datacenter load balancing?
Take environmental factors into consideration?
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
VIM – Key problems
13
Resource management
Resource controls and hierarchy
Resource scheduling
VM placement & dynamic load balancing
Cluster-level power management
Cloud scale management
Application performance management (APM)
Anomaly detection
Automated performance modeling
Automated service level assurance
Management software optimization
Concurrency control
Automated log analysis
Self-awareness is key to all these areas!
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
APM in the cloud is hard!
Can I detect this earlier? – anomaly
detection
Whose fault is it? - problem localization
Why? – root cause analysis
What should I do now? - remediation
Bob: Cloud tenant
running a web application
SLO violation!
Alice: Cloud
infrastructure
provider
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
APM at the cloud-scale is harder!
Alice: Cloud
infrastructure
provider
Need performance analysis
and remediation solutions for
the cloud scale!
100s ~ 1000s
Bob’s
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Traditional approaches
• Ask an expert
Pros:
• Deep technical knowledge and rich experience (mental models)
Cons:
• Highly-variable resolution times (minutes to weeks)
• Hard to scale up (can only look at a finite number of problems)
• Hard to scale out (hard to codify or diffuse mental models to others)
• Use a “cookbook” (best practices for non-experts)
Pros:
• Standard diagnosis steps and flow charts
(CPU -> memory -> storage -> network)
Cons:
• Only gives guidelines for problems seen before
• Likely to see new interactions or emergent behavior in a cloud
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Main challenges in APM
17
Challenges IT organizations face regarding APM
Time spent on troubleshooting performance issues (63%)
Identifying performance issues before they affect end users (61%)
Management costs (48%)
Usability of application performance data (42%)
Visibility into the quality of the end-user experience (39%)
Inability to monitor each transaction across IT (38%)
Challenges regarding usability of performance data
Time spent correlating performance data (63%)
Amount of performance data that is not relevant (61%)
Issues they are not able to see (false negatives) (42%)
Getting alerts which are not valid (false positives) (32%)
* data source: TRAC Research
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Alternative: Data-driven approach!
• Leverage rich telemetry for applications and systems
• Leverage a rich class of data mining and statistical learning
techniques
• Apply to anomaly detection, problem localization, root
cause analysis
Pros:
• Generic: Makes no a priori assumptions
• Automation: Easier to do partially or fully
• Scalable: Codify analysis and resolution in algorithms
Cons:
• Need to analyze large volumes of data offline or in real-time
• Can lead to false positives (signal vs. noise) or false-negatives
• Correlation vs. causation: Only hints at possible root causes
• Application-level performance stats not always available
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Self-learned normal behavior
19
What can we do with just system-level stats data?
No advanced labeling of good vs. bad performance
Performance problem may manifest in abnormal metric values
Self-learned dynamic thresholds that define normal behavior
* VMware vCenter Operations (VCops) [Colbert’11]
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Early anomaly detection
20
Assume typical problems cause multiple metrics to change values
Trend average “noise” level for an object and only raises alert when
number of abnormalities exceeds noise level
Early warning: Alerts may be raised before users are impacted
Challenges
Faster detection while reducing false alarms
Fingerprint: Set of metric anomalies matches previously seen problem
How to identify the root causes of a true anomaly?
Scale: Beyond the current 1000 hosts/10000 VMs limit
# o
f a
bn
orm
ali
ties
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Automated performance modeling
21
Challenges
Application performance depends on many factors
• Multiple resources (distributed, multi-tiered, resource types)
• Potential interference among co-hosted applications
What metrics to use for predicting application performance?
• >1800 performance stats for a single ESX host
Offline modeling may be insufficient
• Time-varying workloads or system configurations
Online learning and modeling can help
• Accuracy (precision & recall) when used for performance diagnosis
• Need automatic online change point detection
Ongoing work (vPerfGuard)
Auto-identify performance-critical (VM & hypervisor) metrics
Auto-learn models (online) that predict application performance
Auto-adapt models (online) when system or workload changes
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Automated service level assurance
22
Requested
allocation
Two-layered control architecture [Padala’09]
Challenges: Controller tolerance of modeling inaccuracy
Utility-driven service differentiation
Auto-discover control knobs – resource types, resource settings, etc.
App1 Controller
App2 Controller
App3 Controller
App4 Controller
DSK CPU DSK CPU DSK CPU
Real
allocation
Sensors Performance targets
Node I Controller
Node II Controller
Node III Controller
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Conclusion
23
• Many challenges and open problems remain in virtualized
infrastructure management
• Self-awareness and adaptivity are key to solving many of
these problems
• Leveraging existing techniques in statistical learning,
control, and optimization can be very powerful
• Encourage more effective collaborations between
theoreticians and system builders
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
References
24
X. Liu et al., “Online response time optimization of Apache Web server.” IWQoS 2003.
A. Holler, “Distributed Power Management concepts and use.” 2009.
P. Padala et al. “Automated control of multiple virtualized resources.” Eurosys 2009.
A. Gulati et al. “PARDA: Proportional allocation of resources for distributed storage
access.” FAST 2009.
A. Gulati, et al. “BASIL: Automated IO load balancing across storage devices.” FAST
2010.
V. Soundararajan et al. , “The impact of management operations on the virtualized
datacenter.” ISCA 2010.
X. Zhu et al. “Adaptive polling in VMware vCloud Director with feedback control.”
FeBID 2011.
A. Gulati et al. “Cloud scale resource management: Challenges and techniques.”
HotCloud 2011.
K. Colbert, “vCenter Operations technical deep dive.” VMworld 2011.
Q. Ali, “Host power management in VMware vSphere 5.” August 2011.
A. Gulati et al. “VMware Distributed Resource Management: Design, Implementation,
and Lessons Learned.” VMware Technical Journal, Vol. 1(1), April 2012.
Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments
Questions?
25
Feedback Computing Workshop 2012 San Jose, CA, September 17, 2012
http://feedbackcomputing.org/
Co-located with International Conference on Autonomic Computing (ICAC 2012)
Submissions are due on June 15, 2012