XS Oracle 2009 CVF
-
Upload
xen-project -
Category
Technology
-
view
1.056 -
download
2
description
Transcript of XS Oracle 2009 CVF
The On-going Evolutions of Power Management in Xen
Kevin Tian
Open Source Technology Center (OTC)
Intel Cooperation
2
Intel Confidential
Agenda
• Brief history
• Evolutions of Idle power management
• Evolutions of run-time power management
• Tools
• Experimental data about Xen power efficiency
• Power impact from VM
3
Intel Confidential
Brief History
Xen 3.1
Jul. 2007
Host S3
Sep. 2007
Dom0 controlled freq/volscaling
Jan. 2008
Xen 3.2 released
May. 2007
Preliminary cpufreq and
cpuidle support in Xen
Jun. 2008
Mature deep C-states support
Aug. 2008
Xen 3.3 released
Enhanced green computing
Improved stability
Better usability
…
4
Intel Confidential
Evolutions of Idle power management
(Xen cpuidle)
5
Intel Confidential
Enter Idle
Schedulers
Halt (C1)
CpuIdle driver
Mwait/IO C1
Hypercall Ladder
Xen
Dom0
ACPI Parser
ExternalControl
Registration
TimekeepingDynamic
TickC2
Xen Summit Boston 2008
6
Intel Confidential
Enter Idle
Schedulers
Halt (C1)
CpuIdle driver
Mwait/IO C1
C1E
Deep C-states
Hypercall Ladder
Timer
Xentrace
Xen
Dom0
ACPI Parser
ExternalControl
Registration
TimekeepingDynamic
TickC2
Enhanced C-states support
100%
91.40%
82.30%
100%
102.60%
107.70%
0%
20%
40%
60%
80%
100%
120%
Perc
enta
ge
id le (Watt) SPECpow e r(Scor e )
noPM x psp1 hv m
noC3 xps p1 hvm
C3 xpsp1 hvm
‘noPM’ has both cpuidle and cpufreq disabled, and vice versa for other two cases. Compared to ‘C3’, noC3 has maximum C-state limited to C2
For idle watt, lower value means greener. For SPECpower score, higher value indicates more power efficient
7
Intel Confidential
Enter Idle
Schedulers
Halt (C1)
CpuIdle driver
Mwait/IO C1
C1E
Deep C-states
Hypercall Ladder
Timer
Xentrace
Xen
Dom0
ACPI Parser
ExternalControl
Registration
TimekeepingDynamic
TickC2
TSC freeze
TSC save/restore
Always running TSC
Ideal TSC
Actual TSC
TSC freeze
CPU0
CPU1
Time went backwards warning
Lots of lost ticks
Faster ToD
Unsynchronized TSC
Fluctuating TSC scale factor
Xen system time skew
Software compensation according to elapsed
counter of platform timer
……
…Percpu platform to TSC scale
…
…
…
Global platform to TSC scale
Platform counterTSC
0 1 sec 1 sec
TSC
TSC
TSC
Percpu platform to TSC scale
Restore TSC upon elapsed platform counter since current entry
Restore TSC upon elapsed platform counter since last calibration
Restore TSC upon elapsed platform counter since power on
Platform counterTSC
Hardware enhancement to have TSC never stopped (e.g. by Intel Core-i7)
Scale error
8
Intel Confidential
Enter Idle
Schedulers
Halt (C1)
CpuIdle driver
Mwait/IO C1
C1E
Deep C-states
Hypercall Ladder
Timer
Xentrace
Xen
Dom0
ACPI Parser
ExternalControl
Registration
TimekeepingDynamic
TickC2
APIC timer freeze
PIT/HPET broadcast
TSC save/restore
Always running TSC
T
T T
T T TT
…
Timer heapNearest deadline Reprogram local APIC
count-down timer
interrupt when count down to
zero
(Delayed)
Local APIC ISRTimer softirq handler
Scan/execute expired timers
APIC timer freeze
Solutionuse platform timer (PIT/HPET) to carry nearest
deadline, when APIC timer is halted in deep C-states:
is required since number of platform timer source is less than number of CPUs
will come soon to have more platform timer sources with reduced broadcast traffic
will be supported in new CPU soon
Broadcast
MSI based HPET interrupt
Always running APIC timer
9
Intel Confidential
Enter Idle
Schedulers
Halt (C1)CpuIdle driver
Mwait/IO C1
C1E
Deep C-states
Hypercall Ladder
Timer
Xentrace
Xen
Dom0
ACPI Parser
ExternalControl
Registration
TimekeepingDynamic
Tick
C2
Menu governor
PIT/HPET broadcast
TSC save/restore
Always running TSC
Menu
Ladder
C0
Cn
Cn+1
Expected minimal residency at CnExpected minimal residency at Cn+1
Promotion if continuous N Cn residencies > expectation Demotion if current Cn+1
residency < expectation
Inefficient
C1, 1ns, 20w
Menu
C3, 100ns, 5w
…
C2, 10ns, 15w
Nearest timer deadline
Last unexpected break event
Latency/power requirement
PICK
less idle watt consumed (-5.2%)Higher SPECpower score (+1.6%)
(HVM WinWPsp1)
To be further tuned!
10
Intel Confidential
Enter Idle
Schedulers
Halt (C1)
CpuIdle driver
Mwait/IO C1
C1E
Deep C-states
Hypercall Ladder
Timer
Xentrace
Xen
Dom0
ACPI Parser
ExternalControl
Registration
TimekeepingDynamic
TickC2
Range timer
PIT/HPET broadcast
TSC save/restore
Always running TSC
Menu
Range timer
0(ms) 1 2 3 4 5 6
0(ms) 1 2 3 4 5 6
C0
Cn
C0
Cn
For each timer, it accepts a range for expiration now:
[expiration, expiration + timer_slop]
(default 50us for timer_slop)
Overlapping ranges can be merged to reduce timer
interrupt count
Frequent C-stateEntry/exit may
Instead consumeMore power
11
Intel Confidential
Range timer effect
50us 100% 100%
1ms 92.50% 101.20%
Idle(watt) SPECpower(score)
-7.5% +1.2%
One UP HVM RHEL5u1
Collected on a two-cores mobile platform
Multiple idle UP HVM RHEL5u1
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1HVM 2HVM 4HVM
timer interrupt/second
range=50us
range=1ms
12
Intel Confidential
Enter Idle
Schedulers
Halt (C1)
Power Aware
CpuIdle driver
Mwait/IO C1
C1E
Deep C-states
Hypercall Statistics Ladder
TimerRange timer
XentracePIT/HPET broadcast
TSC save/restore
Always running TSC
Xen
Dom0
ACPI Parser
ExternalControl
Registration
Xenpm
Menu
TimekeepingDynamic
TickC2
Current picture
13
Intel Confidential
Evolutions of run-time power management
(Xen cpufreq)
14
Intel Confidential
Xen Summit Boston 2008
Xen
Dom0
ACPI Parser
ExternalControl
Linux Cpufreq
Cpufreq core
ACPI cpufreq
PowerNow-K8
Others
OnDemand
UserSpace
Others
EnableMSR
AccessPerm
Schedulers
Query
Idle
State
LinuxPM
Tools
Tiny cpufreq core
OnDemand
ACPICpufreq(IA32)
Registration
Tiny cpufreq core
15
Intel Confidential
Current picture
Xen
Dom0
ACPI Parser
ExternalControl
Linux Cpufreq
Cpufreq core
ACPI cpufreq
PowerNow-K8
Others
OnDemand
UserSpace
Others
EnableMSR
AccessPerm
Schedulers
Query
Idle
State
LinuxPM
Tools
Tiny cpufreq core
OnDemand
ACPICpufreq(IA32)
Xenpm
UserSpace
Performance
PowerSave
Statistics
Control
ACPICpufreq(IA64)
PowerNowK8
Registration
Cpu offline / online
?
Tiny cpufreq core
Turbo mode
More governors
More drivers
EnhancedUser
Interface
16
Intel Confidential
Tools
17
Intel Confidential
Xentrace
Xen
Dom0Xenpm Retrieve run-time
statistics about Xen cpuidle and cpufreq
Apply user policy on exposed control knobs of Xen cpufreq (governor, set freq, etc)
More capabilities to be added later, e.g. profile…
Log every state change for Xen cpuidle and cpufreq:
Raw data could be further processed by other scripts
CPU0 391365842416 (+ 21204) cpu_idle_entry [ idle to state 2 ]CPU0 391375951050 (+10108634) cpu_idle_exit [ return from state 2 ]
18
Intel Confidential
Experimental data about Xen power efficiency
19
Intel Confidential
• All data shown in this section:– For reference only and not guaranteed– On a two-core mobile platform, with one HVM guest created
• Server consolidation effect with multiple VMs/workloads are in progress
• Improvement when Xen cpuidle and cpufreq are enabled– SPECpower score is normalized (100% noPM score is 1000 ssj_ops / watt)– Similarly, consumed watt is also normalized (idle noPM watt is 10w)
0500
1000
1500
0 10 20 30 40 50 60 70 80 90 100
SPECpower workload (%)
normalized score (ssj_ops/watt)
0
5
10
15
20
25
Normalized power (watt)
noPM-score
PM-score
noPM-wattPM-watt
ReducedPower!
ImprovedEfficiency
20
Intel Confidential
• Below is a more attracting comparison– Native WinXPsp1’s idle watt and SPECpower score are both normalized to 100% as the base
50%
60%
70%
80%
90%
100%
110%
120%
130%
140%
idle (Watt) SPECpower (ssj_ops/watt)
Nor
mal
ized
per
cent
age
(%)
native xpsp1native rhel5u1xen xpsp1 hvmkvm xpsp1 hvmxen rhel5u1 hvmkvm rhel5u1 hvm
Similar Idle power consumption for Xen and KVM
Native Native
HZ=1000 in RHEL5u1 incurs high
timer interrupt
Xen is slightly more power
efficient than KVM
21
Intel Confidential
Power impact from VM
22
Intel Confidential
• VMM shouldn’t be blamed as only reason for high power consumption!
• A ‘bad’ VM could eat power– Just like what ‘bad’ application could do in a native OS
– Cause high break events (e.g. timer interrupts) with short C-state residency
• Which parts in VM could draw high power?– ‘Bad’ applications hurt just as what they could on native
– How guest OS is implemented also matters• Periodic tick frequency – HZ• Timer usage in drivers• Time sub-system implementation• …
• Green guest OS wins!– Smaller HZ, idle tickles or fully dynamic tick, range timer, etc.
23
Intel Confidential
80.00%
85.00%
90.00%
95.00%
100.00%
105.00%
110.00%
115.00%
120.00%
bare Dom0 PV HVM Winxp HZ=1000 HZ=250 HZ=100 tickless
Normalized watt
0
0.5
1
1.5
2
2.5
3
3.5
4
Average residency (ms)
Idle power consumption
Average C-state residency
Green Green
‘Bad’guest
Idle power consumption
24
Intel Confidential
Power efficiency
800
820
840
860
880
900
920
940
960
980
1000
1020
bare Dom0 PV HVM Winxp HZ=1000 HZ=100 tickless
Normalized score (ssj_ops / watt)
SPECpower score
25
Intel Confidential
Legal Information
• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT.
• Intel may make changes to specifications, product descriptions, and plans at any time, without notice.
• All dates provided are subject to change without notice.
• Intel is a trademark of Intel Corporation in the U.S. and other countries.
• *Other names and brands may be claimed as the property of others.
• Copyright © 2007, Intel Corporation. All rights are protected.
26
Intel Confidential