XS Oracle 2009 CVF

26
The On-going Evolutions of Power Management in Xen Kevin Tian Open Source Technology Center (OTC) Intel Cooperation

description

Ze'ev Maor: Client Virtualization Framework

Transcript of XS Oracle 2009 CVF

Page 1: XS Oracle 2009 CVF

The On-going Evolutions of Power Management in Xen

Kevin Tian

Open Source Technology Center (OTC)

Intel Cooperation

Page 2: XS Oracle 2009 CVF

2

Intel Confidential

Agenda

• Brief history

• Evolutions of Idle power management

• Evolutions of run-time power management

• Tools

• Experimental data about Xen power efficiency

• Power impact from VM

Page 3: XS Oracle 2009 CVF

3

Intel Confidential

Brief History

Xen 3.1

Jul. 2007

Host S3

Sep. 2007

Dom0 controlled freq/volscaling

Jan. 2008

Xen 3.2 released

May. 2007

Preliminary cpufreq and

cpuidle support in Xen

Jun. 2008

Mature deep C-states support

Aug. 2008

Xen 3.3 released

Enhanced green computing

Improved stability

Better usability

Page 4: XS Oracle 2009 CVF

4

Intel Confidential

Evolutions of Idle power management

(Xen cpuidle)

Page 5: XS Oracle 2009 CVF

5

Intel Confidential

Enter Idle

Schedulers

Halt (C1)

CpuIdle driver

Mwait/IO C1

Hypercall Ladder

Xen

Dom0

ACPI Parser

ExternalControl

Registration

TimekeepingDynamic

TickC2

Xen Summit Boston 2008

Page 6: XS Oracle 2009 CVF

6

Intel Confidential

Enter Idle

Schedulers

Halt (C1)

CpuIdle driver

Mwait/IO C1

C1E

Deep C-states

Hypercall Ladder

Timer

Xentrace

Xen

Dom0

ACPI Parser

ExternalControl

Registration

TimekeepingDynamic

TickC2

Enhanced C-states support

100%

91.40%

82.30%

100%

102.60%

107.70%

0%

20%

40%

60%

80%

100%

120%

Perc

enta

ge

id le (Watt) SPECpow e r(Scor e )

noPM x psp1 hv m

noC3 xps p1 hvm

C3 xpsp1 hvm

‘noPM’ has both cpuidle and cpufreq disabled, and vice versa for other two cases. Compared to ‘C3’, noC3 has maximum C-state limited to C2

For idle watt, lower value means greener. For SPECpower score, higher value indicates more power efficient

Page 7: XS Oracle 2009 CVF

7

Intel Confidential

Enter Idle

Schedulers

Halt (C1)

CpuIdle driver

Mwait/IO C1

C1E

Deep C-states

Hypercall Ladder

Timer

Xentrace

Xen

Dom0

ACPI Parser

ExternalControl

Registration

TimekeepingDynamic

TickC2

TSC freeze

TSC save/restore

Always running TSC

Ideal TSC

Actual TSC

TSC freeze

CPU0

CPU1

Time went backwards warning

Lots of lost ticks

Faster ToD

Unsynchronized TSC

Fluctuating TSC scale factor

Xen system time skew

Software compensation according to elapsed

counter of platform timer

……

…Percpu platform to TSC scale

Global platform to TSC scale

Platform counterTSC

0 1 sec 1 sec

TSC

TSC

TSC

Percpu platform to TSC scale

Restore TSC upon elapsed platform counter since current entry

Restore TSC upon elapsed platform counter since last calibration

Restore TSC upon elapsed platform counter since power on

Platform counterTSC

Hardware enhancement to have TSC never stopped (e.g. by Intel Core-i7)

Scale error

Page 8: XS Oracle 2009 CVF

8

Intel Confidential

Enter Idle

Schedulers

Halt (C1)

CpuIdle driver

Mwait/IO C1

C1E

Deep C-states

Hypercall Ladder

Timer

Xentrace

Xen

Dom0

ACPI Parser

ExternalControl

Registration

TimekeepingDynamic

TickC2

APIC timer freeze

PIT/HPET broadcast

TSC save/restore

Always running TSC

T

T T

T T TT

Timer heapNearest deadline Reprogram local APIC

count-down timer

interrupt when count down to

zero

(Delayed)

Local APIC ISRTimer softirq handler

Scan/execute expired timers

APIC timer freeze

Solutionuse platform timer (PIT/HPET) to carry nearest

deadline, when APIC timer is halted in deep C-states:

is required since number of platform timer source is less than number of CPUs

will come soon to have more platform timer sources with reduced broadcast traffic

will be supported in new CPU soon

Broadcast

MSI based HPET interrupt

Always running APIC timer

Page 9: XS Oracle 2009 CVF

9

Intel Confidential

Enter Idle

Schedulers

Halt (C1)CpuIdle driver

Mwait/IO C1

C1E

Deep C-states

Hypercall Ladder

Timer

Xentrace

Xen

Dom0

ACPI Parser

ExternalControl

Registration

TimekeepingDynamic

Tick

C2

Menu governor

PIT/HPET broadcast

TSC save/restore

Always running TSC

Menu

Ladder

C0

Cn

Cn+1

Expected minimal residency at CnExpected minimal residency at Cn+1

Promotion if continuous N Cn residencies > expectation Demotion if current Cn+1

residency < expectation

Inefficient

C1, 1ns, 20w

Menu

C3, 100ns, 5w

C2, 10ns, 15w

Nearest timer deadline

Last unexpected break event

Latency/power requirement

PICK

less idle watt consumed (-5.2%)Higher SPECpower score (+1.6%)

(HVM WinWPsp1)

To be further tuned!

Page 10: XS Oracle 2009 CVF

10

Intel Confidential

Enter Idle

Schedulers

Halt (C1)

CpuIdle driver

Mwait/IO C1

C1E

Deep C-states

Hypercall Ladder

Timer

Xentrace

Xen

Dom0

ACPI Parser

ExternalControl

Registration

TimekeepingDynamic

TickC2

Range timer

PIT/HPET broadcast

TSC save/restore

Always running TSC

Menu

Range timer

0(ms) 1 2 3 4 5 6

0(ms) 1 2 3 4 5 6

C0

Cn

C0

Cn

For each timer, it accepts a range for expiration now:

[expiration, expiration + timer_slop]

(default 50us for timer_slop)

Overlapping ranges can be merged to reduce timer

interrupt count

Frequent C-stateEntry/exit may

Instead consumeMore power

Page 11: XS Oracle 2009 CVF

11

Intel Confidential

Range timer effect

50us 100% 100%

1ms 92.50% 101.20%

Idle(watt) SPECpower(score)

-7.5% +1.2%

One UP HVM RHEL5u1

Collected on a two-cores mobile platform

Multiple idle UP HVM RHEL5u1

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1HVM 2HVM 4HVM

timer interrupt/second

range=50us

range=1ms

Page 12: XS Oracle 2009 CVF

12

Intel Confidential

Enter Idle

Schedulers

Halt (C1)

Power Aware

CpuIdle driver

Mwait/IO C1

C1E

Deep C-states

Hypercall Statistics Ladder

TimerRange timer

XentracePIT/HPET broadcast

TSC save/restore

Always running TSC

Xen

Dom0

ACPI Parser

ExternalControl

Registration

Xenpm

Menu

TimekeepingDynamic

TickC2

Current picture

Page 13: XS Oracle 2009 CVF

13

Intel Confidential

Evolutions of run-time power management

(Xen cpufreq)

Page 14: XS Oracle 2009 CVF

14

Intel Confidential

Xen Summit Boston 2008

Xen

Dom0

ACPI Parser

ExternalControl

Linux Cpufreq

Cpufreq core

ACPI cpufreq

PowerNow-K8

Others

OnDemand

UserSpace

Others

EnableMSR

AccessPerm

Schedulers

Query

Idle

State

LinuxPM

Tools

Tiny cpufreq core

OnDemand

ACPICpufreq(IA32)

Registration

Tiny cpufreq core

Page 15: XS Oracle 2009 CVF

15

Intel Confidential

Current picture

Xen

Dom0

ACPI Parser

ExternalControl

Linux Cpufreq

Cpufreq core

ACPI cpufreq

PowerNow-K8

Others

OnDemand

UserSpace

Others

EnableMSR

AccessPerm

Schedulers

Query

Idle

State

LinuxPM

Tools

Tiny cpufreq core

OnDemand

ACPICpufreq(IA32)

Xenpm

UserSpace

Performance

PowerSave

Statistics

Control

ACPICpufreq(IA64)

PowerNowK8

Registration

Cpu offline / online

?

Tiny cpufreq core

Turbo mode

More governors

More drivers

EnhancedUser

Interface

Page 16: XS Oracle 2009 CVF

16

Intel Confidential

Tools

Page 17: XS Oracle 2009 CVF

17

Intel Confidential

Xentrace

Xen

Dom0Xenpm Retrieve run-time

statistics about Xen cpuidle and cpufreq

Apply user policy on exposed control knobs of Xen cpufreq (governor, set freq, etc)

More capabilities to be added later, e.g. profile…

Log every state change for Xen cpuidle and cpufreq:

Raw data could be further processed by other scripts

CPU0 391365842416 (+ 21204) cpu_idle_entry [ idle to state 2 ]CPU0 391375951050 (+10108634) cpu_idle_exit [ return from state 2 ]

Page 18: XS Oracle 2009 CVF

18

Intel Confidential

Experimental data about Xen power efficiency

Page 19: XS Oracle 2009 CVF

19

Intel Confidential

• All data shown in this section:– For reference only and not guaranteed– On a two-core mobile platform, with one HVM guest created

• Server consolidation effect with multiple VMs/workloads are in progress

• Improvement when Xen cpuidle and cpufreq are enabled– SPECpower score is normalized (100% noPM score is 1000 ssj_ops / watt)– Similarly, consumed watt is also normalized (idle noPM watt is 10w)

0500

1000

1500

0 10 20 30 40 50 60 70 80 90 100

SPECpower workload (%)

normalized score (ssj_ops/watt)

0

5

10

15

20

25

Normalized power (watt)

noPM-score

PM-score

noPM-wattPM-watt

ReducedPower!

ImprovedEfficiency

Page 20: XS Oracle 2009 CVF

20

Intel Confidential

• Below is a more attracting comparison– Native WinXPsp1’s idle watt and SPECpower score are both normalized to 100% as the base

50%

60%

70%

80%

90%

100%

110%

120%

130%

140%

idle (Watt) SPECpower (ssj_ops/watt)

Nor

mal

ized

per

cent

age

(%)

native xpsp1native rhel5u1xen xpsp1 hvmkvm xpsp1 hvmxen rhel5u1 hvmkvm rhel5u1 hvm

Similar Idle power consumption for Xen and KVM

Native Native

HZ=1000 in RHEL5u1 incurs high

timer interrupt

Xen is slightly more power

efficient than KVM

Page 21: XS Oracle 2009 CVF

21

Intel Confidential

Power impact from VM

Page 22: XS Oracle 2009 CVF

22

Intel Confidential

• VMM shouldn’t be blamed as only reason for high power consumption!

• A ‘bad’ VM could eat power– Just like what ‘bad’ application could do in a native OS

– Cause high break events (e.g. timer interrupts) with short C-state residency

• Which parts in VM could draw high power?– ‘Bad’ applications hurt just as what they could on native

– How guest OS is implemented also matters• Periodic tick frequency – HZ• Timer usage in drivers• Time sub-system implementation• …

• Green guest OS wins!– Smaller HZ, idle tickles or fully dynamic tick, range timer, etc.

Page 23: XS Oracle 2009 CVF

23

Intel Confidential

80.00%

85.00%

90.00%

95.00%

100.00%

105.00%

110.00%

115.00%

120.00%

bare Dom0 PV HVM Winxp HZ=1000 HZ=250 HZ=100 tickless

Normalized watt

0

0.5

1

1.5

2

2.5

3

3.5

4

Average residency (ms)

Idle power consumption

Average C-state residency

Green Green

‘Bad’guest

Idle power consumption

Page 24: XS Oracle 2009 CVF

24

Intel Confidential

Power efficiency

800

820

840

860

880

900

920

940

960

980

1000

1020

bare Dom0 PV HVM Winxp HZ=1000 HZ=100 tickless

Normalized score (ssj_ops / watt)

SPECpower score

Page 25: XS Oracle 2009 CVF

25

Intel Confidential

Legal Information

• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT.

• Intel may make changes to specifications, product descriptions, and plans at any time, without notice.

• All dates provided are subject to change without notice.

• Intel is a trademark of Intel Corporation in the U.S. and other countries.

• *Other names and brands may be claimed as the property of others.

• Copyright © 2007, Intel Corporation. All rights are protected.

Page 26: XS Oracle 2009 CVF

26

Intel Confidential