1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU [email protected].

1

High-Performance, Power-Aware Computing

Vincent W. FreehComputer Science

[email protected]

2

Acknowledgements

Students Mark E. Femal – NCSU Nandini Kappiah – NCSU Feng Pan – NCSU Robert Springer – Georgia

Faculty Vincent W. Freeh – NCSU David K. Lowenthal – Georgia

Sponsor IBM UPP Award

3

The case for power management in HPC

Power/energy consumption a critical issue Energy = Heat; Heat dissipation is costly Limited power supply Non-trivial amount of money

Consequence Performance limited by available power Fewer nodes can operate concurrently

Opportunity: bottlenecks Bottleneck component limits performance of other components Reduce power of some components, not overall performance

Today, CPU is: Major power consumer (~100W), Rarely bottleneck and Scalable in power/performance (frequency & voltage)

Power/performance“gears”

4

Is CPU scaling a win?

Two reasons:1. Frequency and voltage scaling

Performance reduction less than Power reduction

2. Application throughputThroughput reduction less thanPerformance reduction

Assumptions CPU large power consumer CPU driver Diminishing throughput gains

performance (freq)

pow

er

ap

plic

ati

on t

hro

ug

hput

performance (freq)

(1)

(2)

CPU powerP = ½ CVf2

5

AMD Athlon-64

x86 ISA 64-bit technology Hypertransport technology – fast memory bus Performance

Slower clock frequency Shorter pipeline (12 vs. 20) SPEC2K results

2GHz AMD-64 is comparable to 2.8GHz P4P4 better on average by 10% & 30% (INT & FP)

Frequency and voltage scaling 2000 – 800 MHz 1.5 – 1.1 Volts

6

LMBench results

LMBench Benchmarking suite Low-level, micro data

Test each “gear”

GearFrequency (Mhz)

Voltage

0 2000 1.5

1 1800 1.4

2 1600 1.3

3 1400 1.2

4 1200 1.1

6 800 0.9

7

Operations

8

Operating system functions

9

Communication

10

Energy-time tradeoff in HPC

Measure application performance Different than micro benchmarks Different between applications

Look at NAS Standard suite Several HP application

ScientificRegular

11

Single node – EP

CPU bound:•Big time penalty•No (little) energy savings

+11%-2%

+45%+8%

+150%+52%

+25%+2%

+66%+15%

12

Single node – CG

+1%-9%

+10%-20%

Not CPU bound:•Little time penalty•Large energy savings

13

Operations per miss

Metric for memory pressure Must be independent of time Uses hardware performance counters

Micro-operations x86 instructions become one or more micro-operations Better measure of CPU activity

Operations per miss (subset of NAS)

Suggestion: Decrease gear as ops/miss decreases

EP BT LU MG SP CG

844 79.6 73.5 70.6 49.5 8.60

14

Single node – LU

+4%-8%

+10%-10%

Modest memory pressure:Gears offer E-T tradeoff

15

Ops per miss, LU

16

Results – LU

Shift 0/1+1%, -6%

Auto shift+3%, -8%

Gear 1+5%, -8%

Gear 2+10%, -10%

Shift 1/2+1%, -6%

Shift 0/2+5%, -8%

17

Bottlenecks

Intra-node Memory Disk

Inter-node Communication Load (im)balance

18

Multiple nodes – EP

S2 = 2.0S4 = 4.0

S8 = 7.9

Perfect speedup:E constantas N increases

E = 1.02

19

Multiple nodes – LU

S2 = 1.9E2 = 1.03

S4 = 3.3E4 = 1.15

S8 = 5.8E8 = 1.28

Good speedup:E-T tradeoffas N increases

S8 = 5.3E8 = 1.16

Gear 2

20

Multiple nodes – MG

S2 = 1.2E2 = 1.41

S4 = 1.6E4 = 1.99

S8 = 2.7 E8 = 2.29

Poor speedup:Increased Eas N increases

21

Normalized – MG

With communicationbottleneck E-Ttradeoff improvesas N increases

22

Jacobi iteration

Can increase Ndecrease T and

decrease E

23

Future work

We are working on inter-node bottleneck

24

Safe overprovisioning

25

The problem Peak power limit, P

Rack power Room/utility Heat dissipation

Static solution, number of servers is N = P/Pmax

Where Pmax maximum power of individual node

Problem Peak power > average power (Pmax > Paverage) Does not use all power – N * (Pmax - Paverage) unused Under performs – performance proportional to N Power consumption is not predictable

26

Safe over provisioning in a cluster Allocate and manage power among M > N nodes

Pick M > NEg, M = P/Paverage

MPmax > P Plimit = P/M

Goal Use more power, safely under limit Reduce power (& peak CPU performance) of individual

nodes Increase overall application performance

time

pow

er Pmax

Paverage

P(t)

time

pow

er

PlimitPaverage

P(t)

Pmax

27

Safe over provisioning in a cluster Benefits

Less “unused” power/energy More efficient power use

More performance under same power limitation Let P be performance Then more performance means: MP * > NP Or P */ P > N/M or P */ P > Plimit/Pmax

time

pow

er Pmax

Paverage

P(t)

time

pow

er

PlimitPaverage

P(t)

Pmax

unusedenergy

28

When is this a win?

When P */ P > N/M or P */ P > Plimit/Pmax

In words: power reduction more than performance reduction

Two reasons:1. Frequency and voltage scaling2. Application throughput

performance (freq)

pow

er

ap

plic

ati

on t

hro

ug

hput

P * / P

< P av

erag

e/P m

ax

P * / P

> P av

erag

e/P m

ax

performance (freq)

(1)

(2)

29

Feedback-directed, adaptive power control

Uses feedback to control power/energy consumption Given power goal Monitor energy consumption Adjust power/performance of CPU Paper: [COLP ’02]

Several policies Average power

Maximum power

Energy efficiency: select slowest gear (g) such that

30

Implementation Components

Two components Integrated into one daemon process

Daemons on each node Broadcasts information at intervals Receives information and calculates Pi for next interval Controls power locally

Research issues Controlling local power

Add guarantee, bound on instantaneous power Interval length

Shorter: tighter bound on power; more responsiveLonger: less overhead

The function f(L0, …, LM)Depends on relationship between power-performance

interval (k)

Pik

Individual power limit for node i

31

Results – fixed gear

0 1

23

4

5

6

32

Results – dynamic power control

0 1

23

4

5

6

33

Results – dynamic power control (2)

0 1

23

4

5

6

34

Summary

35

End

36

Summary

Safe over provisioning Deploy M > N nodes More performance

Less “unused” powerMore efficient power use

Two autonomic managersLocal: built on prior researchGlobal: new, distributed algorithm

Implementation Linux AMD

Contact: Vince Freeh, 513-7196, [email protected]

37

Autoshift

38

Phases

39

Allocate power based on energy efficiency

Allocate power to maximize throughput Maximize number of tasks completed per unit energy Using energy-time profiles

Statically generate table for each taskTuple (gear, energy/task)

Modifications Nodes exchange pending tasks Pi determined using table and population of tasks

Benefit Maximizes task throughput

Problems Must avoid starvation

40

Memory bandwidth

41

Power management –ICK: need better 1st slide

What Controlling power Achieving desired goal

Why Conserve energy consumption Contain instantaneous power consumption Reduce heat generation Good engineering

42

Related work: Energy conservation

Goal: conserve energy Performance degradation

acceptable Usually in mobile environments

(finite energy source, battery)

Primary goal: Extend battery life

Secondary goal: Re-allocate energy Increase “value” of energy use

Tertiary goal: Increase energy efficiency More tasks per unit energy

Example Feedback-driven, energy

conservation Control average power

usage Pave= (E0 – Ef)/T

E0

Ef

T

power

freq

43

Related work: Realtime DVS

Goal: Reduce energy consumption With no performance

degradation

Mechanism: Eliminate slack time in system

Savings Eidle

with F scaling Additional Etask –Etask’

with V scaling

T

P

Etask

deadlin

e

Pmax

T

P

Etask’ deadlin

e

Pmax

Eidle

44

Related work: Fixed installations

Goal: Reduce cost (in heat generation or $) Goal is not to conserve a battery

Mechanisms Scaling

Fine-grain – DVSCoarse-grain – power down

Load balancing

45

Single node – MG

46

Single node – EP

47

Single node – LU

48

Power, energy, heat – oh, my

Relationship E = P * T H E Thus: control power

Goal Conserve (reduce) energy consumption Reduce heat generation Regulate instantaneous power consumption

Situations (benefits) Mobile/embedded computing (finite energy store) Desktops (save $) Servers, etc (increase performance)

49

Power usage

CPU power Dominated by dynamic power

System power dominated by CPU Disk Memory

CPU notes Scalable Driver of other system Measure of performance

performance (freq)

pow

er

CMOS dynamic power equation:

P = ½CfV2

50

Power management in HPC

Goals Reduce heat generation (and $) Increase performance

Mechanisms Scaling Feedback Load balancing

51

Single node – MG

+6%-7% +12%

-8%

Modest memory pressure:Gears offer E-T tradeoff

52

Power management vs. energy conservation Power management is mechanism Energy conservation is policy

Two elements Energy efficiency

Ie, Decrease energy consumed per task (Instantaneous) power consumption

Ie, Limit maximum Watts used

Power-performance tradeoff Less power & less performance Ultimately energy-time

Power management

2GHz800MHz

AMD system

6 gears

53

Autonomic managers

Implementation uses two autonomic managers Local – power control Global – power allocation

Local Uses prior research project (new implementation) Requires new policy Daemon process

Reads power meterAdjusts processor performance gear (freq)

Global At regular intervals

Collects appropriate information from all nodesAllocates power budget for next quantum

Optimize for one of several objectives

54

Example: Load imbalance

Uniform allocation of power Pi = Plimit = P/M, for node i Not ideal if nodes unevenly loaded

Tasks execute more slowly on busy nodesLightly loaded nodes may not use all power

Allocate power based on load* At regular intervals, nodes exchange load information Each computes individual power limit for next interval (k)

*Note: Load is one of several possible objective functions.

individual power limit for node i at interval k

Ensure:

1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU [email protected].

Documents

Transcript of 1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU [email protected].