Power-Aware Parallel Job Scheduling

36
Power-Aware Parallel Job Scheduling Maja Etinski Julita Corbalan Jesus Labart {maja.etinski,julita.corbalan,jesus.labarta,mateo.val ero}@bsc.es

description

Power-Aware Parallel Job Scheduling. Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero. {maja.etinski,julita.corbalan,jesus.labarta,mateo.valero}@bsc.es. Power Consumption of Supercomputing Systems. - PowerPoint PPT Presentation

Transcript of Power-Aware Parallel Job Scheduling

Page 1: Power-Aware Parallel Job Scheduling

Power-Aware Parallel Job

Scheduling

Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero

{maja.etinski,julita.corbalan,jesus.labarta,mateo.valero}@bsc.es

Page 2: Power-Aware Parallel Job Scheduling

EEHiPC'10 2

Power Consumption of Supercomputing Systems

Striving for performance has led to enormous power dissipation of HPC centers (Top500 list)

KWatts

Page 3: Power-Aware Parallel Job Scheduling

EEHiPC'10 3

Power reduction approaches in HPC

Power reduction approaches

Application level:

- Runtime systems: - exploit certain application characteristics (load imbalance, communication intensive regions)

- based on very fine grain DVFS application

System level:

- Turning off idle nodes: - resource allocation such that there are more completely idle nodes - determining number of online nodes

- Operating system power management via DVFS: - linux governors – per core, unawareness of the rest of the system

- DVFS taking into the account entire system workload?

Page 4: Power-Aware Parallel Job Scheduling

EEHiPC'10 4

Parallel Job Scheduling

Job scheduler has a global view of the whole system

HPC Job Scheduler

Job submission

Job with its

requirements

Queued jobs

Job Scheduling

Resource Manager

Wait Queue

Page 5: Power-Aware Parallel Job Scheduling

EEHiPC'10 5

DVFS and Job Scheduling

HPC Job Scheduler

Job submission

Job with its

requirements

Queued jobs

Job Scheduling

Resource

Manager

Wait Queue

Power-Aware

Component

Job CPU frequency assignment

based on goals/constraints

Page 6: Power-Aware Parallel Job Scheduling

EEHiPC'10 6

Outline

Parallel job scheduling:

short introduction to parallel job scheduling

the EASY backfilling policy

Power and run time modelling:

first we need to understand how frequency scaling affects CPU power dissipation and runtime

Energy-saving parallel job scheduling policies:

Utilization-driven power-aware scheduling

[Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Utilization driven power-aware parallel job scheduling. Energy Aware High Performace Computing Conference, Hamburg, September 2010]

BSLD-driven power-aware scheduling

[Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Bsld threshold driven power management policy for hpc centers. IEEE Parallel and Distributed Processing Symposium, HPPAC Workshops 2010 Atlanta, GA, April 2010]

Power-budgeting:

how to maximize job performance under a given power budget?

[Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Optimizing job performance under a given power

constraint in hpc centers. IEEE International Green Computing Conference, Chicago, IL, August 2010]

Page 7: Power-Aware Parallel Job Scheduling

EEHiPC'10 7

About Parallel Job Scheduling

Parallel job scheduling can be seen as finding a free rectangle for the job being scheduled:

FCFS policy used in the beginning

Backfilling policies introduced to improve system utilization

Job performance metrics:

Response time: WaitTime(J)+RunTime(J)

Slowdown: (WaitTime(J) + RunTime(J))/RunTime(J)

Bounded Slowdown:

max((WaitTime(J)+Runtime(J))/max(Th,RunTime(J)) ,1)

Time

CPUs

Job 1

Job 2

Job 3

Job 5

Job 4 Job 6

Job Performance

Wait Time Run Time

Page 8: Power-Aware Parallel Job Scheduling

EEHiPC'10 8

The EASY backfilling policy

Job 2

Job 3

Job 4

Job 1

Arrival of Job 5

Job 5

Job 5Job 5

Job 6

MakeJobReservation(Job5)

BackfillJob(Job6)

Arrival of Job 6

Time

CPUs

Jobs are executed in FCFS order except when the first job in the wait queue can not start Users have to submit an estimation of job's runtime – requested time When the first job in the WQ can not start, a reservation is made for it based on requested

times of running jobs A job is executed before previously arrived ones only if it does not delay the first job in the

wait queue

Page 9: Power-Aware Parallel Job Scheduling

High-Level DVFS modelling

Page 10: Power-Aware Parallel Job Scheduling

EEHiPC'10 10

Power ModelCPU power presents one of main system power components

It consists of dynamic and static power:

Pcpu

= Pdynamic

+ Pstatic

Pdynamic

= AcfV2 Pstatic

= α V

Fraction of static in total CPU power is a model parameter:

Pstatic

(Vtop

) = X(Pstatic

(Vtop

) + Pdynamic

(ftop

,Vtop

))

( X = 25% in our experiments )

Average activity factor assumed to be same for all jobs (2.5 times higher than idle activity)

Idle processors: do not consume power/ consume power at the lowest frequency

DVFS gear set :

f 0.80 1.10 1.40 1.70 2.00 2.30

V 1.00 1.10 1.20 1.30 1.40 1.50

Norm(P) 0.28 0.38 0.49 0.63 0.80 1.00

Page 11: Power-Aware Parallel Job Scheduling

EEHiPC'10 11

Time ModelExecution time dependence on frequency is captured by the following

model:

F(f,ß)=T(f) / T(ftop) = ß(ftop / f -1) + 1

[Hsu,Feng SC05: A Power-Aware Run Time System for High-

Performance Computing]

ß is assumed to have the following normal distributions:

Global application ß depends on communication/computation ratio

Two ß scenarios:

ß is known in advance (at the moment of scheduling)

ß is not known in advance (at the moment of scheduling the worst

case, ß = 1, is assumed )

N(0.3, 0.064)More than 32

N(0.4, 0.01)Between 4 and 32

N(0.5, 0.01)Less or equal to 4

DistributionNumber of CPUs

0.8 1.1 1.4 1.7 2 2.3

0

0.5

1

1.5

2

2.5

Execution time penalty

due to frequency scaling

Frequency

Exe

cutio

n tim

e

ß=0.7ß=0.5ß=0.3

Page 12: Power-Aware Parallel Job Scheduling

Energy Saving Parallel Job

Scheduling Policies

Page 13: Power-Aware Parallel Job Scheduling

EEHiPC'10 13

Utilization-Driven Policy Frequency assigned once (at jobs start time) for entire job execution based on system utilization

Utilization is computed for each interval T:

An additional control over system load WQthreshold:

If there are more than WQthreshold jobs in the wait queue no frequency scaling will be applied

Otherwise, job started during interval Jk runs at frequency F

Uk-1

Fk

Ulower

Uupper

flower

fupper

ftop

Page 14: Power-Aware Parallel Job Scheduling

EEHiPC'10 14

Evaluation

C++ event driven parallel job scheduling simulator has been upgraded

Policy parameters:

utilization thresholds: Ulower = 50% Uupper = 80%

reduced frequencies: flower = 1.4 GHz fupper = 2.0 GHz

utilization computation interval: T = 10 min

wait queue length threshold: WQthreshold = 0, 4, 16, NO - limit

Metric of job performance – Bounded Slowdown:

BSLD at frequency f :

Alvio simulator

Policy parameters

Metric of performance

Page 15: Power-Aware Parallel Job Scheduling

EEHiPC'10 15

Workloads

Five workloads from production use have been simulated:

Workload - #CPUs Avg Util Avg LR

CTC - 430 70% 1.61 50% 28%

SDSC – 128 85% 8.17 26% 5%

SDSCBlue – 1152 69% 2.31 55% 26%

LLNLThunder – 4008 80% 80.00% 29% 11%

LLNLAtlas – 9216 75% 0.94 26% 19%

%T below Uupper % T below U lower

San Diego Supercomputing Center

-less sequential jobs than CTC-runtime distribution similar

San Diego Supercomputing Center

- no sequential job

Lawrence Livermore National Lab

- small to medium size jobs

LLNLAtlas – 9216 10 – 15 1.08 69LLNLAtlas – 9216 10 – 15 1.08 69LLNLAtlas – 9216 10 – 15 1.08 69LLNLAtlas – 9216 10 – 15 1.08 69

Cornell Theory Center-large jobs with relatively

low level of parallelism

Parallel workload archivehttp://www.cs.huji.ac.il/labs/parallel/workload

Lawrence Livermore National Lab

- large parallel jobs

Page 16: Power-Aware Parallel Job Scheduling

EEHiPC'10 16

Results: Normalized CPU Energy

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas

80%

82%

84%

86%

88%

90%

92%

94%

96%

98%

100%

WQ

0

4

16

NO

Workload

%E

(idle

=0)

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas

80%

82%

84%

86%

88%

90%

92%

94%

96%

98%

100%

WQ

0

4

16

NO

Workload

%E

(idle

=lo

w)

short wait queues

very similar results for both energy scenarios

savings of not highly loaded workloads up to 12%

Page 17: Power-Aware Parallel Job Scheduling

EEHiPC'10 17

Results: Normalized Performance

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

WQ0

4

16

NO

Workload

Nor

mal

ized

Ave

rage

BS

LD

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas

0

500

1000

1500

2000

2500

WQ

0

4

16

NO

WorkloadN

umbe

r of

red

uced

jobs

an increase in number of backfilled jobs

high penalty in the least conservative case for highly loaded workload

WQ threshold has almost no impact

Page 18: Power-Aware Parallel Job Scheduling

EEHiPC'10 18

Average frequency - SDSCBlue

Page 19: Power-Aware Parallel Job Scheduling

EEHiPC'10 19

BSLD-Driven Policy

Frequency is assigned based on job's

predicted performance

Lower frequency -> longer execution time

-> worse job performance metric

BSLDth controls allowable performance

penalty (“target BSLD”)

In order to be run at lower frequency f a job

has to satisfy BSLD condition at frequency

f: if the job's predicted BSLD at frequency f

is lower than BSLDth than it satisfies the

BSLD condition at frequency f

Predicted BSLD:

WQsize

≤ WQthreshold

NO

YES

f = Flowest

find an allocation Alloc

satisfiesBSLD(Alloc,Ji,f) or

f=Ftop

Job Ji

Run job Ji at frequency f

YES

NOf = next higher frequency

Run Ji at F

top

Page 20: Power-Aware Parallel Job Scheduling

EEHiPC'10 20

Results: Normalized CPU Energy

Normalized energies in two energy

scenarios behave in the same way

Average savings in the most

aggressive case: 5% - 23%

Difference in savings per workload for

the most conservative and the most

aggressive threshold combinations

goes from 5% (SDSC) to 15%

(LLNLThunder)

WQthreshold

controls DVFS

aggressiveness much better than

BSLDthreshold

BSLDthreshold

has stronger impact when

WQthreshold

is higher

Page 21: Power-Aware Parallel Job Scheduling

EEHiPC'10 21

Average BSLD

4.66

24.91

5.15 1 1.08

Decrease in performance is proportional to energy savings

Strong impact on

performance in the most

aggressive case

Impact of WQthreshold

higher than of BSLDthreshold

BSLDthreshold

has

stronger impact when

WQthreshold

is higher

Page 22: Power-Aware Parallel Job Scheduling

EEHiPC'10 22

Reduced jobs (out of 5000)

When load is very high (SDSC) no DVFS is applied

(in order to apply it thresholds have to be set to higher values)

Performance depends on

the number of reduced jobs

It depends on used

frequencies as well

It was remarked that

performance of jobs that

have been run at the

nominal frequency was

affected as well

Page 23: Power-Aware Parallel Job Scheduling

EEHiPC'10 23

Wait time

Zoom of SDSCBlue wait time behavior

Main problem observed:

-> high impact on wait time

Page 24: Power-Aware Parallel Job Scheduling

Power-Budgeting Policy

Page 25: Power-Aware Parallel Job Scheduling

EEHiPC'10 25

PB-Guided Policy: How DVFS

can improve overall job performance

J1

J3

J2

Power

Budget

NO

DVFS

CASE

DVFS

CASE

J4

J5

ftop

Wait Queue:J

2

J1

J3

J4

J5

J1

J3

J2

J4

J5

ftop

flower

penalty in run time due to frequency scaling

but more jobs can run simultaneously

TimeT

1T

2

Page 26: Power-Aware Parallel Job Scheduling

EEHiPC'10 26

Power Budgeting: PB-Guided PolicyFrequency assignment is guided by predicted job performance and current power draw

Prediction of BSLD when selecting frequency:

BSLD condition:

A job satisfies BSLD condition at reduced frequency f if its predicted BSLD

at the frequency f is lower than current value of the BSLD threshold

The policy is power conservative:

A job will be scheduled at the lowest frequency at which both BSLD condition and power limit

are satisfied

BSLD

threshold

0 Plower P

upper Power Budget

The closer to the PB limit, the

higher the BSLD threshold

The higher the BSLD threshold,

the lower frequency will be selectedPcurrent

Page 27: Power-Aware Parallel Job Scheduling

EEHiPC'10 27

Power Budgeting: PB-Guided Policy

A job can be scheduled with one of the two functions:

MakeJobReservation(J)

1: scheduled <-- false;

2: shiftInTime <-- 0;

3: nextFinishJob <-- next(OrderedRunningQueue);

4: while( !scheduled)

{

5: f <-- FlowestReduced

6: while(f < Fnominal

)

{

7: Alloc = findAllocation(J,currentTime + shiftInTime,f);

8: if (satisfiesBSLD(Alloc, J, f) and satisfiesPowerLimit(Alloc, J, f) )

9: { schedule(J, Alloc);

10: scheduled <-- true;

11: break; }

12: if (f == Fnominal

)

13: Alloc = findAllocation(J,currentTime + shiftInTime, Fnominal

)

14: if (satisfiesPowerLimit(Alloc, J,Fnominal

))

15: schedule(J, Alloc);

16: break;

17: shiftInTime <-- FinishTime(nextFinishJob) - currentTime;

18: nextFinishJob <-- next(OrderedRunningQueue);

}

BackfillJob(J)

1: f <-- Flowest

2: while(f < Fnominal

)

{

3: Alloc = TryToFindBackfilledAllocation(J,f);

4: if (correct(Alloc) and satisfiesBSLD(Alloc, J,f) and

satisfiesPowerLimit(Alloc,J,f))

5: { schedule(J, Alloc);

6: break; }

7: f <-- nextHigherFrequency

}

8: if (f==Fnominal

)

9: { Alloc = TryToFindBackfilledAllocation(J,Fnominal

);

10: if ((correct(Alloc) and satisfiesPowerLimit(Alloc,J,f))

11: schedule(J, Alloc); }

Power Limit

BSLD Conditionthe lowest frequency that satisfies it will be selected

power budget must not be violated during entire job execution

Page 28: Power-Aware Parallel Job Scheduling

28

EvaluationPolicy parameters:

Power budget thresholds: Plower = 0.6 , Pupper = 0.9BSLD threshold values which have been used:

BSLDlower = avg(BSLD) without power budgeting BSLDupper = 2* BSLDlower

Power budget set to 80% of the total CPU power consumed by whole system when running at Fnominal

Four workloads from production use have been simulated:

89%80%120 - 25LLNLThunder - 4008

74%69%5.1520 – 25SDSCBlue – 1152

95%85%24.9140 - 45SDSC – 128

72%70%4.6620 - 25CTC – 430

Over PBUtilizationAvg BSLDJobs(K)Workload - # CPUs

Page 29: Power-Aware Parallel Job Scheduling

EEHiPC'10 29

Baseline Power Budgeting Policy

Power limited without DVFS:

No job will start if it would violate the budget although there are available processors

This case is equal to the EASY scheduling with a smaller machine

Job 2

Job 3

Job 1

Arrival of Job 4

Job 4

Job 4Job 4

Job 5

MakeJobReservation(Job5)

BackfillJob(Job6)

Arrival of Job 5CPUs

Job 6

Arrival of Job 6

Job 6

can not start because of power budget

Page 30: Power-Aware Parallel Job Scheduling

EEHiPC'10 30

Results: Performance

Oracle case: it is assumed that ß values are known at the scheduling time

PB-guided policy shows better performance for all workloads!

AVG wait time decreases with DVFS under power constraint

Page 31: Power-Aware Parallel Job Scheduling

EEHiPC'10 31

Results: Normalized CPU Energy (idle=0)

Oracle case: it is assumed that ß values are known at the scheduling time

Page 32: Power-Aware Parallel Job Scheduling

EEHiPC'10 32

Utilization Over Time

Page 33: Power-Aware Parallel Job Scheduling

EEHiPC'10 33

Power Budget Consumed

Page 34: Power-Aware Parallel Job Scheduling

EEHiPC'10 34

Comparison of Unknown and Known ß

no ß ß no ß ß no ß ß no ß ß no ß ß

CTC 0.80 0.79 0.63 0.75 1.50 1.40 0.740 0.738 3924 3923

SDS 0.62 0.76 0.53 0.70 1.60 1.50 0.755 0.744 4103 3944

SDSCBlue 0.86 0.75 0.73 0.76 1.50 1.40 0.727 0.724 3611 3550

LLNLThunder 0.33 0.51 0.28 0.47 2.07 1.90 0.884 0.865 3830 3651

Energy Backfilled

Workload

Avg.BSLD Avg.WT Avg. Freq

Avg.BSLD, Avg.WT and Avg.Energy values are normalized with respect to corresponding baseline values

(EASY-backfilling with power limit and without DVFS)

Page 35: Power-Aware Parallel Job Scheduling

EEHiPC'10 35

Conclusions

Energy – performance trade-off must be done carefully as DVFS does not affect only job

runtime but it can affect significantly job wait time and additionally decrease job performance Performance-energy trade-off needs to be done at job scheduling level as it affect jobs in the

wait queue and only the scheduler can estimate potential negative impact on queued jobs DVFS application to highly loaded workloads (SDSC) leads to very high performance penalty

Parallel job scheduling policies can be designed such that maximizes job performance

under a given power constraint It has been shown that DVFS can improve performance in power constrained HPC centers

(using lower CPU frequencies allows more job to run simultaneously) It is not necessary to know ß values in advance, moreover assuming the worst case at

scheduling time can give better performance than when they are known in advance

Page 36: Power-Aware Parallel Job Scheduling

HPPAC 2010 Atlanta

EEHiPC'10 36

Thank you for your attention!