Hardware-aware thread scheduling: the case of asymmetric multicore processors

Post on 25-May-2015

605 views 2 download

Tags:

description

Talk given at ICPADS 2012 in Singapore.

Transcript of Hardware-aware thread scheduling: the case of asymmetric multicore processors

Hardware-aware thread scheduling: the case of asymmetric multicore processors

Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder

* achille.peternier@usi.chhttp://sosoa.inf.unisi.ch

2

CONTEXT AND OVERALL IDEAIntroduction

3

Context

• Modern CPUs increase the computational power through additional cores

• HW architectures are becoming increasingly more complex– Shared caches– Non Uniform Memory Access (NUMA) – Single Instruction Multiple Data (SIMD) registers– Simultaneous MultiThreading (SMT) units

4

Context

• Operating System (OS) kernel and scheduler try to automatically optimize applications’ performance according to the available resources– Based on the underlying HW – Using a limited set of performance indicators (CPU

time, memory usage, etc.)

“Today it is impossible to estimate performance: you have to measure it. Programming has become an empirical science.”

Performance Anxiety: Performance analysis in the new millenniumJoshua Bloch, Google Inc.

6

Contributions

2) Hardware-aware optimized scheduler performing decisions based on hardware resource usage and the output of the workload analysis

- to improve processing units occupancy on SMT/asymmetric processors

1) Automated workload analysis technique relying on a specific set of performance metrics that are currently not used by common OS schedulers

7

FPUINT

The big pictureMonitoring daemon

OS threads and processes

Workload characterization

8

FPUINT

The big picture

Workload characterization

Hardware-aware scheduler

9

AMD BULLDOZER PROCESSORTarget architecture

10

AMD Bulldozer

• AMD Bulldozer architecture– Each CPU is implemented as a series of modules

(a.k.a. “cores”) with two cores (a.k.a. “processing or SMT units”)

– Arithmetic-Logic Units (ALUs) are really available per SMT unit

– A module is more similar to:• A dual core when doing integer ops• A single core with SMT=2 when

doing floating point ops

11

AMD Bulldozer

12

AMD Bulldozer

X

13

AMD Bulldozer

ok

14

WORKLOAD CHARACTERIZATION

15

Workload characterization

• Is used to sort processes and threads that are floating point intensive– Among the X most running threads• (where X = the number of cores available)

• Based on realtime monitoring system using Hardware Performance Counters (HPCs)

16

…about HPCs…

• Registers embedded into processors to keep track of hardware-related events such as cache misses, number of CPU cycles, branch mispredictions, etc.

• Very low overhead (about 1%)• Extremely accurate• Limited resources, only few of them can be used

at the same time– This limits their wide adoption (yet) on large scale

• HW-specific

17

Workload characterization

• HPCs used:– PERF_COUNT_HW_CPU_CYCLES: measures the

total number of CPU cycles consumed by a thread during its execution time

– CYCLES_FPU_EMPTY: keeps track of the number of CPU cycles the floating point units are not being used by a thread during its execution time

– L2_CACHE_MISSES: counts the number of L2 cache misses generated by a thread during its execution time

18

MONITORING AND SCHEDULING INFRASTUCTURE DESING

19

BulldOver design

• Bulldozer Overseer -> BulldOver• Client-server architecture

20

BulldOver design

• Server– Daemon – Scans the underlying architecture– Time-based HPC monitoring (once per sec)• We target scientific workloads, short-lived threads are

not well suitable

– Applies scheduling policies– libHpcOverseer, hwloc, libpfm

21

BulldOver design

• Client– Command-line tool• prompt> bulldover java myprogram

– Traces the creation/termination of threads/processes

– Share information through shared memory with the server

– libmonitor, boost

22

BulldOver design

User space

23

EVALUATION

24

Testing environment

• Dell PowerEdge M915– 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8

modules each)• Limited to 1 CPU with 8 cores/4 modules

– Test limited to a single NUMA node• Avoiding latencies and other NUMA-related well known

effects

– Turbo mode and freq. scaling disabled

25

Benchmark suites

• SPEC CPU 2006– Perfect match for evaluating Integer vs. Floating point

behaviors

• SciMark 2.0– Java based– Noisy environment (additional threads for garbage

collection, JIT, etc.)– Mainly FPU-oriented, with different levels of stress– Modified multi-threaded version running several random

benchmarks over a thread-pool

26

Workload characterizationSpec CPU 2006

Empty FPU Cycles Total CPU Cycles

27

Workload characterizationSciMark 2.0

Empty FPU Cycles Total CPU Cycles

28

FPU usage and cachesFPU usage L2 cache miss ratio

29

Results for SPEC CPU 2006

Inefficient baseline

Improved scheduling

Default OS scheduling

Running 4x Int and 4x FPU benchmarks on a single NUMA node (4 modules/8 cores)

30

Discussion

• BulldOver avoids the worst case scenario– The default OS scheduler is not aware of the

workload characterization• Benefits coming both from improved cache

usage AND better FPU/Integer units occupancy

31

Results for Scimark 2.0

Default OS scheduling

Improved scheduling

Running 8x randomly changing over-time benchmarks on a single NUMA node (4 modules/8 cores)

32

Discussion

• All the threads are FPU-intensive– But at different levels

• Still a reasonable speedup “for free”• Dynamic adaptation, since the FPU usage

intensity varies over time– BulldOver reacts accordingly

33

Conclusions- We show how thread scheduling not aware of the shared HW

resources available on the AMD Bulldozer processor can incur a significant performance penalty

- We presented a monitoring system that is able to characterize the most active threads according to their FPU/Integer usage

- Thanks to the realtime analysis, improved scheduling can be applied and performance improved

- Our system is very low intrusive:- Low overhead (below 2%)- No kernel patching required- No code instrumentation- Works on any application

34

Conclusions

• Currently tuned for a specific HW architecture• Good for scientific workloads– Sampling rate is required (1 sec in our case, could

be less but can’t be 0…)• Based on a very simple scheduling policy– More sophisticated policies could be used

35

Thanks!

Achille Peternierachille.peternier@usi.chhttp://sosoa.inf.unisi.ch

36

“Pow7Over”

• Work in progress on IBM Power7 processors– 1 CPU, 8 cores, up to 4 SMT units per core– Completely different…

• …operating system: RHEL 6.3• …architecture: PowerPC• …HPCs: IBM-specific ones (more than 500 available…)• …compiler: autotools 6.0

• Similar approach• Slightly less significant speedup

– But this is a full SMT– Similar overall behavior both for the PUs and L2 caches