Hardware-aware thread scheduling: the case of asymmetric multicore processors

Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder

* achille.peternier@usi.chhttp://sosoa.inf.unisi.ch

CONTEXT AND OVERALL IDEAIntroduction

Context

• Modern CPUs increase the computational power through additional cores

• HW architectures are becoming increasingly more complex– Shared caches– Non Uniform Memory Access (NUMA) – Single Instruction Multiple Data (SIMD) registers– Simultaneous MultiThreading (SMT) units

Context

• Operating System (OS) kernel and scheduler try to automatically optimize applications’ performance according to the available resources– Based on the underlying HW – Using a limited set of performance indicators (CPU

time, memory usage, etc.)

“Today it is impossible to estimate performance: you have to measure it. Programming has become an empirical science.”

Performance Anxiety: Performance analysis in the new millenniumJoshua Bloch, Google Inc.

Contributions

2) Hardware-aware optimized scheduler performing decisions based on hardware resource usage and the output of the workload analysis

- to improve processing units occupancy on SMT/asymmetric processors

1) Automated workload analysis technique relying on a specific set of performance metrics that are currently not used by common OS schedulers

FPUINT

The big pictureMonitoring daemon

OS threads and processes

Workload characterization

FPUINT

The big picture

Hardware-aware scheduler

AMD BULLDOZER PROCESSORTarget architecture

AMD Bulldozer

• AMD Bulldozer architecture– Each CPU is implemented as a series of modules

(a.k.a. “cores”) with two cores (a.k.a. “processing or SMT units”)

– Arithmetic-Logic Units (ALUs) are really available per SMT unit

– A module is more similar to:• A dual core when doing integer ops• A single core with SMT=2 when

doing floating point ops

AMD Bulldozer

WORKLOAD CHARACTERIZATION

• Is used to sort processes and threads that are floating point intensive– Among the X most running threads• (where X = the number of cores available)

• Based on realtime monitoring system using Hardware Performance Counters (HPCs)

…about HPCs…

• Registers embedded into processors to keep track of hardware-related events such as cache misses, number of CPU cycles, branch mispredictions, etc.

• Very low overhead (about 1%)• Extremely accurate• Limited resources, only few of them can be used

at the same time– This limits their wide adoption (yet) on large scale

• HW-specific

• HPCs used:– PERF_COUNT_HW_CPU_CYCLES: measures the

total number of CPU cycles consumed by a thread during its execution time

– CYCLES_FPU_EMPTY: keeps track of the number of CPU cycles the floating point units are not being used by a thread during its execution time

– L2_CACHE_MISSES: counts the number of L2 cache misses generated by a thread during its execution time

MONITORING AND SCHEDULING INFRASTUCTURE DESING

BulldOver design

• Bulldozer Overseer -> BulldOver• Client-server architecture

BulldOver design

• Server– Daemon – Scans the underlying architecture– Time-based HPC monitoring (once per sec)• We target scientific workloads, short-lived threads are

not well suitable

– Applies scheduling policies– libHpcOverseer, hwloc, libpfm

BulldOver design

• Client– Command-line tool• prompt> bulldover java myprogram

– Traces the creation/termination of threads/processes

– Share information through shared memory with the server

– libmonitor, boost

BulldOver design

User space

EVALUATION

Testing environment

• Dell PowerEdge M915– 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8

modules each)• Limited to 1 CPU with 8 cores/4 modules

– Test limited to a single NUMA node• Avoiding latencies and other NUMA-related well known

effects

– Turbo mode and freq. scaling disabled

Benchmark suites

• SPEC CPU 2006– Perfect match for evaluating Integer vs. Floating point

behaviors

• SciMark 2.0– Java based– Noisy environment (additional threads for garbage

collection, JIT, etc.)– Mainly FPU-oriented, with different levels of stress– Modified multi-threaded version running several random

benchmarks over a thread-pool

Workload characterizationSpec CPU 2006

Empty FPU Cycles Total CPU Cycles

Workload characterizationSciMark 2.0

Empty FPU Cycles Total CPU Cycles

FPU usage and cachesFPU usage L2 cache miss ratio

Results for SPEC CPU 2006

Inefficient baseline

Improved scheduling

Default OS scheduling

Running 4x Int and 4x FPU benchmarks on a single NUMA node (4 modules/8 cores)

Discussion

• BulldOver avoids the worst case scenario– The default OS scheduler is not aware of the

workload characterization• Benefits coming both from improved cache

usage AND better FPU/Integer units occupancy

Results for Scimark 2.0

Default OS scheduling

Improved scheduling

Running 8x randomly changing over-time benchmarks on a single NUMA node (4 modules/8 cores)

Discussion

• All the threads are FPU-intensive– But at different levels

• Still a reasonable speedup “for free”• Dynamic adaptation, since the FPU usage

intensity varies over time– BulldOver reacts accordingly

Conclusions- We show how thread scheduling not aware of the shared HW

resources available on the AMD Bulldozer processor can incur a significant performance penalty

- We presented a monitoring system that is able to characterize the most active threads according to their FPU/Integer usage

- Thanks to the realtime analysis, improved scheduling can be applied and performance improved

- Our system is very low intrusive:- Low overhead (below 2%)- No kernel patching required- No code instrumentation- Works on any application

Conclusions

• Currently tuned for a specific HW architecture• Good for scientific workloads– Sampling rate is required (1 sec in our case, could

be less but can’t be 0…)• Based on a very simple scheduling policy– More sophisticated policies could be used

Thanks!

Achille Peternierachille.peternier@usi.chhttp://sosoa.inf.unisi.ch

“Pow7Over”

• Work in progress on IBM Power7 processors– 1 CPU, 8 cores, up to 4 SMT units per core– Completely different…

• …operating system: RHEL 6.3• …architecture: PowerPC• …HPCs: IBM-specific ones (more than 500 available…)• …compiler: autotools 6.0

• Similar approach• Slightly less significant speedup

– But this is a full SMT– Similar overall behavior both for the PUs and L2 caches

Hardware-aware thread scheduling: the case of asymmetric multicore processors

Technology

Transcript of Hardware-aware thread scheduling: the case of asymmetric multicore processors

Fair-share Scheduling in Single-ISA Asymmetric Multicore ...

III. Multicore Processors (4)

Performance and Power Engineering on Multicore Processors

III. Multicore Processors (3)

Lecture 6. Multithreading & Multicore Processors

A Dedicated Monitoring Infrastructure For Multicore Processors

Asymmetric C++ Multicore Application for StarCore DSPs · Asymmetric C++ Multicore Application for StarCore DSPs, Rev. 0 Freescale Semiconductor 3 1.1.2 Enable Exceptions in the Linker

III. Multicore Processors (5)

Data routing in multicore processors using dimension ...

Multicore Processors: Challenges, Opportunities, Emerging ......scaling, multicore architectures, many-core architecture, multicore performance models, dark silicon I. INTRODUCTION

Multicore Processors and GPUs: Programming Models and ...1. Multicore Processors and GPUs: Programming Models and Compiler Optimizations. J. “Ram” Ramanujam P. “Saday” Sadayappan.

Affect of parallel computing on multicore processors

Performance Analysis of Parallelised Applications on Multicore Processors

Reliable Multicore Processors for NASA Space Missions

Taxonomy of Data Prefetching for Multicore Processors

P OPTIMIZATION METHODS IN HETEROGENEOUS MULTICORE PROCESSORS · 2013-03-08 · power optimization methods in heterogeneous multicore processors prepared for: sharon ahlers engineering

Harnessing Multicore Processors for High Speed Secure Transfer

Multicore processors

Tiled Multicore Processors: The Four Stages of Reality · 1 Tiled Multicore Processors: The Four Stages of Reality Anant Agarwal MIT and Tilera

Asymmetric C++ Multicore Application for StarCore DSPs · 2017-09-11 · Asymmetric C++ Multicore Application for StarCore DSPs, Rev. 0 Freescale Semiconductor 3 1.1.2 Enable Exceptions