Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

32
Predictive Runtime Code Scheduling for Heterogeneous Architectures Authors : Victor J. Jimnez, Llus Vilanova, Isaac Gelado, Marisa Gil, Grigori Fursin, and Nacho Navarro Source : High Performance and Embedded Architectures and Compilers 2009 Presenter : 陳陳陳 2011.10.21 1

Transcript of Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Page 1: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

1

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Authors : Victor J. Jimnez, Llus Vilanova, Isaac Gelado, Marisa Gil, Grigori Fursin, and Nacho Navarro

Source : High Performance and Embedded Architectures and Compilers 2009

Presenter :陳彥廷2011.10.21

Page 2: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

2

Introduction Scheduling Algorithm Experiment method Experiment results Conclusions Future work

Outline

Page 3: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Introduction Scheduling Algorithm Experiment method Experiment results Conclusions Future work

3

Outline

Page 4: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

4

Currently almost every desktop system is an heterogeneous system.

They both have a CPU and a GPU, two processing elements (PEs) with different characteristics but undeniable amounts of processing power.

It often used in a restricted way for domain-specific applications like scientific applications and games.

Heterogeneous system

Page 5: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

5

1. Profile the application to be ported to a GPU and detect the most expensive parts in terms of execution time and the most amenable ones to fit the GPU-way of computing.

2. Port those code fragments to CUDA kernels (or any other framework for general purpose programming on GPU).

3. Iteratively optimize the kernels until the desired performance is achieved.

Current trend to program heterogeneous systems

Page 6: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

6

Exploring and understanding the effect of different scheduling algorithms for heterogeneous architectures.

Fully exploiting the computing power available in current CPU/GPU-like heterogeneous systems.

Increasing overall system performance.

Objective

Page 7: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Introduction Scheduling Algorithm Experiment method Experiment results Conclusions Future work

7

Outline

Page 8: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

8

PE selection - the process to decide on which PE a new task should be executed.

task selection - the mechanism to choose which task must be executed next in a given PE.

Heterogeneous scheduling process

Page 9: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

9

1. First-Free algorithm family2. First-Free Round Robin (k)3. performance history-based scheduler

PE selection Scheduling algorithms

Page 10: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

10

pe : processing element PElist : all processing elements k[pe] : the number of tasks given to PE history[pe, f] : keep the performance for

every pair of PE and task allowedPE : keep the PEs without a big

unbalance

Term explanation

Page 11: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

11

big unbalance

PE1(3) PE2(10)

PE3(5)

1 0.3 0.6

2 3.3 23 1.6 0.5

*suppose is 0.6.

allowedPE={pe │∄pe' : history[pe,f] /history[pe', f] > }𝜃

Page 12: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

12

g(PElist) looks for the first idle PE in that set and if there is not such a PE, it selects the GPU as the target. h(allowedPE) basically estimates the waiting time in each queue and schedules the task to the queue with the smaller waiting time (in case both queues are empty it chooses the GPU). For that purpose the scheduler uses the performance history for every pair (task, PE) to predict how long is going to take until all the tasks in a queue complete their execution.

Term explanation ( cont )

Page 13: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

13

for all pe PElist doif pe is not busy then

return pe return g(PElist)

First-Free algorithm family

Page 14: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

14

for all pe PElist doif pe is not busy then

return pe if k[pe] = 0 pe PElist then

set k with initial values for all pe PElist do

if k[pe] 0 thenk[pe] k[pe] 1return pe

First-Free Round Robin (k)

parameter k = (k1,…,kn)

Ex : k = (1,4) ↑↑ task : 4 1

Page 15: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

15

if pe PElist : history[pe, f] null thenreturn pe

allowedPE if pe allowedPE : pe is not busy then

return pe if allowedPE = then

return g(PElist) else

return h(allowedPE)

Performance History Scheduling

Page 16: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

16

first-come, first-served (FCFS) It could be also possible to implement some

more advanced techniques such as work stealing in order to increase the load balance of the different PEs.

Task selection

Page 17: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Introduction Scheduling Algorithm Experiment method Experiment results Conclusions Future work

17

Outline

Page 18: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

18

matmul - performs multiple square-matrix multiplications

ftdock - computes the interaction between two molecules (docking)

cp - computes the coulombic potential at each grid point over on plane in a 3D grid in which point charges have been randomly distributed. sad - used in MPEG video encoders in order to

perform a sum of absolute differences between frames.

benchmark

Page 19: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

19

Performance

Page 20: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

20

A machine with an Intel Core 2 E6600 processor running at 2.40GHz and 2GB of RAM has been used.

The GPU is an NVIDIA 8600 GTS with 512MB of memory.

The operating system is Red Hat Enterprise Linux 5.

Experiment setup

Page 21: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Introduction Scheduling Algorithm Experiment method Experiment results Conclusions Future work

21

Outline

Page 22: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

22

CPU v.s. GPU performance

Page 23: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

23

Algorithms’ performance

Page 24: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

24

Algorithms’ performance ( cont )

Page 25: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

25

Benchmarks run on heterogeneous system

Page 26: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

26

Effect of the number of tasks on the scheduler

Page 27: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Introduction Scheduling Algorithm Experiment method Experiment results Conclusions Future work

27

Outline

Page 28: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

28

CPU/GPU-like systems consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode.

Performance predicting algorithms better balancing the system load, perform consistently better.

Conclusions

Page 29: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Introduction Scheduling Algorithm Experiment method Experiment results Conclusions Future work

29

Outline

Page 30: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

30

We intend to study new algorithms in order to further improve overall system performance.

Other benchmarks with different characteristics will be also tried. We expect that with a more realistic set of benchmarks (not only GPU-biased) the benefits of our system would be increased.

Future work

Page 31: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

31

use and extend techniques such as clustering , code versioning and program phase runtime adaptation to improve the utilization and adaptation of all available resources in the future heterogeneous computing systems.

Future work

Page 32: Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

32

Thanks for your listening!