Programming modern (accelerator-based) multicore machines: a runtime’s … · 2013-01-08 ·...

Programming modern (accelerator-based) multicore machines:

a runtime’s perspective Raymond Namyst

University of Bordeaux RUNTIME INRIA group

TORRENTS 2012

Toulouse, Dec 14th RUNTIME INRIA Group INRIA Bordeaux Sud-Ouest

The RUNTIME Team

Torrents 2012 Runtime - 2

High Performance Runtime Systems for Parallel Architectures

•  Mid-size research group -  10 permanent researchers -  5 engineers -  8 PhD students

•  Part of -  INRIA Bordeaux – Sud-Ouest

Research Center -  LaBRI, Computer Science Lab

at University of Bordeaux 1

Runtime Systems for Parallel Architectures

Runtime

Toward “portability of performance”

•  Context -  High Performance Computing

•  Runtime systems -  Perform dynamically what

can’t be done statically -  Bridge the gap between

underlying architecture and application requirements

•  Our Goal -  Achieve high performance

while preserving Portability of Performance

Parallel Compilers

HPC Applications

Runtime system

Operating System, Drivers

Parallel Libraries

GPU NIC

- 3 Torrents 2012

Runtime Systems for Parallel Architectures

Runtime

•  Our Research -  New runtime architectures -  New algorithms

•  Scheduling, mapping, communication protocols

-  Open-source software

•  Our Vision -  Abstract Architecture

•  Do not hide nor expose everything!

-  Extract knowledge from upper layers •  Structure of parallelism

-  Let the runtime system take control!

CPU CPU

Parallel Compilers

HPC Applications

Runtime system

Parallel Libraries

Torrents 2012

Understanding the evolution of parallel machines

Runtime - 5

The end of Moore’s law?

•  The end of single thread performance increase -  Clock rate is no longer increasing -  Thermal dissipation

•  Processor architecture is already very sophisticated -  Prediction and Prefetching techniques achieve a very high

percentage of success -  Actually, processor complexity is decreasing!

•  Question: What kind of circuits should we better add on a chip?

Torrents 2012

Runtime - 6

Welcome to the multicore era

•  Answer: Multicore chips -  Several cores instead of one

processor

•  Back to complex memory hierarchies -  Shared caches

•  Organization is vendor-dependent

-  NUMA penalties

•  Clusters can no longer be considered as “flat sets of processors”

Torrents 2012

Runtime - 7

Multicore is a solid trend

•  More performance = more cores -  Toward embarrassingly

parallel machines?

•  Designing scalable multicore architectures -  3D stacked memory -  Non-coherent cache

architectures •  Intel SCC •  IBM Cell/BE

Torrents 2012

Then came heterogeneous computing

Runtime - 8

And it seems to be a solid trend…

•  Accelerators -  Nvidia & AMD GPUs

•  De facto adoption •  Concrete success stories

Speedups > 50 -  Intel MIC -  Kalray MPPA

•  Accelerators get more integrated -  Intel Ivy Bridge -  AMD APUs

•  They all raise the same issues -  Cost of moving data

•  Even for so-called APUs -  Execution model

Torrents 2012

Then came heterogeneous computing

Runtime - 9

And it seems to be a solid trend…

•  “Future processors will be a mix of general purpose and specialized cores” (anonymous source) -  One interpretation of

“Amdalh’s law” •  Need powerful, general

purpose cores to speed up sequential code

•  Are we happy with that? -  No, but it’s probably

unavoidable!

Mixed Large and

Small Core

Torrents 2012

What Programming Models for such machines?

Runtime - 10

Widely used, standard programming models

•  MPI -  Communication Interface -  Scalable implementations exist already

•  Was actually designed with scalability in mind •  Makes programmers “think” scalable algorithms

-  NUMA awareness? -  Memory consumption

•  OpenMP -  Directive-based, incremental parallelization -  Shared-memory model

•  Well suited to symmetric machines -  Portability wrt #cores -  NUMA awareness?

Torrents 2012

OpenMP (1997) A portable approach to shared-memory programming

•  Extensions to existing languages -  C, C++, Fortran -  Set of programming directives

•  Fork/join approach -  Parallel sections

•  Well suited to data-parallel programs -  Parallel loops

•  OpenMP 3.0 introduced tasks -  Support for irregular

parallelism

int matrix[MAX][MAX]; !

#pragma omp parallel for!

for (int i; i < 400; i++)!

matrix[i][0] += ...!

!0 (main)

1 2 3fork

OpenMP (1997) Multithreading over shared-memory machines

The Quest for Programming Models

Runtime - 13

Dealing with multicore machines

•  Several efforts aim at making MPI and OpenMP multicore-aware -  MPI

•  NUMA-aware buffer management

•  Efficient collective operations

-  OpenMP •  Scheduling in a NUMA

context (memory affinity, work stealing)

•  Memory management (page migration)

Multicore

OpenMP TBB

MPI Cilk

Torrents 2012

Runtime - 14

Dealing with accelerators

•  Software Development Kits and Hardware Specific Languages -  “Stay close to the hardware and

get good performance” •  Low-level abstractions

-  Compilers generate code for accelerator device

•  Examples -  Nvidia’s CUDA

•  Compute Unified Device Architecture)

-  IBM Cell SDK

-  OpenCL

M. *PU

Accelerators

Cell SDK CUDA OpenCL

Torrents 2012

Are we forced to use such low-level tools?

•  Fortunately, well-known kernels are available -  BLAS routines

•  e.g. CUBLAS -  FFT kernels

•  Implementations are continuously enhanced -  High Efficiency

•  Limitations -  Data must generally fit accelerators memory -  Multi-GPU configurations not well supported

•  Ongoing efforts -  Using multi-GPU + multicore

•  PLASMA/MAGMA (ICL/UTK)

Directive-based approaches

Runtime - 16

Offloading tasks to accelerators

•  Idea: use simple directives… and better compilers -  HMPP (Caps Enterprise), OpenMPC (Purdue University) -  StarSs (Barcelona Supercomputing Center) -  OpenACC

#pragma omp task inout(C[BS][BS])!

void matmul( float ∗A, float ∗B, float ∗C) {!

// regular implementation!

#pragma omp target device(cuda) implements(matmul)!

copy_in(A[BS][BS] , B[BS][BS] , C[BS][BS])!

copy_out(C[BS][BS])!

void matmul cuda ( float ∗A, float ∗B, float ∗C) {!

// optimized kernel for cuda!

Torrents 2012

Runtime - 17

How shall we program clusters of hybrid machines?

•  Until the long-awaited new programming model shows up, we must go the hard hybrid way -  Combine different paradigms

•  MPI/PGAS + OpenMP/TBB + CUDA/OpenCL

-  Performance portability is hard to achieve •  Work distribution depends

on #GPU and #CPU per node…

•  Needs aggressive tuning

•  Runtime systems can help M.

CPU M. *PU

M. *PU

Multicore

OpenMP TBB

Accelerators

MPI Cilk ?

OpenACC

CUDA OpenCL

OpenMPC ?

Torrents 2012

Runtime - 18

How shall we program heterogeneous clusters?

•  The uniform way -  Use a single (or a combination of)

high—level programming language to deal with network + multicore + accelerators

-  Increasing number of directive-based languages •  Use simple directives… and good

compilers! XcalableMP

PGAS approach HMPP, OpenMPC, OpenMP 4.0

Generate CUDA from OpenMP code StarSs

-  Much better potential for composability… •  Runtime systems can help!

CPU M. *PU

M. *PU

Multicore

OpenMP

Accelerators

StarSs OpenMPC

Torrents 2012

Parallel Compilers

HPC Applications

Runtime system

Operating System

Parallel Libraries

The role of runtime systems

Runtime - 19

•  Do dynamically what can’t be done statically -  Load balance -  React to hardware feedback -  Autotuning, self-organization

•  We need to put more intelligence into the runtime!

GPU …

Torrents 2012

Overview of StarPU

Runtime - 20

A runtime system for heterogeneous architectures

•  Rational -  Dynamically schedule tasks

on all processing units •  See a pool of

heterogeneous processing units

-  Avoid unnecessary data transfers between accelerators •  Software VSM for

heterogeneous machines

A = A+B

CPU M. GPU

M. GPU

Torrents 2012

Parallel Compilers

HPC Applications

StarPU

Drivers (CUDA, OpenCL)

Parallel Libraries

Overview of StarPU

Runtime - 21

Maximizing PU occupancy, minimizing data transfers

•  Accept tasks that may have multiple implementations -  Together with potential inter-

dependencies •  Leads to a dynamic acyclic

graph of tasks •  Data-flow approach

-  Scheduling hints

•  Open, general purpose scheduling platform -  Scheduling policies = plugins

GPU MIC

(ARW, BR, CR)

f cpu gpu spu

Torrents 2012

Tasks scheduling

Runtime - 22

How does it work?

•  When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met

•  Then, the task is “pushed” to the scheduler

•  Idle processing units actively poll for work (“pop”)

•  What happens inside the scheduler is… up to you!

Scheduler

CPU workers

GPU workers

Pop Pop

Torrents 2012

Tasks scheduling

Developing your own scheduler

•  Queue based scheduler -  Each worker « pops » task in

a specific queue

•  Implementing a strategy -  Easy! -  Select queue topology -  Implement « pop » and

« push » •  Priority tasks •  Work stealing •  Performance models, …

•  Scheduling algorithms testbed

CPU workers

GPU workers

Tasks scheduling

Runtime - 24

Developing your own scheduler

•  Queue based scheduler -  Each worker « pops » task in

a specific queue

•  Implementing a strategy -  Easy! -  Select queue topology -  Implement « pop » and

« push » •  Priority tasks •  Work stealing •  Performance models, …

•  Scheduling algorithms testbed

CPU workers

GPU workers

? Push

Torrents 2012

Dealing with heterogeneous architectures

Runtime - 25

Performance prediction

•  Task completion time estimation -  History-based -  User-defined cost function -  Parametric cost model

•  Can be used to improve scheduling -  E.g. Heterogeneous Earliest

Finish Time

cpu #3

gpu #1

cpu #2

cpu #1

gpu #2

Torrents 2012

Dealing with heterogeneous architectures

Runtime - 26

Performance prediction

•  Data transfer time estimation -  Sampling based on off-line

calibration

•  Can be used to -  Better estimate overall exec

time -  Minimize data movements

cpu #3

gpu #1

cpu #2

cpu #1

gpu #2

Torrents 2012

Scheduling in a hybrid environment On the influence of scheduling

•  LU without pivoting (16GB input matrix) -  8 CPUs (nehalem) + 3 GPUs (FX5800)

Speed (GFlops)0

100200300400500600700800

Greedytask modelprefetchdata model

Transfers (GB)0

102030405060

Speed (GFlops)0

100200300400500600700800

Transfers (GB)0

102030405060

Speed (GFlops)0

100200300400500600700800

Transfers (GB)0

102030405060

Speed (GFlops)0

100200300400500600700800

Transfers (GB)0

102030405060

•  Overall load distribution -  80% of work goes on GPUs, 20% on CPUs

•  StarPU exhibits good scalability wrt: -  Problem size -  Number of GPUs

•  But absolute performance was not state-of-the-art…

Mixing PLASMA and MAGMA with StarPU

Runtime - 32

With University of Tennessee & INRIA HiePACS

•  Cholesky decomposition -  5 CPUs (Nehalem) + 3 GPUs

(FX5800) -  Efficiency > 100%

0 100 200 300 400 500 600 700 800 900

5120 15360 25600 35840 46080

Matrix order

3 GPUs + 5 CPUs3 GPUs2 GPUs1 GPU

Torrents 2012

Runtime - 33

With University of Tennessee & INRIA HiePACS

•  QR decomposition -  16 CPUs (AMD) + 4 GPUs (C1060)

0 5000 10000 15000 20000 25000 30000 35000 40000

Matrix order

4 GPUs + 16 CPUs4 GPUs + 4 CPUs3 GPUs + 3 CPUs2 GPUs + 2 CPUs1 GPUs + 1 CPUs

+12 CPUs ~200 GFlops

(although 12 CPUs alone ~150 Gflops)

Torrents 2012

Runtime - 34

« Super-Linear » efficiency in QR? •  Kernel efficiency

-  sgeqrt •  CPU: 9 Gflops GPU: 30 Gflops Ratio: x3

-  stsqrt •  CPU: 12 Gflops GPU: 37 Gflops Ratio: x3

-  somqr •  CPU: 8.5 Gflops GPU: 227 Gflops Ratio: x27

-  Sssmqr •  CPU: 10 Gflops GPU: 285 Gflops Ratio: x28

•  Task distribution observed on StarPU -  sgeqrt: 20% of tasks on GPUs -  Sssmqr: 92.5% of tasks on GPUs

•  Heterogeneous architectures are cool! J

Torrents 2012

Theoretical bound Can we perform a lot better?

•  Express set of tasks (and dependencies) as Linear problem -  With heterogeneous task durations, and heterogeneous resources

Theoretical bound Can we perform a lot better?

•  Express set of tasks (and dependencies) as Linear problem -  With heterogeneous task durations, and heterogeneous resources

Scheduling for Power Consumption

Runtime 37

FP7 project

•  Power consumption information can be retrieved from OpenCL devices -  clGetEventProfilingInfo

•  CL_PROFILING_POWER_CONSUMED

-  Implemented by Movidius multicore accelerator

•  Autotuning of power models similar to performance

•  Scheduling heuristic inspired by MCT -  Actually, a subtle mix with MCT

is possible Consumption

cpu #3

gpu #1

cpu #2

cpu #1

gpu #2

Task consumption

idle consumption

Torrents 2012

Ongoing work

Runtime - 38

•  Target for the HMPP compiler, XMP framework (ANR-JST FP3C) -  Introduce dynamic scheduling in directive-based languages

compilers -  Early experiments with the StarSs framework

•  A quest for “ideal” granularity -  Enabling adaptive granularity

•  Divisible tasks •  Autotuning

•  Tuning on Intel MIC

Torrents 2012

Usability in real applications

Runtime - 39

SOCL: A StarPU-enabled OpenCL implementation

•  Run legacy OpenCL codes on top of StarPU -  StarPU scheduler involved

only if a slight modification is made: •  OpenCL queue attached to

a context, not to a particular device

-  Incremental introduction of “dynamic scheduling” within OpenCL applications

-  Data prefetching also possible

OpenCL

StarPU

CPU GPU …

Legacy OpenCL Application

Torrents 2012

From nested parallelism to parallel code reuse

Runtime - 40

Composing parallel libraries

•  Integration with other shared-memory programming paradigms -  We cannot taskify the whole world -  At least provide support for parallel tasks

•  Implemented with threads/OpenMP/… •  Nested parallelism

-  It raises the composability problem

•  It’s all about composability! -  Probably the biggest upcoming challenge for runtime systems

•  Hybridization will mostly be indirect (linking libraries)

•  And with composability come a lot of related issues -  Need for autotuning / scheduling hints

Torrents 2012

How will we program exascale machines?

Runtime - 41

Let’s prepare for serious changes

•  Billions of threads will be necessary to occupy exascale machines -  Exploit every source of (fine-grain) parallelism

•  Not every algorithm/problem resolution can scale that far L

-  Multi-scale, Multi-physics applications are welcome! •  Great opportunity to exploit multiple levels of parallelism

-  Is SIMD the only reasonable approach? •  Are CUDA & OpenCL our future?

•  No global, consistent view of node’s state -  Local algorithms -  Hierarchical coordination/load balance

•  Maybe, this time, we should seriously consider enabling (parallel) code reuse…

Torrents 2012

Context B

Scheduling contexts

Runtime - 42

Toward code composability

•  Each context features its own scheduler

•  Multiple parallel libraries can run simultaneously -  Virtualization of resources -  Scalability workaround

•  Contexts may share processing units -  Avoid underutilized resources

•  Contexts may expand and shrink -  Hypervised approach -  Maximize overall throughput -  Use dynamic feedback both

from application and runtime

Context A

CPU workers

GPU workers

Torrents 2012

Scheduling contexts

Runtime 43

•  Dynamic adjustment of context size -  Based on computation velocity feedback

Torrents 2012

Toward a common runtime system?

Runtime - 44

i.e. Unified Software Stack

•  There is currently no consensus on a common runtime system -  Could we agree on a common runtime stack? -  Technically feasible, but very challenging

Unified Multicore Runtime System

Topology-aware Scheduling

Memory Management Synchronization

Task Management (Threads/Tasklets/Codelets)

Data distribution facilities I/O services

OpenMP Intel TBB CUDA

MKL PLAGMA

MPI OpenCL PGAS

Torrents 2012

Major Challenges are ahead…

We are living exciting times!

more information: http://runtime.bordeaux.inria.fr

Programming modern (accelerator-based) multicore machines: a runtime’s … · 2013-01-08 ·...

Documents

Transcript of Programming modern (accelerator-based) multicore machines: a runtime’s … · 2013-01-08 ·...

Fisheye Lens Distortion Correction on Multicore and Hardware Accelerator Platformsinf-server.inf.uth.gr/~kodalouk/presentations/Daloukas... · · 2017-09-06Fisheye Lens Distortion

StarPU: a runtime system for multiGPU multicore …calcul.math.cnrs.fr/Documents/Journees/nov2010/R-Namyst.pdfStarPU: a runtime system for multiGPU multicore machines Raymond Namyst

Using Multicore Navigator Multicore Applications.

The Shift to Multicore Architectures · The Shift to Multicore Architectures 3 Computer Science: Crisis by Crisis “To put it quite bluntly: as long as there were no machines, programming

Parallel Computing and MPI Pt2Pt - MIT OpenCourseWare...12.950 Parallel Programming for Multicore Machines using OpenMP and MPI Dr. C. Evangelinos, MIT Parallel Programming for Multicore

Multicore Manual

Today's agenda - MIT OpenCourseWare · Today's agenda Homework ... Parallel Programming for Multicore Machines Using OpenMP and MPI. Groups, Contexts, Communicators

Multicore Application Debugging Multicore Debugging ...

An Accelerator Programming Model for Multicore · 4 PGI Accelerator Model Design Based on two successful models: Vector Programming and OpenMP Vector Programming: No new language

Multicore processors

BlackParrot: An Agile Open-Source RISC-V Multicore for Accelerator SoCspeople.bu.edu/joshi/files/PetriskoMicro2020.pdf · 2020. 8. 25. · BlackParrot: An Agile Open-Source RISC-V

Scientiﬁc Computing on Hybrid Architecturesuu.diva-portal.org/smash/get/diva2:622836/FULLTEXT01.pdf · multicore processors (CPUs) and often a graphics card (GPU) or other accelerator.

A Scalable Ordering Primitive for Multicore Machines

Multicore Processsors

Efficient Taint Analysis Using Multicore Machines Milind Chabbi Dr. Gregory Andrews and Dr. Saumya Debray Computer Science Department University of Arizona.

Multicore processor

Multicore computing

High Performance Compute Platform Based on multi-core …Packet Accelerator Network Power Management Debug Multicore Shared Memory Controller (MSMC) Shared Memory 4MB DDR3- 64b EDMA

Best-First Heuristic Search for Multicore Machines

How do multicore machines actually behave? - microsoft.com€¦How do multicore machines actually behave? (x86, ARM/POWER, Java, and C/C++11) Peter Sewell University of Cambridge TRANSFORM