PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application...

29
PaRSEC : Distributed task - based runtime for scalable applications George Bosilca 1

Transcript of PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application...

Page 1: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

PaRSEC: Distributed task-based runtime for

scalable applications

George Bosilca

1

Page 2: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

A history of computing paradigms

Heterogeneity Concurrency*

Resiliency

SYNC

BSP & early message passing

MPI + X

SYNCSYNC

MPI + X + Y

• Difficult to express the potential algorithmic parallelism• Control flow• Software became an amalgam of algorithm, data distribution and architecture

characteristics• Increasing gaps between the capabilities of today’s programming

environments, the requirements of emerging applications, and the challenges of future parallel architectures

• What is productivity ?

SYNC

MPI + X + Y + Z + …

Page 3: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

Conc

epts

• Clear separation of concerns: compiler optimizeeach task class, developer describe dependencies between tasks, the runtime orchestrate the dynamic execution

• Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task, fork/join, …)

• Separate algorithms from data distribution• Make control flow executions a relic

Runt

ime

• Permeable portability layer for heterogeneous architectures

• Scheduling policies adapt every execution to the hardware & ongoing system status

• Data movements between producers and consumers are inferred from dependencies. Communications/computations overlap naturally unfold

• Coherency protocols minimize data movements

• Memory hierarchies (including NVRAM and disk) integral part of the scheduling decisions

PaRSEC: a generic runtime system for asynchronous, architecture aware scheduling of fine-grained tasks on distributed many-core heterogeneous architectures

Page 4: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

The PaRSEC framework

Cores Memory Hierarchies

CoherenceData

Movement Accelerators

Data Movement

Para

llel R

untim

eHa

rdw

are

Dom

ain

Spec

ific

Exte

nsio

ns

SchedulingSchedulingDistributed

Scheduling

Data Collections

Compact Representation -

PTG

Dynamic Discovered Representation -

DTG

SpecializedKernelsSpecialized

KernelsSpecialized Kernels

TasksTasksTaskclasses

Hardcore

Dense LA … Sparse LA Chemistry

*

DataDataData

*…

Page 5: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

The PaRSEC machine model• Execution flow execute tasks

sequentially• Can be bound to physical cores or can

oversubscribe a resource

• A domain is a collection of execution flows with particular hardware properties: memory locality, similar computing capabilities, …

• A Virtual Process is a localization domain defining the scope of automatic migration/delocalization• Multiple VP can coexist on the same physical node

• Replicate over the total number of nodes

Dom

ain

Dom

ain

Dom

ain

Virtual Process

Dom

ain

Dom

ain

Virtual Process

Phys

ical

Nod

e

Dom

ain

Dom

ain

Dom

ain

Virtual Process

Dom

ain

Dom

ain

Virtual Process

Phys

ical

Nod

e

Dom

ain

Dom

ain

Dom

ain

Virtual Process

Dom

ain

Dom

ain

Virtual Process

Phys

ical

Nod

e

Dom

ain

Dom

ain

Dom

ain

Virtual Process

Dom

ain

Dom

ain

Virtual Process

Phys

ical

Nod

e

Dom

ain

Dom

ain

Dom

ain

Virtual Process

Dom

ain

Dom

ain

Virtual Process

Phys

ical

Nod

e

Dom

ain

Dom

ain

Dom

ain

Virtual Process

Dom

ain

Dom

ain

Virtual Process

Phys

ical

Nod

e

User defined

Runtime defined

Page 6: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

What is a task?• An execution unit taking a set of input data

and generating, upon completion, a different set of output data

6

Bernstein conditions

Data collections

Graph of tasks

Page 7: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

The PaRSEC data• A data is a manipulation token, the basic logical

element used in the description of the dataflow• Location: have multiple coherent copies (remote

node, device, checkpoint)• Shape: can have different memory layout• Visibility: only accessible via the most current

version of the data• State: can be migrated / logged

• Data collections are ensemble of data distributed among the nodes• Can be regular (multi-dimensional matrices)• Or irregular (sparse data, graphs)• Can be regularly distributed (cyclic-k) or randomly

• Data View a subset of the data collection used in a particular algorithm (aka. submatrix, row, column,…)

Runtime defined

User defined

Dat

a Vi

ew

Data Collection

A(k)

v2

v1

v2

• A data-copy is the practical unit of data• Has a memory layout (think MPI datatype)• Has a property of locality (device, NUMA domain,

node)• Has a version associated with• Multiple instances can coexist

Page 8: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

A PaRSEC task• A task is a state machine• The state machine is dynamic:• Can be altered by the runtime based on

available resources• X and Y computing capability

detected (CUDA, Xeon Phi, …)• Resilient runtime

• Or can be altered programmatically

• Changing states is based on the transition return code• Task delocalization to another (possibly

external) execution domain• Task resubmission or reinitialization• Atomic tasks (and many more)

constructor

destructor

scatter

acquire

prologue

hook

epilogue

ready

Page 9: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

A PaRSEC task• A task is a state machine• The state machine is dynamic:• Can be altered by the runtime based on

available resources• X and Y computing capability

detected (CUDA, Xeon Phi, …)• Resilient runtime

• Or can be altered programmatically

• Changing states is based on the transition return code• Task delocalization to another (possibly

external) execution domain• Task resubmission or reinitialization• Atomic tasks (and many more)

constructor

validator

acquire

prologue

hook

epilogue

ready

prologue

hook

epilogue

ready

prologue

hook

epilogue

ready

destructor

scatter

Page 10: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

• Efficiently in terms of memory and search

• DAG are often large• One can hardly afford to generate them

ahead of time• Generate it dynamically only when it is

time• All input are available remotely• Enough inputs are available

(prefetch)• Merge parameterized DAGs with

dynamically generated DAGs

How to describe a graph of tasks ?

Page 11: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

How to describe a graph of tasks ?• Uncountable ways• Generic: Dagguer (Charm++), Legion, ParalleX,

Parameterized Task Graph (PaRSEC), Dynamic Task Discovery (StarPU, StarSS), Yvette (XML), Fork/Join (spawn). CnC

• Application specific: MADNESS

• PaRSEC runtime• The runtime is agnostic to the domain specific

language (DSL)• Different DSL interoperate through the data

collections• The DSL share• Distributed schedulers• Communication engine• Hardware resources• Data management (coherence,

versioning, …)• They don’t share• The task structure• The dataflow

11

Page 12: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

12

The insert_task interface dague_vector_t dDATA;dague_vector_init( &dDATA, matrix_Integer, matrix_Tile,

nodes, rank,1, /* tile_size*/N, /* Global vector size*/0, /* starting point */1 ); /* block size */

dague_context_t* dague;dague = dague_context_init(NULL, NULL); /* start the PaRSEC engine */

dague_dtd_handle_t* DAGUE_dtd_handle = dague_dtd_handle_new (dague);dague_enqueue(dague, (dague_handle_t*) DAGUE_dtd_handle);

for( n = 0; n < N; n++ ) {dague_insert_task(

DAGUE_dtd_handle,call_to_kernel_type_write, "Task Name",PASSED_BY_REF, DATA_AT(&dDATA, n), INOUT | REGION_FULL,0 /* DONE */);

for( k = 0; k < K; k++ ) {dague_insert_task(

DAGUE_dtd_handle,call_to_kernel_type_read, "Read_Task",PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT | REGION_FULL,0 /* DONE */ );

}}

dague_handle_wait( DAGUE_dtd_handle );

Define a distributed collection of data (vector)

Start PaRSEC

Create a tasks placeholder and associate it with the PaRSEC context

Keep adding tasks. A configurable window will limit the number of pending tasks

Wait ’till completion

Page 13: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

The insert_task interface

• Preliminary results• No collective pattern detection• No data cache

13

8 nodes * 20 threads16 nodes * 8 threads

Page 14: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

The Parameterized Task Graph (JDF)

14

• A dataflow description based on data tracking• A simple affine description of the algorithm can be understood and

translated by a compiler into a more complex, control-flow free, form• Abide to all constraints imposed by current compiler technology

FOR k = 0 .. SIZE - 1

A[k][k], T[k][k] <- GEQRT( A[k][k] )

FOR m = k+1 .. SIZE - 1

A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )

FOR n = k+1 .. SIZE - 1

A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )

FOR m = k+1 .. SIZE - 1

A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

GEQRT

TSQRT

UNMQR

TSMQR

Page 15: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

The Parameterized Task Graph (JDF)

15

GEQRT

TSQRT

UNMQR

TSMQR

FOR k = 0 .. SIZE - 1

A[k][k], T[k][k] <- GEQRT( A[k][k] )

FOR m = k+1 .. SIZE - 1

A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )

FOR n = k+1 .. SIZE - 1

A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )

FOR m = k+1 .. SIZE - 1

A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

MEM

n = k+1m = k+1

k = 0

k = SIZE-1

LOWER

UPPER

Incoming DataOutgoing Data

• A dataflow description based on data tracking• A simple affine description of the algorithm can be understood and

translated by a compiler into a more complex, control-flow free, form• Abide to all constraints imposed by current compiler technology

Page 16: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

The Parameterized Task Graph (JDF)

16

GEQRT(k)/* Execution space */k = 0..( MT < NT ) ? MT-1 : NT-1 )

/* Locality */: A(k, k)

RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k)

-> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER]-> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER]-> (k == MT-1) ? A(k, k) [type = UPPER]

WRITE T <- T(k, k)-> T(k, k)-> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1)

/* Priority */;(NT-k)*(NT-k)*(NT-k)

BODYzgeqrt( A, T )

END

• The resulting intermediary language is however more flexible

• Accept non-dense iterators• Allow inlined C/C++ code to

augment the language

• JDF Drawbacks:• Need to know the number of

tasks• The dependencies had to be

globally (and statically) defined prior to the execution• No dynamic DAGs• No data dependent

DAGs

Control flow is eliminated, therefore maximum parallelism is possible

Page 17: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

DPLASMA = ScaLAPACK interface & PaRSEC capabilities

Original pseudo- or PLASMA code is converted by a preprocessor into PaRSEC internal representation (shown below)

Dataflow representation is assembled with the runtime to create a set of executable parameterized tasks (PT), which can execute the kernels, and unfold successors in the graph

SerialCode

PaRSECcompiler

Dataflowrepresentation

Dataflowcompiler

Paralleltasks stubs

Runtime

Programmer

Systemcompiler

AdditionallibrariesMPI

CUDApthreads

PLASMA MAGMA

Application code &Codelets

PaRSEC Toolchain

DomainSpecificExtensions

Datadistribution

Supercomputer

1

1

2

2

GEQRT(k)/* Execution space */k = 0..( MT < NT ) ? MT-1 : NT-1 )/* Locality */: A(k, k)RW A <- (k == 0) ? A(k, k)

: A1 TSMQR(k-1, k, k)-> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER]-> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER]-> (k == MT-1) ? A(k, k) [type = UPPER]

WRITE T <- T(k, k)-> T(k, k)-> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1)

/* Priority */;(NT-k)*(NT-k)*(NT-k)

Intermediate dataflow representation

FOR k = 0 .. SIZE - 1

A[k][k], T[k][k] <- GEQRT( A[k][k] )

FOR m = k+1 .. SIZE - 1

A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )

FOR n = k+1 .. SIZE - 1

A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )

FOR m = k+1 .. SIZE - 1

A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

TiledQRalgorithm:howkernelsareapplied onthematrixduringaniterationk

GEQRT

TSQRT

UNMQR

TSMQR

Page 18: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

DPLASMA = ScaLAPACK+ PaRSEC

Keeneland

0

5

10

15

20

25

768 2304 4032 5760 7776 10080 14784 19584 23868

PE

RF

OR

MA

NC

E (

TF

LO

P/S

)

NUMBER OF CORES

DGEQRF performance strong scaling

LibSCI Scalapack

Systolic QR over PaRSEC (2D)Systolic QR over PULSAR

DPLASMA HQR (best single tree)

Cray XT5 (Kraken) - N = M = 41,472

What about LU ?

Page 19: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

Sparse supportAdvanced examples Sparse direct solver over GPUs: PaStiX

Tasks structure

POTRFTRSMSYRKGEMM

(a) Dense tile task decomposition (b) Decomposition of the taskapplied while processing one panel

M. Faverge - ANR SOLHAR July 3, 2014- 57

Advanced examples Sparse direct solver over GPUs: PaStiX

DAG representation

POTRF

TRSM

T=>T

TRSM

T=>T

TRSM

T=>T TRSM

T=>T

GEMM

C=>B

GEMM

C=>B

GEMM

C=>B

SYRK

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>B

GEMM

C=>C

SYRK

T=>T

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>C

GEMM

C=>C

SYRK

T=>T

C=>A C=>AC=>A SYRK

C=>A

TRSM

C=>C

TRSM

C=>C

TRSM

C=>C

POTRF

T=>T

T=>T T=>TT=>T

C=>B

SYRK

C=>A C=>B C=>A C=>AC=>B

T=>T

GEMM

C=>C

SYRK

T=>T

SYRK

T=>T

C=>A C=>A C=>A

TRSM

C=>C POTRF

T=>T

T=>T

TRSM

T=>T

C=>C

C=>A C=>B

SYRK

T=>T

C=>AC=>A

POTRF

T=>T

TRSM

C=>C

T=>T

C=>A

POTRF

T=>T

(c) Dense DAG

panel(7)

gemm(19)

A=>A

gemm(20)

A=>A

gemm(21)

A=>Agemm(22)

A=>A

gemm(23)

A=>A

gemm(24)

A=>A

gemm(25)

A=>A

C=>C

C=>C

panel(8)

C=>A

gemm(27)

C=>C

C=>C

gemm(29)

C=>C

gemm(31)

C=>C A=>A

gemm(28)

A=>A

A=>A

gemm(30)

A=>A

A=>A

C=>C

panel(9)

C=>A

C=>C

gemm(33)

C=>C

gemm(36)

C=>C

panel(0)

gemm(1)

A=>A

gemm(2)

A=>Agemm(3)

A=>A

gemm(4)

A=>A

C=>C

panel(1)

C=>A

C=>C

gemm(6)

C=>C

A=>A

gemm(8)

C=>C

gemm(10)

C=>C

panel(5)

gemm(14)

A=>A

gemm(15)

A=>A

panel(6)

C=>A

gemm(17)

C=>C A=>A

C=>C

panel(2)

A=>A

gemm(12)

C=>C

panel(3)

A=>A

C=>C

panel(4)

A=>A

gemm(37)

C=>C

A=>A A=>A

gemm(34)

A=>A

gemm(35)

A=>A

A=>A

gemm(38)

A=>A C=>C

C=>C

panel(10)

C=>A

C=>C

gemm(40)

C=>C

panel(11)

C=>A

A=>A

(d) Sparse DAGrepresentation of asparse LDLT factorization

M. Faverge - ANR SOLHAR July 3, 2014- 58

Total, Inria Bordeaux, Inria Pau, LaBRI, ICL

Page 20: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

NWCHEM 6.5

A Open Source High-Performance Computational ChemistryConversion of NWChem CC code into dataflow form not trivial(CCSD code generated by TCE)• Control flow is not affine nor statically decidable:

• Loop execution space has holes,• dataflow goes through external routines,• conditional branches depend on program data,• memory access completely hidden in Global Arrays layer,

etc.à None of the traditional Compiler Analysis tools can be used

Uracil-dimer

inB4(23)

GEMM(23)

ADD_RESULT_IN_MEM(0)

inB4(15)

GEMM(15)

GEMM(16)

DFILL(0)

GEMM(0)

inB4(7)

GEMM(7)

GEMM(8)

GEMM(6)

GEMM(14)

GEMM(22)

inB2(4)

GEMM(4)

inB2(2)

GEMM(2)

inB2(0)

GEMM(1)

inB4(16)

GEMM(17)

inB4(11)

GEMM(11)

inB4(10)

GEMM(10)

inB4(9)

GEMM(9)

inB4(8)

inB4(6)

inB4(5)

GEMM(5)

inB4(3)

GEMM(3)

inB4(1)

inB4(17)

GEMM(18)

inB4(18)

GEMM(19)

inB4(19)

GEMM(20)

GEMM(12)

inB4(20)

GEMM(21)

inB4(12)

GEMM(13)

inB4(21)

inB4(13)

inB4(22)

inB4(14)

Legend

inB_X(i)

GEMM(i)

DFILL()

ADD_RESULT_IN_MEM()

Best

loca

lity

but b

ad p

aral

lelis

m

Other interactions with PaRSECWith Teresa Windus, Heike Jagodeand Anthony Danalis

Page 21: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

Integration of PaRSEC in CCSD

21

PARSEC-enabled version in 2 steps:1. Traverse execution space and

evaluate IF branches, without executing the actual computation (Since the data that affects the control flow is immutable at run-time, this step only needs to be performed once)

2. Create PTG – which includes lookups into our meta-data vectors populated by step 1.

Elimination of synchronization points by describing data dependencies between matrix blocksFiner grained (pure) tasks to allow for exploitation of more parallelism

Page 22: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

NWCHEM 6.5

A Open Source High-Performance Computational ChemistryConversion of NWChem CC code into dataflow form not trivial• Control flow is not affine nor statically decidable:

• Loop execution space has holes,• dataflow goes through external routines,• conditional branches depend on program data,• memory access completely hidden in Global Arrays layer,

etc.à None of the traditional Compiler Analysis tools can be used

Most significant outcomes of porting CC over PARSEC:1. Ability of expressing tasks and their data dependencies

at a finer granularity2. Decoupling of computation and communication enable

more advanced communication patterns than serial chains

Cascade @ EMSL/PNNL

C40H56

Page 23: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

Unbounded parallelism

• The only requirement is that upon a task completion the potential descendants are known• Uncountable DAGs• ” %option nb_local_tasks_fn = …”• Need user defined global termination

• Add support for dynamic DAGs• Already in the language• Properties of the algorithm / tasks• ”hash_fn = …” • ”find_deps_fn = …”

23

Page 24: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

DIP: Elastodynamic Wave Propagation

24

Geophysics - wave equation

Geophysics simulation

Figure : Elastic wave propagation in 3D (2D slice view)

Lionel BOILLOT (Inria) Task-based programming 12-apr-16 6 / 30

Task based programming Task dataflow

Fine granularity

Figure : Subdivision example

More than one domain per CPU

exhibit deeper parallelism

allow dynamic flexibility

reduce the boundary size

Lionel BOILLOT (Inria) Task-based programming 12-apr-16 18 / 30

Dynamically redistribute the data- use PAPI counters to estimate the

imbalance- reshuffle the frontiers to converge

to a load balanced scenario

Geophysics - wave equation

DIVA sequential algorithm

Quasi-explicit reformulation(v

n+1h

= v

n

h

+M

�1v

[�tR��n+1/2h

] UpdateVelocity

�n+3/2h

= �n+1/2h

+M

�1� [�tR

v

v

n+1h

] UpdateStress

Algorithme 3 : DIVA sequentialData : N

h

,�t

,Nt

Result : vh

, �h

[v1h

,�3/2h

] Initialization(Nh

,�t

);

for n = 1..Nt

dofor K = 1..N

h

do

v

n+1h

K

UpdateVelocity(vnh

K

,�n+1/2h

K

,�t

);

endfor K = 1..N

h

do

�n+3/2h

K

UpdateStress(�n+1/2h

K

, vn+1h

K

,�t

);

end

end

Lionel BOILLOT (Inria) Task-based programming 12-apr-16 8 / 30

Runtimes DAG of DIP

DIP algorithm

For n = 1 : n timesteps T

Communication(�n+1/2h

)

vn+1h

computeVelocity(vn

h

,�n+1/2h

,�t

)Communication(vn+1

h

)

�n+3/2h

computeStress(�n+1/2h

, vn+1h

,�t

)End For t

Let’s rename the algorithm steps:

EXCHANGE VS

COMPUTE V

EXCHANGE VV

COMPUTE S

Let’s divide the EXCHANGE task into SEND, RECV and COPY tasks

Lionel BOILLOT (INRIA – TOTAL) HPC: runtime & coprocessors ICL Lunch 13 / 1

Finer grain partitioning comparedwith MPIIncreased communications but alsoincreased potential for parallelismNeed for load-balancing

Total, Inria Bordeaux, Inria Pau, ICL

Page 25: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

DIP: Elastodynamic Wave Propagation

25 Total, Inria Bordeaux, Inria Pau, ICL

Numerical illustration

Intel Xeon Phi results - e�ciency

0.2

0.4

0.6

0.8

1

1 2 4 8 16 32 60 120 240

effic

iency

number of threads

PerfectPaRSEC

MPI

Lionel BOILLOT (Inria) Task-based programming 04-mar-15 22 / 23

Not

yet H

T re

ady

Numerical illustration

Trace comparison

Figure: MPI-based t = 2.517s

Figure: PaRSEC version (NUMA-aware, granularity x6) t = 2.060s

Lionel BOILLOT (Inria) Task-based programming 04-mar-15 20 / 23

Numerical illustration

Trace comparison

Figure: MPI-based t = 2.517s

Figure: PaRSEC version (NUMA-aware, granularity x6) t = 2.060s

Lionel BOILLOT (Inria) Task-based programming 04-mar-15 20 / 23

2517s

2060s

Page 26: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

Resilience: Data Logging Strategy

• Minimize the amount of tasks reexecutions by logging data• Checkpoint interval β, a process will save

a copy of each data every β updates.• Input of failed task:• The same tile checkpointed at mostβ updates ago

• Final output of another task (validated)• Max number of re-executions is β

for factorizations

Checkpoint Beginning Middle End No Failure

β (NB/N)3 β6(NB/N)3 β6(NB/N)3 0

never (NB/N)3 12.5% 100% 0

β = 2

Page 27: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

Checkpoint Beginning Middle End No Failure

β (NB/N)3 β6(NB/N)3 β6(NB/N)3 0

never (NB/N)3 12.5% 100% 0

Resilience: Data Logging Strategy

Page 28: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

Conclusions• Don’t make hardware a serious impediment to

scientific simulation• Programming must be made easy(ier)• Portability: inherently take advantage of all hardware capabilities• Efficiency: deliver the best performance on several families of algorithms

• Build a scientific enabler allowingdifferent communities to focus ondifferent problems• Application developers on their algorithms• Language specialists on Domain Specific Languages• System developers on system issues• Compilers on optimizing the task code

Page 29: PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task,

The PaRSECecosystem• Support for many different types of applications• Dense Linear Algebra: DPLASMA, MORSE/Chameleon• Sparse Linear Algebra: PaSTIX• Geophysics: Total - Elastodynamic Wave Propagation• Chemistry: NWChem Coupled Cluster, MADNESS,

TiledArray• *: ScaLAPACK, MORSE/Chameleon

• A set of tools to understandthe performance

29

Physics & Maths background Geophysics context

RTM context

Geophysics:Hydrocarbons detection: petroleum or natural gasEarth medium: seismic waves, heterogeneous complex domain

Simulation:Seismic imaging: find the subsurface layersEquations: elastic/acoustic wave in 2D/3D

Reverse Time Migration (RTM)

Iterative method based on multiple wave equation resolutions

Lionel BOILLOT (INRIA – TOTAL) HPC: runtime & coprocessors ICL Lunch 7 / 1

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure 11. Power Profiles of the Cholesky Factorization.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure 12. Power Profiles of the QR Factorization.

smaller number of cores. The engine could then decide toturn off or lower the frequencies of the cores using DynamicVoltage Frequency Scaling [16], a commonly used techniquewith which it is possible to achieve reduction of energyconsumption.

ACKNOWLEDGMENT

The authors would like to thank Kirk Cameron and Hung-Ching Chang from the Department of Computer Science atVirginia Tech, for granting access to their platform.

# Cores Library Cholesky QR

128 ScaLAPACK 192000 672000DPLASMA 128000 540000

256 ScaLAPACK 240000 816000DPLASMA 96000 540000

512 ScaLAPACK 325000 1000000DPLASMA 125000 576000

Figure 13. Total amount of energy (joule) used for each test based on thenumber of cores

REFERENCES

[1] MPI-2: Extensions to the message passing interface standard.http://www.mpi-forum.org/ (1997)

[2] Agullo, E., Hadri, B., Ltaief, H., Dongarra, J.: Comparativestudy of one-sided factorizations with multiple software pack-ages on multi-core hardware. SC ’09: Proceedings of theConference on High Performance Computing Networking,Storage and Analysis pp. 1–12 (2009)

[3] Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel,J.W., Dongarra, J.J., Croz, J.D., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide,3rd edn. Society for Industrial and Applied Mathematics,Philadelphia (1999)

[4] Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar,A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief,H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible Devel-opment of Dense Linear Algebra Algorithms on MassivelyParallel Architectures with DPLASMA. In: the 12th IEEEInternational Workshop on Parallel and Distributed Scientificand Engineering Computing (PDSEC-11). ACM, Anchorage,AK, USA (2011)

[5] Bosilca, G., Bouteiller, A., Danalis, A., Herault, T.,Lem arinier, P., Dongarra, J.: DAGuE: A generic dis-tributed DAG engine for high performance computing.Tech. Rep. 231, LAPACK Working Note (2010). URLhttp://www.netlib.org/lapack/lawnspdf/lawn231.pdf

[6] Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Don-garra, J.: DAGuE: A generic distributed DAG engine for highperformance computing (2011)

[7] Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class ofparallel tiled linear algebra algorithms for multicore architec-tures. Parallel Computing 35(1), 38–53 (2009)

[8] Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrou-chov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.:ScaLAPACK, a portable linear algebra library for distributedmemory computers-design issues and performance. ComputerPhysics Communications 97(1-2), 1–15 (1996)

[9] Cosnard, M., Jeannot, E.: Compact DAG representation andits dynamic scheduling. Journal of Parallel and DistributedComputing 58, 487–514 (1999)

[10] Dongarra, J., Beckman, P.: The International Exascale Soft-ware Roadmap. International Journal of High PerformanceComputer Applications 25(1) (2011)