PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application...
Transcript of PaRSEC: Distributed task- based runtime for scalable ...€¦ · • Interface with the application...
PaRSEC: Distributed task-based runtime for
scalable applications
George Bosilca
1
A history of computing paradigms
Heterogeneity Concurrency*
Resiliency
SYNC
BSP & early message passing
MPI + X
SYNCSYNC
MPI + X + Y
• Difficult to express the potential algorithmic parallelism• Control flow• Software became an amalgam of algorithm, data distribution and architecture
characteristics• Increasing gaps between the capabilities of today’s programming
environments, the requirements of emerging applications, and the challenges of future parallel architectures
• What is productivity ?
SYNC
MPI + X + Y + Z + …
Conc
epts
• Clear separation of concerns: compiler optimizeeach task class, developer describe dependencies between tasks, the runtime orchestrate the dynamic execution
• Interface with the application developers through specialized domain specific languages (PTG/JDF, Python, insert_task, fork/join, …)
• Separate algorithms from data distribution• Make control flow executions a relic
Runt
ime
• Permeable portability layer for heterogeneous architectures
• Scheduling policies adapt every execution to the hardware & ongoing system status
• Data movements between producers and consumers are inferred from dependencies. Communications/computations overlap naturally unfold
• Coherency protocols minimize data movements
• Memory hierarchies (including NVRAM and disk) integral part of the scheduling decisions
PaRSEC: a generic runtime system for asynchronous, architecture aware scheduling of fine-grained tasks on distributed many-core heterogeneous architectures
The PaRSEC framework
Cores Memory Hierarchies
CoherenceData
Movement Accelerators
Data Movement
Para
llel R
untim
eHa
rdw
are
Dom
ain
Spec
ific
Exte
nsio
ns
SchedulingSchedulingDistributed
Scheduling
Data Collections
Compact Representation -
PTG
Dynamic Discovered Representation -
DTG
SpecializedKernelsSpecialized
KernelsSpecialized Kernels
TasksTasksTaskclasses
Hardcore
Dense LA … Sparse LA Chemistry
*
DataDataData
*…
The PaRSEC machine model• Execution flow execute tasks
sequentially• Can be bound to physical cores or can
oversubscribe a resource
• A domain is a collection of execution flows with particular hardware properties: memory locality, similar computing capabilities, …
• A Virtual Process is a localization domain defining the scope of automatic migration/delocalization• Multiple VP can coexist on the same physical node
• Replicate over the total number of nodes
Dom
ain
Dom
ain
Dom
ain
Virtual Process
Dom
ain
Dom
ain
Virtual Process
Phys
ical
Nod
e
Dom
ain
Dom
ain
Dom
ain
Virtual Process
Dom
ain
Dom
ain
Virtual Process
Phys
ical
Nod
e
Dom
ain
Dom
ain
Dom
ain
Virtual Process
Dom
ain
Dom
ain
Virtual Process
Phys
ical
Nod
e
Dom
ain
Dom
ain
Dom
ain
Virtual Process
Dom
ain
Dom
ain
Virtual Process
Phys
ical
Nod
e
Dom
ain
Dom
ain
Dom
ain
Virtual Process
Dom
ain
Dom
ain
Virtual Process
Phys
ical
Nod
e
Dom
ain
Dom
ain
Dom
ain
Virtual Process
Dom
ain
Dom
ain
Virtual Process
Phys
ical
Nod
e
User defined
Runtime defined
What is a task?• An execution unit taking a set of input data
and generating, upon completion, a different set of output data
6
Bernstein conditions
Data collections
Graph of tasks
The PaRSEC data• A data is a manipulation token, the basic logical
element used in the description of the dataflow• Location: have multiple coherent copies (remote
node, device, checkpoint)• Shape: can have different memory layout• Visibility: only accessible via the most current
version of the data• State: can be migrated / logged
• Data collections are ensemble of data distributed among the nodes• Can be regular (multi-dimensional matrices)• Or irregular (sparse data, graphs)• Can be regularly distributed (cyclic-k) or randomly
• Data View a subset of the data collection used in a particular algorithm (aka. submatrix, row, column,…)
Runtime defined
User defined
Dat
a Vi
ew
Data Collection
A(k)
v2
v1
v2
• A data-copy is the practical unit of data• Has a memory layout (think MPI datatype)• Has a property of locality (device, NUMA domain,
node)• Has a version associated with• Multiple instances can coexist
A PaRSEC task• A task is a state machine• The state machine is dynamic:• Can be altered by the runtime based on
available resources• X and Y computing capability
detected (CUDA, Xeon Phi, …)• Resilient runtime
• Or can be altered programmatically
• Changing states is based on the transition return code• Task delocalization to another (possibly
external) execution domain• Task resubmission or reinitialization• Atomic tasks (and many more)
constructor
destructor
scatter
acquire
prologue
hook
epilogue
ready
A PaRSEC task• A task is a state machine• The state machine is dynamic:• Can be altered by the runtime based on
available resources• X and Y computing capability
detected (CUDA, Xeon Phi, …)• Resilient runtime
• Or can be altered programmatically
• Changing states is based on the transition return code• Task delocalization to another (possibly
external) execution domain• Task resubmission or reinitialization• Atomic tasks (and many more)
constructor
validator
acquire
prologue
hook
epilogue
ready
prologue
hook
epilogue
ready
prologue
hook
epilogue
ready
destructor
scatter
• Efficiently in terms of memory and search
• DAG are often large• One can hardly afford to generate them
ahead of time• Generate it dynamically only when it is
time• All input are available remotely• Enough inputs are available
(prefetch)• Merge parameterized DAGs with
dynamically generated DAGs
How to describe a graph of tasks ?
How to describe a graph of tasks ?• Uncountable ways• Generic: Dagguer (Charm++), Legion, ParalleX,
Parameterized Task Graph (PaRSEC), Dynamic Task Discovery (StarPU, StarSS), Yvette (XML), Fork/Join (spawn). CnC
• Application specific: MADNESS
• PaRSEC runtime• The runtime is agnostic to the domain specific
language (DSL)• Different DSL interoperate through the data
collections• The DSL share• Distributed schedulers• Communication engine• Hardware resources• Data management (coherence,
versioning, …)• They don’t share• The task structure• The dataflow
11
12
The insert_task interface dague_vector_t dDATA;dague_vector_init( &dDATA, matrix_Integer, matrix_Tile,
nodes, rank,1, /* tile_size*/N, /* Global vector size*/0, /* starting point */1 ); /* block size */
dague_context_t* dague;dague = dague_context_init(NULL, NULL); /* start the PaRSEC engine */
dague_dtd_handle_t* DAGUE_dtd_handle = dague_dtd_handle_new (dague);dague_enqueue(dague, (dague_handle_t*) DAGUE_dtd_handle);
for( n = 0; n < N; n++ ) {dague_insert_task(
DAGUE_dtd_handle,call_to_kernel_type_write, "Task Name",PASSED_BY_REF, DATA_AT(&dDATA, n), INOUT | REGION_FULL,0 /* DONE */);
for( k = 0; k < K; k++ ) {dague_insert_task(
DAGUE_dtd_handle,call_to_kernel_type_read, "Read_Task",PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT | REGION_FULL,0 /* DONE */ );
}}
dague_handle_wait( DAGUE_dtd_handle );
Define a distributed collection of data (vector)
Start PaRSEC
Create a tasks placeholder and associate it with the PaRSEC context
Keep adding tasks. A configurable window will limit the number of pending tasks
Wait ’till completion
The insert_task interface
• Preliminary results• No collective pattern detection• No data cache
13
8 nodes * 20 threads16 nodes * 8 threads
The Parameterized Task Graph (JDF)
14
• A dataflow description based on data tracking• A simple affine description of the algorithm can be understood and
translated by a compiler into a more complex, control-flow free, form• Abide to all constraints imposed by current compiler technology
FOR k = 0 .. SIZE - 1
A[k][k], T[k][k] <- GEQRT( A[k][k] )
FOR m = k+1 .. SIZE - 1
A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )
FOR n = k+1 .. SIZE - 1
A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )
FOR m = k+1 .. SIZE - 1
A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
GEQRT
TSQRT
UNMQR
TSMQR
The Parameterized Task Graph (JDF)
15
GEQRT
TSQRT
UNMQR
TSMQR
FOR k = 0 .. SIZE - 1
A[k][k], T[k][k] <- GEQRT( A[k][k] )
FOR m = k+1 .. SIZE - 1
A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )
FOR n = k+1 .. SIZE - 1
A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )
FOR m = k+1 .. SIZE - 1
A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
MEM
n = k+1m = k+1
k = 0
k = SIZE-1
LOWER
UPPER
Incoming DataOutgoing Data
• A dataflow description based on data tracking• A simple affine description of the algorithm can be understood and
translated by a compiler into a more complex, control-flow free, form• Abide to all constraints imposed by current compiler technology
The Parameterized Task Graph (JDF)
16
GEQRT(k)/* Execution space */k = 0..( MT < NT ) ? MT-1 : NT-1 )
/* Locality */: A(k, k)
RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k)
-> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER]-> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER]-> (k == MT-1) ? A(k, k) [type = UPPER]
WRITE T <- T(k, k)-> T(k, k)-> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1)
/* Priority */;(NT-k)*(NT-k)*(NT-k)
BODYzgeqrt( A, T )
END
• The resulting intermediary language is however more flexible
• Accept non-dense iterators• Allow inlined C/C++ code to
augment the language
• JDF Drawbacks:• Need to know the number of
tasks• The dependencies had to be
globally (and statically) defined prior to the execution• No dynamic DAGs• No data dependent
DAGs
Control flow is eliminated, therefore maximum parallelism is possible
DPLASMA = ScaLAPACK interface & PaRSEC capabilities
Original pseudo- or PLASMA code is converted by a preprocessor into PaRSEC internal representation (shown below)
Dataflow representation is assembled with the runtime to create a set of executable parameterized tasks (PT), which can execute the kernels, and unfold successors in the graph
SerialCode
PaRSECcompiler
Dataflowrepresentation
Dataflowcompiler
Paralleltasks stubs
Runtime
Programmer
Systemcompiler
AdditionallibrariesMPI
CUDApthreads
PLASMA MAGMA
Application code &Codelets
PaRSEC Toolchain
DomainSpecificExtensions
Datadistribution
Supercomputer
1
1
2
2
GEQRT(k)/* Execution space */k = 0..( MT < NT ) ? MT-1 : NT-1 )/* Locality */: A(k, k)RW A <- (k == 0) ? A(k, k)
: A1 TSMQR(k-1, k, k)-> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER]-> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER]-> (k == MT-1) ? A(k, k) [type = UPPER]
WRITE T <- T(k, k)-> T(k, k)-> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1)
/* Priority */;(NT-k)*(NT-k)*(NT-k)
Intermediate dataflow representation
FOR k = 0 .. SIZE - 1
A[k][k], T[k][k] <- GEQRT( A[k][k] )
FOR m = k+1 .. SIZE - 1
A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )
FOR n = k+1 .. SIZE - 1
A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )
FOR m = k+1 .. SIZE - 1
A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
TiledQRalgorithm:howkernelsareapplied onthematrixduringaniterationk
GEQRT
TSQRT
UNMQR
TSMQR
DPLASMA = ScaLAPACK+ PaRSEC
Keeneland
0
5
10
15
20
25
768 2304 4032 5760 7776 10080 14784 19584 23868
PE
RF
OR
MA
NC
E (
TF
LO
P/S
)
NUMBER OF CORES
DGEQRF performance strong scaling
LibSCI Scalapack
Systolic QR over PaRSEC (2D)Systolic QR over PULSAR
DPLASMA HQR (best single tree)
Cray XT5 (Kraken) - N = M = 41,472
What about LU ?
Sparse supportAdvanced examples Sparse direct solver over GPUs: PaStiX
Tasks structure
POTRFTRSMSYRKGEMM
(a) Dense tile task decomposition (b) Decomposition of the taskapplied while processing one panel
M. Faverge - ANR SOLHAR July 3, 2014- 57
Advanced examples Sparse direct solver over GPUs: PaStiX
DAG representation
POTRF
TRSM
T=>T
TRSM
T=>T
TRSM
T=>T TRSM
T=>T
GEMM
C=>B
GEMM
C=>B
GEMM
C=>B
SYRK
C=>A C=>A
SYRK
C=>A
GEMM
C=>B
GEMM
C=>B
GEMM
C=>C
SYRK
T=>T
C=>A C=>A
SYRK
C=>A
GEMM
C=>B
GEMM
C=>C
GEMM
C=>C
SYRK
T=>T
C=>A C=>AC=>A SYRK
C=>A
TRSM
C=>C
TRSM
C=>C
TRSM
C=>C
POTRF
T=>T
T=>T T=>TT=>T
C=>B
SYRK
C=>A C=>B C=>A C=>AC=>B
T=>T
GEMM
C=>C
SYRK
T=>T
SYRK
T=>T
C=>A C=>A C=>A
TRSM
C=>C POTRF
T=>T
T=>T
TRSM
T=>T
C=>C
C=>A C=>B
SYRK
T=>T
C=>AC=>A
POTRF
T=>T
TRSM
C=>C
T=>T
C=>A
POTRF
T=>T
(c) Dense DAG
panel(7)
gemm(19)
A=>A
gemm(20)
A=>A
gemm(21)
A=>Agemm(22)
A=>A
gemm(23)
A=>A
gemm(24)
A=>A
gemm(25)
A=>A
C=>C
C=>C
panel(8)
C=>A
gemm(27)
C=>C
C=>C
gemm(29)
C=>C
gemm(31)
C=>C A=>A
gemm(28)
A=>A
A=>A
gemm(30)
A=>A
A=>A
C=>C
panel(9)
C=>A
C=>C
gemm(33)
C=>C
gemm(36)
C=>C
panel(0)
gemm(1)
A=>A
gemm(2)
A=>Agemm(3)
A=>A
gemm(4)
A=>A
C=>C
panel(1)
C=>A
C=>C
gemm(6)
C=>C
A=>A
gemm(8)
C=>C
gemm(10)
C=>C
panel(5)
gemm(14)
A=>A
gemm(15)
A=>A
panel(6)
C=>A
gemm(17)
C=>C A=>A
C=>C
panel(2)
A=>A
gemm(12)
C=>C
panel(3)
A=>A
C=>C
panel(4)
A=>A
gemm(37)
C=>C
A=>A A=>A
gemm(34)
A=>A
gemm(35)
A=>A
A=>A
gemm(38)
A=>A C=>C
C=>C
panel(10)
C=>A
C=>C
gemm(40)
C=>C
panel(11)
C=>A
A=>A
(d) Sparse DAGrepresentation of asparse LDLT factorization
M. Faverge - ANR SOLHAR July 3, 2014- 58
Total, Inria Bordeaux, Inria Pau, LaBRI, ICL
NWCHEM 6.5
A Open Source High-Performance Computational ChemistryConversion of NWChem CC code into dataflow form not trivial(CCSD code generated by TCE)• Control flow is not affine nor statically decidable:
• Loop execution space has holes,• dataflow goes through external routines,• conditional branches depend on program data,• memory access completely hidden in Global Arrays layer,
etc.à None of the traditional Compiler Analysis tools can be used
Uracil-dimer
inB4(23)
GEMM(23)
ADD_RESULT_IN_MEM(0)
inB4(15)
GEMM(15)
GEMM(16)
DFILL(0)
GEMM(0)
inB4(7)
GEMM(7)
GEMM(8)
GEMM(6)
GEMM(14)
GEMM(22)
inB2(4)
GEMM(4)
inB2(2)
GEMM(2)
inB2(0)
GEMM(1)
inB4(16)
GEMM(17)
inB4(11)
GEMM(11)
inB4(10)
GEMM(10)
inB4(9)
GEMM(9)
inB4(8)
inB4(6)
inB4(5)
GEMM(5)
inB4(3)
GEMM(3)
inB4(1)
inB4(17)
GEMM(18)
inB4(18)
GEMM(19)
inB4(19)
GEMM(20)
GEMM(12)
inB4(20)
GEMM(21)
inB4(12)
GEMM(13)
inB4(21)
inB4(13)
inB4(22)
inB4(14)
Legend
inB_X(i)
GEMM(i)
DFILL()
ADD_RESULT_IN_MEM()
Best
loca
lity
but b
ad p
aral
lelis
m
Other interactions with PaRSECWith Teresa Windus, Heike Jagodeand Anthony Danalis
Integration of PaRSEC in CCSD
21
PARSEC-enabled version in 2 steps:1. Traverse execution space and
evaluate IF branches, without executing the actual computation (Since the data that affects the control flow is immutable at run-time, this step only needs to be performed once)
2. Create PTG – which includes lookups into our meta-data vectors populated by step 1.
Elimination of synchronization points by describing data dependencies between matrix blocksFiner grained (pure) tasks to allow for exploitation of more parallelism
NWCHEM 6.5
A Open Source High-Performance Computational ChemistryConversion of NWChem CC code into dataflow form not trivial• Control flow is not affine nor statically decidable:
• Loop execution space has holes,• dataflow goes through external routines,• conditional branches depend on program data,• memory access completely hidden in Global Arrays layer,
etc.à None of the traditional Compiler Analysis tools can be used
Most significant outcomes of porting CC over PARSEC:1. Ability of expressing tasks and their data dependencies
at a finer granularity2. Decoupling of computation and communication enable
more advanced communication patterns than serial chains
Cascade @ EMSL/PNNL
C40H56
Unbounded parallelism
• The only requirement is that upon a task completion the potential descendants are known• Uncountable DAGs• ” %option nb_local_tasks_fn = …”• Need user defined global termination
• Add support for dynamic DAGs• Already in the language• Properties of the algorithm / tasks• ”hash_fn = …” • ”find_deps_fn = …”
23
DIP: Elastodynamic Wave Propagation
24
Geophysics - wave equation
Geophysics simulation
Figure : Elastic wave propagation in 3D (2D slice view)
Lionel BOILLOT (Inria) Task-based programming 12-apr-16 6 / 30
Task based programming Task dataflow
Fine granularity
Figure : Subdivision example
More than one domain per CPU
exhibit deeper parallelism
allow dynamic flexibility
reduce the boundary size
Lionel BOILLOT (Inria) Task-based programming 12-apr-16 18 / 30
Dynamically redistribute the data- use PAPI counters to estimate the
imbalance- reshuffle the frontiers to converge
to a load balanced scenario
Geophysics - wave equation
DIVA sequential algorithm
Quasi-explicit reformulation(v
n+1h
= v
n
h
+M
�1v
[�tR��n+1/2h
] UpdateVelocity
�n+3/2h
= �n+1/2h
+M
�1� [�tR
v
v
n+1h
] UpdateStress
Algorithme 3 : DIVA sequentialData : N
h
,�t
,Nt
Result : vh
, �h
[v1h
,�3/2h
] Initialization(Nh
,�t
);
for n = 1..Nt
dofor K = 1..N
h
do
v
n+1h
K
UpdateVelocity(vnh
K
,�n+1/2h
K
,�t
);
endfor K = 1..N
h
do
�n+3/2h
K
UpdateStress(�n+1/2h
K
, vn+1h
K
,�t
);
end
end
Lionel BOILLOT (Inria) Task-based programming 12-apr-16 8 / 30
Runtimes DAG of DIP
DIP algorithm
For n = 1 : n timesteps T
Communication(�n+1/2h
)
vn+1h
computeVelocity(vn
h
,�n+1/2h
,�t
)Communication(vn+1
h
)
�n+3/2h
computeStress(�n+1/2h
, vn+1h
,�t
)End For t
Let’s rename the algorithm steps:
EXCHANGE VS
COMPUTE V
EXCHANGE VV
COMPUTE S
Let’s divide the EXCHANGE task into SEND, RECV and COPY tasks
Lionel BOILLOT (INRIA – TOTAL) HPC: runtime & coprocessors ICL Lunch 13 / 1
Finer grain partitioning comparedwith MPIIncreased communications but alsoincreased potential for parallelismNeed for load-balancing
Total, Inria Bordeaux, Inria Pau, ICL
DIP: Elastodynamic Wave Propagation
25 Total, Inria Bordeaux, Inria Pau, ICL
Numerical illustration
Intel Xeon Phi results - e�ciency
0.2
0.4
0.6
0.8
1
1 2 4 8 16 32 60 120 240
effic
iency
number of threads
PerfectPaRSEC
MPI
Lionel BOILLOT (Inria) Task-based programming 04-mar-15 22 / 23
Not
yet H
T re
ady
Numerical illustration
Trace comparison
Figure: MPI-based t = 2.517s
Figure: PaRSEC version (NUMA-aware, granularity x6) t = 2.060s
Lionel BOILLOT (Inria) Task-based programming 04-mar-15 20 / 23
Numerical illustration
Trace comparison
Figure: MPI-based t = 2.517s
Figure: PaRSEC version (NUMA-aware, granularity x6) t = 2.060s
Lionel BOILLOT (Inria) Task-based programming 04-mar-15 20 / 23
2517s
2060s
Resilience: Data Logging Strategy
• Minimize the amount of tasks reexecutions by logging data• Checkpoint interval β, a process will save
a copy of each data every β updates.• Input of failed task:• The same tile checkpointed at mostβ updates ago
• Final output of another task (validated)• Max number of re-executions is β
for factorizations
Checkpoint Beginning Middle End No Failure
β (NB/N)3 β6(NB/N)3 β6(NB/N)3 0
never (NB/N)3 12.5% 100% 0
β = 2
Checkpoint Beginning Middle End No Failure
β (NB/N)3 β6(NB/N)3 β6(NB/N)3 0
never (NB/N)3 12.5% 100% 0
Resilience: Data Logging Strategy
Conclusions• Don’t make hardware a serious impediment to
scientific simulation• Programming must be made easy(ier)• Portability: inherently take advantage of all hardware capabilities• Efficiency: deliver the best performance on several families of algorithms
• Build a scientific enabler allowingdifferent communities to focus ondifferent problems• Application developers on their algorithms• Language specialists on Domain Specific Languages• System developers on system issues• Compilers on optimizing the task code
The PaRSECecosystem• Support for many different types of applications• Dense Linear Algebra: DPLASMA, MORSE/Chameleon• Sparse Linear Algebra: PaSTIX• Geophysics: Total - Elastodynamic Wave Propagation• Chemistry: NWChem Coupled Cluster, MADNESS,
TiledArray• *: ScaLAPACK, MORSE/Chameleon
• A set of tools to understandthe performance
29
Physics & Maths background Geophysics context
RTM context
Geophysics:Hydrocarbons detection: petroleum or natural gasEarth medium: seismic waves, heterogeneous complex domain
Simulation:Seismic imaging: find the subsurface layersEquations: elastic/acoustic wave in 2D/3D
Reverse Time Migration (RTM)
Iterative method based on multiple wave equation resolutions
Lionel BOILLOT (INRIA – TOTAL) HPC: runtime & coprocessors ICL Lunch 7 / 1
0
5000
10000
15000
20000
25000
0 5 10 15 20 25 30 35
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(a) ScaLAPACK.
0
5000
10000
15000
20000
25000
0 5 10 15 20 25 30 35
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(b) DPLASMA.
Figure 11. Power Profiles of the Cholesky Factorization.
0
5000
10000
15000
20000
25000
0 20 40 60 80 100
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(a) ScaLAPACK.
0
5000
10000
15000
20000
25000
0 20 40 60 80 100
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(b) DPLASMA.
Figure 12. Power Profiles of the QR Factorization.
smaller number of cores. The engine could then decide toturn off or lower the frequencies of the cores using DynamicVoltage Frequency Scaling [16], a commonly used techniquewith which it is possible to achieve reduction of energyconsumption.
ACKNOWLEDGMENT
The authors would like to thank Kirk Cameron and Hung-Ching Chang from the Department of Computer Science atVirginia Tech, for granting access to their platform.
# Cores Library Cholesky QR
128 ScaLAPACK 192000 672000DPLASMA 128000 540000
256 ScaLAPACK 240000 816000DPLASMA 96000 540000
512 ScaLAPACK 325000 1000000DPLASMA 125000 576000
Figure 13. Total amount of energy (joule) used for each test based on thenumber of cores
REFERENCES
[1] MPI-2: Extensions to the message passing interface standard.http://www.mpi-forum.org/ (1997)
[2] Agullo, E., Hadri, B., Ltaief, H., Dongarra, J.: Comparativestudy of one-sided factorizations with multiple software pack-ages on multi-core hardware. SC ’09: Proceedings of theConference on High Performance Computing Networking,Storage and Analysis pp. 1–12 (2009)
[3] Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel,J.W., Dongarra, J.J., Croz, J.D., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide,3rd edn. Society for Industrial and Applied Mathematics,Philadelphia (1999)
[4] Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar,A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief,H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible Devel-opment of Dense Linear Algebra Algorithms on MassivelyParallel Architectures with DPLASMA. In: the 12th IEEEInternational Workshop on Parallel and Distributed Scientificand Engineering Computing (PDSEC-11). ACM, Anchorage,AK, USA (2011)
[5] Bosilca, G., Bouteiller, A., Danalis, A., Herault, T.,Lem arinier, P., Dongarra, J.: DAGuE: A generic dis-tributed DAG engine for high performance computing.Tech. Rep. 231, LAPACK Working Note (2010). URLhttp://www.netlib.org/lapack/lawnspdf/lawn231.pdf
[6] Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Don-garra, J.: DAGuE: A generic distributed DAG engine for highperformance computing (2011)
[7] Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class ofparallel tiled linear algebra algorithms for multicore architec-tures. Parallel Computing 35(1), 38–53 (2009)
[8] Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrou-chov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.:ScaLAPACK, a portable linear algebra library for distributedmemory computers-design issues and performance. ComputerPhysics Communications 97(1-2), 1–15 (1996)
[9] Cosnard, M., Jeannot, E.: Compact DAG representation andits dynamic scheduling. Journal of Parallel and DistributedComputing 58, 487–514 (1999)
[10] Dongarra, J., Beckman, P.: The International Exascale Soft-ware Roadmap. International Journal of High PerformanceComputer Applications 25(1) (2011)