Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through...
Transcript of Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through...
![Page 1: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/1.jpg)
KAAPI :Adaptive Runtime System
for Parallel Computing
Thierry Gautier, [email protected] Raffin, [email protected]
MOAIS project, INRIA Grenoble Rhône-Alpes
![Page 2: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/2.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Moais Projecthttp://moais.imag.fr
• Leader
• Jean-Louis Roch
• 10 Members
• Vincent Danjean, Pierre-François Dutot, Thierry Gautier, Guillaume Huard, Grégory Mounié, Clément Pernet, Bruno Raffin, Denis Trystram, Frédéric Wagner
• About 20 PhD students
![Page 3: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/3.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
To mutually adapt application and scheduling
Moais Positioning
GridCluster
MulticoreGPU
MPSoC
![Page 4: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/4.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
KAAPI Overview
Application
KAAPI middleware
system
Model: abstract representation
Algorithms: scheduling, fault tolerance protocol, ...
“causal connexions”
Perf
orm
ance
![Page 5: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/5.jpg)
API
Global address space• Creation of objects in a global address space with ‘shared’ type
Task• Creation with ‘Fork’ keyword (~!Cilk spawn)
• Tasks only communicate through shared objects
Automatic scheduling• work stealing or graph partitioning
‘Sequential’ semantics
similar to TBB/Cilk but with data flow dependencies
![Page 6: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/6.jpg)
C++ Elision
struct Fibonacci {
void operator()( int n, a1::Shared_w<int> result ) ! {
! ! if (n < 2) result.write( n ); ! ! else {
! ! ! a1::Shared<int> subresult1; ! ! ! a1::Shared<int> subresult2; ! ! ! a1::Fork<Fibonacci>()(n-1, subresult1);
! ! ! a1::Fork<Fibonacci>()(n-2, subresult2); ! ! ! a1::Fork<Sum>()(result, subresult1, subresult2); ! ! }
! }
};
struct Sum {
void operator()(! a1::Shared_w<int> result, ! ! ! ! a1::Shared_r<int> sr1, ! ! ! ! a1::Shared_r<int> sr2 )
! { result.write( sr1.read() + sr2.read() ); }
}
![Page 7: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/7.jpg)
struct Fibonacci {
void operator()( int n, a1::Shared_w<int&>result ) ! {
! ! if (n < 2) result = n ;
! ! else {
! ! ! a1::Shared<int> subresult1; ! ! ! a1::Shared<int> subresult2; ! ! ! a1::Fork<Fibonacci>()(n-1, subresult1);
! ! ! a1::Fork<Fibonacci>()(n-2, subresult2); ! ! ! a1::Fork<Sum>()(result, subresult1, subresult2); ! ! }
! }
};
struct Sum {
void operator()(! a1::Shared_w<int&>result, ! ! ! ! a1::Shared_w<int >sr1, ! ! ! ! a1::Shared_w<int >sr2 )
! { result.w=rite(sr1.read() + sr2.read() ); } }
C++ Elision
![Page 8: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/8.jpg)
Abstract Representation
result
Fibonacci
Time
result
Sum
Fibonacci
subres2
Fibonacci
subres1
result
Sum
Fibonacci
subres2
Sum
subres1
Fibonacci
subres1.1
Fibonacci
subres1.2
![Page 9: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/9.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
KAAPI Scheduler
2 Level SchedulingK-Thread
CPU
OS scheduler
CPUOS CPU
OS scheduler
K-Processor
processprocess
other process
Active Message over TCP/IP, Myrinet
and SSH
![Page 10: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/10.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
• Notations
• Ts : Sequential work, time of sequential execution
• T1 : Time of the parallel algorithm on 1 core
• D: Critical Path
• P: Number of cores
• Properties
• with high probability, number of steals is
O(P x D)
• with high probability, execution time is
Tp " T1 / P + O(D)
~ Also similar bound of Cilk’ extension with Rabin et al.
Performance Guarantee
![Page 11: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/11.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Comparison with Cilk/TBB
• 8 processors NUMA machine• STL Transform, Ratio Tstl / Tlibrary on 8 cores
0
1
2
3
4
5
6
7
0 50000 100000 150000 200000 250000 300000
TS
TL /
TLib
rary
Size
STL transform
X-KaapiTBBCilk
![Page 12: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/12.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Comparison with Cilk/TBB
• 8 processors NUMA machine• STL Merge, Ratio Tstl / Tlibrary on 8 cores
0
1
2
3
4
5
6
7
0 50000 100000 150000 200000 250000 300000
TS
TL /
TL
ibra
ry
Size
STL Merge
X-KaapiTBBCilk
![Page 13: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/13.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Grid Experiments
• QAP, Q3AP, NQueens• well suited for work stealing scheduling
• Plugtest Contest Grid@Works• 2007: Grid 5000 (France)
• 1rst rank,
• NQueens N=23, 35 minutes 7s, 3654 cores
• 2008: Grid5000 (2709 cores) + Intrigger (Japan, 900 cores) • 1rst rank, 8760 points. (2snd 1459 pts, 3rd 792 pts)
• Super Quant Monte Carlo option pricing application
![Page 14: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/14.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Iterative Application
• Scheduling by graph partitioning
• Metis / Scotch
Application
![Page 15: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/15.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Experiments
• Finite Difference Kernel• Kaapi / C++ code versus Fortran MPI code
• Constant size sub domain D per processor
• Cluster : N processors on a cluster
• Grid : N/4 processors per cluster, 4 clusters
D=256^3 # processors Cluster (s) Grid (s) Overhead
KAAPI1 0.49 0.49 -64 0.55 0.84 0,53128 0.65 0.91 0,4
MPI1 0.44 0.44 -64 0.66 2.02 2,06128 0.68 1.57 1,31
![Page 16: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/16.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Optimizing MPI code
• Overlapping communication by computation• At the cost of important code restructuring
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
16+16 32+32 64+64
Me
an
tim
e f
or
an
ite
ratio
n (
s)
Nb proc
256^3/proc between Rennes and Bordeaux
kaapi!optsendrecv!ompiirecvisend!ompiasync!ompi
KAAPI automatically reschedules computation and communication tasks
![Page 17: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/17.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Fault Tolerance
• State of application = state of the data flow graph
• Two specialized protocols
• TIC: Theft Induced Checkpointin
• Periodic checkpoint + forced checkpoint on steal
• CCK: for iterative applications
• Coherent checkpoints
• only recovery of failed process + !application
![Page 18: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/18.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
• Implemented using distributed checkpoint services
• two checkpointing periods
• max overhead observed: 0.9%
• TIC: overhead increases as the number of processors increases
0
0,225
0,450
0,675
0,900
20 40 60 120
CIC (period=1s)CIC (period=20s)
Ove
rhea
d (
%)
Protocol Scalability
#Processors
![Page 19: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/19.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Comparison with Satin
• 32 processors, synthetic recursive app.
•
![Page 20: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/20.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Physics Simulation
• SOFA: real-time physics engine
• Strongly supported INRIA initiative
• Open Source:
http://www-sofa-framework.org
• Target application:
Surgery simulation
![Page 21: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/21.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Multi CPU/GPU SOFA
• SOFA: 2 levels of parallelization
• KAAPI: graph partitioning and work stealing
• Nvidia Cuda
• On-going: work stealing between CPUs and GPUs
![Page 22: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/22.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
SOFA
![Page 23: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/23.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Oblivious Algorithms
• Cache oblivious algorithms
• Irregular meshes: 2-20x on CPU, 1.2-2.7x on GPU
• On-going work: cache oblivious + adapted work stealing strategy
![Page 24: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/24.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Conclusions
• KAAPI: flexible framework for parallel programming and fine scheduling control:
• work stealing : recursive computation or local scheduling
• graph partitioning : iterative application
• Data dependency graph:
• used for scheduling or fault tolerance protocols
• On going work on hybrid architectures and large scale machines (BlueGene)
![Page 25: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)](https://reader034.fdocuments.in/reader034/viewer/2022050413/5f8982f55b988448cb49a3c3/html5/thumbnails/25.jpg)
Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project
Questions?
• http://kaapi.gforge.inria.fr
• http://www-sofa-framework.org
• http://moais.imag.fr