Post on 28-Mar-2015
LaboratoireInformatiquede Grenoble
Parallel algorithms and scheduling:
adaptive parallel programming and
applications
Bruno Raffin, Jean-Louis Roch, Denis Trystram Projet MOAIS [ INRIA / CNRS / INPG / UJF ]
moais.imag.fr
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion
Why adaptive algorithms and how?
€
7 3 6
0 1 8
0 0 5
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
Input data varyResources availability is versatile
Adaptation to improve performances
Scheduling• partitioning • load-balancing• work-stealing
Measures onresources
Measures on data
Calibration • tuning parameters block size/ cache choice of instructions, …• priority managing
Choices in the algorithm • sequential / parallel(s) • approximated / exact• in memory / out of core• …
An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion
Modeling an hybrid algorithm
• Several algorithms to solve a same problem f : – Eg : algo_f1, algo_f2(block size), … algo_fk : – each algo_fk being recursive
Adaptationto choose algo_fj for
each call to f
algo_fi ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; …}
.
E.g. “practical” hybrids: • Atlas, Goto, FFPack• FFTW• cache-oblivious B-tree• any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib…
• How to manage overhead due to choices ? • Classification 1/2 :
– Simple hybrid iff O(1) choices [eg block size in Atlas, …]
– Baroque hybrid iff an unbounded number of choices
[eg recursive splitting factors in FFTW]
• choices are either dynamic or pre-computed based on input properties.
• Choices may or may not be based on architecture parameters.
• Classification 2/2. :
an hybrid is– Oblivious: control flow does not depend neither on static properties of the resources nor on the input
[eg cache-oblivious algorithm [Bender]
– Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ]
• Engineered tuned or self tuned[eg ATLAS and GOTO libraries, FFTW, …][eg [LinBox/FFLAS] [ Saunders&al]
– Adaptive : self-configuration of the algorithm, dynamlc• Based on input properties or resource circumstances discovered at run-time
[eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]
Examples
• BLAS libraries– Atlas: simple tuned (self-tuned)– Goto : simple engineered (engineered tuned)– LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al]
• FFTW– Halving factor : baroque tuned– Stopping criterion : simple tuned
Dynamic architecture : non-fixed number of resources, variable speedseg: grid, … but not only: SMP server in multi-users mode
Adaptation in parallel algorithmsProblem: compute f(a)
Sequentialalgorithm
parallelP=2
parallelP=100
parallelP=max
...
Multi-user SMP server GridHeterogeneous network
?Which algorithm to choose ?
… …
Dynamic architecture : non-fixed number of resources, variable speedseg: grid, … but not only: SMP server in multi-users mode
=> motivates « processor-oblivious » parallel algorithm that : + is independent from the underlying architecture:
no reference to p nor i(t) = speed of processor i at time t nor …
+ on a given architecture, has performance guarantees : behaves as well as an optimal (off-line, non-oblivious) one
Processor-oblivious algorithms
How to adapt the application ?
• By minimizing communications • e.g. amortizing synchronizations in the simulation
[Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004]
adaptive granularity
• By contolling latency (interactivity constraints) :• FlowVR [Allard, Menier, Raffin]
overhead• By managing node failures and resilience [Checkpoint/restart][checkers]
• FlowCert [Jafar, Krings, Leprevost; Roch, Varrette]
• By adapting granularity• malleable tasks [Trystram, Mounié]
• dataflow cactus-stack : Athapascan/Kaapi [Gautier] • recursive parallelism by « work-stealling »
[Blumofe-Leiserson 98, Cilk, Athapascan, ... ]
• Self-adaptive grain algorithms • dynamic extraction of paralllelism
[Daoudi, Gautier, Revire, Roch - J. TSI 2004 ]
Parallelism and efficiency
Difficult in general (coarse grain)
But easy if T small (fine grain)
Tp = T1/p + T [Greedy scheduling, Graham69]
Expensive in general (fine grain)But small overhead if coarse grain
Schedulingefficient policy
(close to optimal)
control of the policy (realisation)
Problem : how to adapt the potential parallelism to the resources ?
«Depth »
parallel time on resources
T = #ops on a critcal path
∞T
« Work »sequential time
T1= #operations
=> to have T small with coarse grain control
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion
Parallel algorithms and scheduling: adaptive parallel programming and
applicationsBruno Raffin, Jean-Louis Roch, Denis Trystram
INRIA-CNRS Moais team - LIG Grenoble, FranceContents
I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] Conclusion
A COMPLETER Denis
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion
Parallel algorithms and scheduling: adaptive parallel programming and
applicationsBruno Raffin, Jean-Louis Roch, Denis Trystram
INRIA-CNRS Moais team - LIG Grenoble, FranceContents
I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] Conclusion
Work-stealing (1/2)
«Depth »
W = #ops on a critical path
(parallel time on resources)
• Workstealing = “greedy” schedule but distributed and randomized
• Each processor manages locally the tasks it creates• When idle, a processor steals the oldest ready task on a
remote -non idle- victim processor (randomly chosen)
« Work »
W1= #total
operations performed
Work-stealing (2/2)
«Depth »
W = #ops on a critical path
(parallel time on resources)
« Work »
W1= #total
operations performed
• Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02]
-> if W small enough near-optimal processor-oblivious schedule with good probability on p processors with average speeds ave
NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02]
• Implementation: work-first principle [Cilk serie-parallel, Kaapi dataflow] -> Move scheduling overhead on the steal operations (infrequent case)-> General case : “local parallelism” implemented by sequential function call
Work stealing scheduling of a parallel recursive fine-grain algorithm
• Work-stealing scheduling• an idle processor steals the oldest ready task
• Interests :=> #succeeded steals < p. T [Blumofe 98, Narlikar 01, ....]
=> suited to heterogeneous architectures [Bender-Rabin 03, ....]
• Hypothesis for efficient parallel executions: • the parallel algorithm is « work-optimal »
• T is very small (recursive parallelism)
• a « sequential » execution of the parallel algorithm is valid
• e.g. : search trees, Branch&Bound, ...
• Implementation : work-first principle [Multilisp, Cilk, …]
• overhead of task creation only upon steal request: sequential degeneration of the parallel algorithm
• cactus-stack management
• Intérêt : Grain fin « statique », mais contrôle dynamique
• Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes]
f2
Implementation of work-stealing
fork f2
f1() { ….
fork f2 ; …
} steal
f1
P
+ non-préemptive execution of ready task
P’
Hypothesis : a sequential schedule is valid
f1
Stack
Modèle de coût : avec une grande probabilité, sur p proc. Identiques - Temps d’exécution =
- nombre de requètes de vols =
Experimentation: knary benchmark
SMP ArchitectureOrigin 3800 (32 procs)
Cilk / Athapascan
Distributed Archi.iCluster Athapascan
#procs Speed-Up
8 7,83
16 15,6
32 30,9
64 59,2
100 90,1
Ts = 2397 s T1 = 2435
How to obtain an efficientfine-grain algorithm ?
• Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)
• Problem :• Fine grain (T small) parallel algorithms may
involve a large overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization• But also arithmetic overhead
Work-stealing and adaptability • Work-stealing ensures allocation of processors to tasks transparently to
the application with provable performances• Support to addition of new resources• Support to resilience of resources and fault-tolerance (crash faults, network, …)
• Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, …]
• “Baroque hybrid” adaptation: there is an -implicit- dynamic choice between two algorithms
• a sequential (local) algorithm : depth-first (default choice)• A parallel algorithm : breadth-first• Choice is performed at runtime, depending on resource idleness
• Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]:
• Parallel Divide&Conquer computations • Tree searching, Branch&X …
-> suited when both sequential and parallel algorithms perform (almost) the same number of operations
Self-adaptive grain algorithm• Principle :
To save parallelism overhead by privilegiating a sequential algorithm :
=> use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation
• Hypothesis : two algorithms : • - 1 sequential : SeqCompute
• - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm
•
SeqCompute
Extract_parLastPartComputation
SeqCompute
Generic self-adaptive grain algorithm
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault toleranceConclusion
• General approach: to mix both • a sequential algorithm with optimal work W1 • and a fine grain parallel algorithm with minimal critical time W
• Folk technique : parallel, than sequential • Parallel algorithm until a certain « grain »; then use the sequential one• Drawback : W increases ;o) …and, also, the number of steals
• Work-preserving speed-up technique [Bini-Pan94] sequential, then parallel Cascading [Jaja92] : Careful interplay of both algorithms to build one with both
W small and W1 = O( Wseq ) • Use the work-optimal sequential algorithm to reduce the size • Then use the time-optimal parallel algorithm to decrease the time
Drawback : sequential at coarse grain and parallel at fine grain ;o(
How to get both optimal work W1 and W small?
Illustration : f(i), i=1..100 LastPart(w)
W=2..100
SeqComp(w)sur CPU=A
f(1)
Illustration : f(i), i=1..100 LastPart(w)
W=3..100
SeqComp(w)sur CPU=A
f(1);f(2)
Illustration : f(i), i=1..100 LastPart(w) on CPU=B
W=3..100
SeqComp(w)sur CPU=A
f(1);f(2)
Illustration : f(i), i=1..100
SeqComp(w)sur CPU=A
f(1);f(2)
LastPart(w)on CPU=B
W=3..51
SeqComp(w’)
LastPart(w’)
W’=52..100
LastPart(w)
Illustration : f(i), i=1..100
SeqComp(w)sur CPU=A
f(1);f(2)
W=3..51
SeqComp(w’)
LastPart(w’)
W’=52..100
LastPart(w)
Illustration : f(i), i=1..100
SeqComp(w)sur CPU=A
f(1);f(2)
W=3..51
SeqComp(w’)sur CPU=B
f(52)
LastPart(w’)
W’=53..100
LastPart(w)
Produit iteré Séquentiel, parallèle, adaptatif
[Davide Vernizzi]● Séquentiel :
● Entrée: tableau de n valeurs
● Sortie:
● c/c++ code:
for (i=0; i<n; i++)
res += atoi(x[i]);
● Algorithme parallèle :
● calcul récursif par bloc (arbre binaire avec fusion)
● Taille de bloc = pagesize
● Code kaapi : athapascan API
€
f (x i
i=1
n
∑ )
Expérimentation : parallèle <=> adaptatif
Variante : somme de pages
● Entrée: ensemble de n pages. Chaque page est un tableau de valeurs
● Sortie: une page où chaque élément estla somme des éléments de même indice des pages précédentes
● c/c++ code:
for (i=0; i<n; i++)
for (j=0; j<pageSize; j++)
res [j] += f (pages[i][j]);
res ji 0
n 1
f p a g e i , j
Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel - l’algorithme adaptatif a une efficacité proche de 1
Démonstration sur ensibull
Script: [vernizzd@ensibull demo]$ more go-tout.sh #!/bin/sh./spg /tmp/data &./ppg /tmp/data 1 --a1 -thread.poolsize 3 &./apg /tmp/data 1 --a1 -thread.poolsize 3 &
Résultat: [vernizzd@ensibull demo]$ ./go-tout.sh Page size: 4096Memory allocatedMemory allocated0:In main: th = 1, parallel0: -----------------------------------------0: res = -2.048e+07
0: time = 0.408178 s ADAPTATIF (3 procs)0: Threads created: 540: -----------------------------------------0: res = -2.048e+07
0: time = 0.964014 s PARALLELE (3 procs)0: #fork = 74970: -----------------------------------------: -----------------------------------------: res = -2.048e+07
: time = 1.15204 s SEQUENTIEL (1 proc): -----------------------------------------
Algorithme adaptatif (1/3)
● Hypothèse: ordonnancement non préemptif - de type work-stealing
● Couplage séquentiel adaptatif :
void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) {// cout << "Adaptative" << endl; a1::Shared <Page> resLPC;
a1::Fork<LPC>() (resLPC, dw);
Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq);}
Algorithme adaptatif (2/3)
● Côté séquentiel :
void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } } *resSeq=resLoc;}
Algorithme adaptatif (3/3)● Côté extraction = algorithme parallèle :
struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } }};
Application : gzip parallelisation
• Gzip :
• Utilisé (web) et coûteux bien que de complexité linéaire
• Code source :10000 lignes C, structures de données complexes
• Principe : LZ77 + arbre Huffman
• Pourquoi gzip ?• Problème P-complet, mais parallélisation pratique possible similar to iterated
product
• Inconvénient: toute parallélisation (connue) entraîne un surcoût • -> perte de taux de compression
Fichiercompressé
Fichieren entrée
Compressionà la volée
Algorithme
Partition dynamique en blocs
Parallélisation « facile » ,100% compatible avec gzip/gunzip
Problèmes : perte de taux de compression, grain dépend de la machine, surcoût
Blocs compressés
Compressionparallèle
Partition statique en blocs
Parallélisation
=>
=>
Comment paralléliser gzip ?
Outputcompressedfile
InputFile
Compressionà la volée
SeqComp LastPartComputation
Outputcompressedblocks
Parallelcompression
Parallélisation gzip à grain adaptatif
Dynamicpartitionin blocks
cat
Performances sur SMPDuree de recherche dans de larges fichiers
DNA locaux
0
2
4
6
8
10
12
31924544 63849088
Taille des fichiers (octets)
Duee (sec.)
sequential parallel (12 threads)
Pentium 4x200 Mhz
Performances en distribué
Duree de recherche a travers deux disques non-locaux
0
200
400
600
800
1000
1200
674021 1228427
Taille des repertoires (octets)
Duree (sec.)
sequential 1 node, 4 threads 2 nodes, 4 threads
Séquentiel Pentium 4x200 Mhz
SMP Pentium 4x200 Mhz
Architecture distribuée Myrinet Pentium 4x200 Mhz + 2x333 Mhz
Recherche distribuée dans 2 répertoires de même taille chacun sur un disque distant (NFS)
Taille
Fichiers
Gzip Adaptatif
2 procs
Adaptatif
8 procs
Adaptatif
16 procs
0,86 Mo 272573 275692 280660 280660
5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo
9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo
10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo
5,2 Mo 3,35 s 0,96 s 0,55 s
9,4 Mo 7,67 s 6,73 s 6,79 s
10 Mo 6,79 s 1,71 s 0,88 s
Surcoût en taille de fichier comprimé
Gain en temps
Performances
4 processors computer
0
10
20
30
40
50
60
70
80
90
1,106 2,089 2,263 4,260 6,769 7,905 8,960 10,957 15,962 19,298 21,914
Size of file (Ko)
Time (in seconds)
Sequential gzip
Athapascan gzipPentium 4x200Mhz
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault toleranceConclusion
• Solution: to mix both a sequential and a parallel algorithm
• Basic technique : • Parallel algorithm until a certain « grain »; then use the sequential one• Problem : W increases
also, the number of migration … and the inefficiency ;o(
• Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja92]
Careful interplay of both algorithms to build one withboth W small and W1 = O( Wseq )
• Divide the sequential algorithm into block• Each block is computed with the (non-optimal) parallel algorithm• Drawback : sequential at coarse grain and parallel at fine grain ;o(
• Adaptive granularity : dual approach : • Parallelism is extracted at run-time from any sequential task
But often parallelism has a cost !
• Prefix problem : • input : a0, a1, …, an • output : 0, 1, …, n with
• Sequential algorithm : for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ;
• Fine grain optimal parallel algorithm [Ladner-Fischer]:
Prefix computation
Critical time W =2. log n
but performs W1 = 2.n ops
Twice more expensive than the sequential …
a0 a1 a2 a3 a4 … an-1 an
* * **
Prefix of size n/2 1 3 … n
2 4 … n-1
** *
performs W1 = W = n operations
• Any parallel algorithm with critical time W runs on p processors in time
– strict lower bound : block algorithm + pipeline [Nicolau&al. 1996]
–Question : How to design a generic parallel algorithm, independent from the architecture, that achieves optimal performance on any given architecture ? –> to design a malleable algorithm where scheduling suits the number of operations performed to the architecture
Prefix computation : an example where parallelism always costs
- Heterogeneous processors with changing speed [Bender-Rabin02]
=> i(t) = instantaneous speed of processor i at time t in #operations per second
- Average speed per processor for a computation with duration T :
- Lower bound for the time of prefix computation :
Architecture model
Alternative : concurrently sequential and parallelBased on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead
use parallel algorithm only if a processor becomes idle (ie steals) by extracting parallelism from a sequential computation
Hypothesis : two algorithms : • - 1 sequential : SeqCompute
- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm
– Self-adaptive granularity based on work-stealingSeqCompute
Extract_parLastPartComputation
SeqCompute
Parallel
Sequential
0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12
Work-stealer 1
MainSeq.
Work-stealer 2
Adaptive Prefix on 3 processors
1
Steal request
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
a9 a10 a11 a127
3
Steal request
2
6 i=a5*…*ai
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
8
4
Preempt
10 i=a9*…*ai
8
8
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4 8
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
85
10 i=a9*…*ai9
6
11
8
Preempt 11
118
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4 8 11 a12
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
85
10 i=a9*…*ai9
6
11
12
10
7
118
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4 8 11 a12
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
85
10 i=a9*…*ai9
6
11
12
10
7
118
Implicit critical path on the sequential process
Analysis of the algorithm • Execution time
• Sketch of the proof :– Dynamic coupling of two algorithms that completes simultaneously:– Sequential: (optimal) number of operations S on one processor– Parallel : minimal time but performs X operations on other processors
• dynamic splitting always possible till finest grain BUT local sequential– Critical path small ( eg : log X)– Each non constant time task can potentially be splitted (variable speeds)
– Algorithmic scheme ensures Ts = Tp + O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (s+X) - #ops_optimal
Lower bound
Adaptive prefix : experiments1
Single-user context : processor-oblivious prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors
- Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors :
Optimal off-line on p procs
Oblivious
Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux)T
ime
(s)
#processors
Pure sequential
Single user context
Adaptive prefix : experiments 2
Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule,
Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule,
External charge (9-p external processes)
Off-line parallel algorithm for p processors
Oblivious
Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux)
Tim
e (s
)
#processors
Multi-user context :
The Prefix race: sequential/parallel fixed/ adaptive
Race between 9 algorithms (44 processes) on an octo-SMPSMP
0 5 10 15 20 25
1
2
3
4
5
6
7
8
9
Execution time (seconds)
Série1
Adaptative 8 proc.
Parallel 8 proc.
Parallel 7 proc.
Parallel 6 proc.Parallel 5 proc.
Parallel 4 proc.
Parallel 3 proc.
Parallel 2 proc.
Sequential
On each of the 10 executions, adaptive completes first
Conclusion
The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms
Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors
with changing speeds - practically, achieves near-optimal performances on multi-user SMPs
Generic adaptive scheme to implement parallel algorithms with provable performance
- work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion
A COMPLETER Bruno / Luciano
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion
A COMPLETERBruno / Jean-Denis
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion
The dataflow graph is a portable distributed execution of the execution : can be used to adapt to resource resilience by computing a coherent global state
SEL : protocole par journalisation du graphe de flot de données
Conclusion tolérance défaillances• Kaapi tolère ajout et
défaillance de machines – Protocole TIC :
• période à ajuster
– Détecteur défaillances • signal erreurs +heartbeat
– http://kaapi.gforge.inria.fr
Athapascan / Kaapi
Pile [TIC]
oui
Pas besoin
Locale ou globale
oui
FullDag[SEL]Dataflow graph
Parallel algorithms and scheduling: adaptive parallel programming and
applications
Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault toleranceConclusion
Thank you !
QuickTime™ et undécompresseur codec YUV420
sont requis pour visionner cette image.
Interactive Distributed Simulation[B Raffin &E Boyer]
- 5 cameras, - 6 PCs
3D-reconstruction+ simulation+ rendering
->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp
http://moais.imag.fr
LaboratoireInformatiquede Grenoble
Multi-Programming and Scheduling Design for Applications of Interactive Simulation
Comité d’évaluation du LIG - 23/01/2006
Louvre, Musée de l’Homme Sculpture (Tête)Artist : AnonymeOrigin: Rapa Nui [Easter Island]Date : between the XIst and the XVth centuryDimensions : 1,70 m high
Who are the Moais today ?• Permanent staff (7) : INRIA (2) + INPG (3) + UJF (2)
• Visiting professor (1) + PostDoc (1)
• ITA : Adminstration (1.25) + Engineer (1)
• PhD students (14) : 6 contracts + 2 joined (co-tutelle)
Vincent Danjean [MdC UJF]Thierry Gautier [CR INRIA] Guillaume Huard [MdC UJF] Grégory Mounié [MdC INPG]
Bruno Raffin [CR INRIA]Jean-Louis Roch [MdC INPG]Denis Trystram [Prof INPG]
Axel Krings [CNRS/RAGTIME , Univ Idaho - 1/09/2004->31/08/05}Luciano Suares [PostDoc INRIA, 2006]
Admin. : Barbara Amouroux [INRIA, 50%] Annie-Claude Vial-Dallais [INPG, 50%] Evelyne Feres[UJF, 50%]Engineer : Joelle Prévost [INPG, 50%] , IE [CNRS, 50%]
Julien Bernard (BDI ST)Florent Blanchot (Cifre ST)Guillaume Dunoyer (DCN, MOAIS/POPART/E-MOTION) Lionel Eyraud (MESR)Feryal-Kamila Moulai (Bull)Samir Jafar (ATER)Clément Menier (MESR, MOAIS/MOVI)
Jonathan Pecero-Sanchez(Egide Mexique)Laurent Pigeon (Cifre IFP)Krzysztof Rzadca (U Warsaw, Poland)Daouda Traore (Egide/Mali)Sébastien Varrette (U Luxembourg) Thomas Arcila (Cifre Bull)Eric Saule (MESR)
Jérémie Allard (MESR, 12/2005)Luiz-Angelo Estefanel (Egide/Brasil 11/2005)
Hamid-Reza Hamidi (Egide/Iran 10/2005)Jaroslaw Zola (U Czestochowa, Poland 12/2005)+4
Objective• Programming on virtual networks (clusters and lightweight grids)
applications where the multi-criteria performance depends on the resources
number with provable performances– Adaptability :
• Static to the platform : the devices evolute gradually
• Dynamic to the execution context : data and resources usage
• Target applications : interactive and distributed simulations – Virtual reality, compute intensive applications
(process engineering, optimization, bio-computing)
computation visualization
Eg Grimage platform
MOAIS approach• Algorithm + Scheduling + Programming
– Large degree of parallelism w.r.t. the architecture– both local preemptive (system) and non-preemptive (application) scheduling
to obtain global provable performances
• dataflow, distributed, recursive/multithreaded
• Research thema– Scheduling
• off-line, on-line, multi-criteria
– Adaptive (poly-)algorithms • Various level of malleability : parallelism, precision, refresh rate, …
– Coupling : efficient description of synchronizations KAAPI software : fine grain dynamic scheduling
– Interactivity FlowVR software : coarse grain scheduling
Scientific challenges• To schedule with provable performances for a multicriteria objective
on complex architectures eg - Time / Memory - Makespan / MinSum - Time / Work
Approximation, Game theory, …
• To design distributed adaptive algorithms Efficiency/scalability : local decision but global performance
• Dynamic local choices - performance with good probability
• To implement on emerging platforms Challenging applications (partners)
Grimage (MOVI)H.264 / MPEG-4 encoding on mpScNet [ST]
QAP/Nugent on Grid5000 [PRISM, GILCO, LIFL]
ObjectiveProgramming of applications where performance is a matter of
resources: take benefit of more and suit to less
– eg : a global computing platform (P2P)
Application code : “independent” from resources and adaptive Target applications: interactive simulation
– “ virtual observatory “
MOAIS
intearction
simulation
rendering
Performance is related to #resources - simulation : precision=size/order
#procs, memory space - rendering : images wall, sounds, ...
#video-projectors, #HPs, ... - interaction : acquisition peripherals
#cameras, #haptic sensors, …
GRIMAGE platform
• 2003 : 11 PCs, 8 projectors and 4 cameras
First demo : 12/03
• 2005: 30 PCs, 16 projectors and 20 cameras– A display wall :
• Surface: 2x2.70 m
• Resolution: 4106x3172 pixels
• Very bright: daylight work
Commodity components
[B. Raffin]
Video [J Allard, C Menier]
QuickTime™ et undécompresseur
sont requis pour visionner cette image.