Laboratoire Informatique de Grenoble Parallel algorithms and scheduling: adaptive parallel...

Post on 28-Mar-2015

215 views 0 download

Tags:

Transcript of Laboratoire Informatique de Grenoble Parallel algorithms and scheduling: adaptive parallel...

LaboratoireInformatiquede Grenoble

Parallel algorithms and scheduling:

adaptive parallel programming and

applications

Bruno Raffin, Jean-Louis Roch, Denis Trystram Projet MOAIS [ INRIA / CNRS / INPG / UJF ]

moais.imag.fr

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

Why adaptive algorithms and how?

7 3 6

0 1 8

0 0 5

⎢ ⎢ ⎢

⎥ ⎥ ⎥

Input data varyResources availability is versatile

Adaptation to improve performances

Scheduling• partitioning • load-balancing• work-stealing

Measures onresources

Measures on data

Calibration • tuning parameters block size/ cache choice of instructions, …• priority managing

Choices in the algorithm • sequential / parallel(s) • approximated / exact• in memory / out of core• …

An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

Modeling an hybrid algorithm

• Several algorithms to solve a same problem f : – Eg : algo_f1, algo_f2(block size), … algo_fk : – each algo_fk being recursive

Adaptationto choose algo_fj for

each call to f

algo_fi ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; …}

.

E.g. “practical” hybrids: • Atlas, Goto, FFPack• FFTW• cache-oblivious B-tree• any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib…

• How to manage overhead due to choices ? • Classification 1/2 :

– Simple hybrid iff O(1) choices [eg block size in Atlas, …]

– Baroque hybrid iff an unbounded number of choices

[eg recursive splitting factors in FFTW]

• choices are either dynamic or pre-computed based on input properties.

• Choices may or may not be based on architecture parameters.

• Classification 2/2. :

an hybrid is– Oblivious: control flow does not depend neither on static properties of the resources nor on the input

[eg cache-oblivious algorithm [Bender]

– Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ]

• Engineered tuned or self tuned[eg ATLAS and GOTO libraries, FFTW, …][eg [LinBox/FFLAS] [ Saunders&al]

– Adaptive : self-configuration of the algorithm, dynamlc• Based on input properties or resource circumstances discovered at run-time

[eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]

Examples

• BLAS libraries– Atlas: simple tuned (self-tuned)– Goto : simple engineered (engineered tuned)– LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al]

• FFTW– Halving factor : baroque tuned– Stopping criterion : simple tuned

Dynamic architecture : non-fixed number of resources, variable speedseg: grid, … but not only: SMP server in multi-users mode

Adaptation in parallel algorithmsProblem: compute f(a)

Sequentialalgorithm

parallelP=2

parallelP=100

parallelP=max

...

Multi-user SMP server GridHeterogeneous network

?Which algorithm to choose ?

… …

Dynamic architecture : non-fixed number of resources, variable speedseg: grid, … but not only: SMP server in multi-users mode

=> motivates « processor-oblivious » parallel algorithm that : + is independent from the underlying architecture:

no reference to p nor i(t) = speed of processor i at time t nor …

+ on a given architecture, has performance guarantees : behaves as well as an optimal (off-line, non-oblivious) one

Processor-oblivious algorithms

How to adapt the application ?

• By minimizing communications • e.g. amortizing synchronizations in the simulation

[Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004]

adaptive granularity

• By contolling latency (interactivity constraints) :• FlowVR [Allard, Menier, Raffin]

overhead• By managing node failures and resilience [Checkpoint/restart][checkers]

• FlowCert [Jafar, Krings, Leprevost; Roch, Varrette]

• By adapting granularity• malleable tasks [Trystram, Mounié]

• dataflow cactus-stack : Athapascan/Kaapi [Gautier] • recursive parallelism by « work-stealling »

[Blumofe-Leiserson 98, Cilk, Athapascan, ... ]

• Self-adaptive grain algorithms • dynamic extraction of paralllelism

[Daoudi, Gautier, Revire, Roch - J. TSI 2004 ]

Parallelism and efficiency

Difficult in general (coarse grain)

But easy if T small (fine grain)

Tp = T1/p + T [Greedy scheduling, Graham69]

Expensive in general (fine grain)But small overhead if coarse grain

Schedulingefficient policy

(close to optimal)

control of the policy (realisation)

Problem : how to adapt the potential parallelism to the resources ?

«Depth »

parallel time on resources

T = #ops on a critcal path

∞T

« Work »sequential time

T1= #operations

=> to have T small with coarse grain control

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

Parallel algorithms and scheduling: adaptive parallel programming and

applicationsBruno Raffin, Jean-Louis Roch, Denis Trystram

INRIA-CNRS Moais team - LIG Grenoble, FranceContents

I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] Conclusion

A COMPLETER Denis

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

Parallel algorithms and scheduling: adaptive parallel programming and

applicationsBruno Raffin, Jean-Louis Roch, Denis Trystram

INRIA-CNRS Moais team - LIG Grenoble, FranceContents

I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] Conclusion

Work-stealing (1/2)

«Depth »

W = #ops on a critical path

(parallel time on resources)

• Workstealing = “greedy” schedule but distributed and randomized

• Each processor manages locally the tasks it creates• When idle, a processor steals the oldest ready task on a

remote -non idle- victim processor (randomly chosen)

« Work »

W1= #total

operations performed

Work-stealing (2/2)

«Depth »

W = #ops on a critical path

(parallel time on resources)

« Work »

W1= #total

operations performed

• Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02]

-> if W small enough near-optimal processor-oblivious schedule with good probability on p processors with average speeds ave

NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02]

• Implementation: work-first principle [Cilk serie-parallel, Kaapi dataflow] -> Move scheduling overhead on the steal operations (infrequent case)-> General case : “local parallelism” implemented by sequential function call

Work stealing scheduling of a parallel recursive fine-grain algorithm

• Work-stealing scheduling• an idle processor steals the oldest ready task

• Interests :=> #succeeded steals < p. T [Blumofe 98, Narlikar 01, ....]

=> suited to heterogeneous architectures [Bender-Rabin 03, ....]

• Hypothesis for efficient parallel executions: • the parallel algorithm is « work-optimal »

• T is very small (recursive parallelism)

• a « sequential » execution of the parallel algorithm is valid

• e.g. : search trees, Branch&Bound, ...

• Implementation : work-first principle [Multilisp, Cilk, …]

• overhead of task creation only upon steal request: sequential degeneration of the parallel algorithm

• cactus-stack management

• Intérêt : Grain fin « statique », mais contrôle dynamique

• Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes]

f2

Implementation of work-stealing

fork f2

f1() { ….

fork f2 ; …

} steal

f1

P

+ non-préemptive execution of ready task

P’

Hypothesis : a sequential schedule is valid

f1

Stack

Modèle de coût : avec une grande probabilité, sur p proc. Identiques - Temps d’exécution =

- nombre de requètes de vols =

Experimentation: knary benchmark

SMP ArchitectureOrigin 3800 (32 procs)

Cilk / Athapascan

Distributed Archi.iCluster Athapascan

#procs Speed-Up

8 7,83

16 15,6

32 30,9

64 59,2

100 90,1

Ts = 2397 s T1 = 2435

How to obtain an efficientfine-grain algorithm ?

• Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)

• Problem :• Fine grain (T small) parallel algorithms may

involve a large overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization• But also arithmetic overhead

Work-stealing and adaptability • Work-stealing ensures allocation of processors to tasks transparently to

the application with provable performances• Support to addition of new resources• Support to resilience of resources and fault-tolerance (crash faults, network, …)

• Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, …]

• “Baroque hybrid” adaptation: there is an -implicit- dynamic choice between two algorithms

• a sequential (local) algorithm : depth-first (default choice)• A parallel algorithm : breadth-first• Choice is performed at runtime, depending on resource idleness

• Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]:

• Parallel Divide&Conquer computations • Tree searching, Branch&X …

-> suited when both sequential and parallel algorithms perform (almost) the same number of operations

Self-adaptive grain algorithm• Principle :

To save parallelism overhead by privilegiating a sequential algorithm :

=> use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation

• Hypothesis : two algorithms : • - 1 sequential : SeqCompute

• - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm

SeqCompute

Extract_parLastPartComputation

SeqCompute

Generic self-adaptive grain algorithm

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault toleranceConclusion

• General approach: to mix both • a sequential algorithm with optimal work W1 • and a fine grain parallel algorithm with minimal critical time W

• Folk technique : parallel, than sequential • Parallel algorithm until a certain « grain »; then use the sequential one• Drawback : W increases ;o) …and, also, the number of steals

• Work-preserving speed-up technique [Bini-Pan94] sequential, then parallel Cascading [Jaja92] : Careful interplay of both algorithms to build one with both

W small and W1 = O( Wseq ) • Use the work-optimal sequential algorithm to reduce the size • Then use the time-optimal parallel algorithm to decrease the time

Drawback : sequential at coarse grain and parallel at fine grain ;o(

How to get both optimal work W1 and W small?

Illustration : f(i), i=1..100 LastPart(w)

W=2..100

SeqComp(w)sur CPU=A

f(1)

Illustration : f(i), i=1..100 LastPart(w)

W=3..100

SeqComp(w)sur CPU=A

f(1);f(2)

Illustration : f(i), i=1..100 LastPart(w) on CPU=B

W=3..100

SeqComp(w)sur CPU=A

f(1);f(2)

Illustration : f(i), i=1..100

SeqComp(w)sur CPU=A

f(1);f(2)

LastPart(w)on CPU=B

W=3..51

SeqComp(w’)

LastPart(w’)

W’=52..100

LastPart(w)

Illustration : f(i), i=1..100

SeqComp(w)sur CPU=A

f(1);f(2)

W=3..51

SeqComp(w’)

LastPart(w’)

W’=52..100

LastPart(w)

Illustration : f(i), i=1..100

SeqComp(w)sur CPU=A

f(1);f(2)

W=3..51

SeqComp(w’)sur CPU=B

f(52)

LastPart(w’)

W’=53..100

LastPart(w)

Produit iteré Séquentiel, parallèle, adaptatif

[Davide Vernizzi]● Séquentiel :

● Entrée: tableau de n valeurs

● Sortie:

● c/c++ code:

for (i=0; i<n; i++)

res += atoi(x[i]);

● Algorithme parallèle :

● calcul récursif par bloc (arbre binaire avec fusion)

● Taille de bloc = pagesize

● Code kaapi : athapascan API

f (x i

i=1

n

∑ )

Expérimentation : parallèle <=> adaptatif

Variante : somme de pages

● Entrée: ensemble de n pages. Chaque page est un tableau de valeurs

● Sortie: une page où chaque élément estla somme des éléments de même indice des pages précédentes

● c/c++ code:

for (i=0; i<n; i++)

for (j=0; j<pageSize; j++)

res [j] += f (pages[i][j]);

res ji 0

n 1

f p a g e i , j

Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel - l’algorithme adaptatif a une efficacité proche de 1

Démonstration sur ensibull

Script: [vernizzd@ensibull demo]$ more go-tout.sh #!/bin/sh./spg /tmp/data &./ppg /tmp/data 1 --a1 -thread.poolsize 3 &./apg /tmp/data 1 --a1 -thread.poolsize 3 &

Résultat: [vernizzd@ensibull demo]$ ./go-tout.sh Page size: 4096Memory allocatedMemory allocated0:In main: th = 1, parallel0: -----------------------------------------0: res = -2.048e+07

0: time = 0.408178 s ADAPTATIF (3 procs)0: Threads created: 540: -----------------------------------------0: res = -2.048e+07

0: time = 0.964014 s PARALLELE (3 procs)0: #fork = 74970: -----------------------------------------: -----------------------------------------: res = -2.048e+07

: time = 1.15204 s SEQUENTIEL (1 proc): -----------------------------------------

Algorithme adaptatif (1/3)

● Hypothèse: ordonnancement non préemptif - de type work-stealing

● Couplage séquentiel adaptatif :

void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) {// cout << "Adaptative" << endl; a1::Shared <Page> resLPC;

a1::Fork<LPC>() (resLPC, dw);

Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq);}

Algorithme adaptatif (2/3)

● Côté séquentiel :

void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } } *resSeq=resLoc;}

Algorithme adaptatif (3/3)● Côté extraction = algorithme parallèle :

struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } }};

Application : gzip parallelisation

• Gzip :

• Utilisé (web) et coûteux bien que de complexité linéaire

• Code source :10000 lignes C, structures de données complexes

• Principe : LZ77 + arbre Huffman

• Pourquoi gzip ?• Problème P-complet, mais parallélisation pratique possible similar to iterated

product

• Inconvénient: toute parallélisation (connue) entraîne un surcoût • -> perte de taux de compression

Fichiercompressé

Fichieren entrée

Compressionà la volée

Algorithme

Partition dynamique en blocs

Parallélisation « facile  » ,100% compatible avec gzip/gunzip

Problèmes : perte de taux de compression, grain dépend de la machine, surcoût

Blocs compressés

Compressionparallèle

Partition statique en blocs

Parallélisation

=>

=>

Comment paralléliser gzip ?

Outputcompressedfile

InputFile

Compressionà la volée

SeqComp LastPartComputation

Outputcompressedblocks

Parallelcompression

Parallélisation gzip à grain adaptatif

Dynamicpartitionin blocks

cat

Performances sur SMPDuree de recherche dans de larges fichiers

DNA locaux

0

2

4

6

8

10

12

31924544 63849088

Taille des fichiers (octets)

Duee (sec.)

sequential parallel (12 threads)

Pentium 4x200 Mhz

Performances en distribué

Duree de recherche a travers deux disques non-locaux

0

200

400

600

800

1000

1200

674021 1228427

Taille des repertoires (octets)

Duree (sec.)

sequential 1 node, 4 threads 2 nodes, 4 threads

Séquentiel Pentium 4x200 Mhz

SMP Pentium 4x200 Mhz

Architecture distribuée Myrinet Pentium 4x200 Mhz + 2x333 Mhz

Recherche distribuée dans 2 répertoires de même taille chacun sur un disque distant (NFS)

Taille

Fichiers

Gzip Adaptatif

2 procs

Adaptatif

8 procs

Adaptatif

16 procs

0,86 Mo 272573 275692 280660 280660

5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo

9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo

10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo

5,2 Mo 3,35 s 0,96 s 0,55 s

9,4 Mo 7,67 s 6,73 s 6,79 s

10 Mo 6,79 s 1,71 s 0,88 s

Surcoût en taille de fichier comprimé

Gain en temps

Performances

4 processors computer

0

10

20

30

40

50

60

70

80

90

1,106 2,089 2,263 4,260 6,769 7,905 8,960 10,957 15,962 19,298 21,914

Size of file (Ko)

Time (in seconds)

Sequential gzip

Athapascan gzipPentium 4x200Mhz

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault toleranceConclusion

• Solution: to mix both a sequential and a parallel algorithm

• Basic technique : • Parallel algorithm until a certain « grain »; then use the sequential one• Problem : W increases

also, the number of migration … and the inefficiency ;o(

• Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja92]

Careful interplay of both algorithms to build one withboth W small and W1 = O( Wseq )

• Divide the sequential algorithm into block• Each block is computed with the (non-optimal) parallel algorithm• Drawback : sequential at coarse grain and parallel at fine grain ;o(

• Adaptive granularity : dual approach : • Parallelism is extracted at run-time from any sequential task

But often parallelism has a cost !

• Prefix problem : • input : a0, a1, …, an • output : 0, 1, …, n with

• Sequential algorithm : for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ;

• Fine grain optimal parallel algorithm [Ladner-Fischer]:

Prefix computation

Critical time W =2. log n

but performs W1 = 2.n ops

Twice more expensive   than the sequential …

a0 a1 a2 a3 a4 … an-1 an

* * **

Prefix of size n/2 1 3 … n

2 4 … n-1

** *

performs W1 = W = n operations

• Any parallel algorithm with critical time W runs on p processors in time

– strict lower bound : block algorithm + pipeline [Nicolau&al. 1996]

–Question : How to design a generic parallel algorithm, independent from the architecture, that achieves optimal performance on any given architecture ? –> to design a malleable algorithm where scheduling suits the number of operations performed to the architecture

Prefix computation : an example where parallelism always costs

- Heterogeneous processors with changing speed [Bender-Rabin02]

=> i(t) = instantaneous speed of processor i at time t in #operations per second

- Average speed per processor for a computation with duration T :

- Lower bound for the time of prefix computation :

Architecture model

Alternative : concurrently sequential and parallelBased on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead

use parallel algorithm only if a processor becomes idle (ie steals) by extracting parallelism from a sequential computation

Hypothesis : two algorithms : • - 1 sequential : SeqCompute

- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm

– Self-adaptive granularity based on work-stealingSeqCompute

Extract_parLastPartComputation

SeqCompute

Parallel

Sequential

0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12

Work-stealer 1

MainSeq.

Work-stealer 2

Adaptive Prefix on 3 processors

1

Steal request

Parallel

Sequential

Adaptive Prefix on 3 processors

0 a1 a2 a3 a4

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

a9 a10 a11 a127

3

Steal request

2

6 i=a5*…*ai

Parallel

Sequential

Adaptive Prefix on 3 processors

0 a1 a2 a3 a4

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

8

4

Preempt

10 i=a9*…*ai

8

8

Parallel

Sequential

Adaptive Prefix on 3 processors

0 a1 a2 a3 a4 8

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

85

10 i=a9*…*ai9

6

11

8

Preempt 11

118

Parallel

Sequential

Adaptive Prefix on 3 processors

0 a1 a2 a3 a4 8 11 a12

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

85

10 i=a9*…*ai9

6

11

12

10

7

118

Parallel

Sequential

Adaptive Prefix on 3 processors

0 a1 a2 a3 a4 8 11 a12

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

85

10 i=a9*…*ai9

6

11

12

10

7

118

Implicit critical path on the sequential process

Analysis of the algorithm • Execution time

• Sketch of the proof :– Dynamic coupling of two algorithms that completes simultaneously:– Sequential: (optimal) number of operations S on one processor– Parallel : minimal time but performs X operations on other processors

• dynamic splitting always possible till finest grain BUT local sequential– Critical path small ( eg : log X)– Each non constant time task can potentially be splitted (variable speeds)

– Algorithmic scheme ensures Ts = Tp + O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (s+X) - #ops_optimal

Lower bound

Adaptive prefix : experiments1

Single-user context : processor-oblivious prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors

- Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors :

Optimal off-line on p procs

Oblivious

Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux)T

ime

(s)

#processors

Pure sequential

Single user context

Adaptive prefix : experiments 2

Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule,

Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule,

External charge (9-p external processes)

Off-line parallel algorithm for p processors

Oblivious

Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux)

Tim

e (s

)

#processors

Multi-user context :

The Prefix race: sequential/parallel fixed/ adaptive

Race between 9 algorithms (44 processes) on an octo-SMPSMP

0 5 10 15 20 25

1

2

3

4

5

6

7

8

9

Execution time (seconds)

Série1

Adaptative 8 proc.

Parallel 8 proc.

Parallel 7 proc.

Parallel 6 proc.Parallel 5 proc.

Parallel 4 proc.

Parallel 3 proc.

Parallel 2 proc.

Sequential

On each of the 10 executions, adaptive completes first

Conclusion

The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms

Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors

with changing speeds - practically, achieves near-optimal performances on multi-user SMPs

Generic adaptive scheme to implement parallel algorithms with provable performance

- work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

A COMPLETER Bruno / Luciano

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

A COMPLETERBruno / Jean-Denis

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis]VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

The dataflow graph is a portable distributed execution of the execution : can be used to adapt to resource resilience by computing a coherent global state

SEL : protocole par journalisation du graphe de flot de données

Conclusion tolérance défaillances• Kaapi tolère ajout et

défaillance de machines – Protocole TIC :

• période à ajuster

– Détecteur défaillances • signal erreurs +heartbeat

– http://kaapi.gforge.inria.fr

Athapascan / Kaapi

Pile [TIC]

oui

Pas besoin

Locale ou globale

oui

FullDag[SEL]Dataflow graph

Parallel algorithms and scheduling: adaptive parallel programming and

applications

Contents I. Motivations for adaptation and examples.II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault toleranceConclusion

Thank you !

QuickTime™ et undécompresseur codec YUV420

sont requis pour visionner cette image.

Interactive Distributed Simulation[B Raffin &E Boyer]

- 5 cameras, - 6 PCs

3D-reconstruction+ simulation+ rendering

->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp

http://moais.imag.fr

LaboratoireInformatiquede Grenoble

Multi-Programming and Scheduling Design for Applications of Interactive Simulation

Comité d’évaluation du LIG - 23/01/2006

Louvre, Musée de l’Homme Sculpture (Tête)Artist : AnonymeOrigin: Rapa Nui [Easter Island]Date : between the XIst and the XVth centuryDimensions : 1,70 m high

Who are the Moais today ?• Permanent staff (7) : INRIA (2) + INPG (3) + UJF (2)

• Visiting professor (1) + PostDoc (1)

• ITA : Adminstration (1.25) + Engineer (1)

• PhD students (14) : 6 contracts + 2 joined (co-tutelle)

Vincent Danjean [MdC UJF]Thierry Gautier [CR INRIA] Guillaume Huard [MdC UJF] Grégory Mounié [MdC INPG]

Bruno Raffin [CR INRIA]Jean-Louis Roch [MdC INPG]Denis Trystram [Prof INPG]

Axel Krings [CNRS/RAGTIME , Univ Idaho - 1/09/2004->31/08/05}Luciano Suares [PostDoc INRIA, 2006]

Admin. : Barbara Amouroux [INRIA, 50%] Annie-Claude Vial-Dallais [INPG, 50%] Evelyne Feres[UJF, 50%]Engineer : Joelle Prévost [INPG, 50%] , IE [CNRS, 50%]

Julien Bernard (BDI ST)Florent Blanchot (Cifre ST)Guillaume Dunoyer (DCN, MOAIS/POPART/E-MOTION) Lionel Eyraud (MESR)Feryal-Kamila Moulai (Bull)Samir Jafar (ATER)Clément Menier (MESR, MOAIS/MOVI)

Jonathan Pecero-Sanchez(Egide Mexique)Laurent Pigeon (Cifre IFP)Krzysztof Rzadca (U Warsaw, Poland)Daouda Traore (Egide/Mali)Sébastien Varrette (U Luxembourg) Thomas Arcila (Cifre Bull)Eric Saule (MESR)

Jérémie Allard (MESR, 12/2005)Luiz-Angelo Estefanel (Egide/Brasil 11/2005)

Hamid-Reza Hamidi (Egide/Iran 10/2005)Jaroslaw Zola (U Czestochowa, Poland 12/2005)+4

Objective• Programming on virtual networks (clusters and lightweight grids)

applications where the multi-criteria performance depends on the resources

number with provable performances– Adaptability :

• Static to the platform : the devices evolute gradually

• Dynamic to the execution context : data and resources usage

• Target applications : interactive and distributed simulations – Virtual reality, compute intensive applications

(process engineering, optimization, bio-computing)

computation visualization

Eg Grimage platform

MOAIS approach• Algorithm + Scheduling + Programming

– Large degree of parallelism w.r.t. the architecture– both local preemptive (system) and non-preemptive (application) scheduling

to obtain global provable performances

• dataflow, distributed, recursive/multithreaded

• Research thema– Scheduling

• off-line, on-line, multi-criteria

– Adaptive (poly-)algorithms • Various level of malleability : parallelism, precision, refresh rate, …

– Coupling : efficient description of synchronizations KAAPI software : fine grain dynamic scheduling

– Interactivity FlowVR software : coarse grain scheduling

Scientific challenges• To schedule with provable performances for a multicriteria objective

on complex architectures eg - Time / Memory - Makespan / MinSum - Time / Work

Approximation, Game theory, …

• To design distributed adaptive algorithms Efficiency/scalability : local decision but global performance

• Dynamic local choices - performance with good probability

• To implement on emerging platforms Challenging applications (partners)

Grimage (MOVI)H.264 / MPEG-4 encoding on mpScNet [ST]

QAP/Nugent on Grid5000 [PRISM, GILCO, LIFL]

ObjectiveProgramming of applications where performance is a matter of

resources: take benefit of more and suit to less

– eg : a global computing platform (P2P)

Application code : “independent” from resources and adaptive Target applications: interactive simulation

– “ virtual observatory “

MOAIS

intearction

simulation

rendering

Performance is related to #resources - simulation : precision=size/order

#procs, memory space - rendering : images wall, sounds, ...

#video-projectors, #HPs, ... - interaction : acquisition peripherals

#cameras, #haptic sensors, …

GRIMAGE platform

• 2003 : 11 PCs, 8 projectors and 4 cameras

First demo : 12/03

• 2005: 30 PCs, 16 projectors and 20 cameras– A display wall :

• Surface: 2x2.70 m

• Resolution: 4106x3172 pixels

• Very bright: daylight work

Commodity components

[B. Raffin]

Video [J Allard, C Menier]

QuickTime™ et undécompresseur

sont requis pour visionner cette image.