Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman

Dariusz Kowalski

University of Connecticut & Warsaw University

joint work with

Alex Shvartsman

University of Connecticut & MIT

Performing Tasks in

Asynchronous Environments

Performing Work with Asynchronous Processors 2

Do-All problem ([DHW] et al.)DA(p,t) problem abstracts the basic problem of

cooperation in a distributed setting: p processors must perform t tasks, andat least one processor must know about it

[Dwork Halpern Waarts 92/98]

Tasks are: known to every processor similar - each takes similar number of local steps independent - may be performed in any order idempotent - may be performed concurrently


Do-All: synchronous model with crashes

Model: processors are synchronous, may fail by crashes

Solutions: problem well understood, results close to optimal

Shared-memory model -- communication by read/write Kanellakis, P.C., Shvartsman, A.A.:

Fault-tolerant parallel computation. Kluwer Academic Publishers (1997) Message-passing model -- communication by exchanging messages

Dwork, C., Halpern, J., Waarts, O.Performing work efficiently in the presence of faults.

SIAM Journal on Computing, 27 (1998) De Prisco, R., Mayer, A., Yung, M.

Time-optimal message-efficient work performance in the presence of faults. Proc. of 13th PODC, (1994) Chlebus, B., De Prisco, R., Shvartsman, A.A.

Performing tasks on synchronous restartable message- passing processors. Distributed Computing, 14 (2001)


Do-All: asynchronous models

Models: Shared-memory model -- communication by read/write --

widely studied, but solutions far from optimal Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel

computation. Kluwer Academic Publishers (1997) Anderson, R.J., Woll, H.: Algorithms for the certified Write-All

problem. SIAM Journal on Computing, 26 (1997) Kedem, Z., Palem, K., Raghunathan, A., Spirakis, P.: Combining

tentative and definite executions for very fast dependable parallel computing. Proc. of 23rd STOC, (1991)

Message-passing model -- communication by exchanging messages -- no interesting solutions until recently


Shared-Memory vs. Message-PassingShared-Memory (atomic registers):

processors communicate by read/write in shared-memory atomicity - guarantees that read outputs the last written value one read/write operation per local clock cycle information propagates and information is persistent

Hence cooperation is always possible, although delayedHere processor scheduling is the major challenge

Message-Passing: processors communicate by exchanging messages duration of a local step may be unbounded message delays may be unbounded information may not propagate -- send/recv depend on delay


Message-delay-sensitive approach

Even if messages delay are bounded by d (d-adversary),cooperation may be difficult

Observation:

If d = (t) then work must be (t ·p)

This means that cooperation is difficult, and

addressing scheduling alone is not enough -

- algorithm design and analysis must be d-sensitive

Message-delay-sensitive approach C. Dwork, N. Lynch and L. Stockmeyer.: Consensus in the presence

of partial synchrony. J. of the ACM, 35 (1988)


Measures of efficiency

Termination time : the first time when all tasks are done and at least one processors knows about it Used only to define work and message complexity

Not interesting on its own: if all processors but one are delayed then trivially time is (t)

Work : measures the sum, over all processors, of the number of local steps taken until termination time

Message complexity (message-passing model): measures number of all point-to-point messages sent until termination time


Structure of the presentation

Part 2: Message-passing model. Model: asynchrony, message delay, and modeling issues Delay-sensitive lower bounds for Do-All Progress-tree Do-All algorithms

Simulating shared-memory and Anderson-Woll (AW) Asynch. message-passing progress-tree algorithm

Permutation Do-All algorithms

Part 1: Shared-memory model Model and bibliography Improving AW algorithm in shared-memory by better

scheduling processors (task load-balancing)


Shared-Memory - model and goalWe consider the following model:

p asynchronous processors with PID in {0,…,p-1} processors communicate by read/write in shared-memory atomicity - read outputs the last written value one read/write operation per local clock cycle

Write-All : write 1’s into t locations of given array

Goal: improve scheduling of cooperating asynchronous processors leading to better load-balancing wrt tasks


Write-All: Selected BibliographyIntroducing Write-All problem Kanellakis, P.C., Shvartsman, A.A.: Efficient parallel algorithms

can be made robust. PODC (1989), Distributed Computing (1992)

AW algorithm with work O(t p ) Anderson, R.J., Woll, H.: Algorithms for the certified Write-All

problem. SIAM Journal on Computing, 26 (1997)

Randomized algorithm with work (t + plog p) Martel, C., Subramonian, R.: On the complexity of Certified Write-

All algorithms. J. Algorithms 16 (1994)

First work-optimal deterministic algorithm for t = (p4log p) Malewicz, G.: A work-optimal deterministic algorithm for the

asynchronous Certified Write-All problem. PODC (2003)


• Shared memory• p processors, t tasks (p = t)• q permutations of [q]• q-ary progress tree of depth logq p

• nodes are binary completion bits

Progress tree algorithms [BKRS, AW]

• Permutations establish the order in which the children are visited• p processors traverse the tree and use q-ary expansion of their PID to choose permutations

[Anderson Woll]

1 2 3 q

1 2 3 q

1 2 3 q


Algorithm AWT [Anderson Woll] Progress tree data structure is stored in shared memory

p, t = 9 , q = 3

: list of 3 schedules from S3

T : ternary tree of 9 leaves (progress tree), values 0-1

PID(j) : j-th digit of ternary-representation of PID

1 2

12 3

1 23

3 0 PID = 0,3,61 PID = 1,4,72 PID = 2,5,8

1 2 3

4 5 87 9 10 12116

0 1122

33

7=213

7=213


Contention of permutationsSn - group of all permutations on set [n],

with composition and identity n

, - permutations in Sn

- set of q permutations from Sn

i is lrm (left-to-right maximum) in if (i) > maxj<i (j)

LRM( ) - number of lrm in [Knuth] Cont(, ) = LRM( -1 )

Contention of : Cont( ) = max Cont(, ) [AW]

Theorem: [AW] For any n > 0 there exists set of n permutations from Sn with Cont( ) 3nHn = (n log n).

[Knuth] Knuth, D.E.: The art of computer programming Vol. 3 (third edition). Addison-Wesley Pub Co. (1998)

103 5 2 4 6 1 9 7 8 11


Procedure “Oblivious Do”n - number of jobs and units

- list of n schedules from Sn

Procedure Oblivious :

Forall processors PID = 0 to n-1 for i = 1 to n do

perform Job( PID(i))

Execution of Job( PID(i)) by processor PID is primary, if job PID(i) has not been previously performed

Lemma: [AW] In algorithm Oblivious with n units, n jobs, and using the list of n permutations from Sn, the number of primary job executions is at most Cont( ).


AWT(q) - new progress tree traversal algorithm Instead of using q permutations on set [q],

we use q permutations on set [n], where n = q2 log q

p = 6 , t = 16 , q = 2, n = 4


T : 4-ary tree of 16 leaves (progress tree), values 0-1


0 PID : even1 PID : odd

1 2 3

5 6 98 10 11 13127

01 2 3

1 244

35=1014

5=1014

4

14 15 16

1 43 2

31 2 4

17 18 19 20


Main result Set n = q2 log q and let be list of q schedules from Sn

Define Cont(, ) = max Cont(, )

Lemma: For sufficiently large q and any set of at most exp(q2 log2 q) permutations on set [q2 log q], there is a list of q schedules from Sn such that

Cont(, ) q2 log q + 6q log q

Take q = log p and from above Lemma

Theorem: For every > 0, sufficiently large p and t =

(p2+), algorithm AWT(q) performs work O(t).


Message-Passing - model and goalsWe consider the following model:

p asynchronous processors with PID in {0,…,p-1} processors communicate by message passing in one local step each processor can send a message to any

subset of processors messages incur delays between send and receive processing of all received messages can be done during one

local step

Goal: understand the impact of message delay on efficiency of algorithmic solutions for Do-All


Lower bound - randomized algorithms

Theorem: Any randomized algorithm solving DA with t tasks using p asynchronous message-passing processors performs expected work

(t+pdlogd+1 t)

against any d-adversary.

Proof (sketch):

Adversary partitions computation into stages, each containingd time units, and constructs delay pattern stage after stage:

delays all messages in stage to be received at the end of stage

delays linear number of processors (which want to perform more than (1-1/(3d)) fraction of undone tasks) during stage

selection is on-line, with high probability has good properties


Simulating shared-memory algorithms

Write-All algorithm AWT Anderson, R.J., Woll, H.: Algorithms for the certified Write-All

problem. SIAM Journal on Computing, 26 (1997)

Quorum systems & Atomic memory services Attiya, H., Bar-Noy, A., Dolev, D.: Sharing memory robust-ly in

message passing systems. J. of the ACM, 42 (1996) Lynch, N., Shvartsman, A.: RAMBO: A Reconfigurable Atomic

Memory Service. Proc. of 16th DISC, (2002)

Emulating asynchronous shared-memory algorithms : Momenzadeh, M.: Emulating shared-memory Do-All in

asynchronous message passing systems. Masters Thesis, CSE, University of Conn, (2003)


Atomic memory is not required We use q-ary progress trees as the main data structure that is

“written” and “read” -- note that atomicity is not required If the following two writes occur (the entire tree is written), then a

subsequent read may obtain a third value that was never written:

Property of monotone progress : 1 at a tree node i indicates that all tasks attached to the leaves in

the sub-tree rooted in i have been performed If 1 is written at a node i in the progress tree of a processor, it

remains 1 forever

0

10

0

01

0

11

write write read


Algorithm DAq - traverse progress tree Instead of using shared memory, processors broadcast their

progress trees as soon as local progress is recorded

p, t = 9 , q = 3


T : ternary tree of 9 leaves (progress tree), values 0-1


1 2

12 3

1 23

3 0 PID = 0,3,61 PID = 1,4,72 PID = 2,5,8

1 2 3

4 5 87 9 10 12116

0 1122

33

7=213

7=213


Algorithm DAq - case p t


Procedure DOWORK


Algorithm DAq - analysis

Modification of algorithm DAq for p < t : We partition the t tasks into p jobs of size t /p and let the

algorithm DAq work with these jobs. It takes a processor O(t /p) work (instead of constant) to process

such a job (job unit). In each step, a processor broadcasts at most one message to p-1

other processors, we obtain:

Theorem 4: For any constant > 0 there is a constant q such that the algorithm DAq has work

W(p,t,d) = O(tp + pdt /d )and message complexity

O(p W(p,t,d))

against any d-adversary (d=o(t)).


Permutation algorithms - case p t Algorithms proceed in a loop:

select the next task using ORDER+SELECT rule perform selected task send messages, receive messages, and update state

ORDER+SELECT rules: PARAN1 : initially processor PID permutes tasks randomly

PID selects first task remaining on his schedule

PARAN2 : no initial order

PID selects task from remaining sets randomly

PADET : initially processor PID chooses schedule PID in

PID selects first task remaining on schedule PID

- list of p schedules from St


d-Contention of permutations

We introduce the notion of d-Contention :

i is d-lrm in if |{j < i | (i) < (j)}| < d

d = 2

LRMd() - number of d-lrm in

Contd(, ) = LRMd( -1 )

d-Contention of : Contd( ) = max Contd(, )

Theorem: For sufficiently large p and n, there is a list of p permutations from Sn such that, for every integer d >1,

Contd( ) n log n + 5pd ln(e+n/d).

Moreover, random is good with high probability.

103 5 2 4 6 1 9 7 8 11


d-Contention and workLemma: For algorithms PADET and PARAN1, the respective worst

case work and expected work is at most

Contd( )

against any d-adversary.

Example:

p = 2, t = 11, d = 2

1 3 2 5 7 4 9 8 6 11 10

2 4 6 8 10 11 9 7 5 3 1

Order of tasks to perform : 1,2,3,4,5,6,7,8,9,10,11

1

2

3 2

4

5

6

7

8

9

10 11

11 10


Permutation algorithms - results

Theorem: Randomized algorithms PARAN1 and PARAN2 perform expected work

O(tlog p + pdlog(t /d))

and have expected communication O(tplog p + p2dlog(t /d))

against any d-adversary (d=o(t)).

Corollary: There exists a deterministic list of schedules such that

algorithm PADET performs work

O(tlog p + pmin{t,d}log(2+t /d))

and has communication

O(tplog p + p2min{t,d}log(2+t /d))

when p t.


Conclusions and open problems Work-optimal Write-All algorithm for t = (p2+) First message-delay-sensitive analysis of the Do-All problem for

asynchronous processors in message-passing model lower bounds for deterministic and randomized algorithms deterministic and randomized algorithms with subquadratic

(in p and t ) work for any message delay d as long as d=o(t) Among the interesting open questions are

is there work-optimal scheduling for t = (p log p) for algorithm PADET : how to construct list of permutations

efficiently closing the gap between the upper and the lower bounds investigate algorithms that simultaneously control work and

message complexity

Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman

Documents

Transcript of Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman