Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman
description
Transcript of Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman
Dariusz Kowalski
University of Connecticut & Warsaw University
joint work with
Alex Shvartsman
University of Connecticut & MIT
Performing Tasks in
Asynchronous Environments
Performing Work with Asynchronous Processors 2
Do-All problem ([DHW] et al.)DA(p,t) problem abstracts the basic problem of
cooperation in a distributed setting: p processors must perform t tasks, andat least one processor must know about it
[Dwork Halpern Waarts 92/98]
Tasks are: known to every processor similar - each takes similar number of local steps independent - may be performed in any order idempotent - may be performed concurrently
Performing Work with Asynchronous Processors 3
Do-All: synchronous model with crashes
Model: processors are synchronous, may fail by crashes
Solutions: problem well understood, results close to optimal
Shared-memory model -- communication by read/write Kanellakis, P.C., Shvartsman, A.A.:
Fault-tolerant parallel computation. Kluwer Academic Publishers (1997) Message-passing model -- communication by exchanging messages
Dwork, C., Halpern, J., Waarts, O.Performing work efficiently in the presence of faults.
SIAM Journal on Computing, 27 (1998) De Prisco, R., Mayer, A., Yung, M.
Time-optimal message-efficient work performance in the presence of faults. Proc. of 13th PODC, (1994) Chlebus, B., De Prisco, R., Shvartsman, A.A.
Performing tasks on synchronous restartable message- passing processors. Distributed Computing, 14 (2001)
Performing Work with Asynchronous Processors 4
Do-All: asynchronous models
Models: Shared-memory model -- communication by read/write --
widely studied, but solutions far from optimal Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel
computation. Kluwer Academic Publishers (1997) Anderson, R.J., Woll, H.: Algorithms for the certified Write-All
problem. SIAM Journal on Computing, 26 (1997) Kedem, Z., Palem, K., Raghunathan, A., Spirakis, P.: Combining
tentative and definite executions for very fast dependable parallel computing. Proc. of 23rd STOC, (1991)
Message-passing model -- communication by exchanging messages -- no interesting solutions until recently
Performing Work with Asynchronous Processors 5
Shared-Memory vs. Message-PassingShared-Memory (atomic registers):
processors communicate by read/write in shared-memory atomicity - guarantees that read outputs the last written value one read/write operation per local clock cycle information propagates and information is persistent
Hence cooperation is always possible, although delayedHere processor scheduling is the major challenge
Message-Passing: processors communicate by exchanging messages duration of a local step may be unbounded message delays may be unbounded information may not propagate -- send/recv depend on delay
Performing Work with Asynchronous Processors 6
Message-delay-sensitive approach
Even if messages delay are bounded by d (d-adversary),cooperation may be difficult
Observation:
If d = (t) then work must be (t ·p)
This means that cooperation is difficult, and
addressing scheduling alone is not enough -
- algorithm design and analysis must be d-sensitive
Message-delay-sensitive approach C. Dwork, N. Lynch and L. Stockmeyer.: Consensus in the presence
of partial synchrony. J. of the ACM, 35 (1988)
Performing Work with Asynchronous Processors 7
Measures of efficiency
Termination time : the first time when all tasks are done and at least one processors knows about it Used only to define work and message complexity
Not interesting on its own: if all processors but one are delayed then trivially time is (t)
Work : measures the sum, over all processors, of the number of local steps taken until termination time
Message complexity (message-passing model): measures number of all point-to-point messages sent until termination time
Performing Work with Asynchronous Processors 8
Structure of the presentation
Part 2: Message-passing model. Model: asynchrony, message delay, and modeling issues Delay-sensitive lower bounds for Do-All Progress-tree Do-All algorithms
Simulating shared-memory and Anderson-Woll (AW) Asynch. message-passing progress-tree algorithm
Permutation Do-All algorithms
Part 1: Shared-memory model Model and bibliography Improving AW algorithm in shared-memory by better
scheduling processors (task load-balancing)
Performing Work with Asynchronous Processors 9
Shared-Memory - model and goalWe consider the following model:
p asynchronous processors with PID in {0,…,p-1} processors communicate by read/write in shared-memory atomicity - read outputs the last written value one read/write operation per local clock cycle
Write-All : write 1’s into t locations of given array
Goal: improve scheduling of cooperating asynchronous processors leading to better load-balancing wrt tasks
Performing Work with Asynchronous Processors 10
Write-All: Selected BibliographyIntroducing Write-All problem Kanellakis, P.C., Shvartsman, A.A.: Efficient parallel algorithms
can be made robust. PODC (1989), Distributed Computing (1992)
AW algorithm with work O(t p ) Anderson, R.J., Woll, H.: Algorithms for the certified Write-All
problem. SIAM Journal on Computing, 26 (1997)
Randomized algorithm with work (t + plog p) Martel, C., Subramonian, R.: On the complexity of Certified Write-
All algorithms. J. Algorithms 16 (1994)
First work-optimal deterministic algorithm for t = (p4log p) Malewicz, G.: A work-optimal deterministic algorithm for the
asynchronous Certified Write-All problem. PODC (2003)
Performing Work with Asynchronous Processors 11
• Shared memory• p processors, t tasks (p = t)• q permutations of [q]• q-ary progress tree of depth logq p
• nodes are binary completion bits
Progress tree algorithms [BKRS, AW]
• Permutations establish the order in which the children are visited• p processors traverse the tree and use q-ary expansion of their PID to choose permutations
[Anderson Woll]
1 2 3 q
1 2 3 q
1 2 3 q
Performing Work with Asynchronous Processors 12
Algorithm AWT [Anderson Woll] Progress tree data structure is stored in shared memory
p, t = 9 , q = 3
: list of 3 schedules from S3
T : ternary tree of 9 leaves (progress tree), values 0-1
PID(j) : j-th digit of ternary-representation of PID
1 2
12 3
1 23
3 0 PID = 0,3,61 PID = 1,4,72 PID = 2,5,8
1 2 3
4 5 87 9 10 12116
0 1122
33
7=213
7=213
Performing Work with Asynchronous Processors 13
Contention of permutationsSn - group of all permutations on set [n],
with composition and identity n
, - permutations in Sn
- set of q permutations from Sn
i is lrm (left-to-right maximum) in if (i) > maxj<i (j)
LRM( ) - number of lrm in [Knuth] Cont(, ) = LRM( -1 )
Contention of : Cont( ) = max Cont(, ) [AW]
Theorem: [AW] For any n > 0 there exists set of n permutations from Sn with Cont( ) 3nHn = (n log n).
[Knuth] Knuth, D.E.: The art of computer programming Vol. 3 (third edition). Addison-Wesley Pub Co. (1998)
103 5 2 4 6 1 9 7 8 11
Performing Work with Asynchronous Processors 14
Procedure “Oblivious Do”n - number of jobs and units
- list of n schedules from Sn
Procedure Oblivious :
Forall processors PID = 0 to n-1 for i = 1 to n do
perform Job( PID(i))
Execution of Job( PID(i)) by processor PID is primary, if job PID(i) has not been previously performed
Lemma: [AW] In algorithm Oblivious with n units, n jobs, and using the list of n permutations from Sn, the number of primary job executions is at most Cont( ).
Performing Work with Asynchronous Processors 15
AWT(q) - new progress tree traversal algorithm Instead of using q permutations on set [q],
we use q permutations on set [n], where n = q2 log q
p = 6 , t = 16 , q = 2, n = 4
: list of 2 schedules from S4
T : 4-ary tree of 16 leaves (progress tree), values 0-1
PID(j) : j-th digit of ternary-representation of PID
0 PID : even1 PID : odd
1 2 3
5 6 98 10 11 13127
01 2 3
1 244
35=1014
5=1014
4
14 15 16
1 43 2
31 2 4
17 18 19 20
Performing Work with Asynchronous Processors 16
Main result Set n = q2 log q and let be list of q schedules from Sn
Define Cont(, ) = max Cont(, )
Lemma: For sufficiently large q and any set of at most exp(q2 log2 q) permutations on set [q2 log q], there is a list of q schedules from Sn such that
Cont(, ) q2 log q + 6q log q
Take q = log p and from above Lemma
Theorem: For every > 0, sufficiently large p and t =
(p2+), algorithm AWT(q) performs work O(t).
Performing Work with Asynchronous Processors 17
Message-Passing - model and goalsWe consider the following model:
p asynchronous processors with PID in {0,…,p-1} processors communicate by message passing in one local step each processor can send a message to any
subset of processors messages incur delays between send and receive processing of all received messages can be done during one
local step
Goal: understand the impact of message delay on efficiency of algorithmic solutions for Do-All
Performing Work with Asynchronous Processors 18
Lower bound - randomized algorithms
Theorem: Any randomized algorithm solving DA with t tasks using p asynchronous message-passing processors performs expected work
(t+pdlogd+1 t)
against any d-adversary.
Proof (sketch):
Adversary partitions computation into stages, each containingd time units, and constructs delay pattern stage after stage:
delays all messages in stage to be received at the end of stage
delays linear number of processors (which want to perform more than (1-1/(3d)) fraction of undone tasks) during stage
selection is on-line, with high probability has good properties
Performing Work with Asynchronous Processors 19
Simulating shared-memory algorithms
Write-All algorithm AWT Anderson, R.J., Woll, H.: Algorithms for the certified Write-All
problem. SIAM Journal on Computing, 26 (1997)
Quorum systems & Atomic memory services Attiya, H., Bar-Noy, A., Dolev, D.: Sharing memory robust-ly in
message passing systems. J. of the ACM, 42 (1996) Lynch, N., Shvartsman, A.: RAMBO: A Reconfigurable Atomic
Memory Service. Proc. of 16th DISC, (2002)
Emulating asynchronous shared-memory algorithms : Momenzadeh, M.: Emulating shared-memory Do-All in
asynchronous message passing systems. Masters Thesis, CSE, University of Conn, (2003)
Performing Work with Asynchronous Processors 20
Atomic memory is not required We use q-ary progress trees as the main data structure that is
“written” and “read” -- note that atomicity is not required If the following two writes occur (the entire tree is written), then a
subsequent read may obtain a third value that was never written:
Property of monotone progress : 1 at a tree node i indicates that all tasks attached to the leaves in
the sub-tree rooted in i have been performed If 1 is written at a node i in the progress tree of a processor, it
remains 1 forever
0
10
0
01
0
11
write write read
Performing Work with Asynchronous Processors 21
Algorithm DAq - traverse progress tree Instead of using shared memory, processors broadcast their
progress trees as soon as local progress is recorded
p, t = 9 , q = 3
: list of 3 schedules from S3
T : ternary tree of 9 leaves (progress tree), values 0-1
PID(j) : j-th digit of ternary-representation of PID
1 2
12 3
1 23
3 0 PID = 0,3,61 PID = 1,4,72 PID = 2,5,8
1 2 3
4 5 87 9 10 12116
0 1122
33
7=213
7=213
Performing Work with Asynchronous Processors 22
Algorithm DAq - case p t
Performing Work with Asynchronous Processors 23
Procedure DOWORK
Performing Work with Asynchronous Processors 24
Algorithm DAq - analysis
Modification of algorithm DAq for p < t : We partition the t tasks into p jobs of size t /p and let the
algorithm DAq work with these jobs. It takes a processor O(t /p) work (instead of constant) to process
such a job (job unit). In each step, a processor broadcasts at most one message to p-1
other processors, we obtain:
Theorem 4: For any constant > 0 there is a constant q such that the algorithm DAq has work
W(p,t,d) = O(tp + pdt /d )and message complexity
O(p W(p,t,d))
against any d-adversary (d=o(t)).
Performing Work with Asynchronous Processors 25
Permutation algorithms - case p t Algorithms proceed in a loop:
select the next task using ORDER+SELECT rule perform selected task send messages, receive messages, and update state
ORDER+SELECT rules: PARAN1 : initially processor PID permutes tasks randomly
PID selects first task remaining on his schedule
PARAN2 : no initial order
PID selects task from remaining sets randomly
PADET : initially processor PID chooses schedule PID in
PID selects first task remaining on schedule PID
- list of p schedules from St
Performing Work with Asynchronous Processors 26
d-Contention of permutations
We introduce the notion of d-Contention :
i is d-lrm in if |{j < i | (i) < (j)}| < d
d = 2
LRMd() - number of d-lrm in
Contd(, ) = LRMd( -1 )
d-Contention of : Contd( ) = max Contd(, )
Theorem: For sufficiently large p and n, there is a list of p permutations from Sn such that, for every integer d >1,
Contd( ) n log n + 5pd ln(e+n/d).
Moreover, random is good with high probability.
103 5 2 4 6 1 9 7 8 11
Performing Work with Asynchronous Processors 27
d-Contention and workLemma: For algorithms PADET and PARAN1, the respective worst
case work and expected work is at most
Contd( )
against any d-adversary.
Example:
p = 2, t = 11, d = 2
1 3 2 5 7 4 9 8 6 11 10
2 4 6 8 10 11 9 7 5 3 1
Order of tasks to perform : 1,2,3,4,5,6,7,8,9,10,11
1
2
3 2
4
5
6
7
8
9
10 11
11 10
Performing Work with Asynchronous Processors 28
Permutation algorithms - results
Theorem: Randomized algorithms PARAN1 and PARAN2 perform expected work
O(tlog p + pdlog(t /d))
and have expected communication O(tplog p + p2dlog(t /d))
against any d-adversary (d=o(t)).
Corollary: There exists a deterministic list of schedules such that
algorithm PADET performs work
O(tlog p + pmin{t,d}log(2+t /d))
and has communication
O(tplog p + p2min{t,d}log(2+t /d))
when p t.
Performing Work with Asynchronous Processors 29
Conclusions and open problems Work-optimal Write-All algorithm for t = (p2+) First message-delay-sensitive analysis of the Do-All problem for
asynchronous processors in message-passing model lower bounds for deterministic and randomized algorithms deterministic and randomized algorithms with subquadratic
(in p and t ) work for any message delay d as long as d=o(t) Among the interesting open questions are
is there work-optimal scheduling for t = (p log p) for algorithm PADET : how to construct list of permutations
efficiently closing the gap between the upper and the lower bounds investigate algorithms that simultaneously control work and
message complexity