ParallelAlgorithms-Ranade

8/8/2019 ParallelAlgorithms-Ranade

1/29

Abhiram G. Ranade

Dept of CSE, IIT Bombay


2/29

Availability of very powerful parallel computers, e.g.CDAC PARAM Yuva

Need to solve large problems

Multicore desktop machines

Inexpensive GPUs

FPGA coprocessors


3/29

Network of processors.

Local computation: one operation/step/processor

Communication with d neighbour s:

b words/L steps/proc.

Shared Memory

Local computation: one operation/step/processor

Access to shared memor y: b words/L steps/processor

Fine grain: small L, large b. Else coarse


4/29


5/29

Maximize Speedup = T1 / TpT1 = Time using best sequential algorithm

Tp = Time using parallel algorithm on pprocessors.

Ideally speedup = p. Usually not possible.


6/29

General strategy for designing parallel algorithms

Brief Case Studies

Summary of main themes


7/29

Not necessarily in order:

Design sequential algorithm Make sure you know how to solve the pr oblem!

Identify Parallelism Sometimes obvious

Assign available work to processors Balance load

Minimize communication


8/29

Matrix multiplication

Prefix

Sorting

Sparse matrix multiplication

N-body problems

Parallel Search


9/29

For i = 1..N

For j = 1..N

C[i,j] = 0For k = 1..N

C[i,j] += A[i,k] * B[k,j]

Each C[i,j] can be computed in parallel


10/29

31

23 32 B

13 22 3112 21

A 11

31 21 11 O--O--O

32 22 21 O--O--O

33 23 13 O--O--O Computes C[3,3]


11/29

Implementation needs fine granularity

If your computer has coarse granul arity: treat each

ij as q x q submatrix.

Amount of data input/ step: 2 x q x q

Amount of computation/ step: q3


12/29

If your network has other topol ogy, embed gridnetwork into your topology. Map grid vertices to your processors

Map grid edges to paths in your network Your networksimulates grid

Algorithm is self synchronizing. Processors wait fordata

Data can be on di sks in network, how to distributeis an important question.


13/29

Input: A[1..N], Output: B[1..N]

B[1] = A[1]

For j = 2 to N

B[j] = B[j-1] + A[j] // + : associative op.

Model of recurrences, carry look ahead, matchingwork to processors, algorithmic primitive in sorting,

N-body comp Not parallel? jth iteration needs result of j-1th


14/29

Construct C[1..N/2], C[j] = A[2j-1]+A[2j]

Recursively solve. Answer D[1..N/2]

D[j] = C[1]+C[2]+ C[j]

= A[1]+A[2]+ . A[2j] = B[2j]

B[2j-1] = D[j] - A[2j]

Tree implementation: A fed at leaves, C computed atparents of leaves, B arrives at leaves.


15/29

Fine grained algorithm

For use on coarse grai ned networks, embed bigsubtrees on 1 processor.

Not necessary to have complete binary trees No need for - operation.

Algorithm can wait for data, self synchronizing.


16/29

As much work and as many cl ever algorithms as insequential sorting.

Merge Paradigm: Each processor sorts its data

locally. Then all sublists are merged. Merging canhappen in parallel


17/29

Bucket sort paradigm:

Assign one bucket to each processor.

Use sampling to determine ranges.

Each processor sends its keys to correct bucket.

Each processor locally sorts the keys it receives.


18/29

Manage communication properly: sorting iscommunication intensive.

Pack many keys into single message, do not send 1

message per key.

Sorting is an important primitive, pay greatattention to communication network.

Similar issues in database join operations.


19/29

Key operation in many numerical codes, e.g. finiteelement method.

Invoked repeatedly, e.g. in solving linear systems,

differential equations. Dense matrix vector multiplication is easy -- similar

to matrix-matrix multiplication.


20/29

Graph: derived from the problem, e.g. finite elementmesh graph

Vector : V[j] present at vertex j of some graph

Matrix: A[j,k] present in edge (j,k) of the graph

Multiplication:

Each vertex sends its data on all edges

As the data moves along an edge , it is multiplied by thecoefficient

The products are added as they arrive into the vertices.


21/29

Partition the graph among available processors s.t.

Each processor gets equal no of edges (load balance)

Most edges have both endpoints on the same processor

(minimize communication)

Classic, hard problem


22/29

Many heuristics known.

Spectral Methods

Multi-level methods based on graph coarsening, e.g. METIS

Good partitioning is possible for well shapedmeshes arising out of finite element methods.


23/29

Input: positions of n stars

Output: force on each star due to al l others.

Nave Algorithm: O(n2), for each star calculate the

force due to every other. Fast Multipole Method: O(n). Based on clustering

stars and considering cluster-cluster interactions +star-cluster + star-star interactions.


24/29

Star data resides at leaves of oct tree.

Cluster data resides at internal nodes Computed by passing data along tree edges.

Cluster-cluster interaction: data flows betweenneighbours at same level of tree.

Dataflow is pyramidal. Possible to nicely embed p yramids in most networks.

More complexity because p yramid might have variablenumber of levels if stars are distributed unevenly.


25/29

Find a combinatorial object satisfying properties

, e.g. packing, TSP,

Naturally parallel, but unstructured growth of

search tree.

Difficulty in load balancing.

Key question: can we maintain a distributed queue?

Best solution found? Bounds


26/29

Prefix like computations useful in maintainingdistributed queue.

Randomness also useful. Each processor asks for

work from a random processor when its workqueue runs out.

Broadcasting required to maintain best solution,bounds..


27/29

Graph embedding Develop algorithm for one network, embed that network

on what you actually have

Form graph of dataflow/data dependence , embed that innetwork you have

Solve problem locally, merge global results Sorting

Matrix multiplication: locally implement subblock

multiplication Useful in graph algorithms


28/29

Randomization Useful in load balancing

Also used in communication, select paths randomly

Symmetry breaking

Sampling

Co-ordination Prefix is useful

Distributed data structures, e.g. queues, hash tables

Communication Patter ns All to all .. Sorting

Permutation

Broadcasting


29/29

Vast subject

Many clever algorithms

Many models of computation

Many issues besides algorithms: how to express inprogramming language..

Quick tour of some of the important ideas.

ParallelAlgorithms-Ranade

Documents

Transcript of ParallelAlgorithms-Ranade