ParallelAlgorithms-Ranade
-
Upload
pradosh-k-roy -
Category
Documents
-
view
223 -
download
0
Transcript of ParallelAlgorithms-Ranade
-
8/8/2019 ParallelAlgorithms-Ranade
1/29
Abhiram G. Ranade
Dept of CSE, IIT Bombay
-
8/8/2019 ParallelAlgorithms-Ranade
2/29
Availability of very powerful parallel computers, e.g.CDAC PARAM Yuva
Need to solve large problems
Multicore desktop machines
Inexpensive GPUs
FPGA coprocessors
-
8/8/2019 ParallelAlgorithms-Ranade
3/29
Network of processors.
Local computation: one operation/step/processor
Communication with d neighbour s:
b words/L steps/proc.
Shared Memory
Local computation: one operation/step/processor
Access to shared memor y: b words/L steps/processor
Fine grain: small L, large b. Else coarse
-
8/8/2019 ParallelAlgorithms-Ranade
4/29
-
8/8/2019 ParallelAlgorithms-Ranade
5/29
Maximize Speedup = T1 / TpT1 = Time using best sequential algorithm
Tp = Time using parallel algorithm on pprocessors.
Ideally speedup = p. Usually not possible.
-
8/8/2019 ParallelAlgorithms-Ranade
6/29
General strategy for designing parallel algorithms
Brief Case Studies
Summary of main themes
-
8/8/2019 ParallelAlgorithms-Ranade
7/29
Not necessarily in order:
Design sequential algorithm Make sure you know how to solve the pr oblem!
Identify Parallelism Sometimes obvious
Assign available work to processors Balance load
Minimize communication
-
8/8/2019 ParallelAlgorithms-Ranade
8/29
Matrix multiplication
Prefix
Sorting
Sparse matrix multiplication
N-body problems
Parallel Search
-
8/8/2019 ParallelAlgorithms-Ranade
9/29
For i = 1..N
For j = 1..N
C[i,j] = 0For k = 1..N
C[i,j] += A[i,k] * B[k,j]
Each C[i,j] can be computed in parallel
-
8/8/2019 ParallelAlgorithms-Ranade
10/29
31
23 32 B
13 22 3112 21
A 11
31 21 11 O--O--O
32 22 21 O--O--O
33 23 13 O--O--O Computes C[3,3]
-
8/8/2019 ParallelAlgorithms-Ranade
11/29
Implementation needs fine granularity
If your computer has coarse granul arity: treat each
ij as q x q submatrix.
Amount of data input/ step: 2 x q x q
Amount of computation/ step: q3
-
8/8/2019 ParallelAlgorithms-Ranade
12/29
If your network has other topol ogy, embed gridnetwork into your topology. Map grid vertices to your processors
Map grid edges to paths in your network Your networksimulates grid
Algorithm is self synchronizing. Processors wait fordata
Data can be on di sks in network, how to distributeis an important question.
-
8/8/2019 ParallelAlgorithms-Ranade
13/29
Input: A[1..N], Output: B[1..N]
B[1] = A[1]
For j = 2 to N
B[j] = B[j-1] + A[j] // + : associative op.
Model of recurrences, carry look ahead, matchingwork to processors, algorithmic primitive in sorting,
N-body comp Not parallel? jth iteration needs result of j-1th
-
8/8/2019 ParallelAlgorithms-Ranade
14/29
Construct C[1..N/2], C[j] = A[2j-1]+A[2j]
Recursively solve. Answer D[1..N/2]
D[j] = C[1]+C[2]+ C[j]
= A[1]+A[2]+ . A[2j] = B[2j]
B[2j-1] = D[j] - A[2j]
Tree implementation: A fed at leaves, C computed atparents of leaves, B arrives at leaves.
-
8/8/2019 ParallelAlgorithms-Ranade
15/29
Fine grained algorithm
For use on coarse grai ned networks, embed bigsubtrees on 1 processor.
Not necessary to have complete binary trees No need for - operation.
Algorithm can wait for data, self synchronizing.
-
8/8/2019 ParallelAlgorithms-Ranade
16/29
As much work and as many cl ever algorithms as insequential sorting.
Merge Paradigm: Each processor sorts its data
locally. Then all sublists are merged. Merging canhappen in parallel
-
8/8/2019 ParallelAlgorithms-Ranade
17/29
Bucket sort paradigm:
Assign one bucket to each processor.
Use sampling to determine ranges.
Each processor sends its keys to correct bucket.
Each processor locally sorts the keys it receives.
-
8/8/2019 ParallelAlgorithms-Ranade
18/29
Manage communication properly: sorting iscommunication intensive.
Pack many keys into single message, do not send 1
message per key.
Sorting is an important primitive, pay greatattention to communication network.
Similar issues in database join operations.
-
8/8/2019 ParallelAlgorithms-Ranade
19/29
Key operation in many numerical codes, e.g. finiteelement method.
Invoked repeatedly, e.g. in solving linear systems,
differential equations. Dense matrix vector multiplication is easy -- similar
to matrix-matrix multiplication.
-
8/8/2019 ParallelAlgorithms-Ranade
20/29
Graph: derived from the problem, e.g. finite elementmesh graph
Vector : V[j] present at vertex j of some graph
Matrix: A[j,k] present in edge (j,k) of the graph
Multiplication:
Each vertex sends its data on all edges
As the data moves along an edge , it is multiplied by thecoefficient
The products are added as they arrive into the vertices.
-
8/8/2019 ParallelAlgorithms-Ranade
21/29
Partition the graph among available processors s.t.
Each processor gets equal no of edges (load balance)
Most edges have both endpoints on the same processor
(minimize communication)
Classic, hard problem
-
8/8/2019 ParallelAlgorithms-Ranade
22/29
Many heuristics known.
Spectral Methods
Multi-level methods based on graph coarsening, e.g. METIS
Good partitioning is possible for well shapedmeshes arising out of finite element methods.
-
8/8/2019 ParallelAlgorithms-Ranade
23/29
Input: positions of n stars
Output: force on each star due to al l others.
Nave Algorithm: O(n2), for each star calculate the
force due to every other. Fast Multipole Method: O(n). Based on clustering
stars and considering cluster-cluster interactions +star-cluster + star-star interactions.
-
8/8/2019 ParallelAlgorithms-Ranade
24/29
Star data resides at leaves of oct tree.
Cluster data resides at internal nodes Computed by passing data along tree edges.
Cluster-cluster interaction: data flows betweenneighbours at same level of tree.
Dataflow is pyramidal. Possible to nicely embed p yramids in most networks.
More complexity because p yramid might have variablenumber of levels if stars are distributed unevenly.
-
8/8/2019 ParallelAlgorithms-Ranade
25/29
Find a combinatorial object satisfying properties
, e.g. packing, TSP,
Naturally parallel, but unstructured growth of
search tree.
Difficulty in load balancing.
Key question: can we maintain a distributed queue?
Best solution found? Bounds
-
8/8/2019 ParallelAlgorithms-Ranade
26/29
Prefix like computations useful in maintainingdistributed queue.
Randomness also useful. Each processor asks for
work from a random processor when its workqueue runs out.
Broadcasting required to maintain best solution,bounds..
-
8/8/2019 ParallelAlgorithms-Ranade
27/29
Graph embedding Develop algorithm for one network, embed that network
on what you actually have
Form graph of dataflow/data dependence , embed that innetwork you have
Solve problem locally, merge global results Sorting
Matrix multiplication: locally implement subblock
multiplication Useful in graph algorithms
-
8/8/2019 ParallelAlgorithms-Ranade
28/29
Randomization Useful in load balancing
Also used in communication, select paths randomly
Symmetry breaking
Sampling
Co-ordination Prefix is useful
Distributed data structures, e.g. queues, hash tables
Communication Patter ns All to all .. Sorting
Permutation
Broadcasting
-
8/8/2019 ParallelAlgorithms-Ranade
29/29
Vast subject
Many clever algorithms
Many models of computation
Many issues besides algorithms: how to express inprogramming language..
Quick tour of some of the important ideas.