ParallelAlgorithms-Ranade

download ParallelAlgorithms-Ranade

of 29

Transcript of ParallelAlgorithms-Ranade

  • 8/8/2019 ParallelAlgorithms-Ranade

    1/29

    Abhiram G. Ranade

    Dept of CSE, IIT Bombay

  • 8/8/2019 ParallelAlgorithms-Ranade

    2/29

    Availability of very powerful parallel computers, e.g.CDAC PARAM Yuva

    Need to solve large problems

    Multicore desktop machines

    Inexpensive GPUs

    FPGA coprocessors

  • 8/8/2019 ParallelAlgorithms-Ranade

    3/29

    Network of processors.

    Local computation: one operation/step/processor

    Communication with d neighbour s:

    b words/L steps/proc.

    Shared Memory

    Local computation: one operation/step/processor

    Access to shared memor y: b words/L steps/processor

    Fine grain: small L, large b. Else coarse

  • 8/8/2019 ParallelAlgorithms-Ranade

    4/29

  • 8/8/2019 ParallelAlgorithms-Ranade

    5/29

    Maximize Speedup = T1 / TpT1 = Time using best sequential algorithm

    Tp = Time using parallel algorithm on pprocessors.

    Ideally speedup = p. Usually not possible.

  • 8/8/2019 ParallelAlgorithms-Ranade

    6/29

    General strategy for designing parallel algorithms

    Brief Case Studies

    Summary of main themes

  • 8/8/2019 ParallelAlgorithms-Ranade

    7/29

    Not necessarily in order:

    Design sequential algorithm Make sure you know how to solve the pr oblem!

    Identify Parallelism Sometimes obvious

    Assign available work to processors Balance load

    Minimize communication

  • 8/8/2019 ParallelAlgorithms-Ranade

    8/29

    Matrix multiplication

    Prefix

    Sorting

    Sparse matrix multiplication

    N-body problems

    Parallel Search

  • 8/8/2019 ParallelAlgorithms-Ranade

    9/29

    For i = 1..N

    For j = 1..N

    C[i,j] = 0For k = 1..N

    C[i,j] += A[i,k] * B[k,j]

    Each C[i,j] can be computed in parallel

  • 8/8/2019 ParallelAlgorithms-Ranade

    10/29

    31

    23 32 B

    13 22 3112 21

    A 11

    31 21 11 O--O--O

    32 22 21 O--O--O

    33 23 13 O--O--O Computes C[3,3]

  • 8/8/2019 ParallelAlgorithms-Ranade

    11/29

    Implementation needs fine granularity

    If your computer has coarse granul arity: treat each

    ij as q x q submatrix.

    Amount of data input/ step: 2 x q x q

    Amount of computation/ step: q3

  • 8/8/2019 ParallelAlgorithms-Ranade

    12/29

    If your network has other topol ogy, embed gridnetwork into your topology. Map grid vertices to your processors

    Map grid edges to paths in your network Your networksimulates grid

    Algorithm is self synchronizing. Processors wait fordata

    Data can be on di sks in network, how to distributeis an important question.

  • 8/8/2019 ParallelAlgorithms-Ranade

    13/29

    Input: A[1..N], Output: B[1..N]

    B[1] = A[1]

    For j = 2 to N

    B[j] = B[j-1] + A[j] // + : associative op.

    Model of recurrences, carry look ahead, matchingwork to processors, algorithmic primitive in sorting,

    N-body comp Not parallel? jth iteration needs result of j-1th

  • 8/8/2019 ParallelAlgorithms-Ranade

    14/29

    Construct C[1..N/2], C[j] = A[2j-1]+A[2j]

    Recursively solve. Answer D[1..N/2]

    D[j] = C[1]+C[2]+ C[j]

    = A[1]+A[2]+ . A[2j] = B[2j]

    B[2j-1] = D[j] - A[2j]

    Tree implementation: A fed at leaves, C computed atparents of leaves, B arrives at leaves.

  • 8/8/2019 ParallelAlgorithms-Ranade

    15/29

    Fine grained algorithm

    For use on coarse grai ned networks, embed bigsubtrees on 1 processor.

    Not necessary to have complete binary trees No need for - operation.

    Algorithm can wait for data, self synchronizing.

  • 8/8/2019 ParallelAlgorithms-Ranade

    16/29

    As much work and as many cl ever algorithms as insequential sorting.

    Merge Paradigm: Each processor sorts its data

    locally. Then all sublists are merged. Merging canhappen in parallel

  • 8/8/2019 ParallelAlgorithms-Ranade

    17/29

    Bucket sort paradigm:

    Assign one bucket to each processor.

    Use sampling to determine ranges.

    Each processor sends its keys to correct bucket.

    Each processor locally sorts the keys it receives.

  • 8/8/2019 ParallelAlgorithms-Ranade

    18/29

    Manage communication properly: sorting iscommunication intensive.

    Pack many keys into single message, do not send 1

    message per key.

    Sorting is an important primitive, pay greatattention to communication network.

    Similar issues in database join operations.

  • 8/8/2019 ParallelAlgorithms-Ranade

    19/29

    Key operation in many numerical codes, e.g. finiteelement method.

    Invoked repeatedly, e.g. in solving linear systems,

    differential equations. Dense matrix vector multiplication is easy -- similar

    to matrix-matrix multiplication.

  • 8/8/2019 ParallelAlgorithms-Ranade

    20/29

    Graph: derived from the problem, e.g. finite elementmesh graph

    Vector : V[j] present at vertex j of some graph

    Matrix: A[j,k] present in edge (j,k) of the graph

    Multiplication:

    Each vertex sends its data on all edges

    As the data moves along an edge , it is multiplied by thecoefficient

    The products are added as they arrive into the vertices.

  • 8/8/2019 ParallelAlgorithms-Ranade

    21/29

    Partition the graph among available processors s.t.

    Each processor gets equal no of edges (load balance)

    Most edges have both endpoints on the same processor

    (minimize communication)

    Classic, hard problem

  • 8/8/2019 ParallelAlgorithms-Ranade

    22/29

    Many heuristics known.

    Spectral Methods

    Multi-level methods based on graph coarsening, e.g. METIS

    Good partitioning is possible for well shapedmeshes arising out of finite element methods.

  • 8/8/2019 ParallelAlgorithms-Ranade

    23/29

    Input: positions of n stars

    Output: force on each star due to al l others.

    Nave Algorithm: O(n2), for each star calculate the

    force due to every other. Fast Multipole Method: O(n). Based on clustering

    stars and considering cluster-cluster interactions +star-cluster + star-star interactions.

  • 8/8/2019 ParallelAlgorithms-Ranade

    24/29

    Star data resides at leaves of oct tree.

    Cluster data resides at internal nodes Computed by passing data along tree edges.

    Cluster-cluster interaction: data flows betweenneighbours at same level of tree.

    Dataflow is pyramidal. Possible to nicely embed p yramids in most networks.

    More complexity because p yramid might have variablenumber of levels if stars are distributed unevenly.

  • 8/8/2019 ParallelAlgorithms-Ranade

    25/29

    Find a combinatorial object satisfying properties

    , e.g. packing, TSP,

    Naturally parallel, but unstructured growth of

    search tree.

    Difficulty in load balancing.

    Key question: can we maintain a distributed queue?

    Best solution found? Bounds

  • 8/8/2019 ParallelAlgorithms-Ranade

    26/29

    Prefix like computations useful in maintainingdistributed queue.

    Randomness also useful. Each processor asks for

    work from a random processor when its workqueue runs out.

    Broadcasting required to maintain best solution,bounds..

  • 8/8/2019 ParallelAlgorithms-Ranade

    27/29

    Graph embedding Develop algorithm for one network, embed that network

    on what you actually have

    Form graph of dataflow/data dependence , embed that innetwork you have

    Solve problem locally, merge global results Sorting

    Matrix multiplication: locally implement subblock

    multiplication Useful in graph algorithms

  • 8/8/2019 ParallelAlgorithms-Ranade

    28/29

    Randomization Useful in load balancing

    Also used in communication, select paths randomly

    Symmetry breaking

    Sampling

    Co-ordination Prefix is useful

    Distributed data structures, e.g. queues, hash tables

    Communication Patter ns All to all .. Sorting

    Permutation

    Broadcasting

  • 8/8/2019 ParallelAlgorithms-Ranade

    29/29

    Vast subject

    Many clever algorithms

    Many models of computation

    Many issues besides algorithms: how to express inprogramming language..

    Quick tour of some of the important ideas.