Psc1 2 Beamer

download Psc1 2 Beamer

of 16

Transcript of Psc1 2 Beamer

  • 8/12/2019 Psc1 2 Beamer

    1/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    The Bulk Synchronous Parallel Model(PSC1.2)

    1 / 1 6

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    2/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    What is a parallel computer?

    PC PC PC PC PC PC PC PC

    switch switch switch switch

    Aparallel computerconsists of a set of processors (such as acluster of PCs) that work together on solving a computationalproblem.

    2 / 1 6

    http://goforward/http://find/http://goback/
  • 8/12/2019 Psc1 2 Beamer

    3/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Moores law has broken down

    Single-processor speed cannot continue to improve. MooresLaw(Speed doubles every 18 months) has broken down in2007, because of increasing power consumption O(f3), wheref is the clock frequency. Therefore, improvements can only

    come from using multiple processors or processor cores. Current supercomputers are all parallel computers. IBMs

    Blue Gene/Qnamed Sequoia at Lawrence Livermore NationalLaboratory has 1.6 million processor cores and is the fastestsupercomputer on earth (Top 500, June 2012). In principle, it

    can reach a speed of 16.32 Petaflop/s; 1 petaflop = 1015floating point operations.

    3 / 1 6

    http://find/http://goback/
  • 8/12/2019 Psc1 2 Beamer

    4/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Why parallel computing?

    It is almost as cheap to put two processor cores onto a chip(giving a dual-core chip) as putting just one. The doubledspeed of a dual-core PC makes for fine advertising.

    Two cores running at clock frequency f/2 use only 1/4 of thepower of one core running at frequency f.

    But it is hard to achieve the same total computation speed inpractice.

    4 / 1 6

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    5/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Why not?

    It ismore difficultto write parallel programs than to writesequential ones (i.e. for one processor). The work has to bedistributed evenly over the processors and the amount ofcommunication between the processors has to be kept within

    limits. But not much more difficult. Thats why we have this course.

    Parallel programs may run fast on certain architectures, butsurprisingly slow on others.

    But not if you program in a portable fashion. Thats what wetry to teach.

    5 / 1 6

    http://goforward/http://find/http://goback/
  • 8/12/2019 Psc1 2 Beamer

    6/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Parallel computer: abstract model

    M

    P P P PP

    M M M M

    Communication

    network

    Bulk synchronous parallel (BSP) computer.Proposed by Leslie Valiant, 1989.

    6 / 1 6

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    7/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    BSP computer

    A BSP computer consists of a collection of processors, eachwith its own memory. It is a distributed-memorycomputer.

    Access to own memory is fast, to remote memory slower.

    Uniform-time access to all remote memories.

    No need to open theblack boxof the communicationnetwork. Algorithm designers should not worry about networkdetails, only about global performance.

    Algorithms designed for a BSP computer areportable: they

    can be run efficiently on many different parallel computers.

    7 / 1 6

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    8/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Parallel algorithm: supersteps

    P(0) P(1) P(2) P(3) P(4)

    sync

    sync

    sync

    sync

    sync

    comm

    comm

    comm

    comp

    comp

    8 / 1 6

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    9/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    BSP algorithm

    A BSP algorithm consists of a sequence of supersteps.

    Acomputation superstepconsists of many small steps, suchas thefloating-point operations(flops) addition, subtraction,multiplication, division. In scientific computing, flops are thecommon unit for expressing computation cost.

    Acommunication superstepconsists of many basiccommunication operations, each transferring a data word suchas a real or integer from one processor to another.

    In our theoretical algorithms, we distinguish between the twotypes of supersteps. This helps in the design and analysis ofparallel algorithms.

    In our practical programs, we drop the distinction and mixcomputation and communication freely in each superstep.

    9 / 1 6

    http://find/http://goback/
  • 8/12/2019 Psc1 2 Beamer

    10/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Communication superstep: h-relation

    2-relations:

    P(0) P(1)

    P(2)

    P(0)P(0) P(0)P(0) P(1)

    P(2)

    a b

    An h-relationis a communication superstep in which everyprocessor sends and receives at most h data words:

    h= max{hs, hr}. hs is the maximum number of data words sent by a processor.

    hr is the maximum number of data words received by aprocessor.

    10/16

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    11/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Cost of communication superstep

    T(h) =hg+l, where g is the time per data word and l theglobal synchronisation time.

    Motivation hg: h determines communication time, since

    entry/exit of processor is the bottleneck . Motivation l: contains fixed overhead such as start-up costs

    of sending data, costs of checking whether all data havearrived at their destination, and costs of the synchronisationmechanism itself.

    11/16

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    12/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Time of an h-relation on an 8-processor IBM SP2

    0

    50000

    100000

    150000

    200000

    250000

    300000

    350000

    0 50 100 150 200 250

    Time

    (inflopunits)

    h

    Measured dataLeast-squares fit

    r= 212 Mflop/s, p= 8, g= 187 flop (0.88s),l= 148212 flop (698 s)

    12/16

    http://goback/http://find/http://goback/
  • 8/12/2019 Psc1 2 Beamer

    13/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Cost of computation superstep

    T =w+l, where wis the maximum number of flops of aprocessor in the superstep.

    Processors with less than wflops have to wait. This waitingtime is calledidle time.

    To measure T, a wall clock is needed, giving the elapsed time.Straightforwardly using a CPU timer will not work, since itdoes not measure idle time.

    Synchronising the processors before every time measurementhelps, but it takes time to synchronise!

    Same las in communication superstep, for simplicity.

    13/16

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    14/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Cost of algorithm

    The cost of a BSP algorithm is an expression of the form

    a+bg+cl.

    This cost is obtained by adding the costs of all the supersteps. Note that g=g(p) and l=l(p) are in general a function of

    the number of processors p.

    The parameters a, b, cdepend in general on pand on a

    problem size n.

    14/16

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    15/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Parallel algorithm: supersteps

    P(0) P(1) P(2) P(3) P(4)

    Comp

    Comm

    Comm

    Comp

    Comm

    Sync

    Sync

    Sync

    Sync

    Sync

    0

    50

    5

    20

    6

    60

    2

    60

    100

    150

    200

    250

    300

    Cost

    Cost of BSP algorithm for p= 5, g= 2.5, l= 20 is 320 flops.First computation superstep costs 60 + 20 = 80 flops.First communication superstep costs 452.5 + 20 = 70 flops.

    15/16

    http://find/
  • 8/12/2019 Psc1 2 Beamer

    16/16

    Lecture 1.2 Bulk Synchronous Parallel Model

    Summary

    An abstract BSP machine is just a BSP(p, r, g, l)computer.This is all we need to know about the machine for developingalgorithms. The parameters are:p number of processorsr computing rate (in flop/s)

    g communication cost per data word (in flop time units)l global synchronisation cost (in flop time units)

    The BSP model consists of adistributed-memory architecturewith a black box

    communication network providing uniform-time access to

    remote memories; analgorithmic frameworkformed by a sequence of supersteps; acost modelgiving cost expressions of the form a+bg+cl.

    16/16

    http://find/