Pram Machine Models

download Pram Machine Models

of 72

Transcript of Pram Machine Models

  • 8/4/2019 Pram Machine Models

    1/72

    RAM, PRAM, and LogP models

  • 8/4/2019 Pram Machine Models

    2/72

    Why models?

    What is a machine model? An abstraction describes the operation of a machine.

    Allowing to associate a value (cost) to each machine operation.

    Why do we need models? Make it easy to reason algorithms

    Hide the machine implementation details so that generalresults that apply to a broad class of machines to be obtained.

    Analyze the achievable complexity (time, space, etc) bounds

    Analyze maximum parallelism Models are directly related to algorithms.

  • 8/4/2019 Pram Machine Models

    3/72

    RAM (random access machine) model

    Memory consists of infinite array (memory cells).

    Instructions executed sequentially one at a time

    All instructions take unit time Load/store

    Arithmetic

    Logic

    Running time of an algorithm is the number of

    instructions executed. Memory requirement is the number of memory cells

    used in the algorithm.

  • 8/4/2019 Pram Machine Models

    4/72

    RAM (random access machine) model

    The RAM model is the base of algorithm analysisfor sequential algorithms although it is not

    perfect bcos of following reasons.

    Memory not infinite

    Not all memory access take the same time

    Not all arithmetic operations take the same time

    Instruction pipelining is not taken into consideration.

    The RAM model (with asymptotic analysis) often

    gives relatively realistic results.

  • 8/4/2019 Pram Machine Models

    5/72

    PRAM (Parallel RAM)

    A unbounded collection of processors.

    Each process has infinite number of registers.

    A unbounded collection of shared memory cells.

    All processors can access all memory cells in unit time(when there is no memory conflict).

    All processors execute PRAM instructionssynchronously (some processors may idle).

    Each PRAM instruction executes in 3-phase cycles Read from a share memory cell (if needed) Computation

    Write to a share memory cell (if needed)

  • 8/4/2019 Pram Machine Models

    6/72

    PRAM (Parallel RAM)

    The only way processors exchange data is

    through the shared memory.

    Parallel time complexity: the number of

    synchronous steps in the algorithm

    Space complexity: the number of share memoryParallelism: the number of processors used

  • 8/4/2019 Pram Machine Models

    7/72

    Illustration of PRAM

    P1 P2 P3 Pp

    Shared Memory

    CLK

    P processors connected to a single shared memory

    Each processor

    has a uniqueindex.

    Single program executed in MIMD mode

    All processors can do things in a synchronous manner

    (with infinite shared Memory and infinite local memory),

  • 8/4/2019 Pram Machine Models

    8/72

    Amit Chhabra,Deptt. of Computer Sci. &Engg.

    8

    Communication in PRAM

  • 8/4/2019 Pram Machine Models

    9/72

    PRAM further refinement

    PRAMs are further classifed based on how the

    memory conflicts are resolved.

    Read

    Exclusive Read (ER) all processors can only

    simultaneously read from distinct memory location (but

    not the same location).

    What if two processors want to read from the same location?

    Concurrent Read (CR) all processors can

    simultaneously read from all memory locations.

  • 8/4/2019 Pram Machine Models

    10/72

    PRAM further refinement

    PRAMs are further classified based on how thememory conflicts are resolved.

    Write

    Exclusive Write (EW) all processors can onlysimultaneously write to distinct memory location (but notthe same location).

    Concurrent Write (CR) all processors can simultaneouslywrite to all memory locations.

    Common CW: only allow same value to be written to the samelocation simultaneously.

    Random CW: randomly pick a value

    Priority CW: processors have priority, the value in the highestpriority processor wins.

  • 8/4/2019 Pram Machine Models

    11/72

    PRAM model variations

    EREW, CREW, CRCW (common), CRCW

    (random), CRCW (Priority)

    Which model is closer to the practical SMP

    machines?

    Model A is computationally stronger than

    model B if and only if any algorithm written in

    B will run unchange in A. EREW

  • 8/4/2019 Pram Machine Models

    12/72

    PRAM algorithm example

    SUM: Add N numbers in memory M*0, 1, , N-

    1]

    Sequential SUM algorithm (O(N) complexity)

    for (i=0; i

  • 8/4/2019 Pram Machine Models

    13/72

    PRAM SUM algorithm

    Which mo

    Which PRAM model?

  • 8/4/2019 Pram Machine Models

    14/72

    PRAM SUM algorithm complexity

    Time complexity?

    Number of processors needed?

    Speedup (vs. sequential program):

  • 8/4/2019 Pram Machine Models

    15/72

    Amit Chhabra,Deptt. of Computer Sci. &Engg.

    15

    Basic Parallel Algorithms

    3 elementary problems to be considered Reduction

    Broadcast

    Prefix sums Target Architectures

    Hypercube SIMD model

    2D-mesh SIMD model UMA multiprocessor model

    Hypercube Multicomputer

  • 8/4/2019 Pram Machine Models

    16/72

    Amit Chhabra,Deptt. of Computer Sci. &Engg.

    16

    REDUCTION

    Cost Optimal PRAM Algorithm for the Reduction Problem

    Cost optimal PRAM algorithm complexity:

    O(logn) (using n div 2 processors)

    Example for n=8 and p=4 processors

    a0

    j=0

    j=1

    j=2

    a1 a2 a3 a4 a5 a6 a7

    P0

    P0

    P0

    P1 P2P3

    P2

  • 8/4/2019 Pram Machine Models

    17/72

    Amit Chhabra,Deptt. of Computer Sci. &Engg.

    17

    Reduction using Hyper-cube SIMD

    P0

    P1 P3

    P2

    P4

    P5

    P6

    P7

    P0

    P1 P3

    P2

    P1

    P0

  • 8/4/2019 Pram Machine Models

    18/72

    Amit Chhabra,Deptt. of Computer Sci. &Engg.

    18

    Reduction using 2-D mesh

  • 8/4/2019 Pram Machine Models

    19/72

    Amit Chhabra,Deptt. of Computer Sci. &Engg.

    19

    Reduction using 2-D mesh

    Example: compute the total sum on a 4*4 mesh

    Stage 1

  • 8/4/2019 Pram Machine Models

    20/72

    Amit Chhabra,Deptt. of Computer Sci. &Engg.

    20

    Reduction using 2-D mesh

    Stage 2

  • 8/4/2019 Pram Machine Models

    21/72

    Parallel search algorithm

    P processors PRAM with unsorted N numbers(P

  • 8/4/2019 Pram Machine Models

    22/72

    Parallel search algorithm

    PRAM Algorithm:

    Step 1: Inform everyone what x is

    Step 2: every processor checks N/P numbers and

    sets a flag

    Step 3: Check if any flag is set to 1.

    EREW: O(log(p)) step 1, O(N/P) step 2, and O(log(p)) step 3.

    CREW: O(1) step 1, O(N/P) step 2, and O(log(p)) step 3.

    CRCW (common): O(1) step 1, O(N/P) step 2, and O(1) step 3.

  • 8/4/2019 Pram Machine Models

    23/72

    PRAM strengths

    Natural extension of RAM

    It is simple and easy to understand

    Communication and synchronization issues are

    hided. Can be used as a benchmarks

    If an algorithm performs badly in the PRAMmodel, it will perform badly in reality.

    A good PRAM program may not be practicalthough.

  • 8/4/2019 Pram Machine Models

    24/72

    PRAM weaknesses

    Model inaccuracies

    Unbounded local memory (register)

    All operations take unit time

    Unaccounted costs

    Non-local memory access

    Latency Bandwidth

    Memory access contention

  • 8/4/2019 Pram Machine Models

    25/72

    Pros & Cons

    Pros

    Simple and clean semantics

    The majority of theoretical parallel algorithms are

    specified with the PRAM model.

    Complexity of parallel algorithm

    Independent of the communication network

    topology

  • 8/4/2019 Pram Machine Models

    26/72

    Pros & Cons (Contd)

    Cons

    Not realistic

    too powerful communication model

    Algorithm designer is misled to use IPC without

    hesitation.

    Synchronized processors

    No local memory

  • 8/4/2019 Pram Machine Models

    27/72

    PRAM variations

    Bounded memory PRAM, PRAM(m) In a given step, only m memory accesses can be serviced.

    Bounded number of processors PRAM

    Any problem that can be solved by a p processor PRAM in tsteps canbe solved by a p processor PRAM in t = O(tp/p) steps.

    LPRAM(Local Memory PRAM) L units to access global memory

    Any algorithm that runs in a p processor PRAM can run in LPRAM witha loss of a factor of L.

    different access cost between local memory and remote memory

    BPRAM(BLOCK PRAM) L units for the first message

    B units for subsequent messages

  • 8/4/2019 Pram Machine Models

    28/72

    PRAM summary

    The RAM model is widely used.

    PRAM is simple and easy to understand

    This model never reaches beyond the algorithmcommunity.

    It is getting more important as threaded programmingbecomes more popular.

  • 8/4/2019 Pram Machine Models

    29/72

    BSP Model

    BSP(Bulk Synchronous Parallel) Model

    Proposed by Leslie Valiant of Harvard University

    Developed by W.F.McColl of Oxford University

  • 8/4/2019 Pram Machine Models

    30/72

    BSP Computer Model

    Distributed memory architecture

    3 components

    Node

    Processor

    Local memory

    Router (Communication Network)

    Point-to-point, message passing (or shared variable)

    Barrier synchronizing facility

    All or subset

  • 8/4/2019 Pram Machine Models

    31/72

    Illustration of BSP

    Communication Network (g)

    P M P M P M

    Node (w) Node Node

    Barrier (l)

  • 8/4/2019 Pram Machine Models

    32/72

    Three Parameters

    wparameter

    Maximum computation time within each superstep

    Computation operation takes at most wcycles.

    g parameter

    # of cycles for communication of unit message when all processors areinvolved in communication - network bandwidth

    (total number of local operations performed by all processors in onesecond) / (total number of words delivered by the communicationnetwork in one second)

    h relation coefficient

    Communication operation takes gh cycles.

    lparameter

    Barrier synchronization takes lcycles.

  • 8/4/2019 Pram Machine Models

    33/72

    BSP Program

    A BSP computation consists of S super steps.

    A superstep is a sequence of steps followed by

    a barrier synchronization.

    Superstep

    Any remote memory accesses take effect at

    barrier - loosely synchronous

  • 8/4/2019 Pram Machine Models

    34/72

    BSP Program

    Superstep 1

    Superstep 2

    Barrier

    P1 P2 P3 P4

    Computation

    Communication

  • 8/4/2019 Pram Machine Models

    35/72

    BSP Algorithm Design

    2 variables for time complexity

    N: # time steps (local operations) per super step.

    h-relation: each node sends and receives at most h messages

    gh is time steps for communication within a super step

    Time complexity Valiant's original: max{w, gh, l}

    overlapped communication and computation

    McColl's revised: N + gh + l = Ntot + Ncom g + sl

    Algorithm design Ntot = T / p, minimize Ncom and S.

  • 8/4/2019 Pram Machine Models

    36/72

    Example

    Inner product with 8 processors

    Superstep 1

    Computation: local sum w = 2N/8

    Communication: 1 relation (procs. 0,2,4,6 -> procs.1,3,5,7)

    Superstep 2

    Computation: one addition w = 1

    Communication: 1 relation (procs. 1 5 -> procs. 3.7)

    Superstep 3

    Computation: one addition w = 1

    Communication: 1 relation (proc. 3 -> proc. 7)

  • 8/4/2019 Pram Machine Models

    37/72

    Example (Contd)

    Superstep 4

    Computation: one addition w = 1

    Total execution time = 2N/n + log n(g+l+1)

    log gn: communication overhead, log nl:synchronization overhead

  • 8/4/2019 Pram Machine Models

    38/72

    BSP Programming Library

    More abstract level of programming thanmessage passing

    BSPlib library: Oxford BSP library + Green BSP

    One-sided DRMA (direct remote memory access)Share-memory operations: dynamic shared-variables through registrations

    BSMP (bulk synchronous message passing):

    combine small messages into a large one Additional features: message reordering

  • 8/4/2019 Pram Machine Models

    39/72

    BSP Programming Library (Contd)

    Extensions: Paderborn University BSP

    supports collective communication and group

    management facilities

    Cost model for performance analysis and

    prediction

    Good for debugging: the global state is visible at

    the super-step boundary

  • 8/4/2019 Pram Machine Models

    40/72

    Programming Example

    All sums using the logarithmic technique

    Calculate the partial sums of p integers stored on

    p processors

    log p supersteps

    4 processors

  • 8/4/2019 Pram Machine Models

    41/72

    BSP Program Example

    int bsp_allsum(int x){

    int i, left, right;

    bsp_push_register(&left, sizeof(int));bsp_sync();

    right = x;for(i=1; i= i) right = left + right;

    }bsp_pop_register(&left);return right;

    }

    ll i i h i

  • 8/4/2019 Pram Machine Models

    42/72

    All Sums Using Logarithmic

    Technique

    1 2 3 4

    1 3 5 7

    1 3 6 10

  • 8/4/2019 Pram Machine Models

    43/72

    LogP model

    Common MPP organization:

    complete machine connected by anetwork.

    LogP attempts to capture thecharacteristics of such organization.

    PRAM model: shared memory

    P

    M

    network

    P

    M

    P

    M

  • 8/4/2019 Pram Machine Models

    44/72

    Deriving LogP model

    Processing powerful microprocessor, large DRAM, cache => P

    Communication

    + significant latency => L

    + limited bandwidth => g+ significant overhead => o

    - on both ends

    no consensus on topology

    => should not exploit structure

    + limited capacity no consensus on programming model

    => should not enforce one

  • 8/4/2019 Pram Machine Models

    45/72

    LogP

    Interconnection Network

    MPMPMP

    P ( processors )

    Limited Volume( L/ g to or from

    a proc)

    o (overhead)

    L (latency)

    og (gap)

    Latency in sending a (small) mesage between modules

    overhead felt by the processor on sending or receiving msg

    gap between successive sends or receives (1/BW)

    Processors

    Using the model

  • 8/4/2019 Pram Machine Models

    46/72

    Using the model

    Send n messages from proc to proc in time 2o + L + g(n-1) each processor does o n cycles of overhead

    has (g-o)(n-1) + L available compute cycles

    Send n messages from one to many

    in same time

    Send n messages from many to one

    in same time all but L/g processors block

    so fewer available cycles

    o L o

    o og L time

    P

    P

  • 8/4/2019 Pram Machine Models

    47/72

    Using the model

    Two processors send n words to each other:

    2o + L + g(n-1)

    Assumes no network contention

    Can under-estimate the communication time.

  • 8/4/2019 Pram Machine Models

    48/72

    LogP philosophy

    Think about:

    mapping of a task onto P processors

    computation within a processor, its cost, and balance

    communication between processors, its cost, andbalance

    given a charaterization of processor and network

    performance

    Do not think about what happens within the network

    D l i l b d l i h

  • 8/4/2019 Pram Machine Models

    49/72

    Develop optimal broadcast algorithm

    based on the LogP model

    Broadcast a single datum to P-1 processors

  • 8/4/2019 Pram Machine Models

    50/72

    Strengths of the LogP model

    Simple, 4 parameters

    Can easily be used to guide the algorithm

    development, especially algorithms for

    communication routines.

    This model has been used to analyze many

    collective communication algorithms.

  • 8/4/2019 Pram Machine Models

    51/72

    Weaknesses of the LogP model

    Accurate only at the very low level (machine

    instruction level)

    Inaccurate for more practical communication

    systems with layers of protocols (e.g. TCP/IP)

    Many variations.

    LogP family models: LogGP, logGPC, pLogP, etc

    Making the model more accurate and more complex

  • 8/4/2019 Pram Machine Models

    52/72

    BSP vs. LogP

    BSP differs from LogP in three ways:

    LogP uses a form of message passing based on pairwise

    synchronization.

    LogP adds an extra parameter representing the overhead involved insending a message. Applies to every communication!

    LogP defines g in local terms. It regards the network as having a finite

    capacity and treats g as the minimal permissible gap between

    message sends from a single process. The parameter g in both cases isthe reciprocal of the available per-processor network bandwidth: BSP

    takes a global view of g, LogP takes a local view of g.

  • 8/4/2019 Pram Machine Models

    53/72

    BSP vs. LogP

    When analyzing the performance of LogP model, it is often necessary (or

    convenient) to use barriers.

    Message overhead is present but decreasing

    Only overhead is from transferring the message from user space to asystem buffer.

    LogP + barriers - overhead = BSP

    Both models can efficiently simulate the other.

  • 8/4/2019 Pram Machine Models

    54/72

    BSP vs. PRAM

    BSP can be regarded as a generalization of the PRAMmodel.

    If the BSP architecture has a small value of g (g=1), then it

    can be regarded as PRAM.

    Use hashing to automatically achieve efficient memory

    management.

    The value of l determines the degree of parallel slackness

    required to achieve optimal efficiency.

    If l = g = 1 corresponds to idealized PRAM where no

    slackness is required.

  • 8/4/2019 Pram Machine Models

    55/72

    BSPlib

    Supports a SPMD style of programming.

    Library is available in C and FORTRAN.

    Implementations available (several years ago) for:

    Cray T3E

    IBM SP2 SGI PowerChallenge

    Convex Exemplar

    Hitachi SR2001

    Various Workstation Clusters

    Allows for direct remote memory access or message passing.

    Includes support for unbuffered messages for high performance computing.

  • 8/4/2019 Pram Machine Models

    56/72

    BSPlib

    Initialization Functions

    bsp_init()

    Simulate dynamic

    processes

    bsp_begin()

    Start of SPMD code

    bsp_end() End of SPMD code

    Enquiry Functions

    bsp_pid()

    find my process id

    bsp_nprocs()

    number of processes

    bsp_time()

    local time

    Synchronization Functions

    bsp_sync() barrier synchronization

    DRMA Functions

    bsp_pushregister()

    make region globallyvisible

    bsp_popregister()

    remove global visibility

    bsp_put()

    push to remote memory

    bsp_get()

    pull from remote memory

  • 8/4/2019 Pram Machine Models

    57/72

    BSPlib

    BSMP Functions

    bsp_set_tag_size()

    choose tag size

    bsp_send()

    send to remote queue

    bsp_get_tag()

    match tag with message

    bsp_move()

    fetch from queue

    Halt Functions

    bsp_abort()

    one process halts all

    High Performance Functions

    bsp_hpput()

    bsp_hpget()

    bsp_hpmove()

    These are unbuffered versions of

    communication primitives

  • 8/4/2019 Pram Machine Models

    58/72

    BSPlib Examples

    Static Hello World

    void main( void )

    {

    bsp_begin( bsp_nprocs());

    printf( Hello BSP from %d of %d\n,

    bsp_pid(), bsp_nprocs());

    bsp_end();

    }

    Dynamic Hello World

    int nprocs; /* global variable */

    void spmd_part( void )

    {

    bsp_begin( nprocs );

    printf( Hello BSP from %d of %d\n,

    bsp_pid(), bsp_nprocs());

    }

    void main( void )

    {

    bsp_init( spmd_part, argc, argv );

    nprocs = ReadInteger();

    spmd_part();

    }

  • 8/4/2019 Pram Machine Models

    59/72

    BSPlib Examples

    Serialize Printing of Hello World (shows synchronization)

    void main( void )

    {

    int ii;

    bsp_begin( bsp_nprocs());

    for( ii=0; ii

  • 8/4/2019 Pram Machine Models

    60/72

    BSPlib Examples

    All sums version 1 ( lg( p ) supersteps )

    int bsp_allsums1( int x ){

    int ii, left, right;

    bsp_pushregister( &left, sizeof( int ));

    bsp_sync();

    right = x;

    for( ii=1; ii= I )

    right = left + right;

    }

    bsp_popregister( &left );

    return( right );

    }

  • 8/4/2019 Pram Machine Models

    61/72

    BSPlib Examples

    All sums version 2 (one superstep)

    int bsp_allsums2( int x ) {

    int ii, result, *array = calloc( bsp_nprocs(), sizeof(int));

    if( array == NULL ) bsp_abort( Unable to allocate %d element array,

    bsp_nprocs());

    bsp_pushregister( array, bsp_nprocs()*sizeof( int));

    bsp_sync();

    for( ii=bsp_pid(); ii

  • 8/4/2019 Pram Machine Models

    62/72

    BSPlib vs. PVM and/or MPI

    MPI/PVM are widely implemented and widely

    used.

    Both have HUGEAPIs!!!

    Both may be inefficient on (distributed-)shared

    memory systems. where the communication and

    synchronization are decoupled. True for DSM machines with one sided communication.

  • 8/4/2019 Pram Machine Models

    63/72

    Both are based on pair-wise synchronization,

    rather than barrier synchronization.

    No simple cost model for performanceprediction.No simple means of examining the globalstate.

    BSP could be implemented using a small, carefullychosen subset of MPI subroutines.

  • 8/4/2019 Pram Machine Models

    64/72

    Conclusion

    BSP is a computational model of parallel computing based on the

    concept of supersteps.

    BSP does not use locality of reference for the assignment of processes

    to processors.

    Predictability is defined in terms of three parameters.

    BSP is a generalization of PRAM.

    BSP = LogP + barriers - overhead

    BSPlib has a much smaller API as compared to MPI/PVM.

  • 8/4/2019 Pram Machine Models

    65/72

    Data Parallelism

    Independent tasks apply same operation to

    different elements of a data set

    Okay to perform operations concurrently

    Speedup: potentially p-fold, p=#processors

    for i 0 to 99 doa[i] b[i] + c[i]

    endfor

  • 8/4/2019 Pram Machine Models

    66/72

    Functional Parallelism

    Independent tasks apply different operations todifferent data elements

    First and second statements

    Third and fourth statements

    Speedup: Limited by amount of concurrent sub-tasks

    a 2

    b 3m (a + b) / 2s (a2 + b2) / 2v s - m2

  • 8/4/2019 Pram Machine Models

    67/72

    Amit Chhabra,Deptt. of Computer Sci. &

    Engg. 67

    Another example The Sieve of Eratosthenes- A classical prime finding

    algorithm

    Here we want to find prime numbers less than or equal

    to some positive integer n.

    Let us do this example for n=30

    Sieve of Eratosthenes

  • 8/4/2019 Pram Machine Models

    68/72

    Amit Chhabra,Deptt. of Computer Sci. &

    Engg. 68

    Sieve of Eratosthenes

  • 8/4/2019 Pram Machine Models

    69/72

    Amit Chhabra,Deptt. of Computer Sci. &

    Engg. 69

    Pseudo code serial

    Sieve of Eratosthenes

  • 8/4/2019 Pram Machine Models

    70/72

    Amit Chhabra,Deptt. of Computer Sci. &

    Engg. 70

    Sieve of Eratosthenes

    Control parallel solution

  • 8/4/2019 Pram Machine Models

    71/72

    Amit Chhabra,Deptt. of Computer Sci. &

    Engg. 71

    Sieve of Eratosthenes

    Data parallelism approach

  • 8/4/2019 Pram Machine Models

    72/72

    Data parallel contd