Pram Machine Models

8/4/2019 Pram Machine Models

1/72

RAM, PRAM, and LogP models


2/72

Why models?

What is a machine model? An abstraction describes the operation of a machine.

Allowing to associate a value (cost) to each machine operation.

Why do we need models? Make it easy to reason algorithms

Hide the machine implementation details so that generalresults that apply to a broad class of machines to be obtained.

Analyze the achievable complexity (time, space, etc) bounds

Analyze maximum parallelism Models are directly related to algorithms.


3/72

RAM (random access machine) model

Memory consists of infinite array (memory cells).

Instructions executed sequentially one at a time

All instructions take unit time Load/store

Arithmetic

Logic

Running time of an algorithm is the number of

instructions executed. Memory requirement is the number of memory cells

used in the algorithm.


4/72

RAM (random access machine) model

The RAM model is the base of algorithm analysisfor sequential algorithms although it is not

perfect bcos of following reasons.

Memory not infinite

Not all memory access take the same time

Not all arithmetic operations take the same time

Instruction pipelining is not taken into consideration.

The RAM model (with asymptotic analysis) often

gives relatively realistic results.


5/72

PRAM (Parallel RAM)

A unbounded collection of processors.

Each process has infinite number of registers.

A unbounded collection of shared memory cells.

All processors can access all memory cells in unit time(when there is no memory conflict).

All processors execute PRAM instructionssynchronously (some processors may idle).

Each PRAM instruction executes in 3-phase cycles Read from a share memory cell (if needed) Computation

Write to a share memory cell (if needed)


6/72

PRAM (Parallel RAM)

The only way processors exchange data is

through the shared memory.

Parallel time complexity: the number of

synchronous steps in the algorithm

Space complexity: the number of share memoryParallelism: the number of processors used


7/72

Illustration of PRAM

P1 P2 P3 Pp

Shared Memory

CLK

P processors connected to a single shared memory

Each processor

has a uniqueindex.

Single program executed in MIMD mode

All processors can do things in a synchronous manner

(with infinite shared Memory and infinite local memory),


8/72

Amit Chhabra,Deptt. of Computer Sci. &Engg.

8

Communication in PRAM


9/72

PRAM further refinement

PRAMs are further classifed based on how the

memory conflicts are resolved.

Read

Exclusive Read (ER) all processors can only

simultaneously read from distinct memory location (but

not the same location).

What if two processors want to read from the same location?

Concurrent Read (CR) all processors can

simultaneously read from all memory locations.


10/72

PRAM further refinement

PRAMs are further classified based on how thememory conflicts are resolved.

Write

Exclusive Write (EW) all processors can onlysimultaneously write to distinct memory location (but notthe same location).

Concurrent Write (CR) all processors can simultaneouslywrite to all memory locations.

Common CW: only allow same value to be written to the samelocation simultaneously.

Random CW: randomly pick a value

Priority CW: processors have priority, the value in the highestpriority processor wins.


11/72

PRAM model variations

EREW, CREW, CRCW (common), CRCW

(random), CRCW (Priority)

Which model is closer to the practical SMP

machines?

Model A is computationally stronger than

model B if and only if any algorithm written in

B will run unchange in A. EREW


12/72

PRAM algorithm example

SUM: Add N numbers in memory M*0, 1, , N-

1]

Sequential SUM algorithm (O(N) complexity)

for (i=0; i


13/72

PRAM SUM algorithm

Which mo

Which PRAM model?


14/72

PRAM SUM algorithm complexity

Time complexity?

Number of processors needed?

Speedup (vs. sequential program):


15/72


15

Basic Parallel Algorithms

3 elementary problems to be considered Reduction

Broadcast

Prefix sums Target Architectures

Hypercube SIMD model

2D-mesh SIMD model UMA multiprocessor model

Hypercube Multicomputer


16/72


16

REDUCTION

Cost Optimal PRAM Algorithm for the Reduction Problem

Cost optimal PRAM algorithm complexity:

O(logn) (using n div 2 processors)

Example for n=8 and p=4 processors

a0

j=0

j=1

j=2

a1 a2 a3 a4 a5 a6 a7

P0

P0

P0

P1 P2P3

P2


17/72


17

Reduction using Hyper-cube SIMD

P0

P1 P3

P2

P4

P5

P6

P7

P0

P1 P3

P2

P1

P0


18/72


18

Reduction using 2-D mesh


19/72


19


Example: compute the total sum on a 4*4 mesh

Stage 1


20/72


20


Stage 2


21/72

Parallel search algorithm

P processors PRAM with unsorted N numbers(P


22/72

Parallel search algorithm

PRAM Algorithm:

Step 1: Inform everyone what x is

Step 2: every processor checks N/P numbers and

sets a flag

Step 3: Check if any flag is set to 1.

EREW: O(log(p)) step 1, O(N/P) step 2, and O(log(p)) step 3.

CREW: O(1) step 1, O(N/P) step 2, and O(log(p)) step 3.

CRCW (common): O(1) step 1, O(N/P) step 2, and O(1) step 3.


23/72

PRAM strengths

Natural extension of RAM

It is simple and easy to understand

Communication and synchronization issues are

hided. Can be used as a benchmarks

If an algorithm performs badly in the PRAMmodel, it will perform badly in reality.

A good PRAM program may not be practicalthough.


24/72

PRAM weaknesses

Model inaccuracies

Unbounded local memory (register)

All operations take unit time

Unaccounted costs

Non-local memory access

Latency Bandwidth

Memory access contention


25/72

Pros & Cons

Pros

Simple and clean semantics

The majority of theoretical parallel algorithms are

specified with the PRAM model.

Complexity of parallel algorithm

Independent of the communication network

topology


26/72

Pros & Cons (Contd)

Cons

Not realistic

too powerful communication model

Algorithm designer is misled to use IPC without

hesitation.

Synchronized processors

No local memory


27/72

PRAM variations

Bounded memory PRAM, PRAM(m) In a given step, only m memory accesses can be serviced.

Bounded number of processors PRAM

Any problem that can be solved by a p processor PRAM in tsteps canbe solved by a p processor PRAM in t = O(tp/p) steps.

LPRAM(Local Memory PRAM) L units to access global memory

Any algorithm that runs in a p processor PRAM can run in LPRAM witha loss of a factor of L.

different access cost between local memory and remote memory

BPRAM(BLOCK PRAM) L units for the first message

B units for subsequent messages


28/72

PRAM summary

The RAM model is widely used.

PRAM is simple and easy to understand

This model never reaches beyond the algorithmcommunity.

It is getting more important as threaded programmingbecomes more popular.


29/72

BSP Model

BSP(Bulk Synchronous Parallel) Model

Proposed by Leslie Valiant of Harvard University

Developed by W.F.McColl of Oxford University


30/72

BSP Computer Model

Distributed memory architecture

3 components

Node

Processor

Local memory

Router (Communication Network)

Point-to-point, message passing (or shared variable)

Barrier synchronizing facility

All or subset


31/72

Illustration of BSP

Communication Network (g)

P M P M P M

Node (w) Node Node

Barrier (l)


32/72

Three Parameters

wparameter

Maximum computation time within each superstep

Computation operation takes at most wcycles.

g parameter

# of cycles for communication of unit message when all processors areinvolved in communication - network bandwidth

(total number of local operations performed by all processors in onesecond) / (total number of words delivered by the communicationnetwork in one second)

h relation coefficient

Communication operation takes gh cycles.

lparameter

Barrier synchronization takes lcycles.


33/72

BSP Program

A BSP computation consists of S super steps.

A superstep is a sequence of steps followed by

a barrier synchronization.

Superstep

Any remote memory accesses take effect at

barrier - loosely synchronous


34/72

BSP Program

Superstep 1

Superstep 2

Barrier

P1 P2 P3 P4

Computation

Communication


35/72

BSP Algorithm Design

2 variables for time complexity

N: # time steps (local operations) per super step.

h-relation: each node sends and receives at most h messages

gh is time steps for communication within a super step

Time complexity Valiant's original: max{w, gh, l}

overlapped communication and computation

McColl's revised: N + gh + l = Ntot + Ncom g + sl

Algorithm design Ntot = T / p, minimize Ncom and S.


36/72

Example

Inner product with 8 processors

Superstep 1

Computation: local sum w = 2N/8

Communication: 1 relation (procs. 0,2,4,6 -> procs.1,3,5,7)

Superstep 2

Computation: one addition w = 1

Communication: 1 relation (procs. 1 5 -> procs. 3.7)

Superstep 3


Communication: 1 relation (proc. 3 -> proc. 7)


37/72

Example (Contd)

Superstep 4


Total execution time = 2N/n + log n(g+l+1)

log gn: communication overhead, log nl:synchronization overhead


38/72

BSP Programming Library

More abstract level of programming thanmessage passing

BSPlib library: Oxford BSP library + Green BSP

One-sided DRMA (direct remote memory access)Share-memory operations: dynamic shared-variables through registrations

BSMP (bulk synchronous message passing):

combine small messages into a large one Additional features: message reordering


39/72

BSP Programming Library (Contd)

Extensions: Paderborn University BSP

supports collective communication and group

management facilities

Cost model for performance analysis and

prediction

Good for debugging: the global state is visible at

the super-step boundary


40/72

Programming Example

All sums using the logarithmic technique

Calculate the partial sums of p integers stored on

p processors

log p supersteps

4 processors


41/72

BSP Program Example

int bsp_allsum(int x){

int i, left, right;

bsp_push_register(&left, sizeof(int));bsp_sync();

right = x;for(i=1; i= i) right = left + right;

}bsp_pop_register(&left);return right;

}

ll i i h i


42/72

All Sums Using Logarithmic

Technique

1 2 3 4

1 3 5 7

1 3 6 10


43/72

LogP model

Common MPP organization:

complete machine connected by anetwork.

LogP attempts to capture thecharacteristics of such organization.

PRAM model: shared memory

P

M

network

P

M

P

M


44/72

Deriving LogP model

Processing powerful microprocessor, large DRAM, cache => P

Communication

+ significant latency => L

+ limited bandwidth => g+ significant overhead => o

- on both ends

no consensus on topology

=> should not exploit structure

+ limited capacity no consensus on programming model

=> should not enforce one


45/72

LogP

Interconnection Network

MPMPMP

P ( processors )

Limited Volume( L/ g to or from

a proc)

o (overhead)

L (latency)

og (gap)

Latency in sending a (small) mesage between modules

overhead felt by the processor on sending or receiving msg

gap between successive sends or receives (1/BW)

Processors

Using the model


46/72

Using the model

Send n messages from proc to proc in time 2o + L + g(n-1) each processor does o n cycles of overhead

has (g-o)(n-1) + L available compute cycles

Send n messages from one to many

in same time

Send n messages from many to one

in same time all but L/g processors block

so fewer available cycles

o L o

o og L time

P

P


47/72

Using the model

Two processors send n words to each other:

2o + L + g(n-1)

Assumes no network contention

Can under-estimate the communication time.


48/72

LogP philosophy

Think about:

mapping of a task onto P processors

computation within a processor, its cost, and balance

communication between processors, its cost, andbalance

given a charaterization of processor and network

performance

Do not think about what happens within the network

D l i l b d l i h


49/72

Develop optimal broadcast algorithm

based on the LogP model

Broadcast a single datum to P-1 processors


50/72

Strengths of the LogP model

Simple, 4 parameters

Can easily be used to guide the algorithm

development, especially algorithms for

communication routines.

This model has been used to analyze many

collective communication algorithms.


51/72

Weaknesses of the LogP model

Accurate only at the very low level (machine

instruction level)

Inaccurate for more practical communication

systems with layers of protocols (e.g. TCP/IP)

Many variations.

LogP family models: LogGP, logGPC, pLogP, etc

Making the model more accurate and more complex


52/72

BSP vs. LogP

BSP differs from LogP in three ways:

LogP uses a form of message passing based on pairwise

synchronization.

LogP adds an extra parameter representing the overhead involved insending a message. Applies to every communication!

LogP defines g in local terms. It regards the network as having a finite

capacity and treats g as the minimal permissible gap between

message sends from a single process. The parameter g in both cases isthe reciprocal of the available per-processor network bandwidth: BSP

takes a global view of g, LogP takes a local view of g.


53/72

BSP vs. LogP

When analyzing the performance of LogP model, it is often necessary (or

convenient) to use barriers.

Message overhead is present but decreasing

Only overhead is from transferring the message from user space to asystem buffer.

LogP + barriers - overhead = BSP

Both models can efficiently simulate the other.


54/72

BSP vs. PRAM

BSP can be regarded as a generalization of the PRAMmodel.

If the BSP architecture has a small value of g (g=1), then it

can be regarded as PRAM.

Use hashing to automatically achieve efficient memory

management.

The value of l determines the degree of parallel slackness

required to achieve optimal efficiency.

If l = g = 1 corresponds to idealized PRAM where no

slackness is required.


55/72

BSPlib

Supports a SPMD style of programming.

Library is available in C and FORTRAN.

Implementations available (several years ago) for:

Cray T3E

IBM SP2 SGI PowerChallenge

Convex Exemplar

Hitachi SR2001

Various Workstation Clusters

Allows for direct remote memory access or message passing.

Includes support for unbuffered messages for high performance computing.


56/72

BSPlib

Initialization Functions

bsp_init()

Simulate dynamic

processes

bsp_begin()

Start of SPMD code

bsp_end() End of SPMD code

Enquiry Functions

bsp_pid()

find my process id

bsp_nprocs()

number of processes

bsp_time()

local time

Synchronization Functions

bsp_sync() barrier synchronization

DRMA Functions

bsp_pushregister()

make region globallyvisible

bsp_popregister()

remove global visibility

bsp_put()

push to remote memory

bsp_get()

pull from remote memory


57/72

BSPlib

BSMP Functions

bsp_set_tag_size()

choose tag size

bsp_send()

send to remote queue

bsp_get_tag()

match tag with message

bsp_move()

fetch from queue

Halt Functions

bsp_abort()

one process halts all

High Performance Functions

bsp_hpput()

bsp_hpget()

bsp_hpmove()

These are unbuffered versions of

communication primitives


58/72

BSPlib Examples

Static Hello World

void main( void )

{

bsp_begin( bsp_nprocs());

printf( Hello BSP from %d of %d\n,

bsp_pid(), bsp_nprocs());

bsp_end();

}

Dynamic Hello World

int nprocs; /* global variable */

void spmd_part( void )

{

bsp_begin( nprocs );

printf( Hello BSP from %d of %d\n,

bsp_pid(), bsp_nprocs());

}

void main( void )

{

bsp_init( spmd_part, argc, argv );

nprocs = ReadInteger();

spmd_part();

}


59/72

BSPlib Examples

Serialize Printing of Hello World (shows synchronization)

void main( void )

{

int ii;

bsp_begin( bsp_nprocs());

for( ii=0; ii


60/72

BSPlib Examples

All sums version 1 ( lg( p ) supersteps )

int bsp_allsums1( int x ){

int ii, left, right;

bsp_pushregister( &left, sizeof( int ));

bsp_sync();

right = x;

for( ii=1; ii= I )

right = left + right;

}

bsp_popregister( &left );

return( right );

}


61/72

BSPlib Examples

All sums version 2 (one superstep)

int bsp_allsums2( int x ) {

int ii, result, *array = calloc( bsp_nprocs(), sizeof(int));

if( array == NULL ) bsp_abort( Unable to allocate %d element array,

bsp_nprocs());

bsp_pushregister( array, bsp_nprocs()*sizeof( int));

bsp_sync();

for( ii=bsp_pid(); ii


62/72

BSPlib vs. PVM and/or MPI

MPI/PVM are widely implemented and widely

used.

Both have HUGEAPIs!!!

Both may be inefficient on (distributed-)shared

memory systems. where the communication and

synchronization are decoupled. True for DSM machines with one sided communication.


63/72

Both are based on pair-wise synchronization,

rather than barrier synchronization.

No simple cost model for performanceprediction.No simple means of examining the globalstate.

BSP could be implemented using a small, carefullychosen subset of MPI subroutines.


64/72

Conclusion

BSP is a computational model of parallel computing based on the

concept of supersteps.

BSP does not use locality of reference for the assignment of processes

to processors.

Predictability is defined in terms of three parameters.

BSP is a generalization of PRAM.

BSP = LogP + barriers - overhead

BSPlib has a much smaller API as compared to MPI/PVM.


65/72

Data Parallelism

Independent tasks apply same operation to

different elements of a data set

Okay to perform operations concurrently

Speedup: potentially p-fold, p=#processors

for i 0 to 99 doa[i] b[i] + c[i]

endfor


66/72

Functional Parallelism

Independent tasks apply different operations todifferent data elements

First and second statements

Third and fourth statements

Speedup: Limited by amount of concurrent sub-tasks

a 2

b 3m (a + b) / 2s (a2 + b2) / 2v s - m2


67/72

Amit Chhabra,Deptt. of Computer Sci. &

Engg. 67

Another example The Sieve of Eratosthenes- A classical prime finding

algorithm

Here we want to find prime numbers less than or equal

to some positive integer n.

Let us do this example for n=30

Sieve of Eratosthenes


68/72


Engg. 68



69/72


Engg. 69

Pseudo code serial



70/72


Engg. 70


Control parallel solution


71/72


Engg. 71


Data parallelism approach


72/72

Data parallel contd

Pram Machine Models

Documents

Transcript of Pram Machine Models