Pram Machine Models
-
Upload
ruby-mehta-malhotra -
Category
Documents
-
view
223 -
download
0
Transcript of Pram Machine Models
-
8/4/2019 Pram Machine Models
1/72
RAM, PRAM, and LogP models
-
8/4/2019 Pram Machine Models
2/72
Why models?
What is a machine model? An abstraction describes the operation of a machine.
Allowing to associate a value (cost) to each machine operation.
Why do we need models? Make it easy to reason algorithms
Hide the machine implementation details so that generalresults that apply to a broad class of machines to be obtained.
Analyze the achievable complexity (time, space, etc) bounds
Analyze maximum parallelism Models are directly related to algorithms.
-
8/4/2019 Pram Machine Models
3/72
RAM (random access machine) model
Memory consists of infinite array (memory cells).
Instructions executed sequentially one at a time
All instructions take unit time Load/store
Arithmetic
Logic
Running time of an algorithm is the number of
instructions executed. Memory requirement is the number of memory cells
used in the algorithm.
-
8/4/2019 Pram Machine Models
4/72
RAM (random access machine) model
The RAM model is the base of algorithm analysisfor sequential algorithms although it is not
perfect bcos of following reasons.
Memory not infinite
Not all memory access take the same time
Not all arithmetic operations take the same time
Instruction pipelining is not taken into consideration.
The RAM model (with asymptotic analysis) often
gives relatively realistic results.
-
8/4/2019 Pram Machine Models
5/72
PRAM (Parallel RAM)
A unbounded collection of processors.
Each process has infinite number of registers.
A unbounded collection of shared memory cells.
All processors can access all memory cells in unit time(when there is no memory conflict).
All processors execute PRAM instructionssynchronously (some processors may idle).
Each PRAM instruction executes in 3-phase cycles Read from a share memory cell (if needed) Computation
Write to a share memory cell (if needed)
-
8/4/2019 Pram Machine Models
6/72
PRAM (Parallel RAM)
The only way processors exchange data is
through the shared memory.
Parallel time complexity: the number of
synchronous steps in the algorithm
Space complexity: the number of share memoryParallelism: the number of processors used
-
8/4/2019 Pram Machine Models
7/72
Illustration of PRAM
P1 P2 P3 Pp
Shared Memory
CLK
P processors connected to a single shared memory
Each processor
has a uniqueindex.
Single program executed in MIMD mode
All processors can do things in a synchronous manner
(with infinite shared Memory and infinite local memory),
-
8/4/2019 Pram Machine Models
8/72
Amit Chhabra,Deptt. of Computer Sci. &Engg.
8
Communication in PRAM
-
8/4/2019 Pram Machine Models
9/72
PRAM further refinement
PRAMs are further classifed based on how the
memory conflicts are resolved.
Read
Exclusive Read (ER) all processors can only
simultaneously read from distinct memory location (but
not the same location).
What if two processors want to read from the same location?
Concurrent Read (CR) all processors can
simultaneously read from all memory locations.
-
8/4/2019 Pram Machine Models
10/72
PRAM further refinement
PRAMs are further classified based on how thememory conflicts are resolved.
Write
Exclusive Write (EW) all processors can onlysimultaneously write to distinct memory location (but notthe same location).
Concurrent Write (CR) all processors can simultaneouslywrite to all memory locations.
Common CW: only allow same value to be written to the samelocation simultaneously.
Random CW: randomly pick a value
Priority CW: processors have priority, the value in the highestpriority processor wins.
-
8/4/2019 Pram Machine Models
11/72
PRAM model variations
EREW, CREW, CRCW (common), CRCW
(random), CRCW (Priority)
Which model is closer to the practical SMP
machines?
Model A is computationally stronger than
model B if and only if any algorithm written in
B will run unchange in A. EREW
-
8/4/2019 Pram Machine Models
12/72
PRAM algorithm example
SUM: Add N numbers in memory M*0, 1, , N-
1]
Sequential SUM algorithm (O(N) complexity)
for (i=0; i
-
8/4/2019 Pram Machine Models
13/72
PRAM SUM algorithm
Which mo
Which PRAM model?
-
8/4/2019 Pram Machine Models
14/72
PRAM SUM algorithm complexity
Time complexity?
Number of processors needed?
Speedup (vs. sequential program):
-
8/4/2019 Pram Machine Models
15/72
Amit Chhabra,Deptt. of Computer Sci. &Engg.
15
Basic Parallel Algorithms
3 elementary problems to be considered Reduction
Broadcast
Prefix sums Target Architectures
Hypercube SIMD model
2D-mesh SIMD model UMA multiprocessor model
Hypercube Multicomputer
-
8/4/2019 Pram Machine Models
16/72
Amit Chhabra,Deptt. of Computer Sci. &Engg.
16
REDUCTION
Cost Optimal PRAM Algorithm for the Reduction Problem
Cost optimal PRAM algorithm complexity:
O(logn) (using n div 2 processors)
Example for n=8 and p=4 processors
a0
j=0
j=1
j=2
a1 a2 a3 a4 a5 a6 a7
P0
P0
P0
P1 P2P3
P2
-
8/4/2019 Pram Machine Models
17/72
Amit Chhabra,Deptt. of Computer Sci. &Engg.
17
Reduction using Hyper-cube SIMD
P0
P1 P3
P2
P4
P5
P6
P7
P0
P1 P3
P2
P1
P0
-
8/4/2019 Pram Machine Models
18/72
Amit Chhabra,Deptt. of Computer Sci. &Engg.
18
Reduction using 2-D mesh
-
8/4/2019 Pram Machine Models
19/72
Amit Chhabra,Deptt. of Computer Sci. &Engg.
19
Reduction using 2-D mesh
Example: compute the total sum on a 4*4 mesh
Stage 1
-
8/4/2019 Pram Machine Models
20/72
Amit Chhabra,Deptt. of Computer Sci. &Engg.
20
Reduction using 2-D mesh
Stage 2
-
8/4/2019 Pram Machine Models
21/72
Parallel search algorithm
P processors PRAM with unsorted N numbers(P
-
8/4/2019 Pram Machine Models
22/72
Parallel search algorithm
PRAM Algorithm:
Step 1: Inform everyone what x is
Step 2: every processor checks N/P numbers and
sets a flag
Step 3: Check if any flag is set to 1.
EREW: O(log(p)) step 1, O(N/P) step 2, and O(log(p)) step 3.
CREW: O(1) step 1, O(N/P) step 2, and O(log(p)) step 3.
CRCW (common): O(1) step 1, O(N/P) step 2, and O(1) step 3.
-
8/4/2019 Pram Machine Models
23/72
PRAM strengths
Natural extension of RAM
It is simple and easy to understand
Communication and synchronization issues are
hided. Can be used as a benchmarks
If an algorithm performs badly in the PRAMmodel, it will perform badly in reality.
A good PRAM program may not be practicalthough.
-
8/4/2019 Pram Machine Models
24/72
PRAM weaknesses
Model inaccuracies
Unbounded local memory (register)
All operations take unit time
Unaccounted costs
Non-local memory access
Latency Bandwidth
Memory access contention
-
8/4/2019 Pram Machine Models
25/72
Pros & Cons
Pros
Simple and clean semantics
The majority of theoretical parallel algorithms are
specified with the PRAM model.
Complexity of parallel algorithm
Independent of the communication network
topology
-
8/4/2019 Pram Machine Models
26/72
Pros & Cons (Contd)
Cons
Not realistic
too powerful communication model
Algorithm designer is misled to use IPC without
hesitation.
Synchronized processors
No local memory
-
8/4/2019 Pram Machine Models
27/72
PRAM variations
Bounded memory PRAM, PRAM(m) In a given step, only m memory accesses can be serviced.
Bounded number of processors PRAM
Any problem that can be solved by a p processor PRAM in tsteps canbe solved by a p processor PRAM in t = O(tp/p) steps.
LPRAM(Local Memory PRAM) L units to access global memory
Any algorithm that runs in a p processor PRAM can run in LPRAM witha loss of a factor of L.
different access cost between local memory and remote memory
BPRAM(BLOCK PRAM) L units for the first message
B units for subsequent messages
-
8/4/2019 Pram Machine Models
28/72
PRAM summary
The RAM model is widely used.
PRAM is simple and easy to understand
This model never reaches beyond the algorithmcommunity.
It is getting more important as threaded programmingbecomes more popular.
-
8/4/2019 Pram Machine Models
29/72
BSP Model
BSP(Bulk Synchronous Parallel) Model
Proposed by Leslie Valiant of Harvard University
Developed by W.F.McColl of Oxford University
-
8/4/2019 Pram Machine Models
30/72
BSP Computer Model
Distributed memory architecture
3 components
Node
Processor
Local memory
Router (Communication Network)
Point-to-point, message passing (or shared variable)
Barrier synchronizing facility
All or subset
-
8/4/2019 Pram Machine Models
31/72
Illustration of BSP
Communication Network (g)
P M P M P M
Node (w) Node Node
Barrier (l)
-
8/4/2019 Pram Machine Models
32/72
Three Parameters
wparameter
Maximum computation time within each superstep
Computation operation takes at most wcycles.
g parameter
# of cycles for communication of unit message when all processors areinvolved in communication - network bandwidth
(total number of local operations performed by all processors in onesecond) / (total number of words delivered by the communicationnetwork in one second)
h relation coefficient
Communication operation takes gh cycles.
lparameter
Barrier synchronization takes lcycles.
-
8/4/2019 Pram Machine Models
33/72
BSP Program
A BSP computation consists of S super steps.
A superstep is a sequence of steps followed by
a barrier synchronization.
Superstep
Any remote memory accesses take effect at
barrier - loosely synchronous
-
8/4/2019 Pram Machine Models
34/72
BSP Program
Superstep 1
Superstep 2
Barrier
P1 P2 P3 P4
Computation
Communication
-
8/4/2019 Pram Machine Models
35/72
BSP Algorithm Design
2 variables for time complexity
N: # time steps (local operations) per super step.
h-relation: each node sends and receives at most h messages
gh is time steps for communication within a super step
Time complexity Valiant's original: max{w, gh, l}
overlapped communication and computation
McColl's revised: N + gh + l = Ntot + Ncom g + sl
Algorithm design Ntot = T / p, minimize Ncom and S.
-
8/4/2019 Pram Machine Models
36/72
Example
Inner product with 8 processors
Superstep 1
Computation: local sum w = 2N/8
Communication: 1 relation (procs. 0,2,4,6 -> procs.1,3,5,7)
Superstep 2
Computation: one addition w = 1
Communication: 1 relation (procs. 1 5 -> procs. 3.7)
Superstep 3
Computation: one addition w = 1
Communication: 1 relation (proc. 3 -> proc. 7)
-
8/4/2019 Pram Machine Models
37/72
Example (Contd)
Superstep 4
Computation: one addition w = 1
Total execution time = 2N/n + log n(g+l+1)
log gn: communication overhead, log nl:synchronization overhead
-
8/4/2019 Pram Machine Models
38/72
BSP Programming Library
More abstract level of programming thanmessage passing
BSPlib library: Oxford BSP library + Green BSP
One-sided DRMA (direct remote memory access)Share-memory operations: dynamic shared-variables through registrations
BSMP (bulk synchronous message passing):
combine small messages into a large one Additional features: message reordering
-
8/4/2019 Pram Machine Models
39/72
BSP Programming Library (Contd)
Extensions: Paderborn University BSP
supports collective communication and group
management facilities
Cost model for performance analysis and
prediction
Good for debugging: the global state is visible at
the super-step boundary
-
8/4/2019 Pram Machine Models
40/72
Programming Example
All sums using the logarithmic technique
Calculate the partial sums of p integers stored on
p processors
log p supersteps
4 processors
-
8/4/2019 Pram Machine Models
41/72
BSP Program Example
int bsp_allsum(int x){
int i, left, right;
bsp_push_register(&left, sizeof(int));bsp_sync();
right = x;for(i=1; i= i) right = left + right;
}bsp_pop_register(&left);return right;
}
ll i i h i
-
8/4/2019 Pram Machine Models
42/72
All Sums Using Logarithmic
Technique
1 2 3 4
1 3 5 7
1 3 6 10
-
8/4/2019 Pram Machine Models
43/72
LogP model
Common MPP organization:
complete machine connected by anetwork.
LogP attempts to capture thecharacteristics of such organization.
PRAM model: shared memory
P
M
network
P
M
P
M
-
8/4/2019 Pram Machine Models
44/72
Deriving LogP model
Processing powerful microprocessor, large DRAM, cache => P
Communication
+ significant latency => L
+ limited bandwidth => g+ significant overhead => o
- on both ends
no consensus on topology
=> should not exploit structure
+ limited capacity no consensus on programming model
=> should not enforce one
-
8/4/2019 Pram Machine Models
45/72
LogP
Interconnection Network
MPMPMP
P ( processors )
Limited Volume( L/ g to or from
a proc)
o (overhead)
L (latency)
og (gap)
Latency in sending a (small) mesage between modules
overhead felt by the processor on sending or receiving msg
gap between successive sends or receives (1/BW)
Processors
Using the model
-
8/4/2019 Pram Machine Models
46/72
Using the model
Send n messages from proc to proc in time 2o + L + g(n-1) each processor does o n cycles of overhead
has (g-o)(n-1) + L available compute cycles
Send n messages from one to many
in same time
Send n messages from many to one
in same time all but L/g processors block
so fewer available cycles
o L o
o og L time
P
P
-
8/4/2019 Pram Machine Models
47/72
Using the model
Two processors send n words to each other:
2o + L + g(n-1)
Assumes no network contention
Can under-estimate the communication time.
-
8/4/2019 Pram Machine Models
48/72
LogP philosophy
Think about:
mapping of a task onto P processors
computation within a processor, its cost, and balance
communication between processors, its cost, andbalance
given a charaterization of processor and network
performance
Do not think about what happens within the network
D l i l b d l i h
-
8/4/2019 Pram Machine Models
49/72
Develop optimal broadcast algorithm
based on the LogP model
Broadcast a single datum to P-1 processors
-
8/4/2019 Pram Machine Models
50/72
Strengths of the LogP model
Simple, 4 parameters
Can easily be used to guide the algorithm
development, especially algorithms for
communication routines.
This model has been used to analyze many
collective communication algorithms.
-
8/4/2019 Pram Machine Models
51/72
Weaknesses of the LogP model
Accurate only at the very low level (machine
instruction level)
Inaccurate for more practical communication
systems with layers of protocols (e.g. TCP/IP)
Many variations.
LogP family models: LogGP, logGPC, pLogP, etc
Making the model more accurate and more complex
-
8/4/2019 Pram Machine Models
52/72
BSP vs. LogP
BSP differs from LogP in three ways:
LogP uses a form of message passing based on pairwise
synchronization.
LogP adds an extra parameter representing the overhead involved insending a message. Applies to every communication!
LogP defines g in local terms. It regards the network as having a finite
capacity and treats g as the minimal permissible gap between
message sends from a single process. The parameter g in both cases isthe reciprocal of the available per-processor network bandwidth: BSP
takes a global view of g, LogP takes a local view of g.
-
8/4/2019 Pram Machine Models
53/72
BSP vs. LogP
When analyzing the performance of LogP model, it is often necessary (or
convenient) to use barriers.
Message overhead is present but decreasing
Only overhead is from transferring the message from user space to asystem buffer.
LogP + barriers - overhead = BSP
Both models can efficiently simulate the other.
-
8/4/2019 Pram Machine Models
54/72
BSP vs. PRAM
BSP can be regarded as a generalization of the PRAMmodel.
If the BSP architecture has a small value of g (g=1), then it
can be regarded as PRAM.
Use hashing to automatically achieve efficient memory
management.
The value of l determines the degree of parallel slackness
required to achieve optimal efficiency.
If l = g = 1 corresponds to idealized PRAM where no
slackness is required.
-
8/4/2019 Pram Machine Models
55/72
BSPlib
Supports a SPMD style of programming.
Library is available in C and FORTRAN.
Implementations available (several years ago) for:
Cray T3E
IBM SP2 SGI PowerChallenge
Convex Exemplar
Hitachi SR2001
Various Workstation Clusters
Allows for direct remote memory access or message passing.
Includes support for unbuffered messages for high performance computing.
-
8/4/2019 Pram Machine Models
56/72
BSPlib
Initialization Functions
bsp_init()
Simulate dynamic
processes
bsp_begin()
Start of SPMD code
bsp_end() End of SPMD code
Enquiry Functions
bsp_pid()
find my process id
bsp_nprocs()
number of processes
bsp_time()
local time
Synchronization Functions
bsp_sync() barrier synchronization
DRMA Functions
bsp_pushregister()
make region globallyvisible
bsp_popregister()
remove global visibility
bsp_put()
push to remote memory
bsp_get()
pull from remote memory
-
8/4/2019 Pram Machine Models
57/72
BSPlib
BSMP Functions
bsp_set_tag_size()
choose tag size
bsp_send()
send to remote queue
bsp_get_tag()
match tag with message
bsp_move()
fetch from queue
Halt Functions
bsp_abort()
one process halts all
High Performance Functions
bsp_hpput()
bsp_hpget()
bsp_hpmove()
These are unbuffered versions of
communication primitives
-
8/4/2019 Pram Machine Models
58/72
BSPlib Examples
Static Hello World
void main( void )
{
bsp_begin( bsp_nprocs());
printf( Hello BSP from %d of %d\n,
bsp_pid(), bsp_nprocs());
bsp_end();
}
Dynamic Hello World
int nprocs; /* global variable */
void spmd_part( void )
{
bsp_begin( nprocs );
printf( Hello BSP from %d of %d\n,
bsp_pid(), bsp_nprocs());
}
void main( void )
{
bsp_init( spmd_part, argc, argv );
nprocs = ReadInteger();
spmd_part();
}
-
8/4/2019 Pram Machine Models
59/72
BSPlib Examples
Serialize Printing of Hello World (shows synchronization)
void main( void )
{
int ii;
bsp_begin( bsp_nprocs());
for( ii=0; ii
-
8/4/2019 Pram Machine Models
60/72
BSPlib Examples
All sums version 1 ( lg( p ) supersteps )
int bsp_allsums1( int x ){
int ii, left, right;
bsp_pushregister( &left, sizeof( int ));
bsp_sync();
right = x;
for( ii=1; ii= I )
right = left + right;
}
bsp_popregister( &left );
return( right );
}
-
8/4/2019 Pram Machine Models
61/72
BSPlib Examples
All sums version 2 (one superstep)
int bsp_allsums2( int x ) {
int ii, result, *array = calloc( bsp_nprocs(), sizeof(int));
if( array == NULL ) bsp_abort( Unable to allocate %d element array,
bsp_nprocs());
bsp_pushregister( array, bsp_nprocs()*sizeof( int));
bsp_sync();
for( ii=bsp_pid(); ii
-
8/4/2019 Pram Machine Models
62/72
BSPlib vs. PVM and/or MPI
MPI/PVM are widely implemented and widely
used.
Both have HUGEAPIs!!!
Both may be inefficient on (distributed-)shared
memory systems. where the communication and
synchronization are decoupled. True for DSM machines with one sided communication.
-
8/4/2019 Pram Machine Models
63/72
Both are based on pair-wise synchronization,
rather than barrier synchronization.
No simple cost model for performanceprediction.No simple means of examining the globalstate.
BSP could be implemented using a small, carefullychosen subset of MPI subroutines.
-
8/4/2019 Pram Machine Models
64/72
Conclusion
BSP is a computational model of parallel computing based on the
concept of supersteps.
BSP does not use locality of reference for the assignment of processes
to processors.
Predictability is defined in terms of three parameters.
BSP is a generalization of PRAM.
BSP = LogP + barriers - overhead
BSPlib has a much smaller API as compared to MPI/PVM.
-
8/4/2019 Pram Machine Models
65/72
Data Parallelism
Independent tasks apply same operation to
different elements of a data set
Okay to perform operations concurrently
Speedup: potentially p-fold, p=#processors
for i 0 to 99 doa[i] b[i] + c[i]
endfor
-
8/4/2019 Pram Machine Models
66/72
Functional Parallelism
Independent tasks apply different operations todifferent data elements
First and second statements
Third and fourth statements
Speedup: Limited by amount of concurrent sub-tasks
a 2
b 3m (a + b) / 2s (a2 + b2) / 2v s - m2
-
8/4/2019 Pram Machine Models
67/72
Amit Chhabra,Deptt. of Computer Sci. &
Engg. 67
Another example The Sieve of Eratosthenes- A classical prime finding
algorithm
Here we want to find prime numbers less than or equal
to some positive integer n.
Let us do this example for n=30
Sieve of Eratosthenes
-
8/4/2019 Pram Machine Models
68/72
Amit Chhabra,Deptt. of Computer Sci. &
Engg. 68
Sieve of Eratosthenes
-
8/4/2019 Pram Machine Models
69/72
Amit Chhabra,Deptt. of Computer Sci. &
Engg. 69
Pseudo code serial
Sieve of Eratosthenes
-
8/4/2019 Pram Machine Models
70/72
Amit Chhabra,Deptt. of Computer Sci. &
Engg. 70
Sieve of Eratosthenes
Control parallel solution
-
8/4/2019 Pram Machine Models
71/72
Amit Chhabra,Deptt. of Computer Sci. &
Engg. 71
Sieve of Eratosthenes
Data parallelism approach
-
8/4/2019 Pram Machine Models
72/72
Data parallel contd