FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 7 - 20131 FIT5174 Distributed &...

FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 7 - 2013 1

FIT5174 Distributed & Parallel Systems

Lecture 7

Parallel Computer System Architectures


Acknowledgement

These slides are based on slides and material by:

Carlo Kopp


Parallel Computing• Parallel computing is a form of computation in which many instructions are

carried out simultaneously• It operates on the principle that large problems can often be divided into

smaller ones, which are then solved concurrently (i.e. at the same time)

• There are several different forms of parallel computing: bit-level parallelism, instruction-level parallelism, data parallelism, and task parallelism.

Serial computing Parallel computing


Parallel Computing

• Contemporary computer applications require the processing of large amounts of data in sophisticated ways. Example include:

parallel databases, data mining oil exploration web search engines, web based business services computer-aided diagnosis in medicine management of national and multi-national corporations advanced graphics and virtual reality, particularly in the

entertainment industry networked video and multi-media technologies collaborative work environments

• Ultimately, parallel computing is an attempt to minimise time required to compute a problem, despite the performance limitations of individual CPUs / cores.


Parallel Computing Terminology• There are different ways to classify parallel computers.

One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy.

• Flynn's taxonomy distinguishes multi-processor computer architectures according to two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple.

• The 4 possible classifications according to Flynn. S I S D : Single Instruction, Single Data S I M D : Single Instruction, Multiple Data M I S D : Multiple Instruction, Single Data M I M D : Multiple Instruction, Multiple Data


Concepts and Terminology

• At the executable machine code level, programs are seen by the processor or core as a series of machine instructions, in some machine specific binary code;

• The common format of any instruction is that of an “operation code” or “opcode” and some “operands’, which are arguments the processor/core can understand;

• Typically, operands are held in registers in the processor/core which store several bytes of data, or memory addresses pointing to locations in the machine’s main memory;

• In a “conventional” or “general purpose” processor/core a single instruction combines one opcode with two or three operands, e.g.

ADD R1, R2, R3 – add contents of R1 and R2, put result into R3


Flynn’s Classification


Flynn’s Classification - SISD• Single Instruction, Single Data (SISD):

A serial (non-parallel or “conventional”) computer

Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle

Single data: only one data stream is being used as input during any one clock cycle

Deterministic execution

This is the oldest and until recently, the most prevalent form of computer

Examples: most PCs, single CPU workstations and mainframes


Flynn’s Classification - SIMDSingle Instruction, Multiple Data (SIMD): • A type of parallel computer • Single instruction: All processing units execute the same instruction at any

given clock cycle • Multiple data: Each processing unit can operate on a different data element • This type of machine typically has an instruction dispatcher, a very high-

bandwidth internal network, and a very large array of very small-capacity instruction units.

• Best suited for specialized problems characterized by a high degree of regularity, such as image processing, matrix algebra etc.

• Synchronous (lockstep) and deterministic execution • Two varieties: Processor Arrays and Vector Pipelines • Examples: Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2 Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820


Flynn’s Classification - SIMD


Flynn’s Classification - MISD

Multiple Instruction, Single Data (MISD): • A single data stream is fed into multiple processing units. • Each processing unit operates on data independently via

independent instruction streams. • Few actual examples of this class of parallel computer have

ever existed. One was the experimental Carnegie-Mellon computer

• Some conceivable uses might be: multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single

coded message.


Flynn’s Classification - MIMDMultiple Instruction, Multiple Data (MIMD): • Currently, the most common type of parallel computer. Most

modern computers fall into this category. • Multiple Instruction: every processor may be executing a different

instruction stream • Multiple Data: every processor may be working with a different

data stream • Execution can be synchronous or asynchronous, deterministic or

non-deterministic • Examples: most current supercomputers, networked parallel

computer "grids" and multi-processor SMP computers - including some types of PCs.


Parallel Computer Memory Architectures

• Broadly divided into three categories– Shared memory– Distributed memory– Hybrid

Shared Memory• Shared memory parallel computers vary widely, but generally have in

common the ability for all processors to access all memory as global address space.

• Multiple processors can operate independently but share the same memory resources.

• Changes in a memory location effected by one processor are visible to all other processors.

• Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA;

• Uniform Memory Access vs Non-Uniform Memory Access models.


Parallel Computer - Shared Memory


Parallel Computer - Distributed MemoryDistributed Memory

• Distributed memory systems require a communication network to connect inter-processor memory.

• Processors have their own local memory. There is no concept of global address space across all processors.

• Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of “cache coherency” does not apply.

• When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility.

• The network “fabric” used for data transfers varies widely, though it can be as simple as Ethernet, or as complexed as a specialised bus or switching device.


Parallel Computer - Distributed Memory


Parallel Computer - Hybrid Memory

Hybrid: The largest and fastest computers in the world today employ both shared and distributed memory architectures.

• The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global.

• The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another.

• Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future.

• Advantages and Disadvantages: whatever is common to both shared and distributed memory architectures.


Parallel Computer - Hybrid Memory


Parallel Programming Models

Overview• There are several parallel programming models in common use:

– Shared Memory – Threads – Message Passing – Data Parallel – Hybrid

• Parallel programming models exist as an abstraction above hardware and memory architectures.

• Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware.


Parallel Computing Performance• General Speed-up formula

• Execution time components

Inherently sequential computations: (n)

Potentially parallel computations: (n)

Communication operations: (n,p)

timeexecution Parallel

timeexecution Sequential Speedup =

),(/)()(

)()(),(

pnpnn

nnpn

ϕϕψ

+++

≤


Speed-up Formula

(n)/p (n,p)

(n)/p + (n,p)Speed-up

Computations

Comps + Comms

Communications


AmDahl’s Law of Speed-up• It states that a small portion of the program which cannot be

parallelized will limit the overall speed-up available from parallelization.

• Any large mathematical or engineering problem will typically consist of several parallelizable parts and several non-parallelizable (sequential) parts. This relationship is given by the equation:

where S is the speed-up of the program (as a factor of its original sequential runtime), and P is the fraction that is parallelizable.


Interesting Amdahl Observation

• If the sequential portion of a program is 10% of the runtime, we can get no more than a 10 x speed-up, regardless of how many processors are added.

• This puts an upper limit on the usefulness of adding more parallel execution units.


Amdahl’s Law


Parallel Efficiency• Efficiency

• 0 (n,p) 1• Amdahl’s law

• Let f = (n)/((n) + (n)); i.e., f is the fraction of the code which is inherently sequential

),()()(

)()(),(

pnpnnp

nnpn

ϕϕ++

+≤

pnn

nn

pnpnn

nnpn

/)()(

)()(

),(/)()(

)()(),(

ϕϕ

ϕϕψ

++

≤

+++

≤

pff /)1(

1

−+≤ψ

ProcessorsSpeedupEfficiency


Examples• 95% of a program’s execution time occurs inside a loop that can be

executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs?

• 20% of a program’s execution time is spent within inherently sequential code. What is the limit to the speedup achievable by a parallel version of the program?

9.58/)05.01(05.0

1≅

−+≤ψ

52.0

1

/)2.01(2.0

1lim ==

−+∞→ pp


Amdahl’s Law limitations

Limitations of Amdahl’s Law

• Ignores (n,p) - overestimates speedup• Assumes f constant, so underestimates speedup achievable

Amdahl Effect

• Typically (n,p) has lower complexity than (n)/p• As n increases, (n)/p dominates (n,p)• As n increases, speedup increases• As n increases, sequential fraction f decreases.

n = 100

n = 1,000

n = 10,000Speedup

Processors


Gustafson’s Law

• Gustafson's Law (also known as Gustafson-Barsis' law, 1988) states that any sufficiently large problem can be efficiently parallelized.

• Gustafson's Law is closely related to Amdahl's law, which gives a limit to the degree to which a program can be sped up due to parallelization.

S(P) = P − α * (P − 1).

where P is the number of processors, S is the speedup, and α the non-parallelizable part of the process

• Gustafson's law addresses the shortcomings of Amdahl's law, which cannot scale to match availability of computing power as the machine size increases.


Gustafson’s Law

Also,

• It removes the fixed problem size or fixed computation load on the parallel processors: instead, he proposes a fixed time concept which leads to scaled speed up.

• Amdahl's law is based on fixed workload or fixed problem size. It implies that the sequential part of a program does not change with respect to machine size (i.e, the number of processors). However the parallel part is evenly distributed by n processors.


Performance Summary• Performance terms

– Speedup– Efficiency

• What prevents linear speedup?– Serial operations– Communication operations– Process start-up– Imbalanced workloads– Architectural limitations

• Analyzing parallel performance– Amdahl’s Law– Gustafson-Barsis’ Law


Parallel Programming Examples

• This example demonstrates calculations on 2-dimensional array elements, with the computation on each array element being independent from other array elements.

• The serial program calculates one element at a time in sequential order.

• Serial code could be of the form: do j = 1, n

do i = 1, m a(i,j) = fcn(i,j)

end do end do

• The calculation of elements is independent of one another - leads to an embarrassingly parallel situation.

• The problem should be computationally intensive


Parallel Programming -2D example• Arrays elements are distributed so that each processor owns a portion

of an array (subarray).

• Independent calculation of array elements insures there is no need for communication between tasks.

• Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1) through the subarrays. Unit stride maximizes cache/memory usage.

• After the array is distributed, each task executes the portion of the loop corresponding to the data it owns. For example:

do j = mystart, myend do i = 1,m

a(i,j) = fcn(i,j) end do

end do

• Notice that only the outer loop variables are different from the serial solution.


Pseudo-codefind out if I am MASTER or WORKER

if I am MASTER

initialize the array send each WORKER info on part of array it owns send each WORKER its portion of initial array

receive from each WORKER results

else if I am WORKER receive from MASTER info on part of array I own receive from MASTER my portion of initial array # calculate my portion of array do j = my_first_column, my_last_column

do i = 1,n a(i,j) = fcn(i,j)

end do end do

send MASTER results endif


Pi Calculation : Serial solutionPI Calculation• The value of PI can be calculated in a

number of ways. Consider the following method of approximating PI – Inscribe a circle in a square – Randomly generate points in the square – Determine the number of points in the

square that are also in the circle – Let r be the number of points in the

circle divided by the number of points in the square

– PI ~ 4 r – Note that the more points generated,

the better the approximation


Pi Calculation : Serial solution

• Serial pseudo code for this procedure:

npoints = 10000 circle_count = 0 do j = 1,npoints generate 2 random numbers between 0 and 1 xcoordinate = random1 ; ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle

then circle_count = circle_count + 1 end do PI = 4.0*circle_count/npoints

• Note that most of the time in running this program would be spent executing the loop

• Leads to an embarrassingly parallel solution – Computationally intensive – Minimal communication – Minimal I/O


Pi Calculation : Parallel Solution• Parallel Solution

• Parallel strategy: break the loop into portions that can be executed by the tasks.

• For the task of approximating Pi:

– Each task executes its portion of the loop a number of times.

– Each task can do its work without requiring any information from the other tasks (there are no data dependencies).

– Uses the SPMD** model. One task acts as master and collects the results.

• Pseudo code solution: red highlights changes for parallelism.

[**SPMD: (Single Process, Multiple Data) or (Single Program, Multiple Data)

Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster. SPMD is the most common style of parallel programming. It is a subcategory of MIMD of Flynn’s Taxonomy]


Pi Calculation : Parallel Solution pseudocode

npoints = 10000

circle_count = 0

p = number of tasks

num = npoints/p

find out if I am MASTER or WORKER

do j = 1,num

generate 2 random numbers between 0 and 1

xcoordinate = random1 ; ycoordinate = random2

if (xcoordinate, ycoordinate) inside circle

then circle_count = circle_count + 1

end do

if I am MASTER

receive from WORKERS their circle_counts

compute PI (use MASTER and WORKER calculations)

else if I am WORKER

send to MASTER circle_count

endif


1-D Wave Equation Parallel Solution• Implement as an SPMD model • The entire amplitude array is partitioned and distributed as

sub-arrays to all tasks. Each task owns a portion of the total array.

• Load balancing: all points require equal work, so the points should be divided equally

• A block decomposition would have the work partitioned into the number of tasks as chunks, allowing each task to own mostly contiguous data points.


1-D Wave Equation Parallel Solution

• Communication need only occur on data borders. The larger the block size the less the communication.

• The equation to be solved is the one-dimensional wave equation:

A(i, t+1) = (2.0 * A(i, t)) - A(i, t-1) + (c * (A(i-1, t) - (2.0 * A(i, t)) + A(i+1, t)))

where c is a constant

• Note that amplitude will depend on previous timesteps (t, t-1) and neighboring points (i-1, i+1). Data dependence will mean that a parallel solution will involve communications.


1-D Wave Equation Parallel Solutionfind out number of tasks and task identities

#Identify left and right neighbors left_neighbor = mytaskid – 1; right_neighbor = mytaskid +1 if mytaskid = first then left_neigbor = last if mytaskid = last then right_neighbor = first find out if I am MASTER or WORKER if I am MASTER

initialize array ; send each WORKER starting info and subarray else if I am WORKER

receive starting info and subarray from MASTER endif #Update values for each point along string #In this example the master participates in calculations do t = 1, nsteps

send left endpoint to left neighbor ; receive left endpoint from right neighbor send right endpoint to right neighbor ; receive right endpoint from left neighbor

#Update points along line do i = 1, npoints

newval(i) = (2.0 * values(i)) - oldval(i) + (sqtau * (values(i-1) - (2.0 * values(i)) + values(i+1)))

end do end do #Collect results and write to file if I am MASTER

receive results from each WORKER write results to file else if I am WORKER

send results to MASTER endif

FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 7 - 20131 FIT5174 Distributed &...

Documents

Transcript of FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 7 - 20131 FIT5174 Distributed &...