Download - October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 1

COMP60611 Fundamentals of Parallel and Distributed

Systems

Lecture 4

An Approach to Performance Modelling

Len Freeman, Graham Riley

Centre for Novel Computing

School of Computer Science

University of Manchester

October 2010 2

Overview • Aims of performance modelling

– Allows the comparison of algorithms. Gives an indication of scalability of an algorithm on a machine (a parallel system) as both the problem size and the number of processors change – “complexity analysis of parallel algorithms”.

– Enables reasoned choices at the design stage.

• Overview of an approach to performance modelling.– Based on the approach of Foster and Grama et al.

– Targets a generic multicomputer – (model of message-passing).

• Limitations.

• A worked example– Vector sum reduction (i.e. compute the sum of the elements of a

vector).

• Summary.

October 2010 3

Aims of performance modelling• In this lecture we will look at modelling the performance of

algorithms that compute a result;

– Issues of correctness are relatively straightforward.

• We are interested in questions such as:

– How long will an algorithm take to execute?

– How much memory is required (though we will not consider this in detail here)?

– Does the algorithm scale as we vary the number of processors and/or the problem size? What does scaling mean?

– How do the performances of different algorithms compare?

• Typically, focus on one phase of a computation at a time;

– e.g. assume start-up and initialisation has been done, or that these phases have been modelled separately.

October 2010 4

An approach to performance modelling• Based on a generic multiprocessor (see next slide).

• Defined in terms of Tasks that undertake computation and communicate with other tasks as necessary;

– A Task may be an aggolmeration of smaller tasks.

• Assumes a simple, but realistic, approach to communication between tasks:

– Based on channels that connect pairs of tasks.

• Seeks an analytical expression for execution time (T) as a function of (at least) the problem size (N), number of processors (P) (and, often, the number of tasks (U)),

...),,( UPNfT

October 2010 5

A generic multicomputer

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

…

Interconnect

October 2010 6

Task-channel model• Tasks execute concurrently;

– The number of tasks can vary during execution.

• A task encapsulates a sequential program and local memory.

• Tasks are connected by channels to other tasks;

– Channels are input or output channels.

• In addition to reading from, and writing to, local memory a task can:

– Send messages on output channels.

– Receive messages on input channels.

– Create new tasks.

– Terminate.

October 2010 7

Task-channel model• A channel connecting two tasks acts as a message queue.

• A send operation is asynchronous: it completes immediately;

– Sends are considered to be ‘free’ (take zero time)(?!).

• A receive operation is synchronous: execution of a task is blocked until a message is available;

– Receives may cause waiting (idling) time and take a finite time to complete (as data is transmitted from one task to another).

• Channels can be created dynamically.

• Tasks can be mapped to physical processors in various ways;

– the mapping does not affect the semantics of the program, but

it may well affect performance.

October 2010 8

Specifics of performance modelling• Assume a processor is either computing, communicating or idling.

• Thus, the total execution time can be found as the sum of the time spent in each activity for any particular processor (j):

.j j jcomp comm idleT T T T

• Or as the sum of each activity over all processors divided by the number of processors (P):

– These aggregate totals are often easier to calculate.1 1 1

0 0 0

1.

P P Pi i i

comp comm idlei i i

T T T TP

October 2010 9

Definitions

= Computation time - a function of problem size, ( )

(or of a set of ),

= Communication time - represents the cost of messages

(accessing remote data),

= Idl

i

comp

comm

idle

T N

N

T

T e time - due to either a lack of computation

or a lack of data.

October 2010 10

Cost of messages• A simple model of the cost of a message is:

,msg s wT t t L where:

– Tmsg is the time to receive a message,

– ts is the start up cost of receiving a message,

– tw is the cost per word (s/word),

• 1/ tw is the bandwidth (words/s),

– L is the number of words in the message.

October 2010 11

Cost of messages

commTThus, is the sum of all message times:

.comm msgT T

October 2010 12

Limitations of the Model• The (basic) model presented in this lecture ignores

the hierarchical nature of the memory of real computer systems:– Cache behaviour,

– The impact of network architecture,

– Issues of competition for bandwidth.

• The basic model can be extended to cope with any/all of these complicating factors.

• Experience with real performance analysis on real systems helps the designer to choose when and what extra modelling might be helpful.

October 2010 13

Performance metrics: Speed-up and Efficiency.

• Define relative speed-up as the ratio of the execution time of the parallelised algorithm on one processor to the corresponding time on P processors:

• Define relative efficiency as:

• This is a measure of the time that processors spend doing useful work (i.e., the time spent doing useful work divided by total time on all P processors).

• It characterises the effectiveness of an algorithm on a system, for any problem size and any number of processors

1rel .

P

TS

T

1rel .

P

T SE

PT P

October 2010 14

Absolute performance metrics• Relative speed-up can be misleading! (Why?)

• Define absolute speed-up (efficiency) with reference to the sequential time, Tref , of an implementation of the best known algorithm for the problem-at-hand:

• Note: the best known algorithm may take an approach to solving the problem different to that of the parallel algorithm.

, .ref absabs abs

P

T SS E

T P

October 2010 15

Scalability and Isoefficiency

• What is meant by scalability?

– Scalability applies to an algorithm executing on a parallel machine, not simply to an algorithm!

• How does an algorithm behave for a fixed problem size as the number of processors used increases?

– Known as strong scaling.

• How does an algorithm behave as the problem size changes in addition to changing the number of processors?

• A key insight is to look at how efficiency changes.

October 2010 16

Efficiency and Strong scaling• Typically, for a fixed problem size N the efficiency of

an algorithm decreases as P increases (compare with ‘brush’ diagrams). Why?

– Overheads typically do not get smaller as P increases. They remain ‘fixed’ (e.g. Amdahl fraction), or, worse, they may grow with P (e.g. the number of communications may grow – in an all-to-all comms pattern)

• Recall that:

1

1

refabs

PP

ref

TE

POPTT

October 2010 17

Efficiency and Strong scaling

• POP is the total overhead in the system.

• Tref represents the useful work in the algorithm.

• At some point, with fixed N, efficiency Eabs (i.e. how well each processor is being utilised) will drop below an acceptable threshold – say, 50%(?)

October 2010 18

Scalability• No ‘real’ algorithm scales ‘forever’ on a fixed problem

size on a ‘real’ computer.

• Even ‘embarrassingly’ parallel algorithms will have a limit on the number of processors they can use;

– for example, at the point where, with a fixed N, eventually there is only one ‘element’ to be operated on by each processor.

• So we seek another approach to scalability which applies as both problem size N and the number of processors P change.

October 2010 19

Definition of Scalability – Isoefficiency• An algorithm can be said to (iso)scale if, for a given parallel

system, a specific level of efficiency can be maintained by changing the problem size, N, appropriately as P increases.

• Not all algorithms isoscale!

– e.g. a vector reduction where N = P (see later).

• This approach is called scaled problem analysis.

• The function (of P) describing how the problem size N must change as P increases to maintain a specified efficiency is known as the isoefficiency function.

• Isoscaling does not apply to all problems;

– e.g. weather modelling, where increasing problem size (resolution) is not always an option,

– or image processing with a fixed number of pixels.

October 2010 20

Weak scaling• An alternative approach is to keep the problem size per

processor fixed as P increases (total problem size N increases linearly with P) and see how the efficiency is affected;

– This is known as weak scaling (as opposed to strong scaling).

• Summary: strong scaling, weak scaling and isoefficiency are three approaches to understanding the scalabililty of parallel systems (algorithm + machine).

• We will look at an example shortly but first we need a way of comparing functions, e.g. performance functions and efficiency functions.

• These concepts will also be explored further in lab exercise 2.

October 2010 21

Comparison of functions – asymptotic analysis

• Performance models are generally functions of problem size (N) and the number of processors (P)

• We need relatively easy way to compare models (functions) as N and P vary:

– Model A is ‘at most’ as fast or as big as model B;

– Model A is ‘at least’ as fast or as big as model B;

– Model A is ‘equal’ in performance/size to model B.

• We will see a similar need when comparing efficiencies and in considering scalabilty.

• These are all examples of comparing functions.

• We are often interested in asymptotic behaviour, i.e. the behaviour as some key parameter (e.g. N or P) increases towards infinity.

October 2010 22

Comparing functions - example• From ‘Introduction to Parallel Computing’, Grama.

• Consider three functions:

– think of the functions as modelling the distance travelled by three cars from time t=0. One car has fixed speed and the others are accelerating (car C makes a standing start (zero initial speed)):

2

2

( ) 1000

( ) 100 20

( ) 25

A t t

B t t t

C t t

October 2010 23

Graphically

October 2010 24

• We can see that:

– For t > 45, B(t) is always greater than A(t).

– For t > 20, C(t) is always greater than B(t).

– For t > 0, C(t) is always less than 1.25*B(t).

October 2010 25

Introducing ‘big-Oh’ notation• It is often useful to express a bound on the growth of a particular

function in terms of a simpler function.

• For example, for t > 45, B(t) is always greater than A(t), we can express the relation between A(t) and B(t) using the Ο (Omicron or ‘big-oh’) notation:

• Meaning A(t) is “at most” B(t) beyond some value of t.

• Formally, given functions f(x), g(x),

f(x)=O(g(x))

if there exist positive constants c and x0 such that f(x) ≤ cg(x) for all x ≥ x0 [Definition from JaJa not Grama! – more transparent].

( ) ( ( ))A t B t

October 2010 26

• From this definition, we can see:

– A(t)=O(t2) (“at most”),

– B(t)=O(t2) (“at most” or “of the order t2”),

– Also, A(t)=O(t) (“at most” or “of the order t”),

– Finally, C(t)= O(t2) too.

• Informally, big-Oh can be used to identify the simplest function that bounds (above) a more complex function, as the parameter gets (asymptotically) bigger.

October 2010 27

Theta and Omega• There are two other useful symbols:

– Omega (Ω) meaning “at least”:

– Theta (Θ) “equals” or “goes as”:

• For formal definitions, see, for example, ‘An Introduction to Parallel Algorithms’ by JaJa or ‘Highly Parallel Computing’ by Almasi and Gottlieb.

• Note that the definitions in Grama are a little misleading!

( ) ( ( ))f x g x

( ) ( ( ))f x g x

October 2010 28

Performance modelling example• The following slides develop performance models for

the example of a vector (sum) reduction.

• The models are then used to support basic scalability analysis.

• Consider two parallel systems– First, a binary tree-based vector sum when the number of

elements (N) is equal to the number of processors (P), N=P.

– Second, the case when N >>P.

• Develop performance models;– Compare the models,

– Consider scalability.

October 2010 29

Vector Sum Reduction• Assume that

– N = P, and

– N is a power of 2.

• Propogate intermediate values through a binary tree

– Takes log2N steps (one processor is busy with work and communication on each step, the other processors have some idle time).

• Each step involves the communication of a single word (cost ts+tw) and a single addition (cost tc). Thus:

2 2( ) log (log )p c s wT t t t N N

October 2010 30

Vector Sum Reduction• Speedup:

• Speedup is ‘poor’ (but monotonically increasing)– If N=128, Sabs is ~18 (E = S/P = ~0.14, i.e. 14%),

– If N=1024, Sabs is ~100 (E = ~0.1),

– If N=1M, Sabs is ~ 52,000 (E= ~0.05),

– If N=1G, Sabs is ~ 35M (E = ~ 0.035).

ref

2 2

.( ) log log

cabs

p c s w

T t N NS

T t t t N N

October 2010 31

Vector sum scalability• Efficiency:

• But, N=P in this case, so:

• Strong scaling not ‘good’, as we have seen (E<<0.5).

• Efficiency is monotonically decreasing

– Reaches 50% point, E = 0.5, when (log2 P) = 2, i.e. when P=4.

• This does not isoscale either!

– E gets smaller as P (hence N) increases and P and N must change together.

2

.log

S NE

P P N

2

1.

logE

P

October 2010 32

Vector Sum Reduction• When N>>P, each processor can be allocated N/P

elements.

• Each processor sums its local elements in a first phase.

• A binary tree sum of size P is then be performed to sum the partial results.

• The performance model is:

2 2log log .p c c s w

N NT t t t t P P

P P

October 2010 33

Scalability – strong scaling?• Speedup:

• Strong scaling??

• For a given problem size N (>> P), the (log2P/N) term is always ‘small’ so speedup will fall off ‘slowly’.

• P is, of course, limited by the value of N… but we are considering the case where N >> P.

22

1.

loglog 1

NS

N PPP P N

October 2010 34

Scalabilty – Isoscaling• Efficiency:

• Now, we can always achieve a required efficiency on P processors by a suitable choice of N.

2

1.

log1

EP P

N

October 2010 35

Scalabilty – Isoscaling• For example, for 50% efficiency, choose

• Or, for efficiencies > 50%, choose

– As N gets larger on a given P, E gets closer to 1!

– The ‘good’ parallel phase (N/P work) dominates the log2P phase as N gets larger – leading to relatively good (iso)scalability.

2log .N P P

2log .N P P

October 2010 36

Summary of performance modelling• Performance modelling provides insight into the behaviour of parallel

systems (parallel algorithms on parallel machines).

• Modelling allows the comparison of algorithms and gives insight into their potential scalability.

• Two forms of scalability:

– Strong scaling (fixed problem size N as P varies)

– There is always a limit to strong scaling for real algorithms (e.g. a value of P at which efficiency falls below an acceptable limit).

– Isoscaling (the ability to maintain a specified level of efficiency by changing N as P varies).

– Not all parallel systems isoscale.

• Asymptotic analysis makes comparison easier but BEWARE the constants!

• Weak scaling is related to isoscaling – aim to maintain a fixed problem size per processor as P changes and look at the effect on efficiency.