October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to...

36
October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Transcript of October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to...

Page 1: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 1

COMP60611 Fundamentals of Parallel and Distributed

Systems

Lecture 4

An Approach to Performance Modelling

Len Freeman, Graham Riley

Centre for Novel Computing

School of Computer Science

University of Manchester

Page 2: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 2

Overview • Aims of performance modelling

– Allows the comparison of algorithms. Gives an indication of scalability of an algorithm on a machine (a parallel system) as both the problem size and the number of processors change – “complexity analysis of parallel algorithms”.

– Enables reasoned choices at the design stage.

• Overview of an approach to performance modelling.– Based on the approach of Foster and Grama et al.

– Targets a generic multicomputer – (model of message-passing).

• Limitations.

• A worked example– Vector sum reduction (i.e. compute the sum of the elements of a

vector).

• Summary.

Page 3: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 3

Aims of performance modelling• In this lecture we will look at modelling the performance of

algorithms that compute a result;

– Issues of correctness are relatively straightforward.

• We are interested in questions such as:

– How long will an algorithm take to execute?

– How much memory is required (though we will not consider this in detail here)?

– Does the algorithm scale as we vary the number of processors and/or the problem size? What does scaling mean?

– How do the performances of different algorithms compare?

• Typically, focus on one phase of a computation at a time;

– e.g. assume start-up and initialisation has been done, or that these phases have been modelled separately.

Page 4: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 4

An approach to performance modelling• Based on a generic multiprocessor (see next slide).

• Defined in terms of Tasks that undertake computation and communicate with other tasks as necessary;

– A Task may be an aggolmeration of smaller tasks.

• Assumes a simple, but realistic, approach to communication between tasks:

– Based on channels that connect pairs of tasks.

• Seeks an analytical expression for execution time (T) as a function of (at least) the problem size (N), number of processors (P) (and, often, the number of tasks (U)),

...),,( UPNfT

Page 5: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 5

A generic multicomputer

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

Interconnect

Page 6: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 6

Task-channel model• Tasks execute concurrently;

– The number of tasks can vary during execution.

• A task encapsulates a sequential program and local memory.

• Tasks are connected by channels to other tasks;

– Channels are input or output channels.

• In addition to reading from, and writing to, local memory a task can:

– Send messages on output channels.

– Receive messages on input channels.

– Create new tasks.

– Terminate.

Page 7: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 7

Task-channel model• A channel connecting two tasks acts as a message queue.

• A send operation is asynchronous: it completes immediately;

– Sends are considered to be ‘free’ (take zero time)(?!).

• A receive operation is synchronous: execution of a task is blocked until a message is available;

– Receives may cause waiting (idling) time and take a finite time to complete (as data is transmitted from one task to another).

• Channels can be created dynamically.

• Tasks can be mapped to physical processors in various ways;

– the mapping does not affect the semantics of the program, but

it may well affect performance.

Page 8: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 8

Specifics of performance modelling• Assume a processor is either computing, communicating or idling.

• Thus, the total execution time can be found as the sum of the time spent in each activity for any particular processor (j):

.j j jcomp comm idleT T T T

• Or as the sum of each activity over all processors divided by the number of processors (P):

– These aggregate totals are often easier to calculate.1 1 1

0 0 0

1.

P P Pi i i

comp comm idlei i i

T T T TP

Page 9: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 9

Definitions

= Computation time - a function of problem size, ( )

(or of a set of ),

= Communication time - represents the cost of messages

(accessing remote data),

= Idl

i

comp

comm

idle

T N

N

T

T e time - due to either a lack of computation

or a lack of data.

Page 10: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 10

Cost of messages• A simple model of the cost of a message is:

,msg s wT t t L where:

– Tmsg is the time to receive a message,

– ts is the start up cost of receiving a message,

– tw is the cost per word (s/word),

• 1/ tw is the bandwidth (words/s),

– L is the number of words in the message.

Page 11: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 11

Cost of messages

commTThus, is the sum of all message times:

.comm msgT T

Page 12: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 12

Limitations of the Model• The (basic) model presented in this lecture ignores

the hierarchical nature of the memory of real computer systems:– Cache behaviour,

– The impact of network architecture,

– Issues of competition for bandwidth.

• The basic model can be extended to cope with any/all of these complicating factors.

• Experience with real performance analysis on real systems helps the designer to choose when and what extra modelling might be helpful.

Page 13: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 13

Performance metrics: Speed-up and Efficiency.

• Define relative speed-up as the ratio of the execution time of the parallelised algorithm on one processor to the corresponding time on P processors:

• Define relative efficiency as:

• This is a measure of the time that processors spend doing useful work (i.e., the time spent doing useful work divided by total time on all P processors).

• It characterises the effectiveness of an algorithm on a system, for any problem size and any number of processors

1rel .

P

TS

T

1rel .

P

T SE

PT P

Page 14: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 14

Absolute performance metrics• Relative speed-up can be misleading! (Why?)

• Define absolute speed-up (efficiency) with reference to the sequential time, Tref , of an implementation of the best known algorithm for the problem-at-hand:

• Note: the best known algorithm may take an approach to solving the problem different to that of the parallel algorithm.

, .ref absabs abs

P

T SS E

T P

Page 15: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 15

Scalability and Isoefficiency

• What is meant by scalability?

– Scalability applies to an algorithm executing on a parallel machine, not simply to an algorithm!

• How does an algorithm behave for a fixed problem size as the number of processors used increases?

– Known as strong scaling.

• How does an algorithm behave as the problem size changes in addition to changing the number of processors?

• A key insight is to look at how efficiency changes.

Page 16: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 16

Efficiency and Strong scaling• Typically, for a fixed problem size N the efficiency of

an algorithm decreases as P increases (compare with ‘brush’ diagrams). Why?

– Overheads typically do not get smaller as P increases. They remain ‘fixed’ (e.g. Amdahl fraction), or, worse, they may grow with P (e.g. the number of communications may grow – in an all-to-all comms pattern)

• Recall that:

1

1

refabs

PP

ref

TE

POPTT

Page 17: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 17

Efficiency and Strong scaling

• POP is the total overhead in the system.

• Tref represents the useful work in the algorithm.

• At some point, with fixed N, efficiency Eabs (i.e. how well each processor is being utilised) will drop below an acceptable threshold – say, 50%(?)

Page 18: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 18

Scalability• No ‘real’ algorithm scales ‘forever’ on a fixed problem

size on a ‘real’ computer.

• Even ‘embarrassingly’ parallel algorithms will have a limit on the number of processors they can use;

– for example, at the point where, with a fixed N, eventually there is only one ‘element’ to be operated on by each processor.

• So we seek another approach to scalability which applies as both problem size N and the number of processors P change.

Page 19: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 19

Definition of Scalability – Isoefficiency• An algorithm can be said to (iso)scale if, for a given parallel

system, a specific level of efficiency can be maintained by changing the problem size, N, appropriately as P increases.

• Not all algorithms isoscale!

– e.g. a vector reduction where N = P (see later).

• This approach is called scaled problem analysis.

• The function (of P) describing how the problem size N must change as P increases to maintain a specified efficiency is known as the isoefficiency function.

• Isoscaling does not apply to all problems;

– e.g. weather modelling, where increasing problem size (resolution) is not always an option,

– or image processing with a fixed number of pixels.

Page 20: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 20

Weak scaling• An alternative approach is to keep the problem size per

processor fixed as P increases (total problem size N increases linearly with P) and see how the efficiency is affected;

– This is known as weak scaling (as opposed to strong scaling).

• Summary: strong scaling, weak scaling and isoefficiency are three approaches to understanding the scalabililty of parallel systems (algorithm + machine).

• We will look at an example shortly but first we need a way of comparing functions, e.g. performance functions and efficiency functions.

• These concepts will also be explored further in lab exercise 2.

Page 21: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 21

Comparison of functions – asymptotic analysis

• Performance models are generally functions of problem size (N) and the number of processors (P)

• We need relatively easy way to compare models (functions) as N and P vary:

– Model A is ‘at most’ as fast or as big as model B;

– Model A is ‘at least’ as fast or as big as model B;

– Model A is ‘equal’ in performance/size to model B.

• We will see a similar need when comparing efficiencies and in considering scalabilty.

• These are all examples of comparing functions.

• We are often interested in asymptotic behaviour, i.e. the behaviour as some key parameter (e.g. N or P) increases towards infinity.

Page 22: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 22

Comparing functions - example• From ‘Introduction to Parallel Computing’, Grama.

• Consider three functions:

– think of the functions as modelling the distance travelled by three cars from time t=0. One car has fixed speed and the others are accelerating (car C makes a standing start (zero initial speed)):

2

2

( ) 1000

( ) 100 20

( ) 25

A t t

B t t t

C t t

Page 23: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 23

Graphically

Page 24: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 24

• We can see that:

– For t > 45, B(t) is always greater than A(t).

– For t > 20, C(t) is always greater than B(t).

– For t > 0, C(t) is always less than 1.25*B(t).

Page 25: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 25

Introducing ‘big-Oh’ notation• It is often useful to express a bound on the growth of a particular

function in terms of a simpler function.

• For example, for t > 45, B(t) is always greater than A(t), we can express the relation between A(t) and B(t) using the Ο (Omicron or ‘big-oh’) notation:

• Meaning A(t) is “at most” B(t) beyond some value of t.

• Formally, given functions f(x), g(x),

f(x)=O(g(x))

if there exist positive constants c and x0 such that f(x) ≤ cg(x) for all x ≥ x0 [Definition from JaJa not Grama! – more transparent].

( ) ( ( ))A t B t

Page 26: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 26

• From this definition, we can see:

– A(t)=O(t2) (“at most”),

– B(t)=O(t2) (“at most” or “of the order t2”),

– Also, A(t)=O(t) (“at most” or “of the order t”),

– Finally, C(t)= O(t2) too.

• Informally, big-Oh can be used to identify the simplest function that bounds (above) a more complex function, as the parameter gets (asymptotically) bigger.

Page 27: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 27

Theta and Omega• There are two other useful symbols:

– Omega (Ω) meaning “at least”:

– Theta (Θ) “equals” or “goes as”:

• For formal definitions, see, for example, ‘An Introduction to Parallel Algorithms’ by JaJa or ‘Highly Parallel Computing’ by Almasi and Gottlieb.

• Note that the definitions in Grama are a little misleading!

( ) ( ( ))f x g x

( ) ( ( ))f x g x

Page 28: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 28

Performance modelling example• The following slides develop performance models for

the example of a vector (sum) reduction.

• The models are then used to support basic scalability analysis.

• Consider two parallel systems– First, a binary tree-based vector sum when the number of

elements (N) is equal to the number of processors (P), N=P.

– Second, the case when N >>P.

• Develop performance models;– Compare the models,

– Consider scalability.

Page 29: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 29

Vector Sum Reduction• Assume that

– N = P, and

– N is a power of 2.

• Propogate intermediate values through a binary tree

– Takes log2N steps (one processor is busy with work and communication on each step, the other processors have some idle time).

• Each step involves the communication of a single word (cost ts+tw) and a single addition (cost tc). Thus:

2 2( ) log (log )p c s wT t t t N N

Page 30: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 30

Vector Sum Reduction• Speedup:

• Speedup is ‘poor’ (but monotonically increasing)– If N=128, Sabs is ~18 (E = S/P = ~0.14, i.e. 14%),

– If N=1024, Sabs is ~100 (E = ~0.1),

– If N=1M, Sabs is ~ 52,000 (E= ~0.05),

– If N=1G, Sabs is ~ 35M (E = ~ 0.035).

ref

2 2

.( ) log log

cabs

p c s w

T t N NS

T t t t N N

Page 31: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 31

Vector sum scalability• Efficiency:

• But, N=P in this case, so:

• Strong scaling not ‘good’, as we have seen (E<<0.5).

• Efficiency is monotonically decreasing

– Reaches 50% point, E = 0.5, when (log2 P) = 2, i.e. when P=4.

• This does not isoscale either!

– E gets smaller as P (hence N) increases and P and N must change together.

2

.log

S NE

P P N

2

1.

logE

P

Page 32: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 32

Vector Sum Reduction• When N>>P, each processor can be allocated N/P

elements.

• Each processor sums its local elements in a first phase.

• A binary tree sum of size P is then be performed to sum the partial results.

• The performance model is:

2 2log log .p c c s w

N NT t t t t P P

P P

Page 33: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 33

Scalability – strong scaling?• Speedup:

• Strong scaling??

• For a given problem size N (>> P), the (log2P/N) term is always ‘small’ so speedup will fall off ‘slowly’.

• P is, of course, limited by the value of N… but we are considering the case where N >> P.

22

1.

loglog 1

NS

N PPP P N

Page 34: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 34

Scalabilty – Isoscaling• Efficiency:

• Now, we can always achieve a required efficiency on P processors by a suitable choice of N.

2

1.

log1

EP P

N

Page 35: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 35

Scalabilty – Isoscaling• For example, for 50% efficiency, choose

• Or, for efficiencies > 50%, choose

– As N gets larger on a given P, E gets closer to 1!

– The ‘good’ parallel phase (N/P work) dominates the log2P phase as N gets larger – leading to relatively good (iso)scalability.

2log .N P P

2log .N P P

Page 36: October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

October 2010 36

Summary of performance modelling• Performance modelling provides insight into the behaviour of parallel

systems (parallel algorithms on parallel machines).

• Modelling allows the comparison of algorithms and gives insight into their potential scalability.

• Two forms of scalability:

– Strong scaling (fixed problem size N as P varies)

– There is always a limit to strong scaling for real algorithms (e.g. a value of P at which efficiency falls below an acceptable limit).

– Isoscaling (the ability to maintain a specified level of efficiency by changing N as P varies).

– Not all parallel systems isoscale.

• Asymptotic analysis makes comparison easier but BEWARE the constants!

• Weak scaling is related to isoscaling – aim to maintain a fixed problem size per processor as P changes and look at the effect on efficiency.