October 2010 1
COMP60611 Fundamentals of Parallel and Distributed
Systems
Lecture 4
An Approach to Performance Modelling
Len Freeman, Graham Riley
Centre for Novel Computing
School of Computer Science
University of Manchester
October 2010 2
Overview • Aims of performance modelling
– Allows the comparison of algorithms. Gives an indication of scalability of an algorithm on a machine (a parallel system) as both the problem size and the number of processors change – “complexity analysis of parallel algorithms”.
– Enables reasoned choices at the design stage.
• Overview of an approach to performance modelling.– Based on the approach of Foster and Grama et al.
– Targets a generic multicomputer – (model of message-passing).
• Limitations.
• A worked example– Vector sum reduction (i.e. compute the sum of the elements of a
vector).
• Summary.
October 2010 3
Aims of performance modelling• In this lecture we will look at modelling the performance of
algorithms that compute a result;
– Issues of correctness are relatively straightforward.
• We are interested in questions such as:
– How long will an algorithm take to execute?
– How much memory is required (though we will not consider this in detail here)?
– Does the algorithm scale as we vary the number of processors and/or the problem size? What does scaling mean?
– How do the performances of different algorithms compare?
• Typically, focus on one phase of a computation at a time;
– e.g. assume start-up and initialisation has been done, or that these phases have been modelled separately.
October 2010 4
An approach to performance modelling• Based on a generic multiprocessor (see next slide).
• Defined in terms of Tasks that undertake computation and communicate with other tasks as necessary;
– A Task may be an aggolmeration of smaller tasks.
• Assumes a simple, but realistic, approach to communication between tasks:
– Based on channels that connect pairs of tasks.
• Seeks an analytical expression for execution time (T) as a function of (at least) the problem size (N), number of processors (P) (and, often, the number of tasks (U)),
...),,( UPNfT
October 2010 5
A generic multicomputer
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
…
Interconnect
October 2010 6
Task-channel model• Tasks execute concurrently;
– The number of tasks can vary during execution.
• A task encapsulates a sequential program and local memory.
• Tasks are connected by channels to other tasks;
– Channels are input or output channels.
• In addition to reading from, and writing to, local memory a task can:
– Send messages on output channels.
– Receive messages on input channels.
– Create new tasks.
– Terminate.
October 2010 7
Task-channel model• A channel connecting two tasks acts as a message queue.
• A send operation is asynchronous: it completes immediately;
– Sends are considered to be ‘free’ (take zero time)(?!).
• A receive operation is synchronous: execution of a task is blocked until a message is available;
– Receives may cause waiting (idling) time and take a finite time to complete (as data is transmitted from one task to another).
• Channels can be created dynamically.
• Tasks can be mapped to physical processors in various ways;
– the mapping does not affect the semantics of the program, but
it may well affect performance.
October 2010 8
Specifics of performance modelling• Assume a processor is either computing, communicating or idling.
• Thus, the total execution time can be found as the sum of the time spent in each activity for any particular processor (j):
.j j jcomp comm idleT T T T
• Or as the sum of each activity over all processors divided by the number of processors (P):
– These aggregate totals are often easier to calculate.1 1 1
0 0 0
1.
P P Pi i i
comp comm idlei i i
T T T TP
October 2010 9
Definitions
= Computation time - a function of problem size, ( )
(or of a set of ),
= Communication time - represents the cost of messages
(accessing remote data),
= Idl
i
comp
comm
idle
T N
N
T
T e time - due to either a lack of computation
or a lack of data.
October 2010 10
Cost of messages• A simple model of the cost of a message is:
,msg s wT t t L where:
– Tmsg is the time to receive a message,
– ts is the start up cost of receiving a message,
– tw is the cost per word (s/word),
• 1/ tw is the bandwidth (words/s),
– L is the number of words in the message.
October 2010 11
Cost of messages
commTThus, is the sum of all message times:
.comm msgT T
October 2010 12
Limitations of the Model• The (basic) model presented in this lecture ignores
the hierarchical nature of the memory of real computer systems:– Cache behaviour,
– The impact of network architecture,
– Issues of competition for bandwidth.
• The basic model can be extended to cope with any/all of these complicating factors.
• Experience with real performance analysis on real systems helps the designer to choose when and what extra modelling might be helpful.
October 2010 13
Performance metrics: Speed-up and Efficiency.
• Define relative speed-up as the ratio of the execution time of the parallelised algorithm on one processor to the corresponding time on P processors:
• Define relative efficiency as:
• This is a measure of the time that processors spend doing useful work (i.e., the time spent doing useful work divided by total time on all P processors).
• It characterises the effectiveness of an algorithm on a system, for any problem size and any number of processors
1rel .
P
TS
T
1rel .
P
T SE
PT P
October 2010 14
Absolute performance metrics• Relative speed-up can be misleading! (Why?)
• Define absolute speed-up (efficiency) with reference to the sequential time, Tref , of an implementation of the best known algorithm for the problem-at-hand:
• Note: the best known algorithm may take an approach to solving the problem different to that of the parallel algorithm.
, .ref absabs abs
P
T SS E
T P
October 2010 15
Scalability and Isoefficiency
• What is meant by scalability?
– Scalability applies to an algorithm executing on a parallel machine, not simply to an algorithm!
• How does an algorithm behave for a fixed problem size as the number of processors used increases?
– Known as strong scaling.
• How does an algorithm behave as the problem size changes in addition to changing the number of processors?
• A key insight is to look at how efficiency changes.
October 2010 16
Efficiency and Strong scaling• Typically, for a fixed problem size N the efficiency of
an algorithm decreases as P increases (compare with ‘brush’ diagrams). Why?
– Overheads typically do not get smaller as P increases. They remain ‘fixed’ (e.g. Amdahl fraction), or, worse, they may grow with P (e.g. the number of communications may grow – in an all-to-all comms pattern)
• Recall that:
1
1
refabs
PP
ref
TE
POPTT
October 2010 17
Efficiency and Strong scaling
• POP is the total overhead in the system.
• Tref represents the useful work in the algorithm.
• At some point, with fixed N, efficiency Eabs (i.e. how well each processor is being utilised) will drop below an acceptable threshold – say, 50%(?)
October 2010 18
Scalability• No ‘real’ algorithm scales ‘forever’ on a fixed problem
size on a ‘real’ computer.
• Even ‘embarrassingly’ parallel algorithms will have a limit on the number of processors they can use;
– for example, at the point where, with a fixed N, eventually there is only one ‘element’ to be operated on by each processor.
• So we seek another approach to scalability which applies as both problem size N and the number of processors P change.
October 2010 19
Definition of Scalability – Isoefficiency• An algorithm can be said to (iso)scale if, for a given parallel
system, a specific level of efficiency can be maintained by changing the problem size, N, appropriately as P increases.
• Not all algorithms isoscale!
– e.g. a vector reduction where N = P (see later).
• This approach is called scaled problem analysis.
• The function (of P) describing how the problem size N must change as P increases to maintain a specified efficiency is known as the isoefficiency function.
• Isoscaling does not apply to all problems;
– e.g. weather modelling, where increasing problem size (resolution) is not always an option,
– or image processing with a fixed number of pixels.
October 2010 20
Weak scaling• An alternative approach is to keep the problem size per
processor fixed as P increases (total problem size N increases linearly with P) and see how the efficiency is affected;
– This is known as weak scaling (as opposed to strong scaling).
• Summary: strong scaling, weak scaling and isoefficiency are three approaches to understanding the scalabililty of parallel systems (algorithm + machine).
• We will look at an example shortly but first we need a way of comparing functions, e.g. performance functions and efficiency functions.
• These concepts will also be explored further in lab exercise 2.
October 2010 21
Comparison of functions – asymptotic analysis
• Performance models are generally functions of problem size (N) and the number of processors (P)
• We need relatively easy way to compare models (functions) as N and P vary:
– Model A is ‘at most’ as fast or as big as model B;
– Model A is ‘at least’ as fast or as big as model B;
– Model A is ‘equal’ in performance/size to model B.
• We will see a similar need when comparing efficiencies and in considering scalabilty.
• These are all examples of comparing functions.
• We are often interested in asymptotic behaviour, i.e. the behaviour as some key parameter (e.g. N or P) increases towards infinity.
October 2010 22
Comparing functions - example• From ‘Introduction to Parallel Computing’, Grama.
• Consider three functions:
– think of the functions as modelling the distance travelled by three cars from time t=0. One car has fixed speed and the others are accelerating (car C makes a standing start (zero initial speed)):
2
2
( ) 1000
( ) 100 20
( ) 25
A t t
B t t t
C t t
October 2010 23
Graphically
October 2010 24
• We can see that:
– For t > 45, B(t) is always greater than A(t).
– For t > 20, C(t) is always greater than B(t).
– For t > 0, C(t) is always less than 1.25*B(t).
October 2010 25
Introducing ‘big-Oh’ notation• It is often useful to express a bound on the growth of a particular
function in terms of a simpler function.
• For example, for t > 45, B(t) is always greater than A(t), we can express the relation between A(t) and B(t) using the Ο (Omicron or ‘big-oh’) notation:
• Meaning A(t) is “at most” B(t) beyond some value of t.
• Formally, given functions f(x), g(x),
f(x)=O(g(x))
if there exist positive constants c and x0 such that f(x) ≤ cg(x) for all x ≥ x0 [Definition from JaJa not Grama! – more transparent].
( ) ( ( ))A t B t
October 2010 26
• From this definition, we can see:
– A(t)=O(t2) (“at most”),
– B(t)=O(t2) (“at most” or “of the order t2”),
– Also, A(t)=O(t) (“at most” or “of the order t”),
– Finally, C(t)= O(t2) too.
• Informally, big-Oh can be used to identify the simplest function that bounds (above) a more complex function, as the parameter gets (asymptotically) bigger.
October 2010 27
Theta and Omega• There are two other useful symbols:
– Omega (Ω) meaning “at least”:
– Theta (Θ) “equals” or “goes as”:
• For formal definitions, see, for example, ‘An Introduction to Parallel Algorithms’ by JaJa or ‘Highly Parallel Computing’ by Almasi and Gottlieb.
• Note that the definitions in Grama are a little misleading!
( ) ( ( ))f x g x
( ) ( ( ))f x g x
October 2010 28
Performance modelling example• The following slides develop performance models for
the example of a vector (sum) reduction.
• The models are then used to support basic scalability analysis.
• Consider two parallel systems– First, a binary tree-based vector sum when the number of
elements (N) is equal to the number of processors (P), N=P.
– Second, the case when N >>P.
• Develop performance models;– Compare the models,
– Consider scalability.
October 2010 29
Vector Sum Reduction• Assume that
– N = P, and
– N is a power of 2.
• Propogate intermediate values through a binary tree
– Takes log2N steps (one processor is busy with work and communication on each step, the other processors have some idle time).
• Each step involves the communication of a single word (cost ts+tw) and a single addition (cost tc). Thus:
2 2( ) log (log )p c s wT t t t N N
October 2010 30
Vector Sum Reduction• Speedup:
• Speedup is ‘poor’ (but monotonically increasing)– If N=128, Sabs is ~18 (E = S/P = ~0.14, i.e. 14%),
– If N=1024, Sabs is ~100 (E = ~0.1),
– If N=1M, Sabs is ~ 52,000 (E= ~0.05),
– If N=1G, Sabs is ~ 35M (E = ~ 0.035).
ref
2 2
.( ) log log
cabs
p c s w
T t N NS
T t t t N N
October 2010 31
Vector sum scalability• Efficiency:
• But, N=P in this case, so:
• Strong scaling not ‘good’, as we have seen (E<<0.5).
• Efficiency is monotonically decreasing
– Reaches 50% point, E = 0.5, when (log2 P) = 2, i.e. when P=4.
• This does not isoscale either!
– E gets smaller as P (hence N) increases and P and N must change together.
2
.log
S NE
P P N
2
1.
logE
P
October 2010 32
Vector Sum Reduction• When N>>P, each processor can be allocated N/P
elements.
• Each processor sums its local elements in a first phase.
• A binary tree sum of size P is then be performed to sum the partial results.
• The performance model is:
2 2log log .p c c s w
N NT t t t t P P
P P
October 2010 33
Scalability – strong scaling?• Speedup:
• Strong scaling??
• For a given problem size N (>> P), the (log2P/N) term is always ‘small’ so speedup will fall off ‘slowly’.
• P is, of course, limited by the value of N… but we are considering the case where N >> P.
22
1.
loglog 1
NS
N PPP P N
October 2010 34
Scalabilty – Isoscaling• Efficiency:
• Now, we can always achieve a required efficiency on P processors by a suitable choice of N.
2
1.
log1
EP P
N
October 2010 35
Scalabilty – Isoscaling• For example, for 50% efficiency, choose
• Or, for efficiencies > 50%, choose
– As N gets larger on a given P, E gets closer to 1!
– The ‘good’ parallel phase (N/P work) dominates the log2P phase as N gets larger – leading to relatively good (iso)scalability.
2log .N P P
2log .N P P
October 2010 36
Summary of performance modelling• Performance modelling provides insight into the behaviour of parallel
systems (parallel algorithms on parallel machines).
• Modelling allows the comparison of algorithms and gives insight into their potential scalability.
• Two forms of scalability:
– Strong scaling (fixed problem size N as P varies)
– There is always a limit to strong scaling for real algorithms (e.g. a value of P at which efficiency falls below an acceptable limit).
– Isoscaling (the ability to maintain a specified level of efficiency by changing N as P varies).
– Not all parallel systems isoscale.
• Asymptotic analysis makes comparison easier but BEWARE the constants!
• Weak scaling is related to isoscaling – aim to maintain a fixed problem size per processor as P changes and look at the effect on efficiency.
Top Related