Programming for Performance CS433 Spring 2001

46
Programming for Performance CS433 Spring 2001 Laxmikant Kale

description

Programming for Performance CS433 Spring 2001. Laxmikant Kale. Causes of performance loss. If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance? Several causes, Each must be understood separately - PowerPoint PPT Presentation

Transcript of Programming for Performance CS433 Spring 2001

Page 1: Programming for Performance  CS433 Spring 2001

Programming for Performance CS433

Spring 2001

Laxmikant Kale

Page 2: Programming for Performance  CS433 Spring 2001

2

Causes of performance loss• If each processor is rated at k MFLOPS, and there are p

processors, why don’t we see k.p MFLOPS performance?– Several causes,

– Each must be understood separately

– but they interact with each other in complex ways• Solution to one problem may create another

• One problem may mask another, which manifests itself under other conditions (e.g. increased p).

Page 3: Programming for Performance  CS433 Spring 2001

3

Causes• Sequential: cache performance

• Communication overhead

• Algorithmic overhead (“extra work”)

• Speculative work

• Load imbalance

• (Long) Critical paths

• Bottlenecks

Page 4: Programming for Performance  CS433 Spring 2001

4

Algorithmic overhead• Parallel algorithms may have a higher operation count

• Example: parallel prefix (also called “scan”)– How to parallelize this?

B[0] = A[0];

for (I=1; I<N; I++)

B[I] = B[I-1]+A[I];

Page 5: Programming for Performance  CS433 Spring 2001

5

Parallel Prefix: continued• How to this operation in parallel?

– Seems inherently sequential

– Recursive doubling algorithm

– Operation count: log(P) . N

• A better algorithm: – Take blocking of data into account

– Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements

– N + log(P) +N: doubling of op. Count

Page 6: Programming for Performance  CS433 Spring 2001

6

Bottleneck• Consider the “primes” program (or the “pi”)

– What happens when we run it on 1000 pes?

• How to eliminate bottlenecks:– Two structures are useful in most such cases:

• Spanning trees: organize processors in a tree

• Hypercube-based dimensional exchange

Page 7: Programming for Performance  CS433 Spring 2001

7

Communication overhead• Components:

– per message and per byte

– sending, receiving and network

– capacity constraints

• Grainsize analysis:– How much computation per message

– Computation-to-communication ratio

Page 8: Programming for Performance  CS433 Spring 2001

8

Communication overhead examples• Usually, must reorganize data or work to reduce communication

• Combining communication also helps

• Examples:

Page 9: Programming for Performance  CS433 Spring 2001

9

Communication delay: time interval between sending on one processor to receipt on another:

time = a + b. N

Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN

Typical values: a = 10 - 100 microseconds, b: 2-10 ns

Communication overhead

Page 10: Programming for Performance  CS433 Spring 2001

10

Grainsize control• A Simple definition of grainsize:

– Amount of computation per message

– Problem: short message/ long message

• More realistic:– Computation to communication ratio

– computation time / (a + bN) for one message

Page 11: Programming for Performance  CS433 Spring 2001

11

Example: matrix multiplication• How to parallelize this?

For (I=0; I<N; I++)

For (J=0; j<N; J++) // c[I][j] ==0

For(k=0; k<N; k++)

C[I][J] += A[I][K] * B[K][J];

Page 12: Programming for Performance  CS433 Spring 2001

12

A simple algorithm:• Distribute A by rows, B by columns

– So,any processor can request a row of A and get it (in two messages). Same for a col of B,

– Distribute the work of computing each element of C using some load balancing scheme

• So it works even on machines with varying processor capabilities (e.g. timeshared clusters)

– What is the computation-to-communication ratio?• For each object: 2.N ops, 2 messages with N bytes

• 2N / (2 a + 2N b) = 2N * 0.01 / (2*10 + 2*0.002N)

Page 13: Programming for Performance  CS433 Spring 2001

13

A better algorithm:• Store A as a collection row-bunches

– each bunch stores g rows

– Same of B’s columns

• Each object now computes a gxg section of C

• Comp to commn ratio:– 2*g*g*N ops

– 2 messages, gN bytes each

– alpha ratio: 2g*g*N/2, beta ratio: 2g

Page 14: Programming for Performance  CS433 Spring 2001

14

Alpha vs beta• The per message cost is significantly larger than per byte cost

– factor of several thousands

– So, several optimizations are possible that trade off : • get larger beta cost in return for smaller alpha

– I.e. send fewer messages

– Applications of this idea:• Examined in the last two lectures

Page 15: Programming for Performance  CS433 Spring 2001

15

Programming for performance:steps• Select/design Parallel algorithm

• Decide on Decomposition

• Select Load balancing strategy

• Plan Communication structure

• Examine synchronization needs– global synchronizations, critical paths

Page 16: Programming for Performance  CS433 Spring 2001

16

Design Philosophy:

• Parallel Algorithm design:– Ensure good performance (total op count)

– Generate sufficient parallelism

– Avoid/minimize “extra work”

• Decomposition:– Break into many small pieces:

• Smallest grain that sufficiently amortizes overhead

Page 17: Programming for Performance  CS433 Spring 2001

17

Design principles: contd.• Load balancing

– Select static, dynamic, or quasi-dynamic strategy• Measurement based vs prediction based load estimation

– Principle: let a processor idle but avoid overloading one • (think about this)

• Reduce communication overhead– Algorithmic reorganization (change mapping)

– Message combining

– Use efficient communication libraries

Page 18: Programming for Performance  CS433 Spring 2001

18

Design principles: Synchronization• Eliminate unnecessary global synchronization

– If T(i,j) is the time during i’th phase on j’th PE• With synch: sum ( max {T(i,j)})

• Without: max { sum(T (i,j) }

• Critical Paths:– Look for long chains of dependences

• Draw timeline pictures with dependences

Page 19: Programming for Performance  CS433 Spring 2001

19

Diagnosing performance problems• Tools:

– Back of the envelope (I.e. simple) analysis

– Post-mortem analysis, with performance logs• Visualization of performance data

• Automatic analysis

• Phase-by-phase analysis (prog. may have many phases)

– What to measure• load distribution, (commun.) overhead, idle time

• Their averages, max/min, and variances

• Profiling: time spent in individual modules/subroutines

Page 20: Programming for Performance  CS433 Spring 2001

20

Diagnostic technniques• Tell-tale signs:

– max load >> average, and # PEs > average is >>1

– max load >> average, and # PEs > average is ~ 1

– Profile shows increase in total time in routine f with increase in PEs:

– Communication overhead: obvious

Load imbalance

Possible bottleneck (if there is dependence)

Algorithmic overhead

Page 21: Programming for Performance  CS433 Spring 2001

21

Communication Optimization• Example problem from earlier lecture: Molecular Dynamics

– Each Processor, assumed to house just one cell, needs to send 26 short messages to “neighboring” processors

– Assume Send/Receive each: alpha = 10 us, beta: 2ns

– Time spent (notice: 26 sends and 26 receives):• 26*2(10 ) = 520 us

– If there are more than one cells on each PE, multiply this number!

– Can this be improved? How?

Page 22: Programming for Performance  CS433 Spring 2001

22

Message combining• If there are multiple cells per processor:

– Neighbors of a cell may be on the same neighboring processor.

– Neighbors of two different cells on the same processor

– Combine messages going to the same processor

Page 23: Programming for Performance  CS433 Spring 2001

23

Communication Optimization I • Take advantage of the structure of communication, and do

communication in stages:– If my coordinates are: (x,y,z):

• Send to (x+1, y,z), anything that goes to (x+1, *, *)

• Send to (x-1, y,z), anything that goes to (x-1, *, *)

• Wait for messages from x neighbors, then

• Send to y neighbors a combined message

– A total of 6 messages instead of 26

– Apparently longer critical path

Page 24: Programming for Performance  CS433 Spring 2001

24

Communication Optimization II• Send all migrating atoms to processor 0

– Let processor 0 sort them out and send 1 message to each processor

– Works ok if the number of processors is small• Otherwise, bottleneck at 0

Page 25: Programming for Performance  CS433 Spring 2001

25

Communication Optimization 3• Generalized problem:

– Each to all, individualized messages

– Apply all previously learned techniques

Page 26: Programming for Performance  CS433 Spring 2001

26

Intro to Load Balancing• Example: 500 processors, 50000 units of work

• What should the objective of load balancing be?

Page 27: Programming for Performance  CS433 Spring 2001

27

Causes of performance loss

• If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance?– Several causes,

– Each must be understood separately

– but they interact with each other in complex ways• Solution to one problem may create another

• One problem may mask another, which manifests itself under other conditions (e.g. increased p).

Page 28: Programming for Performance  CS433 Spring 2001

28

Causes• Sequential: cache performance

• Communication overhead

• Algorithmic overhead (“extra work”)

• Speculative work

• Load imbalance

• (Long) Critical paths

• Bottlenecks

Page 29: Programming for Performance  CS433 Spring 2001

29

Algorithmic overhead• Parallel algorithms may have a higher operation count

• Example: parallel prefix (also called “scan”)– How to parallelize this?

B[0] = A[0];

for (I=1; I<N; I++)

B[I] = B[I-1]+A[I];

Page 30: Programming for Performance  CS433 Spring 2001

30

Parallel Prefix: continued

• How to this operation in parallel?– Seems inherently sequential

– Recursive doubling algorithm

– Operation count: log(P) . N

• A better algorithm: – Take blocking of data into account

– Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements

– N + log(P) +N: doubling of op. Count

Page 31: Programming for Performance  CS433 Spring 2001

31

Bottleneck• Consider the “primes” program (or the “pi”)

– What happens when we run it on 1000 pes?

• How to eliminate bottlenecks:– Two structures are useful in most such cases:

• Spanning trees: organize processors in a tree

• Hypercube-based dimensional exchange

Page 32: Programming for Performance  CS433 Spring 2001

32

Communication overhead• Components:

– per message and per byte

– sending, receiving and network

– capacity constraints

• Grainsize analysis:– How much computation per message

– Computation-to-communication ratio

Page 33: Programming for Performance  CS433 Spring 2001

33

Communication overhead examples• Usually, must reorganize data or work to reduce communication

• Combining communication also helps

• Examples:

Page 34: Programming for Performance  CS433 Spring 2001

34

Communication overhead

Communication delay: time interval between sending on one processor to receipt on another:

time = a + b. N

Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN

Typical values: a = 10 - 100 microseconds, b: 2-10 ns

Page 35: Programming for Performance  CS433 Spring 2001

35

Grainsize control• A Simple definition of grainsize:

– Amount of computation per message

– Problem: short message/ long message

• More realistic:– Computation to communication ratio

Page 36: Programming for Performance  CS433 Spring 2001

36

Example: matrix multiplication• How to parallelize this?

For (I=0; I<N; I++)

For (J=0; j<N; J++) // c[I][j] ==0

For(k=0; k<N; k++)

C[I][J] += A[I][K] * B[K][J];

Page 37: Programming for Performance  CS433 Spring 2001

37

A simple algorithm:

• Distribute A by rows, B by columns– So,any processor can request a row of A and get it (in two

messages). Same for a col of B,

– Distribute the work of computing each element of C using some load balancing scheme

• So it works even on machines with varying processor capabilities (e.g. timeshared clusters)

– What is the computation-toc-mmunication ratio?• For each object: 2.N ops, 2 messages with N bytes

Page 38: Programming for Performance  CS433 Spring 2001

38

A better algorithm:

• Store A as a collection row-bunches – each bunch stores g rows

– Same of B’s columns

• Each object now computes a gxg section of C

• Comp to commn ratio:– 2*g*g*N ops

– 2 messages, gN bytes each

– alpha ratio: 2g*g*N/2, beta ratio: g

Page 39: Programming for Performance  CS433 Spring 2001

39

Alpha vs beta

• The per message cost is significantly larger than per byte cost – factor of several thousands

– So, several optimizations are possible that trade off : get larger beta cost for smaller alpha

– I.e. send fewer messages

– Applications of this idea:• Message combining

• Complex communication patterns: each-to-all, ..

Page 40: Programming for Performance  CS433 Spring 2001

40

Example:• Each to all communication:

– each processor wants to send N bytes, distinct message to each other processor

– Simple implementation: alpha*P + N * beta *P• typical values?

Page 41: Programming for Performance  CS433 Spring 2001

41

Programming for performance:steps

• Select/design Parallel algorithm

• Decide on Decomposition

• Select Load balancing strategy

• Plan Communication structure

• Examine synchronization needs– global synchronizations, critical paths

Page 42: Programming for Performance  CS433 Spring 2001

42

Design Philosophy:

• Parallel Algorithm design:– Ensure good performance (total op count)

– Generate sufficient parallelism

– Avoid/minimize “extra work”

• Decomposition:– Break into many small pieces:

• Smallest grain that sufficiently amortizes overhead

Page 43: Programming for Performance  CS433 Spring 2001

43

Design principles: contd.

• Load balancing– Select static, dynamic, or quasi-dynamic strategy

• Measurement based vs prediction based load estimation

– Principle: let a processor idle but avoid overloading one (think about this)

• Reduce communication overhead– Algorithmic reorganization (change mapping)

– Message combining

– Use efficient communication libraries

Page 44: Programming for Performance  CS433 Spring 2001

44

Design principles: Synchronization

• Eliminate unnecessary global synchronization– If T(i,j) is the time during i’th phase on j’th PE

• With synch: sum ( max {T(i,j)})

• Without: max { sum(T (i,j) }

• Critical Paths:– Look for long chains of dependences

• Draw timeline pictures with dependences

Page 45: Programming for Performance  CS433 Spring 2001

45

Diagnosing performance problems

• Tools: – Back of the envelope (I.e. simple) analysis

– Post-mortem analysis, with performance logs• Visualization of performance data

• Automatic analysis

• Phase-by-phase analysis (prog. may have many phases)

– What to measure• load distribution, (commun.) overhead, idle time

• Their averages, max/min, and variances

• Profiling: time spent in individual modules/subroutines

Page 46: Programming for Performance  CS433 Spring 2001

46

Diagnostic technniques

• Tell-tale signs:– max load >> average, and # Pes > average is >>1

• Load imbalance

– max load >> average, and # Pes > average is ~ 1• Possible bottleneck (if there is dependence)

– profile shows increase in total time in routine f with increase in Pes: algorithmic overhead

– Communication overhead: obvious