Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

34
1 Parallel Parallel Performance Performance Measurement Measurement Dr. Stephen Tse Dr. Stephen Tse [email protected] Lesson 8 Lesson 8

description

Lesson 8. Parallel Performance Measurement Dr. Stephen Tse [email protected]. Speed Up. 1.Speed Up - PowerPoint PPT Presentation

Transcript of Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

Page 1: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

1

ParallelParallel Performance MeasurementPerformance Measurement

Dr. Stephen TseDr. Stephen Tse

[email protected]

Lesson 8Lesson 8

Page 2: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

2

Speed Up 1. Speed Up

Let T(1, N) be the time required for the best serial algorithm to solve problem of size N on 1 processor and T(P, N) be the time for a given parallel algorithm to solve the same problem of the same size N on P processors. Speedup is defined as

S(P, N) = T(1,N)/T(P, N)

Remarks: • Normally, S(P,N) < P; Ideally, S(P,N) = P; Rarely,

S(P,N) > P --- super speedup. • Linear speedup: S(P,N) = c*P where c is a constant

independent of N and P.• Algorithms with S(P,N) = c P are called scalable

algorithm.

Page 3: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

3

Parallel Efficiency 2. Parallel Efficiency

Let T(1, N) be the time required for the best serial algorithm to solve problem of size N on 1 processor and T(P, N) be the time for a given parallel algorithm to solve the same problem of the same size N on P processors. Parallel efficiency is defined as

E(P,N)= T(1, N)/[T(P, N)P] = S(P,N)/P

Remarks: 1. Normally, E(P,N) < 1; Ideally, E(P,N) = 1; Rarely,

E(P,N) > 1; E(P,N) ~.6 acceptable. Of course, it is problem-dependent.

2. Linear speedup: E(P,N) = c where c is a constant independent of N and P.

3. Algorithms with E(P,N) = c are called scalable algorithms.

Page 4: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

4

3. Load Imbalance Ratio I(P,N)• Processor i spends ti doing useful work and tmax = max{ti} is the

maximum time spent by one or more processors and

tavg = (i=0P-1 ti)/P= average time

The total time spent on useful task for computation and communication is i=0

P-1 ti while the time that the system is occupied (either computation or communication or idle) is P tmax. Thus, we define a parameter called load imbalance ratio:

I(P,N) = [Ptmax - i=0P-1 ti]/ i=0

P-1 ti = tmax / tavg – 1

Remarks:1. Per processor wasted time= tavg * I(P,N) = tmax-tavg

1. I(P,N) is the average time wasted by each processor due to load imbalance.

2. If tmax = tavg, then ti = tavg, then, I(P,N) = 0 this implies complete load balance.

3. One slow (not doing what it suppose to do) processor (tmax) can mess up the entire team. This observation shows that Slave-Master scheme is usually very inefficient because of the load imbalance issue due to slow master processor. Therefore, Slave-Master scheme is usually avoided.

Page 5: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

5

Load Balance:ti on P Nodes Within Synchronization

Page 6: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

6

Overhead4. Overhead• A parameter h(P,N) is defined by

E(P,N)= 1/[1 + h(P,N)] where h(P,N) is called overhead and it can be solved as

Remarks: 1. h(P,N) measures time spent result from communication

and load imbalance.2. h(P,N) if E(P,N) 0.3. h(P,N) 0 if E(P,N) 1.

h(P,N) = - 1 = - 1 E(P,N) S(P,N)

1 P

Page 7: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

7

Amdahl’s Law

5. Amdahl’s LawSuppose a fraction of an algorithm for a problem of size N on P processors is inherently serial and the remainder is perfectly parallel, then assume T(1,N) = . Thus,

T(P,N) = f + (1-f) /P

Therefore,

S(P,N) =1/[f + (1-f)/P]

This equation indicates that when P, the speedup S(P,N) is bounded by 1/f. It means that the maximum possible speedup is finite even if P.

Page 8: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

8

Granularity

6. Granularity The size of the problem allocated to individual

processors is called the granularity of the decomposition.

Remarks:1. Granularity is usually determined by the problem size

N and computer size P.2. Decreasing granularity usually increases

communication and decreases load imbalance.3. Increasing granularity usually decreases

communication and increases load imbalance.

Page 9: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

9

Total Overhead

Ove

rhea

d

Page 10: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

10

Scalability7. A scalable algorithm is that whose

E(P, N) remains bounded from below, i.e., E(P, N) E0 > 0, when the number of processors P at fixed problem size.A quasi-scalable algorithm is that whose E(P, N) remains bounded from below, i.e., E(P, N) E0 > 0, when the number of processors Pmin < P < Pmax at fixed problem size.The interval [Pmin, Pmax] is called scaling zone.

Remarks:1. True scalable: rare; quasi-scalable: often. 2. Quasi-scalable is usually regarded as scalable. 3. At fixed N=N(P), E(P,N(P)) decreases monotonically as P increases. But,

this relationship is problem-dependent.4. At fixed P=P(N), E(P(N), N) increases monotonically as N increases. But,

this relationship is problem-dependent.5. Efforts: maximize scaling zone [Pmin, Pmax] and E0.

Page 11: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

11

Principles: minimizing overhead.

8. Principles: minimizing overhead.

1. Minimize communication-to-computation ratio.

2. Minimize load imbalance

3. Maximize scaling zone

Page 12: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

12

C ProgrammingC Programming

Dr. Stephen TseDr. Stephen Tse

[email protected]

Lesson 8-aLesson 8-a

Page 13: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

13

C Language

• C is a general purpose programming language• C provides variety of data type:

– characters, integers, and floating-point numbers. – Derived data types created with pointers, arrays,

structures, and unions

• Expression: formed from operators and operands;

• Pointers: provide for machine-independent address arithmetic.

Page 14: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

14

C Control-flow

• statement grouping, decision making (if-else)

• selecting one of a set of possible cases (switch)

• looping with the termination test at the top (while, for)

• looping with the termination test at the bottom (do)

• early loop exit (break)

Page 15: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

15

C Functions

• Functions may return values of basic types, structures, unions, or pointers.

• Any function may be called recursively.• Function definitions may not be nested but

variables may be declared in a block-structured fashion.

• Variables may be internal to a function, external but known only within a single source file, or visible to the entire program.

Page 16: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

16

Getting Started• The first program to print ‘hello world”

#include <stdio.h> /*Include info of standard lib*/

main() /*Define a function - Main*/{ /* statements are enclosed in

braces */printf(“hello, world\n”); /* calls the print function */

} /* \n is a new line character */The first C program

• You must create the program in a file whose name ends in “.c”, then compile it with the command:

cc hello.c• If the program has no error, the compilation will be silent and creates an

executable file called a.out• If you run the a.out by typing this command; it will print hello, world

Page 17: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

17

Anatomy of a C program

• C program consists of functions and variables.

• A function contains statements that specify the computing operations. to be done.

• Variables store values used during the computation.

• Every C program has a “main” and the program begins execution at the beginning of main.

Page 18: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

18

main ()

• “main” calls other functions to help perform its job; some that you wrote, and others from libraries that are provided for you.

• Therefore, the first line of the program is always include the standard input/output library:

#include <stdio.h>

Page 19: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

19

arguments

• f(arg1, arg2) - communicating data between functions is for the calling function to provide a list of values, called arguments, to the function it calls. The parentheses after the function name surround the argument list. Functions have no arguments are represented by the empty list ().

Page 20: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

20

Variables and Arithmetic Expressions

• All variables must be declared before they are used. A declaration announced the property of the variables:– lnt (integer) fahr, celsius;– char (character-a single byte)– short (short integer)– double (double-precision floating point)– float (floating point, numbers represented by decimal)

• If Arithmetic Expressions has one floating point operand, all integers will be converted to float before opration.

Page 21: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

21

Pointers and Arrays• A pointer is a variable that contains the address of the variable:

int *P ; /* P is a pointer */ P = &C ; /* P now point to C */

P: C:

Add. of CContent

201 203 . . . . 875 874 875 . . . .

874 3.1416

int x=1, y=2, z[10];int *iP;iP=&x; /* iP now points to x */y=*iP; /* Unary operator * applied to a pointer, it

access the object the pointer points to.So, y now is 1 */

*iP=0; /* x now is 0 */ip=&z[0]; /* iP now point to the beginning of array

z[0] */

What are the following:y=*ip+1; /* y = whatever iP points to add 1. */*ip += 1; /* Increments what ip points to */

or ++*ip; or (*ip)++;

Page 22: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

22

swap(a,b)• C passes arguments to functions by value, there is no direct way for the called function to alter a

variable in the calling function.• It is not enough to write:

swap(a, b);where the swap is:

void swap(int x, int y) /* WRONG */{

int temp;temp = x;x = y;y = temp;

}Instead:

void swap(int *px, int *py) /* interchange *px and * py */ {

int temp;temp = *px;*px = *py;*py = temp;

}

Page 23: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

23

printf and format• printf(“%3.0f %6.f\n”, fahr, celsius)

– “ … “ is the character string to be printed.– variable can have different data type:

• char a single byte, capable of holding one character (8 bits)

• int an integer (either 16 or 32 bits)• short short interger, (16 bits) • long long integer (32 bits)• float single-precision floating point (contain a decimal

point or an exponent)• double double-precision floating point

– print as:• %d print as decimal integer• %6d print as decimal integer, at least 6 characters wide• %f print as floating point• %.2f print as floating point, 2 character after decimal point• %6.2f print as floating point, at least 6 character wide and 2

after decimal point.• %s print as character string• %o print octal• %x print hexadecimal• %c print as character

Page 24: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

24

Data Object Type Requirements (IEEE Formate)Fundamental Type Derived Type Description or size Value range

char char

Signed char

Unsigned char

1 byte, 8 bits -128 to 127Same

0 to 255

Int Int

(defaule)

Signed

Signed int

Unsigned

Unsigned int

2 bytes, 16 bites -32,768 to 32767Same

Same

Same

0 to 65,535

0 to 65,535

Short Short

Signed short

Short int

Signed short int

Unsigned short

Unsigned short int

2 bytes, 16 bites -32,768 to 32,767Same

Same

Same

0 to 65,535

0 to 65,535

long Long

Signed long

Long int

Signed long int

Unsigned long

Unsigned long int

4 bytes, 32 bites -2,147,483,648 to 2,147,483,647Same

Same

Same

0 to 4,294,967,295

0 to 4,294,967,295

Enum enum 2 bytes, 16 bites -32,768 to 32,767

float float 4 bytes, 32 bites -1038 to -10-38, 0, 10-38 to 1038

double double 8 bytes, 64 bites -10-308 to -10-308, 0, 10-308 to 10308

Long double Long double 10 bytes, 80 bites -104932 to -10-4932, 0, 10-4932 to 10 4932

void void 0 bytes, 0 bites No value

_segment _segment 2 bytes, 16 bites 0x0000-0xF000

Page 25: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

25

Data-object type categories

Integral

Floating-Point

Aggregate

char

int

enum

float

double

_segment

Pointers

Arrays

Structures

Unions

arithmetic

scalar

TYPE CATEGORIES

Page 26: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

26

Constants• Integer constant: 1234 is an int.• A long constant: 123456789L is written with a terminal l (ell) or L.• Unsigned constant: is written with a terminal u or U. the suffix of ul or UL

indicates unsigned long.• Floating-point constant: contains a decimal point (123.4) or an exponent

(1e-2) or both; their type is double. The suffix f or F indicate a float constant; l or L indicate a long double.

• Octal: a leading 0 (zero) on an integer constant• Hexadecimal: a leading 0X or 0x means hexadecimal.• Character constant: is an integer, written as one character within single

quotes, such as ‘0’ (zero); has the value 48, which is unrelated to the numeric value 0.

• Escape sequences: certain characters can be represented in string constants by escape sequences like \n (newline) which looks like two characters, but represent only one.

– \000 one or three octal digits (0…7)– \xhh where hh is one or more hexadecimal digits (0…9,a…,f, A…F)– \013 for vertical tab– \007 for bell character – \a Alert (bell) character \\ backslash– \b backspace \? question mark– \f formfee \’ single quote– \n newline \” double quote– \r carriage return \0 null character, end of string with

value 0– \t horizontal tab EOF end of file– \v vertical tab

Page 27: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

27

Function and if statement• A function definition has this form:

return-type function-name(parameter declarations, if any)

{

declarationsstatements}

• The if logical pattern

if ( condition1)

statements1

else if (condition2)

statements2

… … … …else

statementn

Remark: if every conditions fail, the final statement will be executed

False True

Final

TrueFalse

out

out

outFalse

Page 28: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

28

String Termination• A string constant:

“hello\n”• It is stored as an array of characters containing

the characters of the string and terminated with a ‘\0’ to mark the end.

h e l l o \n \0

Remark: The ‘\0’ is not a part of the normal text; but the %s string format expects the input argument is terminated by ‘\0’, and it copies this character into the output argument.

Page 29: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

29

Type Conversion• When an operator has operands of different types, the

“narrow” operand is automatically converted to the “wider” one.

• If either operand is long double, convert the other to long double.

• Otherwise, if either operand is double, convert the other to double.

• Otherwise, if either operand is float, convert the other to float.• Otherwise, convert char and short to int.• Then, it either operand is long, convert the other to long.• Finally, explicit type conversions can be forced (“coerced”) in

any expression with a unary operator called a cast.

(type name) expression

the expression is converted to the named type.

Page 30: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

30

Random Numbers• Many simulations do not simulate events given by input data, but

rather generate events according to some probability distribution. A random number generating function rand(x) is used.

• The starting point of the pseudorandom integer, x, called the seed, is set by calling srand(x). The default seed for rand is 1. The same seed will generate the same set of random sequence for rand.

• the statement x=rand(x) resets the value of the variable x to a uniform random real number between 0 and Rand_Max(32767).

• The following statements (where a and b are integers):x=rand(x);y=(b-a)*x+a

The variable y is said to be a uniformly distributed random variable between a and b-1.

a b

y

0

xResult = Low + (High – Low)*numberPseudo-random numbers:srand ((unsigned) Time(NULL));use Time as a seed to generate integers between 0 and 32,767 (RAND_MAX)

32,767

Page 31: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

31

Algorithms to Generate Pseudorandom Numbers – Linear Congruential Algorithms

• Developed by D.H. Lehmer around 1950, to use the four integer parameters to generate a pseudorandom sequence:– The starting (or current/seed) value, X0 (or Xn)– The multiplier, a (greater than or equal to 0)– The incrementer, c (greater than or equal to 0)– The modulus, m (must be the largest & greater then 0)– The formula:

Xn+1 = (aXn + c) % m (% is the modular operator.)

The formula will generate random value between 0 and m-1, inclusive. If m=10, the formula will generate random values between 0 and 9, inclusive.If the modulus parameter chose to be close to the maximum possible signed int (32767). The formula will produce good random numbers.

Page 32: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

32

Assignment Operations• The operator += is called an assignment operator.• Most binary operators have a left and a right operand

with a corresponding assignment operator “op=“, where op is one of:

+ - * / % << >> & ^ |• If expr1 and expr2 are expressions, then

expr1 op= expr2

is equal toexpr1 = (expr1) op (expr2)

• For example:expr1 /= expr2

is equal to:expr1 =

expr1

expr2

Page 33: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

33

Conditional Expressions• The conditional expression, written with the ternary

operator “?:”, provides an alternate way to write:it (a>b) z = a;else z = b;

• The similar construction in conditional expression:

expr1 ? expr2 : expr3

The expr1 is evaluated first. If it is non-zero (true), then the expr2 is evaluated and that is the value of the conditional expression. Otherwise exp3 is evaluated, and that is the value.

Page 34: Parallel Performance Measurement Dr. Stephen Tse stephen_tse@qc

34