Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  ·...

23
Parallelism and Performance Ideal parallel world: Sequential runs in T s P processors run in T p = T s P Today: Why that usually does’t happen 1-2

Transcript of Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  ·...

Page 1: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Parallelism and Performance

Ideal parallel world:

• Sequential runs in Ts

• P processors run in Tp = Ts

P

Today: Why that usually does’t happen

1-2

Page 2: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Measuring Performance

Obstacle: Non-Parallelism

Obstacle: Overhead

3

Page 3: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Measuring Performance

Latency: time to complete a task

This is normally what we want to reduce throughparallelism

4

Page 4: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Measuring Performance

Speedup: ratio of latencies = Ts

Tp

• Linear speedup: speedup approximates P

• Sublinear speedup: speedup less than P

• Superlinear speedup: speedup more than P !

Superlinear speedup happens when the algorithmor machine changes

5-7

Page 5: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Superlinear Speedup

Machine change:

SequentialL2 Cache

Processor

ParallelL2 Cache

Processor

L2 Cache

Processor

8

Page 6: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Superlinear Speedup

Algorithm change:

Sequential

Parallel

9

Page 7: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Measuring Performance

Throughput: workTp

Higher throughput doesn’t imply lower latency

10

Page 8: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Measuring Performance

Efficiency: effective use of processors = Speedup

P

11

Page 9: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Measuring Performance

FLOPS: floating-point operations per second

IOPS: integer operations per second

12

Page 10: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Measuring Performance

Performance measurement don'ts:

• use different machines

• disable compiler optimizations

• equate “sequential” with a single parallel process

• ignore cold start

• ignore devices

Do measure multiple P and multiple problem sizes

13

Page 11: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Measuring Performance

Obstacle: Non-Parallelism

Obstacle: Overhead

14

Page 12: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Inherent Non-Parallelism

Amdahl's Law

1S

of program is inherently sequential ⇒

Speedup < S

• 50% sequential ⇒ maximum speedup of 2

• 90% sequential ⇒ maximum speedup of 1.1

• 10% sequential ⇒ maximum speedup of 10

and yet lots of processors help for somecomputations, because it’s easy and useful to scale

the problem size15-16

Page 13: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Dependencies

Flow Dependence: write followed by read

sum = a+1; /* << */first_term = sum*scale1; /* << */sum = sum+b;second_term = sum*scale2;

This is a true dependence

17-18

Page 14: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Dependencies

Anti Dependence: read followed by write

sum = a+1;first_term = sum*scale1; /* << */sum=b+1; /* << */second_term=sum*scale2;

This is a false dependence

Rewrite:

sum = a+1;first_term = sum*scale1;sum2 = b+1;second_term = sum2*scale2;

19-21

Page 15: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Dependencies

Output Dependence: write followed by write

sum = a+1; /* << */first_term = sum*scale1;sum=b+1; /* << */second_term=sum*scale2;

This is a false dependence

Rewrite:

sum = a+1;first_term = sum*scale1;sum2 = b+1;second_term = sum2*scale2;

22-24

Page 16: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Avoiding Dependencies

Sometimes, you can change the algorithm

25

Page 17: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Lack of Dependencies

A task that spends all its time on many mutuallyindependent computations is

embarassingly parallel

26

Page 18: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Other Non-Parallelism

Other kinds of non-parallelism:

• Memory-bound computation

• I/O-bound computation

• Load imbalance

27

Page 19: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Measuring Performance

Obstacle: Non-Parallelism

Obstacle: Overhead

28

Page 20: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Overhead

29

Page 21: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Overhead

Sources of overhead:

• Communication and synchronization

• Contention

• Extra computation

• Extra memory

30

Page 22: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Overhead

Reducing communication and contention overhead:

• Larger granularity, so that per-messageoverhead is less costly

Example: pass whole array section instead ofindividual elements

• Improve locality, so that less communication isneeded

Example: compute sums where data alreadyresides

• Recompute instead of communicating

Example: recompute pseudo-randomsequences instead of centralizing

31

Page 23: Parallelism and Performance - The College of Engineering …cs4960-01/lecture3.pdf ·  · 2008-12-03Parallelism and Performance Ideal parallel world: • Sequential runs in Ts •

Overhead

Trade-offs:

• Communication versus computation

• Memory versus parallelism

• Overhead versus parallelism

32