Tuesday, September 19, 2006

1

Tuesday, September 19, 2006

The practical scientist is trying to solve tomorrow's problem

on yesterday's computer. Computer scientists often have

it the other way around.

- Numerical Recipes, C Edition

2

Reference Material Lectures 1 & 2

“Parallel Computer Architecture” by David Culler et. al., Chapter 1. “Sourcebook of Parallel Computing” by Jack Dongarra et. al.,

Chapters 1 and 2. Introduction to Parallel Computing by Grama et. al., Chapter 1 and

Chapter 2 §2.4. www.top500.org

Lecture 3 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3 Introduction to Parallel Computing, Lawrence Livermore National

Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ Lecture 4 & 5

“Techniques for Optimizing Applications” by Garg et. al., Chapter 9 “Software Optimizations for High Performance Computing” by

Wadleigh et. al., Chapter 5 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1-

2.2

3

Software Optimizations

Optimize serial code before parallelizing it.

4

Loop Unrolling

do i=1,n

A(i)=B(i)

enddo

do i=1,n,4

A(i)=B(i)

A(i+1)=B(i+1)

A(i+2)=B(i+2)

A(i+3)=B(i+3)

enddo•Unrolled by 4.•Some compilers allow users to specify unrolling depth.•Avoid excessive unrolling: Register pressure / spills can hurt performance•Pipelining to hide instruction latencies•Reduces overhead of index increment and conditional check

Assumption n is divisible by 4

5

Loop Unrolling

do j=1 to N

do i = 1 to N

Z[i,j]=Z[i,j]+X[i]*Y[j]

enddo

enddo

Unroll outer loop by 2

6

Loop Unrolling

do j=1 to N

do i = 1 to N


enddo

enddo

do j=1 to N step 2

do i = 1 to N


Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1]

enddo

enddo

7

Loop Unrolling

do j=1 to N

do i = 1 to N


enddo

enddo

do j=1 to N step 2

do i = 1 to N


Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1]

enddo

enddo

Number of load operations can be reduced e.g. Half as many loads of X

8

Loop Fusion

Beneficial in loop-intensive programs.Decreases index calculation overhead.Can also help in instruction level

parallelism.Beneficial if same data structures are

used in different loops.

9

Loop Fusion

for (i=0; i<n; i++)

temp[i] =x[i]*y[i];

for (i=0; i<n; i++)

z[i] =w[i]+temp[i];

10

Loop Fusion

for (i=0; i<n; i++)

temp[i] =x[i]*y[i];

for (i=0; i<n; i++)

z[i] =w[i]+temp[i];

for (i=0; i<n; i++)

z[i] =x[i]*y[i]+w[i];

Check for register pressure before fusing

11

Loop Fission

Condition statements can hurt pipeliningSplit into two, one with condition

statements and the other without.Compiler can do optimizations in

condition-free loop like unrolling.Beneficial for fat loops that may lead to

register spills

12

Loop Fission

for (i=0;i<nodes;i++) {

a[i] = a[i]*small;

dtime = a[i] + b[i];

dtime = fabs(dtime*ratinpmt);

temp1[i] = dtime*relaxn;

if(temp1[i] > hgreat) {

temp1[i]=1;

}

}

13

Loop Fission

for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime =

fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } }

for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime =

fabs(dtime*ratinpmt); temp1[i] =

dtime*relaxn;}for (i=0;i<nodes;i++) { if(temp1[i] > hgreat)

{ temp1[i]=1;

}}

14

Reductions

for (i=0; i<n; i++)

{

sum +=x[i];

}

Normally a single register would be used for reduction variable.

Hide floating point instruction latency?

15

Reductionsfor (i=0; i<n; i++)

{

sum +=x[i];

}

sum1=sum2=sum3=sum4=0.0

nend = (n>>2)<<2;

for (i=0; i<nend; i+=4){

sum1 +=x[i];

sum2 +=x[i+1];

sum3 +=x[i+2];

sum4 +=x[i+3];

}

sumx = sum1 + sum2+ sum3 + sum4;

for (i=nend; i<n; i++)

sumx += x[i]

16

a**0.5 vs sqrt(a)

17

a**0.5 vs sqrt(a) Appropriate include files can help in

generating faster code. e.g. math.h

18

The time to access memory has not kept pace with CPU clock speeds.

Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them.

Wastage of CPU cycles: CPU starvation

20

Ability of memory system to feed data to the processor Memory latency Memory Bandwidth

21

Effect of Memory Latency

1 GHz processor (1ns clock) Capable of executing 4 instructions in each

cycle of 1ns

DRAM with latency 100nsCache block size : 1 wordPeak processor rating?

22



cycle of 1ns

DRAM with latency 100ns (no caches)Memory block 1 wordPeak processor rating 4 GFlops

23



cycle of 1nsDRAM with latency 100ns (no caches)Memory block: 1 wordPeak processor rating 4 GFlopsDot product of two vectors Peak speed of computation?

24

Effect of Memory Latency1 GHz processor (1ns clock)

Capable of executing 4 instructions in each cycle of 1ns

DRAM with latency 100ns (no caches)Memory block 1 wordPeak processor rating 4 GFlops• Dot product of two vectors • Peak speed of computation? one floating point

operation every 100ns i.e. speed of 10 MFLOPS

25

Effect of Memory Latency: Introduce Cache

1 GHz processor (1ns clock) Capable of executing 4 instructions in each cycle of

1ns

DRAM with latency 100ns Memory block 1 wordCache 32KB with 1ns latencyMultiply two matrices A and B of 32x32 words

with result in C. (Note: Previous example had no data reuse).

Assume ideal cache placement and enough capacity to hold A,B and C

26


Multiply two matrices A and B of 32x32 words with result in C

32x32 = 1K wordsTotal operations and total time taken?

27


Multiply two matrices A and B of 32x32 words with result in C

32x32 = 1K wordsTotal operations and total time taken?Two matrices = 2K require wordsMultiplying two matrices requires 2n3

operations

28

Effect of Memory Latency: Introduce CacheMultiply two matrices A and B of 32x32 words

with result in C32x32 = 1KTwo matrices = 2K require 2K *100ns = 200µs.Multiplying two matrices requires 2n3

operations = 2*323 = 64K operations 4 operations per cycle we need 64K/4 cycles =

16µsTotal time = 200+16µsComputation rate 64K operations/(200+16µs) =

303 MFLOPS

29

Effect of Memory Bandwidth


cycle of 1ns

DRAM with latency 100ns Memory block 4 wordsCache 32KB with 1ns latencyDot product example againBandwidth increased 4 fold

30

Reduce cache misses.Spatial localityTemporal locality

31

Impact of strided access

for (i=0; i<1000; i++)

column_sum[i] = 0.0;

for(j=0; j<1000; j++)

column_sum[i]+= b[j][i];

32

Eliminating strided access

for (i=0; i<1000; i++)

column_sum[i] = 0.0;

for(j=0; j<1000; j++)

for (i=0; i<1000; i++)

column_sum[i]+= b[j][i];

Assumption: Vector column_sum is retained in the cache

33

do i = 1, N

do j = 1, N

A[i] =A[i] + B[j]

enddo

enddo

N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop.

Little reuse between touches

How many cache misses for A and B?

Tuesday, September 19, 2006

Documents

Transcript of Tuesday, September 19, 2006