Introduction

1
Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin, qasem}@cs.rice.edu Introduction The gap between memory speed and processor speed is increasing with each new generation of processors. As a result, lack of sufficient bandwidth between the processor and various levels of memory has become a principle bottleneck in achieving peak performance for large-scale applications. Because application programs are typically written in a clear style that facilitates code development and maintainability they often fail to exploit data reuse to the greatest extent possible. Although significant performance gains can be achieved by careful hand rewriting of code, more automated approaches are needed for improving both performance, productivity and maintainability. There are two major obstacles to automatic optimization of large scale applications. First the choice of algorithms and data structures in the original source may severely limit opportunities for optimization. Second compiler-based optimizations must be expressed in terms of algorithms that are both general and correct. However, the techniques for achieving performance at the leading edge of capability computing are currently neither simple nor general enough to be applied universally. For this reason, we are pursuing a hybrid approach of building tools that use compiler technology to assist expert human programmers. Transformation Framework Using the Rice compiler infrastructure we have constructed a source to source loop transformation tool that combines new compilation techniques with well-known loop transformations to improve data reuse on various levels of the memory hierarchy. Our optimization strategy builds upon multi-level fusion of loop nests. We employ both data and computation restructuring strategies such as array contraction, selective unroll-and-jam and guard-free core generation. Controlled Loop Fusion We enable multi-level loop fusion by adjusting the alignment of statements in different loops relative to a common iteration space. Loop fusion improves cache reuse by reducing the reuse distance between accesses to the same data or data that share the same cache line. Loop fusion also opens up possibilities for applying array contraction. However, fusing loops too aggressively may cause the memory footprint for the innermost body to be too large. To resolve this issue we have extended our fusion algorithm to enable more fine- grained control over which loop nests should be fused. Using a compiler directive, we can instruct our fusion algorithm to selectively fuse any pair of loops in any dimension as long as fusing those loops is legal Selective Unroll-and-Jam We implemented unroll-and-jam to exploit temporal reuse carried by outer loops. Applying this transformation to a loop nest brings reuse in the outer loops closer together which improves register reuse. In a multilevel fused loop nest, the dependences may be carried by different loops within the fused loop nest. Unroll-and-jamming along a particular dimension does not provide us with sufficient benefits in this case. To get the benefits of both cache reuse from fusion and register reuse from unroll-and-jam we have augmented the conventional unroll-and-jam strategy. Within a fused loop nest we apply unroll-and-jam selectively to loop nests on the dimension(s) that give best reuse. c dir$ fuse 1 do j = 4-nbdy,ny+nbdy-2 c dir$ fuse 1 do i = 4-nbdy,nx+nbdy-3 c dir$ fuse 1 do k = 4-nbdy,nz+nbdy-3 ab(k,i,j) = (f60 * (a(k,i,j-1) + a(k,i,j)) + f61 & * (a(k,i,j-2) + a(k,i,j+1)) + f62 * (a(k,i,j-3) + a(k,i,j+2))) & * thirddtbydy * uyb(k,i,j) enddo enddo enddo c dir$ fuse 1 do j = 4-nbdy,ny+nbdy-3 c dir$ fuse 2 do i = 4-nbdy,nx+nbdy-2 c dir$ fuse 2 do k = 4-nbdy,nz+nbdy-3 al(k,i,j) = (f60 * (a(k,i-1,j) + a(k,i,j)) + f61 & * (a(k,i-2,j) + a(k,i+1,j)) + f62 * (a(k,i-3,j) + a(k,i+2,j))) & * thirddtbydx * uxl(k,i,j) enddo enddo enddo c dir$ fuse 1 do j = 4-nbdy,ny+nbdy-3 c dir$ fuse 4 do i = 4-nbdy,nx+nbdy-3 c dir$ fuse 4 do k = 4-nbdy,nz+nbdy-3 athird(k,i,j) = a(k,i,j) + (al(k,i+1,j) - al(k,i,j)) & + (ab(k,i,j+1) - ab(k,i,j)) + (af(k+1,i,j) - af(k,i,j)) enddo enddo enddo c dir$ fuse 1 do j = 7-nbdy,ny+nbdy-5 c dir$ fuse 5 do i = 7-nbdy,nx+nbdy-6 c dir$ fuse 5 do k = 7-nbdy,nz+nbdy-6 abthird(k,i,j) = (f60 * (athird(k,i,j-1) + athird(k,i,j)) & + f61 * (athird(k,i,j-2) + athird(k,i,j+1)) & + f62 * (athird(k,i,j-3) + athird(k,i,j+2))) & * halfdtbydx * uybthird(k,i,j) enddo enddo enddo do jj = -5, ny + 7 - 2 + 1, 2 j = jj do i = -5, nx + 6 do k = -5, nz + 6 ab_$(k, i, iand(j, 1)) = (f60 * (a(k, i, j - 1) + a(k, i, *j)) + f61 * (a(k, i, j - 2) + a(k, i, j + 1)) + f62 * (a(k, i, j - * 3) + a(k, i, j + 2))) * thirddtbydy * uyb(k, i, j) ab_$(k, i, iand(j+1, 1)) = (f60 * (a(k, i, j) + a(k, i, j+1))+f *61 * (a(k, i, j - 1) + a(k, i, j + 2)) + f62 * (a(k, i, j - 2) + a *(k, i, j + 3))) * thirddtbydy * uyb(k, i, j + 1) enddo enddo do j = jj, jj + 1 if (j .le. ny + 6) then do i = -5, nx + 7 - 2 + 1, 2 do k = -5, nz + 6 al_$(k, i, iand(j, 1)) = (f60 * (a(k, i - 1, j) + a(k, i *, j)) + f61 * (a(k, i - 2, j) + a(k, i + 1, j)) + f62 * (a(k, i - *3, j) + a(k, i + 2, j))) * thirddtbydx__dtmp * uxl(k, i, j) al_$(k, i + 1, iand(j, 1)) = (f60 * (a(k, i, j) +a(k, i+1,j))+f *61 * (a(k, i - 1, j) + a(k, i + 2, j)) + f62 * (a(k, i - 2, j) + a *(k, i + 3, j))) * thirddtbydx__dtmp * uxl(k, i + 1, j) enddo enddo endif enddo do j = jj, jj + 1 if (j .ge. (-4)) then do i = -5, nx + 6 do k = -5, nz + 6 athird_$(k, i, iand(j - 1, 7)) = a(k, i, j - 1) + (al_$( *k, i + 1, iand(j - 1, 1)) - al_$(k, i, iand(j - 1, 1))) + (ab_$(k, *i, iand(j, 1)) - ab_$(k, i, iand(j - 1, 1))) + (af_$(k + 1, i, ia *nd(j - 1, 1)) - af_$(k, i, iand(j - 1, 1))) enddo enddo endif enddo j = jj if (j .ge. 1) then do i = -2, nx + 3 do k = -2, nz + 3 abthird_$(k, i, iand(j - 3, 1)) = (f60 * (athird_$(k, i, *iand(j - 4, 7)) + athird_$(k, i, iand(j - 3, 7))) + f61 * (athird *_$(k, i, iand(j - 5, 7)) + athird_$(k, i, iand(j - 2, 7))) + f62 * * (athird_$(k, i, iand(j - 6, 7)) + athird_$(k, i, iand(j - 1, 7))) *) * halfdtbydx__dtmp * uybthird(k, i, j - 3) abthird_$(k, i, iand(j - 2, 1)) = (f60 * (athird_$(k, i, *iand(j - 3, 7)) + athird_$(k, i, iand(j - 2, 7))) + f61 * (athird *_$(k, i, iand(j - 4, 7)) + athird_$(k, i, iand(j - 1, 7))) + f62 * * (athird_$(k, i, iand(j - 5, 7)) + athird_$(k, i, iand(j, 7))) *) * halfdtbydx__dtmp * uybthird(k, i, j - 2) enddo enddo endif enddo Array Contraction If it is possible to fuse a set of loop nests that include the definition and all of the uses of values of a temporary array, we typically can reduce the storage footprint by applying array contraction Reducing the storage footprint leads to reuse of data locations. This can reduce the memory bandwidth required if the reduced-size array can now fit in some level of cache Guard-Free Core Generation Using a symbolic set manipulation package, we compute the iteration space for a guard-free core loop nest from a fused computation. Pieces of iteration spaces are clipped off along each dimension of the fused loop nest to uncover a guard-free rectangular core. To facilitate unroll-and-jam, the loop bounds of the core are adjusted so that the number of iterations is divisible by the unroll factor in the dimension to be unrolled. Results We applied our transformation tool on the Runge-Kutta kernel of the NCOMMAS weather code developed at NCSA. We ran experiments with different combinations of transformations. This enables us to measure the effectiveness of each individual transformation as well as its interplay with other transformations in the framework. Our results show that applying multi-level loop fusion significantly reduces cache misses. Applying storage reduction with fusion farther improves the cache performance. Table: Performance Comparison of the Runge-Kutta Kernel Future Direction We plan to enhance our transformation tool by integrating other loop transformations such as tiling. Also our preliminary experiments suggest that there is significant interplay among the various transformations that we use in our framework. We plan to explore this relationship farther and develop algorithms based on cost models for orchestrating the transformations. Figure 4: Guard Free Core Generation for 2D Iteration Space Figure 3: Array Contraction Figure 2: Runge-Kutta kernel code transformation with 1-level fusion, unroll-and-jam and array contraction Figure 1: Controlled Loop Fusion A(L,M ,N ) A_$(L,M,0:1) I K J DO j DO i DO k A(k, i, j) = … = A(k, I, j – 1) = A(k, I, j) ENDDO ENDDO ENDDO DO j DO i DO k A_$(k, i, (j mod 2)) = … = A_$(k, i, (j– 1 mod 2)) = A_$(k, I, (j mod 2)) ENDDO ENDDO ENDDO S4 S3 S2 S1 S1 S2 S3 S4 J I to p l e f t bottom r i g h t Guard free core 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 C ycles L1 m isses L2 m isses TLB m isses O riginal 1L Fusion 1L Fusion+ST 2L Fusion 2L Fusion+ST

description

J. top. I. r i g h t. S1. l e f t. S1. Guard free core. S2. S2. S3. bottom. S3. S4. S4. Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin, qasem}@cs.rice.edu. Introduction - PowerPoint PPT Presentation

Transcript of Introduction

Page 1: Introduction

Compiler Support for Better Memory Utilization in Scientific CodeRob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem

{rjf, johnmc, jin, qasem}@cs.rice.edu

IntroductionThe gap between memory speed and processor speed is increasing with each new generation of processors. As a result, lack of sufficient bandwidth between the processor and various levels of memory has become a principle bottleneck in achieving peak performance for large-scale applications. Because application programs are typically written in a clear style that facilitates code development and maintainability they often fail to exploit data reuse to the greatest extent possible. Although significant performance gains can be achieved by careful hand rewriting of code, more automated approaches are needed for improving both performance, productivity and maintainability.

There are two major obstacles to automatic optimization of large scale applications. First the choice of algorithms and data structures in the original source may severely limit opportunities for optimization. Second compiler-based optimizations must be expressed in terms of algorithms that are both general and correct. However, the techniques for achieving performance at the leading edge of capability computing are currently neither simple nor general enough to be applied universally. For this reason, we are pursuing a hybrid approach of building tools that use compiler technology to assist expert human programmers.

Transformation Framework

Using the Rice compiler infrastructure we have constructed a source to source loop transformation tool that combines new compilation techniques with well-known loop transformations to improve data reuse on various levels of the memory hierarchy. Our optimization strategy builds upon multi-level fusion of loop nests. We employ both data and computation restructuring strategies such as array contraction, selective unroll-and-jam and guard-free core generation.

Controlled Loop Fusion

• We enable multi-level loop fusion by adjusting the alignment of statements in different loops relative to a common iteration space.

• Loop fusion improves cache reuse by reducing the reuse distance between accesses to the same data or data that share the same cache line.

• Loop fusion also opens up possibilities for applying array contraction.

• However, fusing loops too aggressively may cause the memory footprint for the innermost body to be too large.

• To resolve this issue we have extended our fusion algorithm to enable more fine- grained control over which loop nests should be fused.

• Using a compiler directive, we can instruct our fusion algorithm to selectively fuse any pair of loops in any dimension as long as fusing those loops is legal

Selective Unroll-and-Jam

• We implemented unroll-and-jam to exploit temporal reuse carried by outer loops. Applying this transformation to a loop nest brings reuse in the outer loops closer together which improves register reuse.

• In a multilevel fused loop nest, the dependences may be carried by different loops within the fused loop nest. Unroll-and-jamming along a particular dimension does not provide us with sufficient benefits in this case.

• To get the benefits of both cache reuse from fusion and register reuse from unroll-and-jam we have augmented the conventional unroll-and-jam strategy. Within a fused loop nest we apply unroll-and-jam selectively to loop nests on the dimension(s) that give best reuse.

c dir$ fuse 1do j = 4-nbdy,ny+nbdy-2

c dir$ fuse 1do i = 4-nbdy,nx+nbdy-3

c dir$ fuse 1do k = 4-nbdy,nz+nbdy-3

ab(k,i,j) = (f60 * (a(k,i,j-1) + a(k,i,j)) + f61 & * (a(k,i,j-2) + a(k,i,j+1)) + f62 * (a(k,i,j-3) + a(k,i,j+2))) & * thirddtbydy * uyb(k,i,j)

enddoenddo

enddoc dir$ fuse 1

do j = 4-nbdy,ny+nbdy-3c dir$ fuse 2

do i = 4-nbdy,nx+nbdy-2c dir$ fuse 2

do k = 4-nbdy,nz+nbdy-3al(k,i,j) = (f60 * (a(k,i-1,j) + a(k,i,j)) + f61

& * (a(k,i-2,j) + a(k,i+1,j)) + f62 * (a(k,i-3,j) + a(k,i+2,j)))& * thirddtbydx * uxl(k,i,j)

enddoenddo

enddoc dir$ fuse 1

do j = 4-nbdy,ny+nbdy-3c dir$ fuse 4

do i = 4-nbdy,nx+nbdy-3c dir$ fuse 4

do k = 4-nbdy,nz+nbdy-3athird(k,i,j) = a(k,i,j) + (al(k,i+1,j) - al(k,i,j))

& + (ab(k,i,j+1) - ab(k,i,j)) + (af(k+1,i,j) - af(k,i,j))enddo

enddoenddo

c dir$ fuse 1do j = 7-nbdy,ny+nbdy-5

c dir$ fuse 5do i = 7-nbdy,nx+nbdy-6

c dir$ fuse 5do k = 7-nbdy,nz+nbdy-6

abthird(k,i,j) = (f60 * (athird(k,i,j-1) + athird(k,i,j))& + f61 * (athird(k,i,j-2) + athird(k,i,j+1))& + f62 * (athird(k,i,j-3) + athird(k,i,j+2)))& * halfdtbydx * uybthird(k,i,j)

enddoenddo

enddo

do jj = -5, ny + 7 - 2 + 1, 2

j = jj do i = -5, nx + 6

do k = -5, nz + 6ab_$(k, i, iand(j, 1)) = (f60 * (a(k, i, j - 1) + a(k, i,

*j)) + f61 * (a(k, i, j - 2) + a(k, i, j + 1)) + f62 * (a(k, i, j -* 3) + a(k, i, j + 2))) * thirddtbydy * uyb(k, i, j)

ab_$(k, i, iand(j+1, 1)) = (f60 * (a(k, i, j) + a(k, i, j+1))+f*61 * (a(k, i, j - 1) + a(k, i, j + 2)) + f62 * (a(k, i, j - 2) + a*(k, i, j + 3))) * thirddtbydy * uyb(k, i, j + 1)

enddoenddo

do j = jj, jj + 1 if (j .le. ny + 6) then

do i = -5, nx + 7 - 2 + 1, 2do k = -5, nz + 6al_$(k, i, iand(j, 1)) = (f60 * (a(k, i - 1, j) + a(k, i

*, j)) + f61 * (a(k, i - 2, j) + a(k, i + 1, j)) + f62 * (a(k, i -*3, j) + a(k, i + 2, j))) * thirddtbydx__dtmp * uxl(k, i, j)

al_$(k, i + 1, iand(j, 1)) = (f60 * (a(k, i, j) +a(k, i+1,j))+f*61 * (a(k, i - 1, j) + a(k, i + 2, j)) + f62 * (a(k, i - 2, j) + a*(k, i + 3, j))) * thirddtbydx__dtmp * uxl(k, i + 1, j)

enddoenddo

endifenddodo j = jj, jj + 1

if (j .ge. (-4)) thendo i = -5, nx + 6

do k = -5, nz + 6athird_$(k, i, iand(j - 1, 7)) = a(k, i, j - 1) + (al_$(

*k, i + 1, iand(j - 1, 1)) - al_$(k, i, iand(j - 1, 1))) + (ab_$(k,*i, iand(j, 1)) - ab_$(k, i, iand(j - 1, 1))) + (af_$(k + 1, i, ia*nd(j - 1, 1)) - af_$(k, i, iand(j - 1, 1)))

enddoenddo

endifenddo

j = jjif (j .ge. 1) then

do i = -2, nx + 3do k = -2, nz + 3

abthird_$(k, i, iand(j - 3, 1)) = (f60 * (athird_$(k, i,*iand(j - 4, 7)) + athird_$(k, i, iand(j - 3, 7))) + f61 * (athird*_$(k, i, iand(j - 5, 7)) + athird_$(k, i, iand(j - 2, 7))) + f62 ** (athird_$(k, i, iand(j - 6, 7)) + athird_$(k, i, iand(j - 1, 7)))*) * halfdtbydx__dtmp * uybthird(k, i, j - 3)

abthird_$(k, i, iand(j - 2, 1)) = (f60 * (athird_$(k, i,*iand(j - 3, 7)) + athird_$(k, i, iand(j - 2, 7))) + f61 * (athird*_$(k, i, iand(j - 4, 7)) + athird_$(k, i, iand(j - 1, 7))) + f62 ** (athird_$(k, i, iand(j - 5, 7)) + athird_$(k, i, iand(j, 7)))*) * halfdtbydx__dtmp * uybthird(k, i, j - 2)

enddoenddo

endifenddo

Array Contraction

• If it is possible to fuse a set of loop nests that include the definition and all of the uses of values of a temporary array, we typically can reduce the storage footprint by applying array contraction

• Reducing the storage footprint leads to reuse of data locations. This can reduce the memory bandwidth required if the reduced-size array can now fit in some level of cache

Guard-Free Core Generation

• Using a symbolic set manipulation package, we compute the iteration space for a guard-free core loop nest from a fused computation.

• Pieces of iteration spaces are clipped off along each dimension of the fused loop nest to uncover a guard-free rectangular core.

• To facilitate unroll-and-jam, the loop bounds of the core are adjusted so that the number of iterations is divisible by the unroll factor in the dimension to be unrolled.

Results

We applied our transformation tool on the Runge-Kutta kernel of the NCOMMAS weather code developed at NCSA. We ran experiments with different combinations of transformations. This enables us to measure the effectiveness of each individual transformation as well as its interplay with other transformations in the framework.Our results show that applying multi-level loop fusion significantly reduces cache misses. Applying storage reduction with fusion farther improves the cache performance.

Table: Performance Comparison of the Runge-Kutta Kernel

Future Direction

We plan to enhance our transformation tool by integrating other loop transformations such as tiling. Also our preliminary experiments suggest that there is significant interplay among the various transformations that we use in our framework. We plan to explore this relationship farther and develop algorithms based on cost models for orchestrating the transformations.

Figure 4: Guard Free Core Generation for 2D Iteration Space

Figure 3: Array Contraction

Figure 2: Runge-Kutta kernel code transformation with 1-level fusion, unroll-and-jam and array contraction

Figure 1: Controlled Loop Fusion

A(L, M, N) A_$(L, M, 0:1)

I

K

J

DO j

DO i

DO k

A(k, i, j) = …

= A(k, I, j – 1)

= A(k, I, j)

ENDDO

ENDDO

ENDDO

DO j

DO i

DO k

A_$(k, i, (j mod 2)) = …

… = A_$(k, i, (j– 1 mod 2))

= A_$(k, I, (j mod 2))

ENDDO

ENDDO

ENDDO

S4

S3

S2

S1S1

S2

S3

S4

JI

top

l

e

f

t

bottom

r

i

g

h

t

Guard free core

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Cycles L1 misses L2 misses TLB misses

Original

1L Fusion

1L Fusion+ST

2L Fusion

2L Fusion+ST