Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Compiling High Performance Fortran

Allen and Kennedy, Chapter 14


Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary


Motivation for HPF

• Require “Message Passing” to communicate data between processors

• Approach 1: Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor


MPI implementation

Motivation for HPF

• Consider the following sum reduction

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

PROGRAM SUM

REAL A(100), BUFF(100)

IF (PID == 0) THEN

DO IP = 0, 99

READ (9) BUFF(1:100)

IF (IP == 0)

A(1:100) = BUFF(1:100)

ELSE SEND(IP,BUFF,100)

ENDDO

ELSE RECV(0,A,100)

ENDIF

/*Actual sum reduction code here */

IF (PID == 0) SEND(1,SUM,1)

IF (PID > 0) RECV(PID-1,T,1)

SUM = SUM + T

IF (PID < 99) SEND(PID+1,SUM,1)

ELSE SEND(0,SUM,1)

ENDIF

IF (PID == 0) PRINT SUM;

END


Motivation for HPF

• Disadvantages of MPI approach—User has to rewrite the program in SPMD form [Single

Program Multiple Data]—User has to manage data movement [send & receive], data

placement and synchronization—Too messy and not easy to master


Motivation for HPF

• Approach 2: Use HPF—HPF is an extended version of Fortran 90—HPF has Fortran 90 features and a few directives

• Directives—Tell how data is laid out in processor memories in parallel

machine configuration. For example,– !HPF DISTRIBUTE A(BLOCK)

—Assist in identifying parallelism. For example,– !HPF INDEPENDENT


Motivation for HPF

• The same sum reduction code

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

• When written in HPF...

PROGRAM SUM

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

• Minimum modification

• Easy to write

• Now compiler has to do more work


Motivation for HPF

• Advantages of HPF—User needs only to write some easy directives; need not

write the whole program in SPMD form—User does not need to manage data movement [send &

receive] and synchronization—Simple and easy to master


Overview







• Dependence Analysis

• Used for communication analysis—Fact used: No dependence

carried by I loop

HPF Compilation Overview

• Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO




REAL A(10000), B(10000)


DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO


• Distribution Analysis




REAL A(10000), B(10000)


DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO



• Computation Partitioning—Partition so as to distribute

work of the I loops


HPF Compilation OverviewREAL A(1,100), B(0:100)


DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 2, 100

S1: A(I) = B(I-1)+C

ENDDO

DO I = 1, 100

S2: B(I) = A(I)

ENDDO

ENDDO



• Computation Partitioning

• Communication Analysis and placement—Communication reqd for

B(0)for each iteration—Shadow region B(0)


HPF Compilation OverviewREAL A(1,100), B(0:100)


DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

DO I = 2, 100

S1: A(I) = B(I-1)+C

ENDDO

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 1, 100

S2: B(I) = A(I)

ENDDO

ENDDO



• Computation Partitioning

• Communication Analysis and placement

• Optimization—Aggregation—Overlap communication and

computation—Recognition of reduction


Overview







Basic Loop Compilation

• Distribution Propagation and analysis—Analyze what distribution holds for a given array at a given

point in the program—Difficult due to

– REALIGN and REDISTRIBUTE directives– Distribution of formal parameters inherited from

calling procedure—Use “Reaching Decompositions” data flow analysis and its

interprocedural version



• For simplicity assume single distribution for an array at all points in a subprogram

• Define

• For example suppose array A of size N is block distributed over p processors—Block size,

μA(i) =(ρA(i),δA(i))=(p, j)

BA =ceiling(N,p)

ρA(i) =ceiling(i,BA) −1

δA(i) =(i−1)modBA −1



• Iteration Partitioning—Dividing work among

processors– Computation

partitioning—Determine which iterations

of a loop will be executed on which processor

—Owner-computes rule

REAL A(10000)


DO I = 1, 10000

A(I) = A(I) + C

ENDDO

• Iteration I is executed on owner of A(I)

• 100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on


Iteration Partitioning

• Multiple statements in a loop in a recurrence: choose a partitioning reference

• Processor responsible for performing computation for iteration I is

• Set of indices executed on p

A(α(I ))

θL(I ) =ρA(α(I ))

{I |1≤I ≤N;θL(I )=p}

α−1(ρA−1({p}))∩ [1..N]



• Have to map global loop index to local loop index

• Smallest value in maps to 1

REAL A(10000)


DO I = 1, N

A(I+1) = B(I) + C

ENDDO

Global_ Loop_ index Δ L ⏐ → ⏐ Local_ Loop_ index

α−1(ρA−1({p}))


ρA−1({p})=[100p+1:100p+100]

α−1(ρA−1({p}))=[100p:100p+99]

ΔL(I,p) =I −min(α −1(ρA−1({p})))+1

=I −100p+1

Iteration PartitioningREAL A(10000),B(10000)

!HPF$ DISTRIBUTE A(BLOCK),B(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Map global iteration space, I to local iteration space,i as follows:



• Adjust array subscripts for local iterations:

B(β(I))→ B(γ(i))

γ(i) =δB(β(ΔL−1(i,p)))

ΔL−1(i,PID) =i+100* PID−1

δB(k)=k−100* PID

γ(i) =i+100* PID−1−100*PID =i−1



• For interior processors the code becomes..

DO i = 1, 100

A(i) = B(i-1) + C

ENDDO

• Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions..

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1

DO i = lo, hi

A(i) = B(i-1) + C

ENDDO


Communication Generation

• For our example no communication is required for iterations in

• Iterations which require receiving data are

• Iterations which require sending data are

α−1(ρA−1({p}))∩ β−1(ρB

−1({p}))∩ [1..N]

(α−1(ρA−1({p}))−β−1(ρB

−1({p})))∩ [1..N]

(β−1(ρB−1({p}))−α−1(ρA

−1({p})))∩ [1..N]



REAL A(10000), B(10000)


...

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Receive required for iterations in [100p:100p]

• Send required for iterations in [100p+100:100p+100]

• No communication required for iterations in [100p+1:100p+99]



• After inserting receive

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL((N+1)/100)-1)

hi = MOD(N,100) + 1

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

• Send must happen in the 101st iteration

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO


Communication Generationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

• Move SEND outside the loop

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1


IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF



IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

• Move receive outside loop and loop peel

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF



IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF



• When is such rearrangement legal?

• Receive: copy from global to local location

• Send: copy local to global location


S1: IF (lo == 1 && PID /= 0) THEN

B(0) = Bg(0) ! RECV

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S2: IF (PID /= lastP) Bg(100) = B(100) ! SEND

ENDIF

No chain of dependences from S1 to S2


Communication GenerationREAL A(10000), B(10000)


...

DO I = 1, N

A(I+1) = A(I) + C

ENDDO

Would be rewritten as ..


S1: IF (lo == 1 && PID /= 0) THEN

A(0) = Ag(0) ! RECV

A(1) = A(0) + C

ENDIF

DO i = 2, hi

A(i) = A(i-1) + C

ENDDO

S2: IF (PID /= lastP)

Ag(100) = A(100) ! SEND

ENDIF

• Rearrangement won’t be correct


Overview







Communication VectorizationREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = B(I,J) + C

ENDDO

ENDDO

• Using Basic Loop compilation gives..

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1


IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO


Communication VectorizationDO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1


IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIFDistribute J Loop


Communication Vectorizationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



IF (lo == 1) THEN

RECV (PID-1, B(0,1:M), M)

DO J = 1, M

A(1,J) = B(0,J) + C

ENDDO

ENDIF

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100,1:M), M)

ENDIF


Communication VectorizationDO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1


S1: IF (PID /= lastP)

Bg(100,J)=B(100,J)

IF (lo == 1) THEN

S2: B(0,J)=Bg(0,J)

S3: A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

S4: A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

• Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop


Communication VectorizationREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + B(I,J)

ENDDO

ENDDO

• Can sends be done before the receives?

• Can communication be vectorized?

REAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J+1) = A(I,J) + C

ENDDO

ENDDO

• Can sends be done before the receives?

• Can communication be fully vectorized?


Overlapping Communication and Computation

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

ENDIF


PipeliningREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + C

ENDDO

ENDDO

• Initial code generation for the I loop gives..

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

Can be vectorizedBut gives up parallelism


Pipelining

• Pipelined parallelism with communication


Pipelining

• Pipelined parallelism with communication overhead


Pipelining: Blockinglo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1



DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

...


DO J = 1, M, K

IF (lo == 1) THEN

RECV (PID-1, A(0,J:J+K-1), K)

DO j = J, J+K-1

A(1,J) = A(0,J) + C

ENDDO

ENDIF

DO j = J, J+K-1

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J:J+K-1),K)

ENDDO

ENDIF


Other Optimizations

• Alignment and Replication

• Identification of Common recurrences

• Storage Mangement—Minimize temporary storage used for communication—Space taken for temporary storage should be at most

equal to the space taken by the arrays

• Interprocedural Optimizations


Results


Summary

• HPF is easy to code—But hard to compile

• Steps required to compile HPF programs—Basic loop compilation

– Communication generation—Optimizations

– Communication vectorization– Overlapping communication with computation– Pipelining

Compiling High Performance Fortran

Documents

Transcript of Compiling High Performance Fortran