Compiling High Performance Fortran
description
Transcript of Compiling High Performance Fortran
Optimizing Compilers for Modern Architectures
Compiling High Performance Fortran
Allen and Kennedy, Chapter 14
Optimizing Compilers for Modern Architectures
Overview
• Motivation for HPF
• Overview of compiling HPF programs
• Basic Loop Compilation for HPF
• Optimizations for compiling HPF
• Results and Summary
Optimizing Compilers for Modern Architectures
Motivation for HPF
• Require “Message Passing” to communicate data between processors
• Approach 1: Use MPI calls in Fortran/C code
Scalable Distributed Memory Multiprocessor
Optimizing Compilers for Modern Architectures
MPI implementation
Motivation for HPF
• Consider the following sum reduction
PROGRAM SUM
REAL A(10000)
READ (9) A
SUM = 0.0
DO I = 1, 10000
SUM = SUM + A(I)
ENDDO
PRINT SUM
END
PROGRAM SUM
REAL A(100), BUFF(100)
IF (PID == 0) THEN
DO IP = 0, 99
READ (9) BUFF(1:100)
IF (IP == 0)
A(1:100) = BUFF(1:100)
ELSE SEND(IP,BUFF,100)
ENDDO
ELSE RECV(0,A,100)
ENDIF
/*Actual sum reduction code here */
IF (PID == 0) SEND(1,SUM,1)
IF (PID > 0) RECV(PID-1,T,1)
SUM = SUM + T
IF (PID < 99) SEND(PID+1,SUM,1)
ELSE SEND(0,SUM,1)
ENDIF
IF (PID == 0) PRINT SUM;
END
Optimizing Compilers for Modern Architectures
Motivation for HPF
• Disadvantages of MPI approach—User has to rewrite the program in SPMD form [Single
Program Multiple Data]—User has to manage data movement [send & receive], data
placement and synchronization—Too messy and not easy to master
Optimizing Compilers for Modern Architectures
Motivation for HPF
• Approach 2: Use HPF—HPF is an extended version of Fortran 90—HPF has Fortran 90 features and a few directives
• Directives—Tell how data is laid out in processor memories in parallel
machine configuration. For example,– !HPF DISTRIBUTE A(BLOCK)
—Assist in identifying parallelism. For example,– !HPF INDEPENDENT
Optimizing Compilers for Modern Architectures
Motivation for HPF
• The same sum reduction code
PROGRAM SUM
REAL A(10000)
READ (9) A
SUM = 0.0
DO I = 1, 10000
SUM = SUM + A(I)
ENDDO
PRINT SUM
END
• When written in HPF...
PROGRAM SUM
REAL A(10000)
!HPF$ DISTRIBUTE A(BLOCK)
READ (9) A
SUM = 0.0
DO I = 1, 10000
SUM = SUM + A(I)
ENDDO
PRINT SUM
END
• Minimum modification
• Easy to write
• Now compiler has to do more work
Optimizing Compilers for Modern Architectures
Motivation for HPF
• Advantages of HPF—User needs only to write some easy directives; need not
write the whole program in SPMD form—User does not need to manage data movement [send &
receive] and synchronization—Simple and easy to master
Optimizing Compilers for Modern Architectures
Overview
• Motivation for HPF
• Overview of compiling HPF programs
• Basic Loop Compilation for HPF
• Optimizations for compiling HPF
• Results and Summary
Optimizing Compilers for Modern Architectures
• Dependence Analysis
• Used for communication analysis—Fact used: No dependence
carried by I loop
HPF Compilation Overview
• Running example:
REAL A(10000), B(10000)
!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)
DO J = 1, 10000
DO I = 2, 10000
S1: A(I) = B(I-1) + C
ENDDO
DO I = 1, 10000
S2: B(I) = A(I)
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
HPF Compilation Overview
• Running example:
REAL A(10000), B(10000)
!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)
DO J = 1, 10000
DO I = 2, 10000
S1: A(I) = B(I-1) + C
ENDDO
DO I = 1, 10000
S2: B(I) = A(I)
ENDDO
ENDDO
• Dependence Analysis
• Distribution Analysis
Optimizing Compilers for Modern Architectures
HPF Compilation Overview
• Running example:
REAL A(10000), B(10000)
!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)
DO J = 1, 10000
DO I = 2, 10000
S1: A(I) = B(I-1) + C
ENDDO
DO I = 1, 10000
S2: B(I) = A(I)
ENDDO
ENDDO
• Dependence Analysis
• Distribution Analysis
• Computation Partitioning—Partition so as to distribute
work of the I loops
Optimizing Compilers for Modern Architectures
HPF Compilation OverviewREAL A(1,100), B(0:100)
!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)
DO J = 1, 10000
I1: IF (PID /= 100) SEND(PID+1,B(100),1)
I2: IF (PID /= 0) THEN
RECV(PID-1,B(0),1)
A(1) = B(0) + C
ENDIF
DO I = 2, 100
S1: A(I) = B(I-1)+C
ENDDO
DO I = 1, 100
S2: B(I) = A(I)
ENDDO
ENDDO
• Dependence Analysis
• Distribution Analysis
• Computation Partitioning
• Communication Analysis and placement—Communication reqd for
B(0)for each iteration—Shadow region B(0)
Optimizing Compilers for Modern Architectures
HPF Compilation OverviewREAL A(1,100), B(0:100)
!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)
DO J = 1, 10000
I1: IF (PID /= 100) SEND(PID+1,B(100),1)
DO I = 2, 100
S1: A(I) = B(I-1)+C
ENDDO
I2: IF (PID /= 0) THEN
RECV(PID-1,B(0),1)
A(1) = B(0) + C
ENDIF
DO I = 1, 100
S2: B(I) = A(I)
ENDDO
ENDDO
• Dependence Analysis
• Distribution Analysis
• Computation Partitioning
• Communication Analysis and placement
• Optimization—Aggregation—Overlap communication and
computation—Recognition of reduction
Optimizing Compilers for Modern Architectures
Overview
• Motivation for HPF
• Overview of compiling HPF programs
• Basic Loop Compilation for HPF
• Optimizations for compiling HPF
• Results and Summary
Optimizing Compilers for Modern Architectures
Basic Loop Compilation
• Distribution Propagation and analysis—Analyze what distribution holds for a given array at a given
point in the program—Difficult due to
– REALIGN and REDISTRIBUTE directives– Distribution of formal parameters inherited from
calling procedure—Use “Reaching Decompositions” data flow analysis and its
interprocedural version
Optimizing Compilers for Modern Architectures
Basic Loop Compilation
• For simplicity assume single distribution for an array at all points in a subprogram
• Define
• For example suppose array A of size N is block distributed over p processors—Block size,
μA(i) =(ρA(i),δA(i))=(p, j)
BA =ceiling(N,p)
ρA(i) =ceiling(i,BA) −1
δA(i) =(i−1)modBA −1
Optimizing Compilers for Modern Architectures
Basic Loop Compilation
• Iteration Partitioning—Dividing work among
processors– Computation
partitioning—Determine which iterations
of a loop will be executed on which processor
—Owner-computes rule
REAL A(10000)
!HPF$ DISTRIBUTE A(BLOCK)
DO I = 1, 10000
A(I) = A(I) + C
ENDDO
• Iteration I is executed on owner of A(I)
• 100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on
Optimizing Compilers for Modern Architectures
Iteration Partitioning
• Multiple statements in a loop in a recurrence: choose a partitioning reference
• Processor responsible for performing computation for iteration I is
• Set of indices executed on p
A(α(I ))
θL(I ) =ρA(α(I ))
{I |1≤I ≤N;θL(I )=p}
α−1(ρA−1({p}))∩ [1..N]
Optimizing Compilers for Modern Architectures
Iteration Partitioning
• Have to map global loop index to local loop index
• Smallest value in maps to 1
REAL A(10000)
!HPF$ DISTRIBUTE A(BLOCK)
DO I = 1, N
A(I+1) = B(I) + C
ENDDO
Global_ Loop_ index Δ L ⏐ → ⏐ Local_ Loop_ index
α−1(ρA−1({p}))
Optimizing Compilers for Modern Architectures
ρA−1({p})=[100p+1:100p+100]
α−1(ρA−1({p}))=[100p:100p+99]
ΔL(I,p) =I −min(α −1(ρA−1({p})))+1
=I −100p+1
Iteration PartitioningREAL A(10000),B(10000)
!HPF$ DISTRIBUTE A(BLOCK),B(BLOCK)
DO I = 1, N
A(I+1) = B(I) + C
ENDDO
• Map global iteration space, I to local iteration space,i as follows:
Optimizing Compilers for Modern Architectures
Iteration Partitioning
• Adjust array subscripts for local iterations:
B(β(I))→ B(γ(i))
γ(i) =δB(β(ΔL−1(i,p)))
ΔL−1(i,PID) =i+100* PID−1
δB(k)=k−100* PID
γ(i) =i+100* PID−1−100*PID =i−1
Optimizing Compilers for Modern Architectures
Iteration Partitioning
• For interior processors the code becomes..
DO i = 1, 100
A(i) = B(i-1) + C
ENDDO
• Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions..
lo = 1
IF (PID==0) lo = 2
hi = 100
IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1
DO i = lo, hi
A(i) = B(i-1) + C
ENDDO
Optimizing Compilers for Modern Architectures
Communication Generation
• For our example no communication is required for iterations in
• Iterations which require receiving data are
• Iterations which require sending data are
α−1(ρA−1({p}))∩ β−1(ρB
−1({p}))∩ [1..N]
(α−1(ρA−1({p}))−β−1(ρB
−1({p})))∩ [1..N]
(β−1(ρB−1({p}))−α−1(ρA
−1({p})))∩ [1..N]
Optimizing Compilers for Modern Architectures
Communication Generation
REAL A(10000), B(10000)
!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)
...
DO I = 1, N
A(I+1) = B(I) + C
ENDDO
• Receive required for iterations in [100p:100p]
• Send required for iterations in [100p+100:100p+100]
• No communication required for iterations in [100p+1:100p+99]
Optimizing Compilers for Modern Architectures
Communication Generation
• After inserting receive
lo = 1
IF (PID==0) lo = 2
hi = 100
IF (PID==CEIL((N+1)/100)-1)
hi = MOD(N,100) + 1
DO i = lo, hi
IF (i==1 && PID /= 0)
RECV (PID-1, B(0), 1)
A(i) = B(i-1) + C
ENDDO
• Send must happen in the 101st iteration
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP)
hi = MOD(N,100) + 1
DO i = lo, hi+1
IF (i==1 && PID /= 0)
RECV (PID-1, B(0), 1)
IF (i <= hi) THEN
A(i) = B(i-1) + C
ENDIF
IF (i == hi+1 && PID /= lastP)
SEND(PID+1, B(100), 1)
ENDDO
Optimizing Compilers for Modern Architectures
Communication Generationlo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
DO i = lo, hi+1
IF (i==1 && PID /= 0)
RECV (PID-1, B(0), 1)
IF (i <= hi) THEN
A(i) = B(i-1) + C
ENDIF
IF (i == hi+1 && PID /= lastP)
SEND(PID+1, B(100), 1)
ENDDO
• Move SEND outside the loop
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
DO i = lo, hi
IF (i==1 && PID /= 0)
RECV (PID-1, B(0), 1)
A(i) = B(i-1) + C
ENDDO
IF (PID /= lastP)
SEND(PID+1, B(100), 1)
ENDIF
Optimizing Compilers for Modern Architectures
Communication Generationlo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
DO i = lo, hi
IF (i==1 && PID /= 0)
RECV (PID-1, B(0), 1)
A(i) = B(i-1) + C
ENDDO
IF (PID /= lastP)
SEND(PID+1, B(100), 1)
ENDIF
• Move receive outside loop and loop peel
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
IF (lo == 1 && PID /= 0) THEN
RECV (PID-1, B(0), 1)
A(1) = B(0) + C
ENDIF
! lo = MAX(lo,1+1) == 2
DO i = 2, hi
A(i) = B(i-1) + C
ENDDO
IF (PID /= lastP)
SEND(PID+1, B(100), 1)
ENDIF
Optimizing Compilers for Modern Architectures
Communication Generationlo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
IF (lo == 1 && PID /= 0) THEN
RECV (PID-1, B(0), 1)
A(1) = B(0) + C
ENDIF
! lo = MAX(lo,1+1) == 2
DO i = 2, hi
A(i) = B(i-1) + C
ENDDO
IF (PID /= lastP)
SEND(PID+1, B(100), 1)
ENDIF
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
IF (PID /= lastP)
SEND(PID+1, B(100), 1)
IF (lo == 1 && PID /= 0) THEN
RECV (PID-1, B(0), 1)
A(1) = B(0) + C
ENDIF
DO i = 2, hi
A(i) = B(i-1) + C
ENDDO
ENDIF
Optimizing Compilers for Modern Architectures
Communication Generation
• When is such rearrangement legal?
• Receive: copy from global to local location
• Send: copy local to global location
IF (PID <= lastP) THEN
S1: IF (lo == 1 && PID /= 0) THEN
B(0) = Bg(0) ! RECV
A(1) = B(0) + C
ENDIF
DO i = 2, hi
A(i) = B(i-1) + C
ENDDO
S2: IF (PID /= lastP) Bg(100) = B(100) ! SEND
ENDIF
No chain of dependences from S1 to S2
Optimizing Compilers for Modern Architectures
Communication GenerationREAL A(10000), B(10000)
!HPF$ DISTRIBUTE A(BLOCK)
...
DO I = 1, N
A(I+1) = A(I) + C
ENDDO
Would be rewritten as ..
IF (PID <= lastP) THEN
S1: IF (lo == 1 && PID /= 0) THEN
A(0) = Ag(0) ! RECV
A(1) = A(0) + C
ENDIF
DO i = 2, hi
A(i) = A(i-1) + C
ENDDO
S2: IF (PID /= lastP)
Ag(100) = A(100) ! SEND
ENDIF
• Rearrangement won’t be correct
Optimizing Compilers for Modern Architectures
Overview
• Motivation for HPF
• Overview of compiling HPF programs
• Basic Loop Compilation for HPF
• Optimizations for compiling HPF
• Results and Summary
Optimizing Compilers for Modern Architectures
Communication VectorizationREAL A(10000,100)
!HPF$ DISTRIBUTE A(BLOCK,*),
B(BLOCK,*)
DO J = 1, M
DO I = 1, N
A(I+1,J) = B(I,J) + C
ENDDO
ENDDO
• Using Basic Loop compilation gives..
DO J = 1, M
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP)
hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
IF (PID /= lastP)
SEND(PID+1, B(100,J), 1)
IF (lo == 1) THEN
RECV (PID-1, B(0,J), 1)
A(1,J) = B(0,J) + C
ENDIF
DO i = lo+1, hi
A(i,J) = B(i-1,J) + C
ENDDO
ENDIF
ENDDO
Optimizing Compilers for Modern Architectures
Communication VectorizationDO J = 1, M
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP)
hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
IF (PID /= lastP)
SEND(PID+1, B(100,J), 1)
IF (lo == 1) THEN
RECV (PID-1, B(0,J), 1)
A(1,J) = B(0,J) + C
ENDIF
DO i = lo+1, hi
A(i,J) = B(i-1,J) + C
ENDDO
ENDIF
ENDDO
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
DO J = 1, M
IF (PID /= lastP)
SEND(PID+1, B(100,J), 1)
ENDDO
DO J = 1, M
IF (lo == 1) THEN
RECV (PID-1, B(0,J), 1)
A(i,J) = B(i-1,J) + C
ENDIF
ENDDO
DO J = 1, M
DO i = lo+1, hi
A(i,J) = B(i-1,J) + C
ENDDO
ENDDO
ENDIFDistribute J Loop
Optimizing Compilers for Modern Architectures
Communication Vectorizationlo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
DO J = 1, M
IF (PID /= lastP)
SEND(PID+1, B(100,J), 1)
ENDDO
DO J = 1, M
IF (lo == 1) THEN
RECV (PID-1, B(0,J), 1)
A(i,J) = B(i-1,J) + C
ENDIF
ENDDO
DO J = 1, M
DO i = lo+1, hi
A(i,J) = B(i-1,J) + C
ENDDO
ENDDO
ENDIF
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
IF (lo == 1) THEN
RECV (PID-1, B(0,1:M), M)
DO J = 1, M
A(1,J) = B(0,J) + C
ENDDO
ENDIF
DO J = 1, M
DO i = lo+1, hi
A(i,J) = B(i-1,J) + C
ENDDO
ENDDO
IF (PID /= lastP)
SEND(PID+1, B(100,1:M), M)
ENDIF
Optimizing Compilers for Modern Architectures
Communication VectorizationDO J = 1, M
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP)
hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
S1: IF (PID /= lastP)
Bg(100,J)=B(100,J)
IF (lo == 1) THEN
S2: B(0,J)=Bg(0,J)
S3: A(1,J) = B(0,J) + C
ENDIF
DO i = lo+1, hi
S4: A(i,J) = B(i-1,J) + C
ENDDO
ENDIF
ENDDO
• Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop
Optimizing Compilers for Modern Architectures
Communication VectorizationREAL A(10000,100)
!HPF$ DISTRIBUTE A(BLOCK,*),
B(BLOCK,*)
DO J = 1, M
DO I = 1, N
A(I+1,J) = A(I,J) + B(I,J)
ENDDO
ENDDO
• Can sends be done before the receives?
• Can communication be vectorized?
REAL A(10000,100)
!HPF$ DISTRIBUTE A(BLOCK,*)
DO J = 1, M
DO I = 1, N
A(I+1,J+1) = A(I,J) + C
ENDDO
ENDDO
• Can sends be done before the receives?
• Can communication be fully vectorized?
Optimizing Compilers for Modern Architectures
Overlapping Communication and Computation
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
S0:IF (PID /= lastP)
SEND(PID+1, B(100), 1)
S1:IF (lo == 1 && PID /= 0) THEN
RECV (PID-1, B(0), 1)
A(1) = B(0) + C
ENDIF
L1:DO i = 2, hi
A(i) = B(i-1) + C
ENDDO
ENDIF
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
S0:IF (PID /= lastP)
SEND(PID+1, B(100), 1)
L1:DO i = 2, hi
A(i) = B(i-1) + C
ENDDO
S1:IF (lo == 1 && PID /= 0) THEN
RECV (PID-1, B(0), 1)
A(1) = B(0) + C
ENDIF
ENDIF
Optimizing Compilers for Modern Architectures
PipeliningREAL A(10000,100)
!HPF$ DISTRIBUTE A(BLOCK,*)
DO J = 1, M
DO I = 1, N
A(I+1,J) = A(I,J) + C
ENDDO
ENDDO
• Initial code generation for the I loop gives..
lo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
DO J = 1, M
IF (lo == 1) THEN
RECV (PID-1, A(0,J), 1)
A(1,J) = A(0,J) + C
ENDIF
DO i = lo+1, hi
A(i,J) = A(i-1,J) + C
ENDDO
IF (PID /= lastP)
SEND(PID+1, A(100,J), 1)
ENDDO
ENDIF
Can be vectorizedBut gives up parallelism
Optimizing Compilers for Modern Architectures
Pipelining
• Pipelined parallelism with communication
Optimizing Compilers for Modern Architectures
Pipelining
• Pipelined parallelism with communication overhead
Optimizing Compilers for Modern Architectures
Pipelining: Blockinglo = 1
IF (PID==0) lo = 2
hi = 100
lastP = CEIL((N+1)/100) - 1
IF (PID==lastP) hi = MOD(N,100) + 1
IF (PID <= lastP) THEN
DO J = 1, M
IF (lo == 1) THEN
RECV (PID-1, A(0,J), 1)
A(1,J) = A(0,J) + C
ENDIF
DO i = lo+1, hi
A(i,J) = A(i-1,J) + C
ENDDO
IF (PID /= lastP)
SEND(PID+1, A(100,J), 1)
ENDDO
ENDIF
...
IF (PID <= lastP) THEN
DO J = 1, M, K
IF (lo == 1) THEN
RECV (PID-1, A(0,J:J+K-1), K)
DO j = J, J+K-1
A(1,J) = A(0,J) + C
ENDDO
ENDIF
DO j = J, J+K-1
DO i = lo+1, hi
A(i,J) = A(i-1,J) + C
ENDDO
ENDDO
IF (PID /= lastP)
SEND(PID+1, A(100,J:J+K-1),K)
ENDDO
ENDIF
Optimizing Compilers for Modern Architectures
Other Optimizations
• Alignment and Replication
• Identification of Common recurrences
• Storage Mangement—Minimize temporary storage used for communication—Space taken for temporary storage should be at most
equal to the space taken by the arrays
• Interprocedural Optimizations
Optimizing Compilers for Modern Architectures
Results
Optimizing Compilers for Modern Architectures
Summary
• HPF is easy to code—But hard to compile
• Steps required to compile HPF programs—Basic loop compilation
– Communication generation—Optimizations
– Communication vectorization– Overlapping communication with computation– Pipelining