Experiences Building a Multi-platform Compiler for Co-array Fortran
description
Transcript of Experiences Building a Multi-platform Compiler for Co-array Fortran
![Page 1: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/1.jpg)
1
John Mellor-Crummey
Cristian Coarfa, Yuri Dotsenko
Department of Computer ScienceRice University
Experiences Building a Multi-platform Compiler for
Co-array Fortran
AHPCRC PGAS Workshop September, 2005
![Page 2: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/2.jpg)
2
Goals for HPC Languages
• Expressiveness
• Ease of programming
• Portable performance
• Ubiquitous availability
![Page 3: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/3.jpg)
3
PGAS Languages
• Global address space programming model
– one-sided communication (GET/PUT)
• Programmer has control over performance-critical factors
– data distribution and locality control
– computation partitioning
– communication placement
• Data movement and synchronization as language primitives
– amenable to compiler-based communication optimization
HPF & OpenMP compilers must get this right
simpler than msg passing
lacking in OpenMP
![Page 4: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/4.jpg)
4
Co-array Fortran Programming Model
• SPMD process images– fixed number of images during execution– images operate asynchronously
• Both private and shared data– real x(20, 20) a private 20x20 array in each image– real y(20, 20)[*] a shared 20x20 array in each image
• Simple one-sided shared-memory communication – x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns
• Synchronization intrinsic functions– sync_all – a barrier and a memory fence– sync_mem – a memory fence– sync_team([team members to notify], [team members to wait for])
• Pointers and (perhaps asymmetric) dynamic allocation• Parallel I/O
![Page 5: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/5.jpg)
5
integer a(10,20)[*]
if (this_image() > 1)
a(1:10,1:2) = a(1:10,19:20)[this_image()-1]
a(10,20) a(10,20) a(10,20)
image 1 image 2 image N
image 1 image 2 image N
One-sided Communication with Co-Arrays
![Page 6: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/6.jpg)
6
CAF Compilers
• Cray compilers for X1 & T3E architectures
• Rice Co-Array Fortran Compiler (cafc)
![Page 7: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/7.jpg)
7
Performance comparable to that of hand-tuned MPI codes
Rice cafc Compiler
• Source-to-source compiler
– source-to-source yields multi-platform portability
• Implements core language features
– core sufficient for non-trivial codes
– preliminary support for derived types
• soon support for allocatable components
• Open source
![Page 8: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/8.jpg)
8
Implementation Strategy
• Goals
– portability
– high performance on a wide range of platforms
• Approach
– source-to-source compilation of CAF codes
• use Open64/SL Fortran 90 infrastructure
• CAF Fortran 90 + communication operations
– communication
• ARMCI and GASNet one-sided comm libraries for portability
• load/store communication on shared-memory platforms
![Page 9: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/9.jpg)
9
Key Implementation Concerns
• Fast access to local co-array data
• Fast communication
• Overlap of communication and computation
![Page 10: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/10.jpg)
10
Accessing Co-Array Data
Two Representations
• SAVE and COMMON co-arrays as Fortran 90 pointers– F90 pointers to memory allocated outside Fortran run-time system
– original references accessing local co-array data
• rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - …
– transformed references
• rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - …
• Procedure co-array arguments as F90 explicit-shape arrays– CAF language requires explicit shape for co-array arguments
real :: a(10,10,10)[*]
type CAFDesc_real_3 real, pointer:: ptr(:,:,:) ! F90 pointer to local co-array dataend Type CAFDesc_real_3type(CAFDesc_real_3):: a
![Page 11: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/11.jpg)
11
Performance Challenges
• Problem
– Fortran 90 pointer-based representation does not convey
• the lack of co-array aliasing
• contiguity of co-array data
• co-array bounds information
– lack of knowledge inhibits important code optimizations
• Approach: procedure splitting
![Page 12: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/12.jpg)
12
Procedure Splitting
subroutine f(…)real, save :: c(100)[*]interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_innerend interface
call f_inner(…,c(1))
end subroutine f
subroutine f_inner(…, c_arg)real :: c_arg(100)[*]
... = c_arg(50) ...
end subroutine f_inner
subroutine f(…)real, save :: c(100)[*]
... = c(50) ...
end subroutine f
CAF to CAF optimization
Benefits• better alias analysis• contiguity of co-array data• co-array bounds information• better dependence analysis
result: back-end compiler can generate better code
![Page 13: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/13.jpg)
13
Implementing Communication
• x(1:n) = a(1:n)[p] + …
• General approach: use buffer to hold off processor data
– allocate buffer
– perform GET to fill buffer
– perform computation: x(1:n) = buffer(1:n) + …
– deallocate buffer
• Optimizations
– no buffer for co-array to co-array copies
– unbuffered load/store on shared-memory systems
![Page 14: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/14.jpg)
14
Strided vs. Contiguous Transfers
• Problem
– CAF remote reference might induce many small data transfers
• a(i,1:n)[p] = b(j,1:n)
• Solution
– pack strided data on source and unpack it on destination
• Constraints
– can’t express both source-level packing and unpacking for a one-sided transfer
– two-sided packing/unpacking is awkward for users
• Preferred approach
– have communication layer perform packing/unpacking
![Page 15: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/15.jpg)
15
Pragmatics of Packing
Who should implement packing?
• CAF programmer
– difficult to program
• CAF compiler
– must convert PUTs into two-sided communication to unpack
• difficult whole-program transformation
• Communication library
– most natural place
– ARMCI currently performs packing on Myrinet (at least)
![Page 16: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/16.jpg)
16
Synchronization
• Original CAF specification: team synchronization only
– sync_all, sync_team
• Limits performance on loosely-coupled architectures
• Point-to-point extensions
– sync_notify(q)
– sync_wait(p)
Point to point synchronization semantics
Delivery of a notify to q from p
all communication from p to q issued before the notify has been delivered to q
![Page 17: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/17.jpg)
17
Hiding Communication Latency
Goal: enable communication/computation overlap
• Impediments to generating non-blocking communication
– use of indexed subscripts in co-dimensions
– lack of whole program analysis
• Approach: support hints for non-blocking communication
– overcome conservative compiler analysis
– enable sophisticated programmers to achieve good performance today
![Page 18: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/18.jpg)
18
Questions about PGAS Languages
• Performance
– can performance match hand-tuned msg passing programs?
– what are the obstacles to top performance?
– what should be done to overcome them?
• language modifications or extensions?
• program implementation strategies?
• compiler technology?
• run-time system enhancements?
• Programmability
– how easy is it to develop high performance programs?
![Page 19: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/19.jpg)
19
Investigating these Issues
Evaluate CAF, UPC, and MPI versions of NAS benchmarks
• Performance
– compare CAF and UPC performance to that of MPI versions
• use hardware performance counters to pinpoint differences
– determine optimization techniques common for both languages as well as language specific optimizations
• language features
• program implementation strategies
• compiler optimizations
• runtime optimizations
• Programmability
– assess programmability of the CAF and UPC variants
![Page 20: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/20.jpg)
20
Platforms and Benchmarks
• Platforms
– Itanium2+Myrinet 2000 (900 MHz Itanium2)
– Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB)
– SGI Altix 3000 (1.5 GHz Itanium2)
– SGI Origin 2000 (R10000)
• Codes
– NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
– MG, CG, SP, BT
– CAF and UPC versions were derived from Fortran77+MPI versions
![Page 21: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/21.jpg)
21
MG class A (2563) on Itanium2+Myrinet2000
Intel compiler: restrict yields factor of 2.3 performance
improvement
UPCstrided comm
28% faster thanmultiple transfers
UPCpoint to point
49% faster than barriers
CAFpoint to point
35% faster than barriers
Higher is better
![Page 22: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/22.jpg)
22
MG class C (5123) on SGI Altix 3000
Intel C compiler: scalar performance
Fortran compiler: linearized array subscripts 30% slowdown compared
to multidimensional subscripts
Higher is better
64
![Page 23: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/23.jpg)
23
MG class B (2563) on SGI Origin 2000
Higher is better
![Page 24: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/24.jpg)
24
CG class C (150000) on SGI Altix 3000
Intel compiler: sum reductions in C 2.6 times slower than Fortran!
point to point
19% faster than
barriers
Higher is better
![Page 25: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/25.jpg)
25
CG class B (75000) on SGI Origin 2000
Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran!
Higher is better
![Page 26: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/26.jpg)
26
SP class C (1623) on Itanium2+Myrinet2000
restrict yields 18%
performance improvement
Higher is better
![Page 27: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/27.jpg)
27
SP class C (1623) on Alpha+Quadrics
Higher is better
![Page 28: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/28.jpg)
28
BT class C (1623) on Itanium2+Myrinet2000
UPC: use of restrict boosts the performance 43%
CAF: procedure splitting improves performance 42-60%
UPC: comm. packing 32%
faster
CAF: comm. packing 7%
faster
Higher is better
![Page 29: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/29.jpg)
29
BT class B (1023) on SGI Altix 3000
use of restrict improves
performance 30%
Higher is better
![Page 30: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/30.jpg)
30
Performance Observations
• Achieving highest performance can be difficult
– need effective optimizing compilers for PGAS languages
• Communication layer is not the problem
– CAF with ARMCI or GASNet yields equivalent performance
• Scalar code optimization of scientific code is the key!
– SP+BT: SGI Fortran: unroll+jam, SWP
– MG: SGI Fortran: loop alignment, fusion
– CG: Intel Fortran: optimized sum reduction
• Linearized subscripts for multidimensional arrays hurt!
– measured 30% performance gap with Intel Fortran
![Page 31: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/31.jpg)
31
Performance Prescriptions
For portable high performance, we need …
• Better language support for CAF synchronization
– point-to-point synchronization is an important common case!
– currently only a Rice extension outside the CAF standard
• Better CAF & UPC compiler support
– communication vectorization
– synchronization strength reduction: important for programmability
• Compiler optimization of loops with complex dependences
• Better run-time library support
– efficient communication support for strided array sections
![Page 32: Experiences Building a Multi-platform Compiler for Co-array Fortran](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146bc550346895db3f137/html5/thumbnails/32.jpg)
32
Programmability Observations
• Matching MPI performance required using bulk communication
– communicating multi-dimensional array sections is natural in CAF
– library-based primitives are cumbersome in UPC
• Strided communication is problematic for performance
– tedious programming of packing/unpacking at src level
• Wavefront computations
– MPI buffered communication easily decouples sender/receiver
– PGAS models: buffering explicitly managed by programmer