F2GPU: A General Fortran to GPU-Code Translator - GTC...

Applied Simulations, Inc.

F2GPU: A General Fortran to GPU-Code Translator

Rainald Löhner


McLean, VA, USA

www.appliedsimulations.com


Outline

� General Observations

� GPUs

� F2GPU

� Translating Codes

� Examples and Timings

� Conclusions and Outlook


General Observations


Observations (1)� Capability More Important Than Speed

� Obvious, But Very Often Overlooked

� True to a Point (e.g. Hierarchical Design Setting)

� Engineering Codes Have Many Options� Important Differentiator

� Man-Years of Coding/Testing/Comparison to Experiments

� Mostly Scalar

� Geometrical Flexibility Crucial� All Commercial CFD Codes Use Unstructured Grids/Data Structures

� Over-Simplification Can Be Very Costly� Better Wait a Few Days for a Good Result Than an Hour for …

� Industry and Government Have Learned This Time and Again …


Observations (2)� Problem Size Correlates With Physics

� Airplane Aerodynamics: Nr. of Gridpoints

Potential: 106, Euler: 108, RANS: 109, LES: 1010-1014, DNS: 1016

� Car Crash: Beam: 104, Shells: 106, Solids: 108

� Physical Complexity Increases Potential Load Imbalances� CFD: Chemical Reactions, Particles and Fluids, H-Ref, Remesh

� CSD: Contact, Rupture

� Multiphysics: Different Parallelization/Spatial Subdivisions

� Zalesak’s Uncertainty Principle Holds� Compute Speed * Results < Z

� Who Gets More Than 512 Procs 24/7/365 ?� Very Few, If Any, Even in 2012

� Many More in the Future (Perhaps: Energy Costs…)


Characteristics of Successful Codes

� Do Not Solve `The World’; Just One Class of Problems� Commit to a Problem; Not to an Algorithm/Technique/…

� Physics: Complete� CFD: Turbulence, Cavitation, Moldfilling, Casting, Extrusion,

Food Processing, Biofluids, Non-Newtonian, …� CSD: Springback, Failure, Cracking, Rupture, Tearing, …

� Numerics: Basic� Just Enough (Hourglass, Upwinding, Stabilization, …)

� Benchmarks, QA� Link to Experiments� Long-Term Commitment� Documentation, Manuals, Training, …

� Institutional Memory (Long-Term Commitment)


Codes: Going Forward

� Many Codes Written in F77/F90� I Can Hear the Sighs of Disbelief…

� If: Cost is in `Physics’/ Debugging/ Benchmarking

� If: Differentiation is in Options

� If: Codes Are Large (O(1-10 Mlines) Usual)

� � Codes Will Not Be Re-Written� Unless Factors of 1:100 Are Possible

� � Need An Automatic Way of Porting F77/F90 to GPUs


GPUs


GPUs: Caveats

� Transfer is Expensive� Do as Much as Possible on GPU

� GPU Works Well on: � Large Amounts of Data [But: Memory Limited]

� Simple Operations / Reuse of Registers [But: Registers Limited]

� � Very Similar to (Autotasking) Vector Machines

� GPU-Based Field Solvers Appearing in Literature


GPUs: Porting Options (1)

� Option 1: Re-Write By Hand� Expensive (Millions of Lines of [Fortran] Code)

� Error Prone

� Multiple Versions of Same (?) Code

� Will Certainly Work

� Option 2: Compilers � Will Require Re-Vectorization of Large Parts of Codes

� MPI-Decade Spoiled Code-Writers

� Vector-Compilers: Could Only Go So Far

� Insertion of Directives Does Not Account for Placement of Arrays� Main Bottleneck of GPUs Not Addressed !


GPUs: Porting Options (2)

� Option 3: Translators/Scripts� Needs Inner Fine-Grained Parallel (Vectorized) Loops

� Needs Uniform Code/Loop Structure

� Lowest Number of Possible Errors (All/Nothing)

� Placement/Transfer Issue of Data (CPU/GPU) Addressed

� Single Code Version

� Can Keep Developing/Debugging on CPU

� Attempted by Several Groups Worldwide


F2GPU


F2GPU: Design Criteria

� Exploit Existing Fine-Grain Parallelism

� Take Every OMP Loop:� Identify Arrays and Define as `On GPU’

� Perform Inner Vector Loops on GPU

� Allow for User-Defined $gpu - Directives� Some (Small, Diagnostics) Arrays Should Stay on CPU

� Transfer To/From CPU/GPU Only What is Required

� Allow for User-Defined Ignore-Option� Difficult Subs Omitted by Translator

� Allows for Gradual Porting


Avoid CPU / GPU Data Transfer (1)

� GPU/CPU: Separate Memory Spaces

� Transfer GPU/CPU: Extremely Slow� Bus Bandwidth < 10GB/s

� Internal CPU Bandwidth ~20-60 GB/s

� Internal GPU Bandwidth ~100-200 GB/s

� � Just Porting “Bottleneck Subroutines”�Replace One Bottleneck With Another (!)

� � All (!) Parallel Loops Must Run on GPU

� � Limit Data Transfer to Code Init/Shutdown


Avoid CPU / GPU Data Transfer (2)

� F2GPU: Analyze Array Placement

� Warn User of Placement


F2GPU (1)

� Generate GPU Code From Existing Fortran Code

� Original Code Must Expose Fine-Grain Parallelism� Assume Every OpenMP Loop as on GPU

� Allow for !$gpu parallel Directive for SubLoops/Conflicts

� Explicit Separation of Memory Spaces

� O(103) Lines of Python Script

� Uses FParser Package From F2PY Project (Open Source)

� Not Specialized for Any Particular Code� But Full-Performance, Automatic `GPU Compiler’ Unlikely


F2GPU (2)

� Single F77/F90 + OpenMP Codebase

� No New Bugs

� Catches Old Bugs� Uninitialized Variables

� Nonsensical OpenMP Directives

� …

� No Performance Hit Due to Translation� Same Code As Produced By Experienced GPU Coder

� Not Specialized for Any Particular Code� But Not a Generic, Automatic Compiler


F2GPU: Fine-Grain Parallelism

� Explicitly Defined via OpenMP in F77/F90� Simple Translation

� User-Defined Parallel Inner Loop via Directives� !$gpu parallel do

� Difficult Subs:� Array Compression: rcmpresp, icmpresp

� Prefix Sum: iadlinklistv

� Random Number Generation: rgaussrndv, rgrndv

� Custom Code Via Thrust Library


F2GPU: Output Options

� Multiple Output Targets

� Fortran Code Analyzed/Represented Abstractly Using Objects

� Different Object Methods Can Output to Different Targets

� CUDA Target: Fully Supported

� OpenCL and CUDA Fortran: Incomplete

� Other Targets: Possible in Future


F2GPU: MPI

� GPU Translator Fully Integrates with Existing MPI Parallelism

� CUDA/MPI: Orthogonal

� Each MPI Rank Processes a Sub-Domain� Coarse-Grain

� CUDA Threads in Sub-Domain (edges/points/elements)� Fine-Grain


Translating Codes With F2GPU


Translating Code With F2GPU (1)

� Start With Original Code: code.f, Makefile, …

� mkdir code_gpu

� From Makefile:� Get List of *.f/*.f90 Files

� Prepare `F2GPU Makefile’: Makef2gpu

� Until Translated (in code_gpu directory):� make convert

� � *.f90.cu Files

� Once Translated (in code_gpu directory):� make

� � Executable


Translating Code With F2GPU (2)

� Errors Detected� Incorrect Number of Arguments in Calls

� Incorrect Arguments Type in Calls

� Incorrect OpenMP Arguments (local)

� Problems Detected� GPU Arrays Accessed in Scalar/CPU Loop

� CPU Arrays Accessed in Vector/GPU Loop

� � Force User to Place !$gpu gpu2cpu / cpu2gpu Directive


Running Codes With F2GPU


Running Code With F2GPU (1)

� Debug/Production Options

� Debug Option� Alert User to CPU to GPU and GPU to CPU Transfer

� Essential to Obtain GPU Performance


FEFLO On GPUs


FEFLO: Examples…


FEFLO (1)� Physics

� Compressible and Incompressible Flow� Many EOS� Turbulence Models� Chemical Reactions� Dilute Particle Phases� Adjoints

� Numerics� 1-Element Type Code� Edge-Based Solvers (Upwind, Riemann, Limiters, …)� Explicit and Implicit Timestepping� Iterative Solvers (DPCG/GMRES, LU-SGS, Deflated, Linelets, ..)� Optimized for Vector, SMP and DMP (Domain Decomposer)


FEFLO (2)� Engineering

� Periodic BC� Embedded/Immersed Bodies� Overlapping Grids� Body/Surface/Mesh Motion Modules� Link to CSD/CTD/Control/… Codes� Link to Optimization Packages

� Statistics (05/2012) for Physics Modules� O(1.20) Mlines of F77 Code� O(6.70) KSubs� O(650) Egde-Loops With Sub-Subroutine Calls

� Typical `Legacy Code’


FEFLO (3)� Coding:

� Well-Organized, Consistent (Same Names, Same Loops, …)

� Uses Simple, Explicit Subset of F77/F90

� Parallelized (OMP, MPI) + Vectorized


FEFLO: ON GPUs

� 12/2012: O(3.8KSubs Ported) [Running Entirely on GPU]

� Comp: locfct.f, rukucomp.f, lusgscomp.f (Ideal Gas)

� Inco: incosubs.f, vectruku.f, vectlusgs.f, projecsubs.f

� Scalars: scalfct.f, scalruku.f, scallusgs.f, vofsubs.f

� Preconditioners: linelets.f

� Lagrangian Particles: lagparts.f

� Moving Body Options: alemesh.f, alebody.f

� Turbulence Models: smago, wale, kepsilon, …

� Radiation Transport

� Adjoints, …


Effect of Vector Length: Blast in a Room

� Compressible Euler� Ideal Gas Equation of State

� Flux-Corrected Transport

� 1 Mels

� 60 Time Steps� Double Precision


Effect of Vector Length: Blast in a Room

4732Xeon(1)1.0 M

6225600Xeon(1)1.0 M

1232Xeon(8)1.0 M

1025600GTX2951.0 M

36032GTX2951.0 M

Time [sec]mveclCPU/GPUnelem

6025600Xeon(1a)1.0 M

4532Xeon(1a)1.0 M

36032Tesla1.0 M

1125600Tesla1.0 M


Effect of Renumbering: Blast In Cube

� Compressible Euler

� Ideal Gas EOS

� Explicit FEM-FCT

� Initialization From 1-D File

� 1.0 Mels

� Run for 500 Steps

� Cartesian Point Distribution � Test Renumbering Options


Blast In Cube


Blast in a Cube

2.23

2.23

2.40

2.71

2.71

-

fact_1

2.30

5.24

6.25

8.36

8.36

-

fact_2

39

43

42

46

45

280

Time [sec]

6.2242572BINGTX295

1.0032BINXeon (1)

6.0942572ADVGTX295

6.6742572GPU1GTX295

6.5129800GPUDGTX295

Speedupmveclnrenunelem

7.1842572BINDGTX295


Blast in a Room



� 4 Mels


304884151Tesla M2050

Number of Domains

191

315

1

102

136

2

15

59

68

4CPU/GPU

8

Xeon X5670 (6) 36

Xeon X5670 (12) 33

Tesla M2050 2Mels 11


Maximal Throughput

� Data Transfer Per Edge-Loop:

� ndata = nedge·nrealed = 7·nrealed·npoin [real*8, i.e. 64bits]

� � = 56·nrealed·bytes/pt

� Assume 150 Gbytes/sec, i.e. 0.67·10-11 sec/byte

� � Time Per Point Per Edge-Loop

� tppdata= 56·nrealed bytes/pt ·0.67·10-11 sec/byte

� tppdata= 38·nrealed·10-11 sec/pt


Maximal Throughput: FEM -FCT

� For: tppdata= 38 ·nrealed·10-11 sec/pt :

� � FEMFCT: 0.950 ·10-7 sec/pt/step

� Measured: 6.920 ·10-7 sec/pt/step

5121unknoGetDt

222

30

30

30

60

30

30

nreal_ia

31unkno, rhsLow-Ord

51unkno, rhsLapLoe AV

1

1

2

1

# Calls

28Total

2rhsFinal du

2unkno, fluxlimiting

6unkno, rhsConsMass

5unkno, rhsTayGal

nreal_daTransferSub


Blast in a Room



� 4 Mels


304884151Tesla M2050

Number of Domains

191

315

1

102

136

2

15

59

68

4CPU/GPU

8

Xeon X5670 (6) 36

Xeon X5670 (12) 33

Tesla M2050 2Mels 11

= 6.92 ·10-7 sec/pt/step


NACA 0012

� Compressible Euler, Ideal Gas� Explicit RK3, HLLC, nlimi=0� Steady State, Local Timestepping� Residual Damping � Ma=2, AOA=15o

� 1.0 Mels� Run for 100 Steps� Double Precision

34Xeon E5530 (4)

61Xeon E5530 (2)

19Xeon E5530 (8)

23GTX295

CPU (Sec)

106

CPU/GPU

Xeon E5530 (1)


NACA 0012� Compressible Euler, Ideal Gas� Explicit RK3, HLLC, nlimi=0� Steady State, Local Timestepping� Residual Damping � Ma=2, AOA=15o


34Xeon E5530 (4)

61Xeon E5530 (2)

19Xeon E5530 (8)

23GTX295

CPU (Sec)

106

CPU/GPU

Xeon E5530 (1)


OneraM6� Compressible Euler, Steady State � Ideal Gas EOS � Implicit LU-SGS , HLLC, nlimi=2� Local Timestepping� Ma=0.84, AOA=3.06o

� 0.955 Mels� 50 Time Steps� Double Precision

43Xeon E5530 (4)

76Xeon E5530 (2)

32Xeon E5530 (6)

36Tesla C2070

CPU (Sec)

142

CPU/GPU

Xeon E5530 (1)


Turbine Blade� Compressible Euler, Steady State � Ideal Gas EOS� Implicit LU-SGS , HLLC, nlimi=2� Local Timestepping, Periodic BC� Ma=0.84, AOA=3.06o


107Xeon E5530 (4)

204Xeon E5530 (2)

78Xeon E5530 (6)

80Tesla C2070

CPU (Sec)

394

CPU/GPU

Xeon E5530 (1)


Supersonic Inlet� CompressibleEuler, Steady State � Ideal Gas EOS � Implicit LU-SGS , HLLC, nlimi=2� Local Timestepping� Ma=3.00� 2.08 Mels� 50 Time Steps� Double Precision

70Xeon E5530 (4)

129Xeon E5530 (2)

51Xeon E5530 (6)

42Tesla C2070

CPU (Sec)

252

CPU/GPU

Xeon E5530 (1)


NACA 0012� Incompressible Euler� Advection: Explicit RK3, Roe,

nlimi=2� Pressure: Poisson (Projection),

DPCG [Scalar Products] � Steady State, Local Timestepping� AOA=15o


24Xeon E5530 (4)

40Xeon E5530 (2)

15Xeon E5530 (8)

31GTX295

CPU (Sec)

80

CPU/GPU

Xeon E5530 (1)


Pipe

� Steady-State Incompressible Navier-Stokes + Heat Transfer

� Advection: Roe solver� Pressure: Poisson (Projection),

DPCG(Scalar Products)� 4.0 Mels� 100 Time Steps� Double Precision

183371456Tesla M2050

Number of Domains

662

1105

1

336

684

2

152

212

4CPU/GPU

8

Xeon X5670 (6) 112

Xeon X5670 (12) 75


Dispersion in City

� Transient Incompressible Navier-Stokes + Scalar Transport



442Xeon E5530 (4)

-Xeon E5530 (2)

336Xeon E5530 (6)

356Tesla C2070

CPU (Sec)

-

CPU/GPU

Xeon E5530 (1)


Dispersion in Metro Station

� Transient Incompressible Navier-Stokes + Scalar Transport


DPCG(Scalar Products) [800 Iter]� 3.95 Mels� 10 Time Steps� Double Precision

108Xeon E5530 (4)

191Xeon E5530 (2)

96Xeon E5530 (6)

115Tesla C2070

CPU (Sec)

375

CPU/GPU

Xeon E5530 (1)


Artery

� Steady-State Incompressible NS� Advection: Roe, LU-SGS, Implicit� Pressure: Poisson (Projection),


161Xeon E5530 (4)

255Xeon E5530 (2)

105Xeon E5530 (6)

172Tesla C2070

CPU (Sec)

465

CPU/GPU

Xeon E5530 (1)


Cylinder, Re=190

� Transient Incompressible Navier-Stokes

� Adv: Roe, RK2�� LineletLinelet Preconditioning (Visc,P)� 7198 (13),954 (14),27(15)


69Xeon E5530 (4)

107Xeon E5530 (2)

53Xeon E5530 (6)

75Tesla C2070

CPU (Sec)

205

CPU/GPU

Xeon E5530 (1)


Cylinder, Re=190


� Adv: Roe, RK3�� LineletLinelet Preconditioning (Visc,P)� 1(3), 16,588 (12), 30(15)


190Xeon E5530 (4)

284Xeon E5530 (2)

150Xeon E5530 (6)

150Tesla C2070

CPU (Sec)

528

CPU/GPU

Xeon E5530 (1)


Dam Break


� VOF for Free Surface

� 1.0 Mels

� 100 Time Steps

� Double Precision

56Xeon E5530 (4)

85Xeon E5530 (2)

46Xeon E5530 (6)

41Tesla M2050

CPU (Sec)

145

CPU/GPU

Xeon E5530 (1)


Dam Break


� VOF for Free Surface

� 4.0 Mels

� 100 Time Steps


6885125-Tesla M2050

Number of Domains

-

195

1

77

110

2

48

64

4CPU/GPU

8

Xeon X5670 (6) 46

Xeon X5670 (12) 36


Blast in Room With Dilute Material� Compressible Euler, Ideal GAS

EOS

� FEM-FCT (Explicit)

� 4.0 Mels

� 93Kparts

� Run for 60 Steps


113Xeon E5530 (4)

178Xeon E5530 (2)

68Xeon E5530 (8)

49GTX295

CPU (Sec)

305

CPU/GPU

Xeon E5530 (1)


Blast in Room With Dilute Material


FEHEAT On GPUs


FEHEAT (1)� Physics

� Heat Conduction

� Tensor Conductivity

� Nonlinear Conductivity

� Nonlinear Source Terms

� Numerics� 3-Element Type Code

� Bar (Edge), Shell (Triangle), Solid (Tetrahedron)

� Finite Element Formulation (Element Loops)

� Implicit Timestepping

� Iterative Solvers (DPCG/GMRES, ..)

� Optimized for Vector and SMP


FEHEAT (2)� Engineering

� Nonlinear BC (Convection, Radiation, …)

� Link to CSD/CTD/Control/… Codes

� Link to Optimization Packages

� Statistics (05/2012) for Physics Modules� O(16.5) Klines of F77 Code

� O(0.20) KSubs

� Coding:� Well-Organized, Consistent (Same Names, Same Loops, …)

� Uses Simple, Explicit Subset of F77/F90

� Parallelized (OMP) + Vectorized

� Typical `Legacy Code’


Cube

� Transient Heat Conduction

� 3.87 Mels

� 10 Time Steps


17Xeon E5530 (4)

30Xeon E5530 (2)

13Xeon E5530 (8)

13Tesla C2070

CPU (Sec)

54

CPU/GPU

Xeon E5530 (1)


Conclusions and Outlook (1)

� GPUs Here to Stay� Need Large Vector Lengths to Achieve Performance �

� Porting Will Pose Challenging Task for Many Legacy Codes

� Translators Such as F2GPU Offer Ability to:� Continue Development In Original Language Unhindered

� Incorporate Expert GPU Programming Skills Without Re-Write

� Evolve With Changing Standards

� GPUs: Transfer CPU/GPU Slow � All Code on GPU �� Memory Limits Applicability

� Use Usual MPI Domain Decomposition for Large Problems


Conclusions and Outlook (2)

� Results Obtained To Date� Speedup: 1:4-1:16 in Double Precision [Tesla/Xeon 1-Core]

� Run on Several Graphics Cards [Tesla/Geforce]

� GPUs: Transfer Rate on GPU Slow [200 Gbytes/sec]� Limits Performance

� Attempt to Use/Re-Use Shared Memory [2 Tbytes/sec] ?

� Rethink Vectorization/Colouring Mapping to GPU [Coalescing]

� F2GPU: Looking for Codes…

F2GPU: A General Fortran to GPU-Code Translator - GTC...

Documents

Transcript of F2GPU: A General Fortran to GPU-Code Translator - GTC...