Post on 06-Feb-2022
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
GenIDLEST Co-‐Design
Virginia Tech
1
AFOSR-‐BRI Workshop Feb 7 2014
Danesh TaNi & Amit Amritkar
Collaborators Wu-‐chun Feng, Paul Sathre, Kaixi Hou, Sriram Chivukula,
Hao Wang, Eric de Sturler, Kasia Swirydowicz
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Outline
• GenIDLEST Code • Features and capabili3es
• Ini3al Code por3ng • Strategy
• Co-‐design op3miza3on • Valida3on • Applica3on
• Bat wing flapping • Solver co-‐design • Future work
2
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Generalized Incompressible Direct and Large-‐Eddy Simula3ons of Turbulence (GenIDLEST)
• Background • Finite-‐volume mul3-‐block body fiZed mesh topology. • Grid has structured (i,j,k) connec3vity but blocks can have arbitrary connec3vity (unstructured)
• Pressure based frac3onal step algorithm • MPI, OpenMP or MPI-‐OpenMP parallelism • OpenMP scalability as good as MPI (tested up to 256 cores)
• Capability • Wall resolved and wall modeled LES in complex geometries
• Dynamic Smagorinsky with zonal two-‐layer wall model • Hybrid URANS-‐LES
• K-‐ω based • Dynamic Grids (ALE) • Immersed Boundary Method (IBM) on general grids. • Discrete Element Method (DEM)
3
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Generalized Incompressible Direct and Large-‐Eddy Simula3ons of Turbulence (GenIDLEST)
• Dynamic stall on pitching airfoil (LES+ALE) -‐45 million cells-‐304 cores
4
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Generalized Incompressible Direct and Large-‐Eddy Simula3ons of Turbulence (GenIDLEST)
• Flapping Bat wing (IBM) -‐ 25 million cells – 60 cores
5
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Generalized Incompressible Direct and Large-‐Eddy Simula3ons of Turbulence (GenIDLEST)
• DEM in a fluidized bed (5.3 million par3cles, 1 million fluid grid cells – 256 cores)
6
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
GenIDLEST Flow Chart
7
GenIDLEST (Generalized Incompressible Direct and Large Eddy Simulation of Turbulence)
ALE
Rezoning Phase
IBM
DEM
Setup system matrices
Solve momentum equations for intermediate velocity
Solve linear system
Calculate fluxes
Solve Pressure Poisson equation
Solve linear system
Correct intermediate velocities
T>Tmax
START
Advance time step
END (Post processing)
No
Yes
Initialize velocity and temperature fields
Solve turbulent model equations for turbulent
viscosity
Forced boundary node movement
Corner displacement using spring analogy
TFI for corner -‐> edge -‐> face -‐> volume displacement
Calculate new system matrices
Read user defined input and grid
Ts>Tsmax
Calculate forces and heat flux on particle
Update particle position, velocity and
temperature
Calculate volume fraction
Calculate momentum and temperature source terms
Yes
No
Identify solid, fluid and IB nodes
Apply velocity boundary condition at IB nodes
Correct mass fluxes
Apply boundary conditions for velocity and pressure at
IB nodes
Post-‐process data at immersed boundary
Move the immersed boundary to new location
Read surface grid
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
GenIDLEST-‐Linear Systems
• Sparse • Sparsity paZern depends on B.C.s. • Nonorthogonal grid -‐ Non-‐symmetric -‐ 18 off-‐diagonal elements
• Orthogonal grid – Symmetric – 6 off-‐diagonal elements
• Use CG for symmetric system, and BiCGSTAB or GMRES(m) for nonsymmetric non-‐orthogonal systems.
• Precondi3oners: Domain decomposi3on -‐ Addi3ve Schwarz Richardson or SSOR itera3ons applied to sub-‐structured overlapping “cache” blocks – SIMD parallel
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
GenIDLEST – Time Consuming Rou3nes on CPU
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Data Structures and Mapping to Co-‐Processor Architectures
10
Compute Node
CPUs Co-procs.
MPI offload
OpenMP offload
MPI MPI
GPU
Intel MIC
Global grid
Nodal grid
Mesh blocks
Cache blocks
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Challenges in GPU Co-‐Processing Implementa3on
• Large memory footprint for problems of prac3cal interest limits GPU use to subset of code • Added overhead of gejng data on and off GPU in 3me-‐integra3on loop
• All computa3ons on GPU have to be data parallel – anything that deviates from this paradigm is not easy to implement
• Message passing for global block boundary synchroniza3on between processors requires addi3onal data transfer from and to GPU
11
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
GPU IMPLEMENTATION
• Implementa3on Strategy • Port large func3onal blocks of subrou3nes to GPU without overwhelming GPU memory while keeping CPU-‐GPU and GPU-‐CPU data transfers to a minimum.
• Over 75 CUDA-‐Fortran kernels wriZen • Calcula3on of diffusion coefficients for all transport equa3ons • Solu3on of all linear systems in transport equa3ons and pressure equa3on which consume between 50-‐90% of computa3on 3me on CPU • 2 global sparse matrix-‐vector mul3plies, 4 global inner products or global reduc3ons, 2 precondi3oner calls and 4 global block boundary updates per BiCGSTAB itera3on.
• Message passing packing and unpacking 12
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Co-‐design with computer science team
• Approach 1. The program was profiled to iden3fy performance
boZlenecks. 2. Accelerator architecture aware op3miza3ons including
coalesced memory access, shared memory usage, double buffering and pointer swapping, and removal of redundant synchroniza3ons were applied.
3. Kernel modifica3ons were performed to suit the GPUs. This involved op3miza3ons such as elimina3on of trivial memory accesses by performing in situ computa3ons and dot products plus reduc3ons which are ghost cell aware.
• GPU code op3miza3on • Improved performance from 3x to 6x
13
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Turbulent Channel flow
• Turbulent flow through channel Reτ = 180 • Orthogonal Mesh
• 7 point stencil • 643 grid nodes • Single mesh block
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Channel flow -‐ valida3on
• BiCGSTAB solver • addi3ve Schwarz block pre-‐condi3oner • Richardson itera3ve smoothing
• Time averaged turbulence sta3s3cs • Root mean squared flow veloci3es
• Single vs Double precision • More SP cores on GPUs
15 DNS study by – Moser, R. D., Kim, J., and Mansour, N. N., 1999, "Direct numerical simulation of turbulent channel flow up to Re= 590," Physics of fluids, 11, p. 943
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
• Cache block configura3ons • 3D data blocking • 25 itera3ons /3me step -‐ pressure Poisson equa3on
• Different GPUs • 200 3me steps • Time in 3me integra3on loop
16
Channel flow – performance results
CPU (s)
Nvidia Tesla GPU (s) K20c C2050
DP SP DP SP DP 504.5 39.7 63.1 54.2 96.3
Layout Total threads Time (s) 8x4x4 128 98.7 8x8x2 128 97.9 8x8x4 256 101.1
16x8x2 256 103.6 32x4x2 256 104 64x2x2 256 103.5 64x4x2 512 96.3 32x8x2 512 97.5
16x16x2 512 98.4 8x8x8 512 107.5
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
• Turbulent flow through pipe Reτ = 338 • Non-‐orthogonal body fiZed mesh
• 19 point stencil • 1.78 million grid nodes • 36 mesh blocks
• Krylov solvers • Sensi3ve to round off • Single precision calcula3ons
• Relaxed convergence criterion • Restricted maximum number of itera3ons
17
Turbulent Pipe flow
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Pipe flow – valida3on
0
5
10
15
20
25
-0.5 -0.3 -0.1 0.1 0.3 0.5
U+
R/D
Experimental
Numerical
Numierical-SP
0
2
4
6
8
10
12
14
16
18
20
0.1 1 10 100
U
Y+
Experimental
Numerical
Numierical-SP
Experiments by – den Toonder, J. M. J., and Nieuwstadt, F. T. M., 1997, "Reynolds number effects in a turbulent pipe flow for low to moderate Re," Physics of Fluids (1994-present), 9(11), pp. 3398-3409
0
0.5
1
1.5
2
2.5
3
0.1 1 10 100
rms (U,V)
y+
Experimental U Experimental V Numerical-‐DP U Numerical-‐DP V Numerical-‐SP U Numerical-‐SP V
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5
Shear S
tress
r/D
Total Laminar Turbulent Experimental
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
• Cache block sizes • ‘i’ direc3on has the most threads
• Strong scaling • CPU and GPU cache block sizes are op3mized
19
Pipe flow – performance results
Total threads in each cache block Time (s)
32x38x40 32x40x40 32x39x40 512 512 416 255 304 200 208 263 416 400 380 270 190 200 293 293
Number of CPU cores or GPUs
CPU (s) C2050 GPU (s)
DP DP SP
2 8492 1975 1353 4 4220 1290 806
12 1426 837 492 36 511 386 255
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Close up of Bat wing mo3on
20
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Wing flapping – simula3on details
• Flapping of a bat wing using IBM • 25 million nodes for background grid • 9400 surface mesh elements for bat wing • Re = 433
• Runs on HokieSpeed at Virginia Tech • 2 Nvidia M2050/C2050 GPUs • Dual socket Intel Xeon E5645 CPUs
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Wing flapping – geometry
22
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Comparison of results (GPU vs CPU)
• LiN force over 1 cycle of bat wing flapping
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
GPU (60 GPUs) CPU (60 CPU cores)
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Solver co-‐design with Math team
• Solu3on of pressure Poisson equa3on • Most 3me consuming func3on (50 to 90 % of total 3me)
• Solving linear system Ax = b • ‘A’ remains constant from one 3me step to other in many CFD calcula3ons
• GCROT algorithm • Recycling of basis vectors from one 3me step to the other
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Current/Future work
• Use GPU aware-‐MPI • GPU Direct v2 gives about 25% performance improvement
• GCROT algorithm • OpCmizaCon • Implement of GCROT on GPUs
• Non-‐orthogonal case (pipe flow) • Op3miza3on
• Use of co-‐processing (Intel mic) for offloading • Direct offloading with pragmas • Combina3on of CPU and Co-‐processor
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Publica3ons – in process
• Accelera3ng Bio-‐Inspired MAV Computa3ons using GPUs • Amit Amritkar, Danesh TaNi, Paul Sathre, Kaixi Hou, Sriram Chivakula, Wu-‐Chun Feng
• AIAA Avia3on and Aeronau3cs Forum and Exposi3on 2014, 16 -‐ 20 June 2014, Atlanta, Georgia
• CFD computa3ons using precondi3oned Krylov solver on GPUs • Amit Amritkar, Danesh TaNi • Proceedings of FESEM 2014, August 3-‐7, 2014, Chicago, Illinois, USA
• Recycling Krylov subspaces for CFD applica3on • Katarzyna Swirydowicz, Amit Amritkar, Eric De Sturler, Danesh TaNi • FEDSM 2014, August 3-‐7, 2014, Chicago, Illinois, USA
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
Fruit bat wing mo3on
28 Based on work by – Viswanath, K., et al., Straight-Line Climbing Flight Aerodynamics of a Fruit Bat. Physics of Fluids, accepted
High Performance Computa3onal Fluid-‐Thermal Sciences & Engineering Lab
TAU profile data – 10 3me steps
• Mpi_allreduce • Gpu_mpi_sendrecv • Ibm_movement • Ibm_write_tecplot • Gpu_pc2
• Mpi_waitall • Mpi_sendrecv • Read_grid • Gpu_dot_prd • Ibm_buffer_alloc
• Gpu_exchange_var • Ibm_projection • Alloc_field_var • Ibm_locate_ijk
Total = 157s, communication = 70s, initial/end (one time) = 25s