An Optimized Solver for Unsteady Transonic Aerodynamics and … · 2016. 4. 21. · 1 An Optimized...

23
1 An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles Jean-Marie Le Gouez, Onera CFD Department Jean-Matthieu Etancelin, ROMEO HPC Center, Université de Reims Champagne-Ardennes Thanks to Nikolay Markovskiy dev-tech at NVIDIA research Center, GB, and to Carlos Carrascal, research master intern GTC 2016, April 7th, San José California

Transcript of An Optimized Solver for Unsteady Transonic Aerodynamics and … · 2016. 4. 21. · 1 An Optimized...

  • 1

    An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles

    Jean-Marie Le Gouez, Onera CFD Department

    Jean-Matthieu Etancelin, ROMEO HPC Center, Université de Reims Champagne-Ardennes

    Thanks to Nikolay Markovskiy dev-tech at NVIDIA research Center, GB,and to Carlos Carrascal, research master intern

    GTC 2016, April 7th, San José California

  • 2

    Unsteady CFD for aerodynamics profiles

    •Context

    •State of the art for unsteady Fluid Dynamics simulations of aerodynamics profiles

    •Prototypes for new generation flow solvers

    •NextFlow GPU prototype : Development stages, data models, programming languages, co-processing tools

    •Capacity of TESLA networks for LES simulations

    •Performance measurements, tracks for further optimizations

    •Outlook

    GTC 2016, April 7th, San José California

  • 3

    The expectations of the external users:•Extended simulation domains: � effects of wake on downstream components, blade-vor tex interaction on helicopters, thermal loadings by the reactor jets on composite structure s,

    •Model of full systems and not only the individual c omponents : multi-stage turbomachinery internal flo ws, couplings between the combustion chamber and the turbine aerodynamics, …

    •More multi-scale effects : representation of techno logical effects to improve the overall flow system efficiency : grooves in the walls, local injectors for flow / acoustics control,

    •Advanced usage of CFD : adjoint modes for automatic shape optimization and grid refinement, uncertaint y management, input parameters defined as pdf,

    Système Cassiopee for application productivity, modularity and coupling,associated to the elsA solver, partly OpenSource

    General context : CFD at Onera

    GTC 2016, April 7th, San José California

  • 4

    Expectations from the internal users :

    - to develop and validate state of the art physical models : transition to turbulence, wall models, sub-grid closure models, flame stability,

    - to propose novel designs in rupture for aeronautics in terms of aerodynamics, propulsion integration, noise mitigation, …

    - to tackle the CFD grand challenges

    � New classes of numerical methods, less dependent on the grids, more robustand versatile,

    � Computational efficiency near the hardware design performance, high parallelscalability,

    Decision to launch research projects :

    � On deployment of the DG method for complex cases : AGHORA code� On modular multi-solver architecture within the Cassiopee set of tools

    Onera

    CFD at Onera

    elsA

    GTC 2016, April 7th, San José California

  • 5

    Improvement of predictive capabilities in the last 5 year RANS / zonal LES of the flow around a High-Lift win g

    2D steady RANS and 3D LES 7,5 Mpts

    2009

    Mach 0.18Rey 1 400 000/corde

    LEISA project Onera FUNK software

    2014

    GTC 2016, April 7th, San José California

    •Optimized on a CPU architecture MPI / OpenMP / vecto rization

    •CPU ressources for 70ms of simulation : JADE compute r (CINES) Cpu time alloted by Genci•Nxyz~ 2 600 Mpts 4096 cores / 10688 domains T CPU~ 6 200 000 h Residence time : 63 days

  • NextFlow : Spatially High-Order Finite Volume me thod for RANS / LESDemonstration of the feasability of porting these a lgorithms on heterogeneous architectures

    ,

    ,

    GTC 2016, April 7th, San José California

  • NextFlow : Spatially High-Order Finite Volume me thod for RANS / LESDemonstration of the feasability of porting these a lgorithms on heterogeneous architectures

    ,

    ,

    GTC 2016, April 7th, San José California

  • 8

    Multi-GPU implementation of a High Order Finite Vol ume solver

    Main Choices : CUDA, Thrust, mvapich, Reasons : resource-aware programming, productivity librairies

    Hierarchy of memories correspond to the algorithm phases1/ main memory for field and metrics variables : 40 million cells on a K40 (12Gb),

    and for the communication buffers (halo of cells for other partitions)2/ shared memory at the stream multi-processor level for stencil operations3/ careful use of registers for node, cell, face algorithms

    Stages of the project

    •Initial porting with the same data model organization than on the CPU

    •Generic refinement of coarse triangular elements with curved faces : hierarchy of grids

    •Multi-GPU implementation of a highly space-parallel model : extruded in the span direction and periodic

    •On-going work on a 3D generalization of the preceding phases : embedded grids inside a regular distribution (Octree-type)

    GTC 2016, April 7th, San José California

  • 9

    1st Approach: Block Structuration of a Regular Line ar Grid

    Partition the mesh into small blocks

    SM SMSM SM

    Block Block Block Block

    SM: Stream Multiprocessor

    Map the GPU scalable structure

    GTC 2016, April 7th, San José California

  • 10

    Relative advantage of the small block partition

    ●Bigger blocks provide

    • Better occupancy

    • Less latency due to kernel launch

    • Less transfers between blocks

    ●Smaller blocks provide

    • Much more data caching

    0

    20

    40

    60

    80

    100

    L1 hit rate0

    0,2

    0,4

    0,6

    0,8

    1

    Fluxes time (normalized)0

    0,2

    0,4

    0,6

    0,8

    1

    1,2

    Overall time(normalized)

    2561024409624097

    ● Final speedup wrt. to 2 hyperthreaded Westmere CPU: ~2GTC 2016, April 7th, San José California

  • 11

    Unique grid connectivity for the inner algorithmOptimal to organize data for coalescent memory accessduring the algorithm and communication phasesEach coarse element in a block is allocated to an inner thread(threadId.x)

    2nd approach : Embedded grids, hierachical data model NXO-GPU

    Hierachical model for the grid : high order (quarti c polynomial) triangles generated by gmsh refined on the GPU

    the whole fine grid as such could remain unknown to the host CPU

    GTC 2016, April 7th, San José California

    Imposing a sub-structuration to the grid and data model (inspired by the ‘tesselation’ mechanism in surface rendering)

  • 12

    Code structure

    Preprocessing

    Postprocessing

    Mesh generation and block and generic refinement generation

    Visualization and data analysis

    Solver

    Allocation and initialization of data structure from the modified mesh file

    Computational routine

    Time stepping

    Data fetching binder

    Computational binders

    GPU allocation and initialization binders

    CUDA kernels

    Fortran

    Fortran

    C

    C

    C

    CUDA

    GTC 2016, April 7th, San José California

  • 13

    Version 2 : Measured efficiency on Tesla K20C (with respect to 2 Cpu Xeon 5650, OMP loop-based)

    Initial results on a K20C : Max. Acceleration = 38 wrt to 2 Westmere sockets

    Improvement of the Westmere CPU efficiency : OpenMP t ask-based rather than inner-loop

    Same block data model on the CPU also, then the K20 C GPU / CPU acceleration drops to 13 ( 1 K20c = 150 Westmere cores)

    In fact this method is memory bounded, and GPU band width is critical.

    More CPU optimisation needed (cache blocking, vector isation ?)

    Flop count : around 80 Gflops DP /K20C

    These are valuable flop, not Ax=b, but highly non l inear Riemann solver flop with high order (4th, 5th ) extrapolated

    values, characteristic splitting to avoid interfere nce between waves, … :

    it requires a very high memory traffic to permit th eses flops : wide stencils method

    Thanks to the NVIDIA GB dev-tech group for their su pport, “my flop is rich”

    GTC 2016, April 7th, San José California

  • 14

    Version 3 : 2.5D periodic spanwise (circular shift vectors), MULTI-GPU / MPI

    Objective : one Billion cells on a cluster

    with only 64 TESLA K20 or 16 K80

    (40 000 cells * 512 spanwise stations per partition :

    20 million cells addressed to each TESLA K20)

    The CPU (MPI / Fortran, OpenMP inner loop-based)

    and GPU ( GPUDirect / C/ Cuda) versions are in the

    same executable, for efficiency and accuracy

    comparisons

    High CPU vectorisation (all variables are vectors of length 256

    to 512) in the 3rd homogeneous direction

    Full data parallel Cuda kernels with coalesced memory access

    GTC 2016, April 7th, San José California

    •Coarse partitionning : number of partitions equal to the number of sockets / accelerators

  • 15

    Version 3 : 2.5D periodic spanwise (cshift vectors) , MULTI-GPU / MPIInitial performance measurements

    GTC 2016, April 7th, San José California

  • 16

    Initial Kernel Optimization and analysis performed b y NVIDIA DevTech

    After this first optimization : ratio of 14 in performance K40 / 8-core Ivy-Bridge socket

    Strategy for further optimization of performances:

    Increase occupancy, reduce registers’ use, reduce amount of operations with global memory and texture cache for wide arrays in read-only in a kernel

    Put stencil coefficients in shared memory, Use constant memory, __launch_bounds__(128, 4)

    GTC 2016, April 7th, San José California

  • 17

    Next stage of optimizations

    Work done by Jean-Matthieu

    • - Use thread collaboration to transfer stencil data

    • from main memory to shared memory

    • - Refactor the kernel where face-stencil operations are done :

    • split in two phases to reduce stress on registers

    • - Use the thrust library to class the face and cell indices into lists to template the kernels according to the list number

    and avoid internal conditional switches

    Enable an overlapping :

    - computations in the center of the partition,

    - transfer of the halo cells at the periphery, use of mvapich2

    by using multiple streams and further classification of the cell and face indices : center� periphery (thrust)

    GTC 2016, April 7th, San José California

  • 18

    Version 3 : Kernel granularity revised to optimize r egister use, Overlapping of communications with computations at the centers of the partitions, local memory usage and inner-block thread collabora tion

    GTC 2016, April 7th, San José California

    TAYLOR-GREEN Vortex

    Scalability analysis with up to one billion cells and 4th degree

    polynomial reconstruction (5 dof per cell, stencil size 68 cells)

    with 1 to 128 Gpu (K20Xm)

    High performance : 12 ns to compute one set of 5 fluxes on an

    interface from a wide stencil : 180 GBytes/s, 170 Gflops DP

    Scalability drops only for extreme degraded usage : small grid 1283

    cells on more than 32 GPUs, over 30% of cells to exchange

    Strong scalability

    Weak scalability

  • High Order CFD Workshop Case 3.5 Taylor-Green Vortex

    ,

    ,

  • 20

    GPU implementation of the NextFlow solver

    Performance on each K20Xm GPU :

    in k3 1,8e-8 s per RHS, 0.36s for 20 000 000 cells

    in k4 2,5e-8 s per RHS, 0.50s for 20 000 000 cells

    � Taylor Green vortex 256**3 - wall-clock = 12 hours on 16 IVY-Bridge processors (total 128 cores) : 1600 hours CPU Intel core

    25 minutes on 16 Tesla K20M GPU

    By comparison, at the 1st HO CFD workshop , this case requested between 1100 and 33000 Intel core Cpu hours, depending on the numerical method

    1. Taylor Green vortex 512**3 - wall-clock : 4 hours on 16 Tesla K20M GPUs

    Taylor-Green Vortex Rey = 1600Computations on wedges

    GTC 2016, April 7th, San José California

  • �Grids (structured ?) Octree � Tet-tree

    All tets are identical, only oriented differently in space

    From a grid of very coarse « structured tets » (bottom right): perform a refinement based on a simple

    criterion (distance to an object) : 8, 82 , 83 …coarse tets in each (figure on top right)

    � Tet-tree ‘Coarse’ grid , managed, partitioned on the cluster by the CPU thread 0 of each node

    Each coarse tet of any size is filled dynamically with small tets : finite volumes for the solver

    The size of the inner grid is adapted dynamically to the solution by refinement fronts crossing the coarse edges

    The coarse tets are clustered by identical refinement level : these sets are alloted to the multiprocessors of theaccelarators available on nodes

    On-going work Hierarchical grids based on the generic refinement of a coarse grid of type Octree/Tet-tree

    GTC 2016, April 7th, San José California

  • On-going work Hierarchical grids based on the generic refinement of a coarse grid of type Octree/Tet-tree

    �A generic set offilling grids are generated on this type of simple gmsh models (the «tet-farm ») : inner connectivity list,coefficients of the spatial scheme, halos of ghost-cell andtheir correspondence with the inner numbering of neighbors, HO projectioncoefficients when the filling grid density varies in time

    �This commoninner data model is only stored on the GPUs and accessed in a coalesced way by the threads

    �Wall boundary conditions are Immersed Boundary conditionsor CAD–Cut cells with curved geometry

    GTC 2016, April 7th, San José California

  • 23

    Conclusion

    A number of preparatory projects enabled to acquire a good expertise on the porting of CFD solvers, their compute

    intensive kernels and intefaces, and the best organization of the data models for MULTI-GPU performance.

    A high Compute intensity was reached by approching the peak main memory bandwidth and almost fully overlapping

    computations and communications for big models : up to 80 million cells on a K80.

    The initial choice of cuda, thrust, mvapich revealed correct : good stability of the language, SDK and associated

    programming productivity tools

    A project of full software deployment of a variety of CFD options for complex 3D geometries and adaptive grid

    refinement, without the need for a preliminary meshing tool, was started

    GTC 2016, April 7th, San José California