Post on 21-Apr-2018
Towards Exascale Direct Numerical Simulations of Turbulent Combustion on Heterogeneous
Machines Jacqueline Chen
Director of ExaCT Sandia National Laboratories
Livermore, CA jhchen@sandia.gov
www.exactcodesign.org NVIDIA Booth@SC’14
November 20, 2014, New Orleans
Why Exascale Combustion
• Predict behavior of new fuels in different combustion scenarios at realistic pressure and turbulence conditions
– Develop new combustor concepts
– Design new fuels
• Co-design center is focusing on high-fidelity direct numerical simulation methodologies
– Need to perform simulations with sufficient chemical fidelity to differentiate effects of fuels where there is strong coupling with turbulence
– Need to address uncertainties in thermo-chemical properties
– Not addressing complexity of geometry in engineering design codes
Fundamental Turbulence-Chemistry Interactions Motivated by Advanced Engines and Gas Turbines
• Higher fuel efficiency and lower emissions driving combustion towards more dilute, fuel lean, partially-premixed conditions
• New mixed-mode combustion regimes
• Strong sensitivities to fuel chemistry
• Preferential diffusion effects – synthesis gases enriched with hydrogen for carbon capture storage in gas turbines for power
Motivation: Understanding Stabilization of Lifted Flames in Heated Coflow
Chemiluminescence from diesel lift-off stabilization for #2 diesel,
ambient 21% O2, 850K, 35 bar courtesy of Lyle Pickett, SNL
What is the role of ignition in lifted flame stabilization?
DNS of Lifted Ethylene-air Jet Flame in a Heated Coflow
• 3D slot burner configuration:
– Lx Ly Lz = 30 40 6 mm3 with
– 1.28 billion grid points
– High fuel jet velocity (204m/s); coflow
velocity (20m/s)
– Nozzle size for fuel jet, H = 2.0mm
– Rejet = 10,000
– Cold fuel jet (18% C2H4 + 82% N2) at 550K,
ηst ≈ 0.27
– Detailed C2H4/air chemistry, 22 species 18 global reactions, 201 steps
– Hot coflow air at 1,550K
Ethylene-air lifted jet flame at Re=10000
Dynamics of lifted flame stabilization – Log(scalar dissipation) and Temperature
Why does this need exascale?
• Turbulent combustion consists of phenomena occurring over a wide range of scales that are closely coupled
– More grid points needed to resolve larger dynamic range of scales
– More time steps needed for better statistics and less dependence on initial condition
• Complex fuels require higher number of equations per grid point
• In situ uncertainty quantification with adjoint sensitivity – reverse causality – uncertainties in chemical inputs
• In situ analytics/visualization
• Coupled execution (hybrid Eulerian-Lagrangian particle solver, or lockstep DNS/LES)
• Future algorithms, programming environments, runtimes, hardware need to:
– Express data locality (sometimes at the expense of FLOPS) and independence
– Allow expression of massive parallelism
– Minimize data movement and reduce synchronization
– Detect and address faults
Why do we need to do co-design?
• Peak clock frequency: as primary
limiter for performance improvement
• Cost: FLOPs are biggest cost for
system: optimize for compute
• Concurrency: Modest growth of
parallelism by adding nodes
• Locality: MPI+X model (uniform costs
within node & between nodes)
• Memory Scaling: maintain byte per flop
capacity and bandwidth
• Uniformity: Assume uniform system
performance
• Power: primary design constraint for
future HPC system design
• Cost: Data movement dominates:
optimize to minimize data movement
• Concurrency: Exponential growth of
parallelism within chips
• Locality: must reason about data
locality and possibly topology
• Memory Scaling: Compute growing 2x
faster than capacity or bandwidth, no
global hardware cache coherence
• Heterogeneity:Architectural and
performance non-uniformity increase
Old Constraints New Constraints
ExaCT Vision and Goal
• Goal of combustion exascale co-design is to consider all aspects of the combustion simulation process from formulation and basic algorithms to programming environments to hardware characteristics needed to enable combustion simulations on exascale architectures
– Interact with vendors to help define hardware requirements, computer scientists on requirements for programming environment and software stack, and applied mathematics community locality-aware algorithms for PDE’s, UQ, and analytics
• Combustion is a surrogate for a much broader range of multiphysics computational science areas
Petascale codes provide starting point for co-design process
• S3D – Compressible formulation – Eighth-order finite difference discretization – Fourth-order Runge-Kutta temporal integrator – Detailed kinetics and transport – Hybrid parallel model with MPI + OpenMP – MPI+ OpenACC (directives for GPU’s) – Legion (deferred execution hides latencies)
• LMC – Low Mach Number model that exploits separation of
scales between acoustic wave speed and fluid motion – Second-order projection formulation – Detailed kinetics and transport – Block-structure adaptive mesh refinement – Hybrid parallel model with MPI + OpenMP
Expectation is that exascale will require new code base
LMC simulation of NOx emissions from a low swirl injector
S3D simulation of HO2 ignition marker in lifted flame
Laboratory scale
flames
S3D MPI Parallelism
• 3D domain decomposition.
– Each MPI process is in charge of a piece of the 3D domain.
• All MPI processes have the same number of grid points and the same computational load
• Inter-processor communication is only between nearest neighbors in 3D topology
– Large message sizes. Non-blocking sends and receives
• All-to-all communications are only required for monitoring and synchronization ahead of I/O
• Good parallel scaling on Titan
1
N
1 N
What happens in the main solver?
• Computes rate of change of N conserved quantities at every grid point
– d/dt (Qk) = (Advection) + (Diffusion) + (Source)
– Sum of all the terms that contribute to the time derivative is called the RHS
• d/dt (Qk) is integrated explicitly in time through Runge-Kutta
• RHS contains multiple terms that are functions of Qk, variables derived from Qk
• Advection and diffusion require finite differencing and MPI
• Source terms are point-wise functions
• Thermodynamic, chemical and molecular transport properties are point-wise functions of Qk
Source term is the most compute intensive kernel
• Called as ckwyp or getrates
• Chemical reaction rate computed using Arrhenius model
– A + B C + D
– Forward reaction rate = C*[A]*[B]*Taexp(-Ta/T)
– Equilibrium constant gives reverse reaction rates
– More terms for third body efficiency, collision efficiency, pressure corrections …
• The source term for a species is the sum of the rates of all reactions in which it participates
• The kernel uses exp/log heavily
Hybridizing S3D: OLCF CAAR Early Science Project
• Collaboration between Cray (John Levesque), NVIDIA (Greg Ruetsch, Cliff Woolley), NREL (Ray Grout) and ORNL (Ramanan Sankaran)
• Significant restructuring of S3D to expose node-level parallelism
– Movement of outer loops to the highest level in RHS
– Combine several pointwise physics computations together within same OpenMP structure
– Reordered computation and communication to achieve most overlap
– Restructured computation to minimize memory operations and vectorized all loops that reside on accelerator
• Control minimal data communication between the host and the accelerator with asynchronous updates
• Resulting code is hybrid MPI+OpenMP and MPI+OpenACC (-DGPU only changes directives)
• 6-fold performance improvement on Titan over Jaguar
Levesque et al. SC’12
Day 1 Science with S3D OpenACC on Titan
• Extinction/Re-ignition in a Turbulent Di-methyl Ether Jet Flame – Largest Reynolds number reacting DNS
with complex chemistry (32 species for DME, an oxygenated biofuel)
– Enabled comparison with companion experiment
– Validate an experimental diagnostic for surrogate peak heat release based on product imaging of CH2O and OH
– Exploring the causality between turbulence and chemistry in re-ignition process
The logarithm of the scalar dissipation rate
(that is, the local mixing rate) where white
denotes high mixing rates and red, lower
mixing rates
Bhagatwala et al., Proc. Combust. Inst. (2014)
In Situ Uncertainty Quantification Guided by Analytics
• Topological Segmentation and Tracking
Topological segmentation and tracking
Distance field (level set)
• Statistics
Filtering and averaging (spatial and temporal)
Statistical moments (conditional)
Statistical dimensionality reduction (joint PDFS)
Spectra (scalar, velocity, coherency)
Chemical Explosive Mode
• Uncertainty in reaction rates characterizing ignition/extinction events that control fuel efficiency and emissions with respect to uncertainties in input chemical and transport parameters
• Solve adjoint equations backward in time: need the primal state at all times
• Exploit space-time locality guided by analytics to bound regions of interest
• I/O bandwidth constraint make it infeasible to save all raw simulation data to persistent storage Workflow must integrate simulation , UQ and analysis !!!!
• Challenge: co-design a workflow that supports smart placement of solver + UQ, and analytics, reducing checkpointing size with in-situ and in-transit analytics
Petascale Workflow Model Won’t Scale
O(1B) cores
O(1 PB)/dump every 30 min (1 min)
O(1-10 EB)/run
Synchronous I/O
Synchronous I/O combustion simulation
• Analysis • Visualization
Performing the simulation is not enough – need to analyze results
Proxy Architectures and Apps are a key Capability for HPC Co-design
• Mini-Applications and Proxy Apps – Express essential parts of application
that are essential to science – Express pain points for performance – Hide complexities of REAL apps that
are inconsequential to science
• Proxy Hardware Models are a Reflection of the Underlying Machine Architecture
– Express what is important for performance
– Hide complexity that is not consequential to performance
• Use Proxy Applications on Proxy Hardware Architectures to understand what choices offer the highest value (using modeling and simulation)
– Not absolute performance predictions – Scientific experiments that gauge
relative improvements over control
• Still Need Full-Applications
Applications
Architectures
Proxy Apps
Proxy
Architectures
Models/
Measurements
Architecture
Simulators
Overall Workflow Captured Through Proxy and Full Apps
Exascale Machine
Compressible and Low-Mach Reacting Flow
Solvers
Write to Disk
• Reduced Checkpoints • Reduced Data
Proxy Apps
Descriptive Statistics Visualization Topological
Analysis UQ
Meta-skeletal workflow proxy app
New UQ + topology proxy for adjoint solution data flow
CNS SSA Streaming Statistics
PVR Parallel Volume
Rendering
RTC Reduced Topology
LMC Multi-grid Chemistry
UQ
SMC S3D www.exactcodesign.org
Castro
adds,3620
muls,4303
divs,540
trans,710
Instruc onMix
adds3% muls
4%
divs18%
trans75%
BreakdownofCPUTime
adds,14509
muls,16447
divs,387
sqrt,1
Instruc onMix
adds34%
muls34%
divs32%
BreakdownofCPUTime
Chemistry Dynamics
1
4
16
64
256
1024
4096
1 4 16 64 256 1024 4096Num
berofAccesses
Variables(sortedbynumberofaccesses)
9Species
21Species
53Species
71Species
107Species
Chemistry FP State Variables by Rank
ExaSAT: Analysis of Proxy Applications
• Extracts floating point instruction mix and estimates CPU time
• Analyzes state variable accesses and estimates register spill impact on L1 cache traffic
Impact of Software Optimization on SMC Dynamics Computation
• Data reuse analysis shows impact of software optimizations on memory traffic for various cache sizes
• Large potential benefit from increased memory bandwidth and fast exponentials
• Trends can be investigated in further detail using hardware simulators
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2 8 32 128 512 2048 8192
BytesperFlop
CacheSize(kB)
Baseline
+BestBlocking
+BestFusion
Bandwidth Tapering
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
9 21 53 71 107
EsmatedTeraflo
ps
NumberofSpecies
+FastNIC(400GB/s)
+Fast-exp
+Fast-div
+Fastmemory(4TB/s)
+Loopfusion
+Cacheblocking
Baseline
Estimated Performance Benefits
Programming Environment Critical to Performance
Effective use of exascale hardware will require programming environment that effectively maps algorithms to hardware
• Driven by programmability of combustion applications and characterization of algorithms on different designs of architectures
– Simplify programming to express locality and independence
– Simplify programming of block-structured PDE’s, analytics, UQ for performance, scalability & portability on heterogeneous architectures with high variability and still maintain readability
Legion Approach
• Capture the structure of program data
• Decouple specification from mapping
• Automate data movement, parallelism discovery, synchronization, hiding long latency
Goal: A performance-oriented programming model with
a raised abstraction for productivity and portability.
(Bauer, Treichler, Slaughter, Aiken – Structure Slicing: Extending Logical Regions
with Fields, Thursday Nov 20 1:30-2:00, Rm 388-390)
http://legion.stanford.edu
f0 f1 p0
p1
p2
p3
p4
p5
Describing Data
• Logical regions (array of objects):
– Have no implied layout
– Have no implied location
• Described by:
– Index space (set of keys)
– Field space (set of fields)
• Operations include:
– Partitioning into subregions
– Slicing by fields
25
Field Space
Index S
pace p6
p7
p8
p9
p10
p11
f2 f3
B
Legion Tasks
• Hierarchical Tree of Tasks
• Legion tasks specify:
– Region usage
– Field usage
– Access modes
• Legion runtime:
– Infers data dependences
– Inserts copies
26
Task 1
Region R
Field A
Read-Write
Task 2
Region R
Field B
Read-Only
Node 0 Node 1
Copy
Legion Mapping
• Mapping happens at runtime:
– Assign task to processors
– Assign regions to memories
• With mapping API, programmer selects:
– Where tasks run
– Where data is placed
27
t 1
t 2
t 3
t 4
t 5
rc
rw
rw1 rw2
rn
rn1 rn2
$
$
$
$
N
U
M
A
N
U
M
A
FB
D
R
A
M
x86
CUDA
x86
x86
x86
Application: S3D
• Current state-of-the-art combustion DNS simulation
– Written in 200K lines of Fortran
– Direct numerical solver using explicit methods
– Many phases, lots of parallelism
11/05/13 28 ξ OH HO2 CH2O
S3D Task Parallelism
• One call to Right-Hand-Side-Function (RHSF) as seen by the Legion runtime
– Called 6 times per time step by Runge-Kutta solver
– Width == task parallelism
– H2 mechanism (only 9 species)
– Heptane (52 species) is significantly wider
• Manual task scheduling would be difficult!
S3D Execution Profile
• Two custom mappers
– Schedule tasks onto CPUs and GPUs
– Prioritize tasks on critical path of RHSf
Legion execution of RHSF on one node
11/05/13 30
GPU’s
CPU
cores
Legion
Runtime
core
Dependence analysis tasks
Leaf Tasks
• Legion treats tasks as black boxes
– Doesn’t care how tasks are written
• Still need fast leaf tasks for computationally expensive chemistry, diffusion, viscosity
– For CPUs & GPUs
– For multiple mechanisms
• Singe* is a DSL compiler for chemistry kernels
Singe Compiler
CHEMKIN TRANSPORT THERMO
SSE AVX CUDA
*Bauer et al. PPoPP’14
Combustion Challenges
• GPU programming models emphasize data parallelism
– Not always the best choice for performance
• Large working sets (per point)
– nHeptane chemistry needs 566 double precision reaction rates (per point)
– GPU register file only store 128 per thread
• Multi-phase computations
– Fissioned kernels limited by memory bandwidth, slow
Compute
Reaction Rates
QSSA
Stiffness
Accumulate
Outputs
Warp Specialization
• Leverage knowledge of underlying hardware
– GPUs execute warps: streams of 32-wide vector instructions
– All threads in warp execute the same program (data parallel unit)
• Each warp can run different computation
– Generate code that specializes each warp,(leverage task parallelism)
– Different warps do different computations on the same data
– Allows much better use of memory while keeping processors busy
– Fit large working sets on chip
Program P Input Data
P P P P P P P P P
Warp 0 Warp 1 Warp 2
Partitioned
Program
P0 P1 P2
Input Data
P0 P0 P0 P1 P1 P1 P2 P2 P2
Warp 0 Warp 1 Warp 2
Data parallel
Warp specialization
Singe Warp-Specialized Chemistry Kernel
Chem
Reacs
1
Chem
Reacs
2
Chem
Reacs
3
Chem
Reacs
4
Chem
Reacs
5
Chem
Reacs
6
QSSA QSSA
Stiff Stiff Stiff Stiff
Output
Math
Output
Math
Output
Math
Output
Math
Reaction Rate Exchange through Shared Mem
Reaction Rate Exchange through Shared Mem
Warp 0 Warp 1 Warp 2 Warp 3
QSS
5
QSS
1
QSS
11
QSS
10
QSS
9
QSS
6
QSS
3
QSS
7
QSS
15
QSS
14
QSS
2
Performance Results
• Chemistry Kernel
– All Singe kernels significantly faster than current production versions
– Warp specialized SINGE code is up to 3.75 times faster than previously optimized data-parallel CUDA kernels
• Multi-Node Heterogeneous Testbeds S3D Legion:
– Keeneland: 128 nodes, 16 Sandy Bridge cores, 3 Fermis
– Titan: 18K nodes, 16 Interlagos cores, 1 K20 GPU
11/05/13 35
S3D Performance Comparison
• Compare against MPI+OpenACC code on Titan
– Tuned by Cray and NVIDIA engineers with ORNL/NREL domain experts
• OpenACC runs 483 and 643 for DME and heptane
– Fixed mapping (most compute on GPU)
• Legion runs 483, 643, and 963 for any mechanism (DME, heptane, PRF)
– Try both All-GPU and mixed CPU+GPU mappings
• Use re-ranking script for runs >= 1024 nodes
Legion S3D Performance on Titan (weak scaling)
N-heptane 52 species
1.71X - 2.33X faster between 1024 and 8192 nodes DME
1.88-2.85X faster between 1024 and 8192 nodes n-heptane (64*3)
Exascale Use Cases: Science at Relevant Conditions
• Homogeneous Charge Compression Ignition (HCCI engine combustion) – ‘Chemical’ engine with high diesel-like efficiency without NOx and soot, tailor the charge stratification to control ignition and burn rate
• Turbulent Jet Flames (Swirl, transverse, cavity) – low-swirl Injector gas turbines with staged lean premixed combustion, flame stabilization, emissions
• Lifted Diesel Jet flames –lifted autoignitive diesel jet flame stabilization with multi-stage ignition fuels
• Need to include UQ with respect to chemistry and transport properties
• Extrapolation of current capability show that these cases will require exascale-level resources