1. REPORT DATE 2. REPORT TYPE Briefing Charts 4. TITLE AND … · 2012-04-12 · Conclusion and...

REPORT DOCUMENTATION PAGE Form Approved

OMB No. 0704-0188 Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing this collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.

1. REPORT DATE (DD-MM-YYYY) 15-12-2011

2. REPORT TYPEBriefing Charts

3. DATES COVERED (From - To)

4. TITLE AND SUBTITLE

5a. CONTRACT NUMBER

Development of a Flow Solver with Complex Kinetics on the Graphic Processing 5b. GRANT NUMBER

Units (GPU) 5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

Hai P. Le and Jean-Luc Cambier

5f. WORK UNIT NUMBER23041057

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)

8. PERFORMING ORGANIZATION REPORT NUMBER

Air Force Research Laboratory (AFMC) AFRL/RZSS 1 Ara Drive Edwards AFB CA 93524-7013

9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

Air Force Research Laboratory (AFMC) AFRL/RZS 11. SPONSOR/MONITOR’S

5 Pollux Drive NUMBER(S) Edwards AFB CA 93524-7048 AFRL-RZ-ED-VG-2011-581

12. DISTRIBUTION / AVAILABILITY STATEMENT Distribution A: Approved for public release; distribution unlimited. PA# 111067.

13. SUPPLEMENTARY NOTES For presentation at the 50th AIAA Aerospace Sciences Meeting, Nashville, TN, 9-12 Jan 2012.

14. ABSTRACT In the current work, we have implemented a numerical solver on the Graphic Processing Units (GPU) to solve the reactive Euler equations with detailed chemical kinetics. The solver incorporates high-order finite volume methods for solving the fluid dynamical equations and an implicit point solver for the chemical kinetics. Generally, the computing time is dominated by the time spent on solving the kinetics which can be benefitted from the computing power of the GPSs. Preliminary investigation shows that the performance of the kinetics solver strongly depends on the mechanism used in the simulations. The speed-up factor obtained in the simulation of an ideal gas ranges from 30 to 55 compared to the CPU. For a 9-species gas mixture, we obtained a speed-up factor of 7.5 to 9.5 compared to the CPU. For such a small mechanism, the achieved speed-up factor is quite promising. This factor is expected to go much higher when the size of the mechanism is increased. The numerical formulation for solving the reactive Euler equations is briefly discussed in this paper along with the GPU implementation strategy. We also discussed some preliminary performance results obtained with the current solver.

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF:

17. LIMITATION OF ABSTRACT

18. NUMBER OF PAGES

19a. NAME OF RESPONSIBLE PERSON Dr. Jean-Luc Cambier

a. REPORT Unclassified

b. ABSTRACT Unclassified

c. THIS PAGE Unclassified

SAR

30 19b. TELEPHONE NUMBER (include area code) N/A

Standard Form 298 (Rev. 8-98)Prescribed by ANSI Std. 239.18

Development of a Flow Solver with ComplexKinetics on the Graphic Processing Units (GPU)

Hai P. Le1, Jean-Luc Cambier2

1Mechanical and Aerospace Engineering DepartmentUniversity of California, Los Angeles

2Air Force Research LaboratoryEdwards AFB, CA

January 11, 2012

50th AIAA Aerospace Sciences Meeting, Nashville, Tennessee

Distribution A: Approved for Public Release; Distribution Unlimited

Table of Contents

1 Objectives & Motivation

2 Approach

3 GPU Implementation

4 Results

5 Conclusion and Future Works

Objectives & MotivationApproach

GPU ImplementationResults

Conclusion and Future Works

Objectives

Develop a fluid code on the GPU for modeling flows withcomplex chemical kinetics. The entire code is written usingCUDA C/C++ for maximum flexibility.

Explore different strategies for optimizing the performance ofthe code for a general chemistry mechanism.

Emphasis on the kinetics solver since it is morecomputationally expensive.

Benchmark with standard test cases.

Distribution A: Approved for Public Release; Distribution Unlimited H. P. Le and J.-L. Cambier




Motivation

Detail study of non-equilibrium processes associated withhigh-speed flow.

Detonation instabilityPartially ionized gasMHD

Development of a multi-physics code utilizing Object-Orientedand CUDA technology. Both of these features are available inCUDA C/C++.





Governing Equations

Euler equations with source term for chemical kinetics

∂Q

∂t+

1

V

∮

SFndS = Ω (1)

Q =

ρsρuρvρwE

; Fn =

ρsUn

Pnx + ρuUn

Pny + ρvUn

Pnz + ρwUn

UnH

; Ω =

ωs

000

−∑s ωse0s

Solution method:

Finite Volume method for hyperbolic conservation laws

Source terms are solved by using operator splitting technique





Numerical Schemes

Monoticity Preserving1 (MP) Schemes

3rd and 5th order spatial discretization was used inconjunction with 3rd order TVD-Runge-Kutta time integration

Arbitrary Derivative Riemann solver with Weighted EssentialNon-Oscillatory2 (ADERWENO) scheme

5th order spatial and 3rd order temporal without Runge-Kuttatime integrationUtilizes Cauchy-Kowalewski procedure and Taylor seriesexpansion of WENO fluxes for high order in time

1Suresh & Huynh (1997) J. Comp. Phys. 136, 83-992Titarev & Toro (2001) J. Comp. Phys. 204, 715-736





Chemical Kinetics

Implicit formulation

dQ

dt= Ω→

(I −∆t

∂Ω

∂Q

)dQ

dt= Ω (2)

Elementary Reaction:∑

s

ν ′rs [Xs ] ∑

s

ν ′′rs [Xs ] (3)

Species production/destruction rate

ωs =∑

r

νrsKfr

∏

s

[Xs ]ν′rs −

∑

r

νrsKbr

∏

s

[Xs ]ν′′rs (4)

whereνrs = ν

′′rs − ν

′rs





Graphic Processing Unit

What is GPU?

Graphic processing units containing a massive amount ofprocessing cores

Designed specifically for graphic rendering which is a highlyparallel process

Why GPU?

GPU is faster than CPU on SIMD execution model

GPU is now very easy to program

GPU is much cheaper than CPU





GPU versus CPU

0

200

400

600

800

1000

1200

1400

2003 2004 2005 2006 2007 2008 2009 2010

TheoreticalPeakGFLOP/s

NVIDIA GPU (Single)Intel CPU (Single)

NVIDIA GPU (Double)Intel CPU (Double)

Figure 1: Single and double precision floating point operation capability of GPU and CPU from2003-2010 (adapted from NVIDIA[8])

attempts had been made in writing scientific codes on the GPU, and promising results wereobtained both in terms of performance and flexibility.[1, 6]

In this work, we attempt to follow the same path on the design of numerical solvers onthe GPUs. However, our focus is different from the previous attempts such that we placemore attention to the kinetics solver than the fluid dynamics. This is due to the fact that forthe simulation of high-speed fluid flow, the computation is dominated by solving the kinetics.While the current implementation is only for chemical kinetics, we aim to extend it to a morecomplicated and computationally intensive kinetics model for plasma.

2 Governing Equations

The set of the Euler equations for a reactive gas mixture can be written as

∂Q

∂t+∇ · F = Ω (1)

where Q and F are the vectors of conservative variables and fluxes. We assumed that there isno species diffusion and the gas is thermally equilibrium (i.e., all species have the same velocityand all the internal energy modes are at equilibrium). The right hand side (RHS) of Eq. 1denotes the vector of source terms Ω. In this case, the source terms are composed of exchangeterms due to chemical reactions. By applying Gauss’s law to the divergence of the fluxes, onecould obtain

∂Q

∂t+

1

V

∮

S

FndS = Ω (2)

2

Figure: Single and double precision floating point operation capability ofGPU and CPU from 2003-2010





GPU Programming

Programming languages for GPU: CUDA, OpenCL,DirectCompute, BrookGPU, ...CUDA is the most mature programing environment for GPU.

similar to C/C++

support OO features

easy to debug





GPU Programming Model

Each device contains a set of streaming multi-processor (SM).Each SM contains a set of streaming processors (SP).Parallel based on grid and thread blocksExecution instruction called kernel

6

A Quick Refresher on CUDA

CUDA is the hardware and software architecture that enables NVIDIA GPUs to execute

programs written with C, C++, Fortran, OpenCL, DirectCompute, and other languages. A

CUDA program calls parallel kernels. A kernel executes in parallel across a set of parallel

threads. The programmer or compiler organizes these threads in thread blocks and grids of

thread blocks. The GPU instantiates a kernel program on a grid of parallel thread blocks.

Each thread within a thread block executes an instance of the kernel, and has a thread ID

within its thread block, program counter, registers, per-thread private memory, inputs, and

output results.

A thread block is a set of

concurrently executing threads

that can cooperate among

themselves through barrier

synchronization and shared

memory. A thread block has a

block ID within its grid.

A grid is an array of thread

blocks that execute the same

kernel, read inputs from global

memory, write results to global

memory, and synchronize

between dependent kernel calls.

In the CUDA parallel

programming model, each

thread has a per-thread private

memory space used for register

spills, function calls, and C

automatic array variables. Each

thread block has a per-Block

shared memory space used for

inter-thread communication,

data sharing, and result sharing

in parallel algorithms. Grids of

thread blocks share results in

Global Memory space after

kernel-wide global

synchronization.

CUDA Hierarchy of threads, blocks, and grids, with corresponding

per-thread private, per-block shared, and per-application global

memory spaces.





GPU Programming Model





CFD

CFD:

Cell-based parallelization: EOS, time marching, etc.

Face-based parallelization: Reconstruction, flux, etc.

Strategies:

Global memory

large but high latency; requires coalesced access

Shared memory

small but very fast; not useful in this case since NQ ∼ Ns

Reduce block occupancy to utilize more registers3.

3Volkov (2010) Better Performance at Lower Occupancy, GPU Tech. Conf.Distribution A: Approved for Public Release; Distribution Unlimited H. P. Le and J.-L. Cambier




Chemical Kinetics

Main strategies

Coalesce memory access pattern for high global memorybandwidth

Utilize shared memory to reduce DRAM latency

Texture binding for read-only data

Issues:

How to overcome shared memory limitation?

How effective is global memory?





Summary of Steps in Gaussian Elimination Algorithm

Forward substitution:

for np = 1:N-1

for ns = np+1:N

P := A(ns,np)/A(np,np)

RHS(ns) := RHS(ns)-RHS(np)*P

for ms = np+1:N

A(ns,ms) := A(ns,ms)-A(np,ms)*P

Backward substitution:

for np = N-1:1

P := 0

for ns=np+1:N

P := P+A(np,ns)*RHS(ns)

RHS(np) := (RHS(np)-P)/A(np,np)





Shared Memory Limit

How many kinetics system can we put on shared memory (48KB/CUDA block)?

solved at every computational cell. Storing the whole Jacobian and the RHS vector on shared memory isnot an ideal situation here. Figure 4 shows the memory requirement for storing the Jacobian and RHS onthe typical shared memory (48 KB for a Tesla C2050/2070). If we associated an entire thread block to thechemical system in a cell, the number of species is limited to 75. Storing more than 1 system per blockmakes this limit go even lower, as can be seen for 1 thread per cell (32 threads block size).

32 cells/CUDA block

1 cell/CUDA blockMatrix

Size(K

B)

Numbers of Species

Shared Memory Limit

0 20 40 60 800

10

20

30

40

50

60

70

Figure 4: Chemistry size limit





Reduced Storage Pattern

Store one row of matrix in shared memory for each row elimination

S

Shared Memory (Reduced)Shared Memory (Full)Global Memory

Maxim

um

NumbersofSpecies

Numbers of Elements

Shared Mem. Limit

Shared Mem. Limit

102 103 104 105 1060

500

1000

1500

2000

2500

3000

Figure 5: Chemistry size limit





Algorithms

Algorithm 1: store matrix data on global memory andcoalesce memory access pattern

Inverse several matrices per CUDA block

Algorithm 2: store part of matrix data (one row at a time) onshared memory

Load and reload after row pivotingInverse one matrix per CUDA block





CFD Results: Forward Step

Mach 3 flow over a step with reflective boundary on top

No special treatment at the corner of the step

MP5 scheme with RK3 using 630,000 cells

Figure 4: Forward step problem

The second test is the shock diffraction problem which is modeled as the diffraction of ashock wave (M = 2.4) down a step[12]. The diffraction results in a strong rarefaction generatedat the corner of the step which can cause a problem of having negative density when performingthe reconstruction. The problem is modeled using 27,000 cells. The numerical simulation isshown in pair with the experimental images in figure 5. It has been shown that the solver wasable to reproduce the correct flow features in the region of the rarefaction fan.

Figure 5: Diffraction of a Mach 2.4 shock wave down a step. Comparison betweennumerical schlieren and experimental images

We also modeled the Rayleigh-Taylor instability problem. The problem is described as theacceleration of a heavy fluid into a light fluid driven by gravity. For a rectangular domain of

9





CFD Results: Backward Step

Mach 2.4 shock diffracted from a step

MP5 scheme with RK3 using 300,000 cells

Comparison with experiment shows excellent agreement





CFD Results: Rayleigh-Taylor Instability

Acceleration of a heavy fluid to a lighter fluidMP5 scheme with RK3 using 1.6M cellsContact discontinuity well resolved; evidence of fine scaleinstability structure

[0, 0.25]× [0, 1], the initial conditions are given as follows:

ρ = 2, u = 0, v = −0.025 cos(8πx), P = 2y + 1 for 0 ≤ y ≤ 1

2

ρ = 1, u = 0, v = −0.025c cos(8πx), P = y +3

2for

1

2≤ y ≤ 1

where c is the speed of sound.The top and bottom boundaries are set as reflecting and the left and right boundaries are

periodic. As the flow progress, the shear layer starts to develop and the Kelvin-Helmholtz insta-bilities become more evident. The gravity effect is taken into account by adding a source termvector which modifies the momentum vector and the energy of the flow based on gravitationalforce. The source term in this case is relatively simple and contributes very little to the overallcomputational time. The performance of the fluid dynamics calculation is discussed in the nextsection of this report.

10





Cellular Detonation

Test setup:

Wall sparked ignition (P = 40 atm; T = 1500 K) withpremixed Stoichiometric Mixture of H2Air

Contact discontinuity initially disturbed in 2-D simulation

Maas and Warnatz4 H2-O2 reaction mechanism

Code Validation Detonations Detonation w/ MHD Simple Detonation

Detonation Test Setup

L

30 cm

P = 1 atmT = 300K

T = 1500 KP = 40 atm

• Premixed Stoichiometric Mixture of H2−Air 11 species and 38 reaction mechanisms

• Closed Ends

• Spark ignited (L = 0.25 cm)

Distribution A: For Public Release; Distribution Unlimited

4Maas, U. and J. Warnatz (1988). Combust. Flame 74, 5369.Distribution A: Approved for Public Release; Distribution Unlimited H. P. Le and J.-L. Cambier




Cellular Detonation

Pressure and temperature evolution of flow field

Cellular structure developed due to to flame front instability

Figure 8: Rayleigh-Taylor instability computed with the MP5 scheme on a 400× 1600 grid

Figure 9: Evolution of the pressure and temperature in a 2D detonation simulation

11 of 18

American Institute of Aeronautics and Astronautics





Performance Results: Algorithm 1

How effective is global memory access?

Theoretical PeakNon-coalescedCoalesced

Mem

ory

Bandwidth

(GB/s)

Numbers of Species0 50 100 150 200

0

50

100

150

200

Figure 11: Memory bandwidth of the kinetics solver measured on a Tesla C2050Distribution A: Approved for Public Release; Distribution Unlimited H. P. Le and J.-L. Cambier




Performance Results: Algorithm 1 vs. 2

Measurement of the performance of the kinetics solver fordifferent species sizes.Grid size is varied due to limitation of global memory

Global MemoryShared Memory

Speed-up

Numbers of Species

32× 32256× 256

128× 128 64× 64

64× 64

128× 128

512× 512

0 50 100 150 200 250 300 350 400 4505

10

15

20

25

30

35

40

45

Figure 12: Comparison of the speed-up factor obtained from the kinetics solver using both global and shaDistribution A: Approved for Public Release; Distribution Unlimited H. P. Le and J.-L. Cambier




Performance Results: CFD

ADERWENO shows substantial advantages over the MP5 dueto single step integration

Figure 8: Two dimensional simulation of a detonation wave

5.2 Performance Results

Figure 9 shows the performance of the solver for the simulation considering only the fluiddynamics aspect using the MP5 and the ADERWENO schemes. The speed-ups obtained inboth cases are very promising. Since the ADERWENO scheme only requires single-stage timeintegration, it is expected that it is faster than the MP5 scheme. It can be seen in Figure 9that the speed up scales almost linearly with the number of computational cells. For theADERWENO scheme, we can obtain almost 60 times speed-up for a large grid which is abouttwice faster than the MP5 scheme. All the comparisons are made between a Tesla C2070 andan Intel Xeon X5650 using double precision calculation.

ADERWENO

MP5

Speed-up

Numbers of Elements

103 104 1050

10

20

30

40

50

60

70

Figure 9: Performance of the fluid dynamics simulation only

Figure 10 shows the performance of only the kinetics solver (i.e., no convection). The testis performed considering only solving the kinetics system. Although constructing the kinetics

12





Performance

Speed-up obtained for a larger mechanism (CH4 − O2) isnearly 40 times faster

36 species, 308 reactions9 species, 38 reactions

Speed-up

Numbers of Elements

103 1045

10

15

20

25

30

35

40

45

50

Figure 12: Performance of the reactive flow solver for two different chemistry mechanism

6 Conclusion and Future Works

In the current paper, we show the implementation of a numerical solver for simulating chemicallyreacting flow on the GPU. The fluid dynamics is modeled using high-order finite volume schemes,and the chemical kinetics is solved using an implicit solver. Results of both the fluid dynamicsand chemical kinetics are shown. Considering only the fluid dynamics, we obtained a speed-up of 30 and 55 times compared to the CPU version for the MP5 and ADERWENO scheme,respectively. For the chemical kinetics, we present two different approaches on implementing theGaussian elimination algorithm on the GPU. The best performance obtained by only solvingthe kinetics problem ranges from 30-40 depending on the size of the chemical mechanism. Whenthe fluid dynamics is coupled with the kinetics, we obtained a speed-up factor of 40 times for a9-species gas mixture with 38 reactions. The solver is also tested with a larger mechanism (36species, 308 reactions) and the performance obtained is faster than the small mechanism.

The current work can be extended in different ways. First, since the framework is performingwell in shared memory architecture, it is possible to also extend it to distributed memoryarchitecture utilizing Message Passing Interface (MPI). The extension permits using multi-GPUwhich is attracted for performing large-scale simulations. On the other hand, although thecurrent simulation is done for chemically reacting flow, it is desired to extend it to simulateionized gas (i.e. plasma) which requires additional modeling process (Collisional-Radiative) tomodel the changes in the different excited level of the charged species. In adidition, the governingequations also need to be extended to characterize the thermal non-equilibrium environment ofthe plasma. Given that the physics has been well established[2, 7, 4], the extension does notpresent a priori challenges.

15






Accomplishment:

Basic CFD framework for fluid simulation with detailedchemical kinetics.

Performance obtained in both cases are very promising: up to60 times for non-reacting flow and up to 40 for reacting flow

Future Works:

Extension to Multi-GPU using MPI

Collisional-Radiative kinetics for partial ionized gas

MHD simulation for electromagnetic field effects





Acknowledgements and Questions

AFRL

Mr. David Bilyeu

Dr. Justin Koo

UCLA

Mr. Lord Cole

Prof. Ann Karagozian

Questions?

Hai [email protected]


mailto:[email protected]

1. REPORT DATE 2. REPORT TYPE Briefing Charts 4. TITLE AND … · 2012-04-12 · Conclusion and...

Documents

Transcript of 1. REPORT DATE 2. REPORT TYPE Briefing Charts 4. TITLE AND … · 2012-04-12 · Conclusion and...