U.S. Department of Energy Vehicle Technologies Program … · 2016. 4. 10. · Session 6195 Funded...

Post on 21-Sep-2020

0 views 0 download

Transcript of U.S. Department of Energy Vehicle Technologies Program … · 2016. 4. 10. · Session 6195 Funded...

LLNL-PRES-687782 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Burning on the GPU: Fast and Accurate Chemical Kinetics

GPU Technology Conference

Russell Whitesides April 7, 2016

Session 6195

Funded by: U.S. Department of Energy

Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton

Lawrence Livermore National Laboratory LLNL-PRES-687782 2

To make it go faster?

+

Why?

Lawrence Livermore National Laboratory LLNL-PRES-687782 3

Why?

We burn a lot of gasoline.

•  Transportation efficiency • Chemistry is vital to predictive simulations • Chemistry can be > 90% of simulation time.

Lawrence Livermore National Laboratory LLNL-PRES-687782 4

National lab compute power and industry need.

Supercomputing @ DOE labs: Strong investment in GPUs with eye towards exascale

OEM engine designers:

Require fast turnaround with desktop class hardware

Why?

Lawrence Livermore National Laboratory LLNL-PRES-687782 5

“Colorful Fluid Dynamics”

YO2 Temperature

“Typical” engine simulation w/ detailed chemistry

Lawrence Livermore National Laboratory LLNL-PRES-687782 6

Detailed Chemistry in Reacting Flow CFD:

Each cells is treated as an isolated system for chemistry.

Operator Splitting Technique: Solve independent set of ordinary differential equations (ODEs) in each cell to calculate chemical source terms for species and energy advection/diffusion equations.

t t+∆t

Lawrence Livermore National Laboratory LLNL-PRES-687782 7

CPU (un-coupled) chemistry integration

Each cells is treated as an isolated system for chemistry.

t t+∆t

Lawrence Livermore National Laboratory LLNL-PRES-687782 8

GPU (batched) chemistry integration

On the GPU we solve chemistry in batches of cells simultaneously.

t t+∆t

Lawrence Livermore National Laboratory LLNL-PRES-687782 9

See also Whitesides & McNenly, GTC 2015; McNenly & Whitesides, GTC 2014

Previously at GTC:

Lawrence Livermore National Laboratory LLNL-PRES-687782 10

n_gpu = 0;

Note: most CFD simulations are done on distributed memory systems

rank0

rank1

rank2

rank3

rank4

rank6

rank7

rank5

CPU

CPU

CPU

CPU

CPU

CPU

CPU CPU

Lawrence Livermore National Laboratory LLNL-PRES-687782 11

++n_gpu; //now what?

Note: most CFD simulations are done on distributed memory systems

rank0

rank1

rank2

rank3

rank4

rank6

rank7

rank5

CPU

CPU

CPU

CPU

CPU

CPU

CPU CPU

Lawrence Livermore National Laboratory LLNL-PRES-687782 12

Here CPU is a single core.

Ideal CPU-GPU Work-sharing

SGPU =walltime(CPU)walltime(GPU)

Lawrence Livermore National Laboratory LLNL-PRES-687782 13

Let’s make use of the whole machine.

Ideal CPU-GPU Work-sharing

§  # CPU cores = NCPU

§  # GPU devices = NGPU

Stotal =NCPU + NGPU SGPU −1( )( )

NCPU 1

2

3

4

5

6

7

8

1 2 3 4

S tot

al

NGPU

SGPU = 8 NCPU=4

NCPU=8

NCPU=16

NCPU=32 **

*TITAN(1.4375)*surface(1.8750)

SGPU =walltime(CPU)walltime(GPU)

Lawrence Livermore National Laboratory LLNL-PRES-687782 14

Distribute based on number of cells and give more to GPU.

Good performance in simple case with both CPU and GPU doing work

100

1000

10000

1 2 4 8 16

Chem

istryTime(secon

ds)

NumberofProcessors

CPU Chemistry

GPU Chemistry (std work sharing)

Lawrence Livermore National Laboratory LLNL-PRES-687782 15

Distribute based on number of cells and give more to GPU.

Good performance in simple case with both CPU and GPU doing work

100

1000

10000

1 2 4 8 16

Chem

istryTime(secon

ds)

NumberofProcessors

CPU Chemistry

GPU Chemistry (std work sharing)

GPU Chemistry (custom work sharing)

SGPU = 7 Stotal = 1.7 (SGPU = 6.6)

Lawrence Livermore National Laboratory LLNL-PRES-687782 16

Let’s go!

First attempt @

engine calculation on GPU+CPU

Lawrence Livermore National Laboratory LLNL-PRES-687782 17

What happened?

First attempt @

engine calculation on GPU+CPU

§  2x Xeon E5-2670 (16 cores) => §  2x Xeon E5-2670 + 2 Tesla K40m => §  Stotal = 21.2/17.6 = 1.20

21.2 hours 17.6 hours

(SGPU = 2.6)

Lawrence Livermore National Laboratory LLNL-PRES-687782 18

Integrator performance when doing batch solution

If the systems are not similar how much extra work needs to be done?

vs.

Lawrence Livermore National Laboratory LLNL-PRES-687782 19

Batches of dissimilar reactors will suffer from excessive extra steps

What penalty do we pay when batching?

Lawrence Livermore National Laboratory LLNL-PRES-687782 20

Batches of dissimilar reactors will suffer from excessive extra steps

What penalty do we pay when batching?

Lawrence Livermore National Laboratory LLNL-PRES-687782 21

Batches of dissimilar reactors will suffer from excessive extra steps

Possibly a lot of extra steps.

Lawrence Livermore National Laboratory LLNL-PRES-687782 22

Sort reactors by how many steps they took to solve on the last CFD step

Easy as pie?

n_steps >100

1

batch3 batch2 batch1 batch0

Lawrence Livermore National Laboratory LLNL-PRES-687782 23

Have to manage the sorting and load-balancing in distributed memory system

Not so fast.

rank0

rank7

rank5

rank6

rank4

rank1

rank2

rank3

Lawrence Livermore National Laboratory LLNL-PRES-687782 24

Load balance based on expected cost and expected performance.

MPI communication to re-balance for chemistry.

rank0

rank7

rank5

rank6

rank4

rank1

rank2

rank3

Lawrence Livermore National Laboratory LLNL-PRES-687782 25

Let’s go again!

Second attempt @

engine calculation on GPU+CPU

Lawrence Livermore National Laboratory LLNL-PRES-687782 26

How much does difference does it make?

Total steps significantly reduced by batching appropriately

Lawrence Livermore National Laboratory LLNL-PRES-687782 27

J

Engine results with improved work-sharing and reactor sorting

9.1 hrs

7.6 hrs

13.0 hrs

~40 % reduction in chemistry time; ~36% reduction in overall time

Stotal=1.7SGPU=6.6

Lawrence Livermore National Laboratory LLNL-PRES-687782 28

§  Improve SGPU •  Derivative kernels •  Matrix operations

§  Extrapolative integration methods •  Less “startup” cost when re-initializing •  Potentially well suited for GPU

§  Non-chemistry calc’s on GPU •  Multi-species transport •  Particle spray

Future directions

Possibilities for significant further improvements.

Lawrence Livermore National Laboratory LLNL-PRES-687782 29

§ Much improved CFD chemistry work-sharing with GPU

§ ~40% reduction in chemistry time for real engine case (~36% total time)

§ Working on further improvement

Summary

Thank you!

+