Accelerating Gyrokinetic Tokamak Simulation (GTS) Code ...

Accelerating Gyrokinetic Tokamak Simulation (GTS) Code using OpenACC

M.G. Yoo, C.H. Ma, S. Ethier, Jin Chen, W.X. Wang, E. Startsev

Princeton Plasma Physics Laboratory, Princeton, U.S.A.

OpenACC Summit 2020

Sep. 01. 2020

Overview of porting GTS to GPU

2

GTS (Gyrokinetic Tokamak Simulation)

• OpenACC directives ported the code to GPU keeping compatibilities with CPU machine

Porting to GPU machine

• A global gyrokinetic particle simulation code for micro-turbulence study in tokamak

• Recently upgraded for physics studies associated with the thermal quench transport

- Significant speed-up (>20x) for the particle parts

• Particle-In-Cell algorithm (particles + grid-based field solve)

• GTS is now running production simulations on Traverse with a significant acceleration,making efficient use of the Traverse computational resource

- Field solve part (Poisson equation) will be ported to GPU via some libraries (ex. PETSc, Hypre, AMGx)

• Attended GPU Hackathon in 2019 at Princeton University (Mentor: Rueben Budiardja (ORNL))

3

❑ A global gyrokinetic particle simulation code to study the micro turbulence physics of the fusion plasma in tokamaks


- Mainly written in Fortran, partly in C

- Parallelized using MPI + OpenMP (previously), now using MPI+OpenACC

- 𝛿𝑓 particle-in-cell code in 3-dimensional curvilinear coordinate

Gyrokinetic Equation

4

▪ The gyrokinetic equation for particle distribution in 5-dimension phase space

a (or ρ): radial coordinates (a flux surface label)

𝜃 and 𝜙: poloidal and toroidal angle

𝑣∥: parallel velocity

𝜇 = 𝑚𝑠𝑣⊥2/2𝐵 : magnetic moment

𝐵∗ = 𝐵 + 𝑚𝑠𝑣∥/𝑒𝑠 b · ∇ × b

𝑓𝑠: gyro-center distribution function

[G. Hunt, Ph.D. Thesis, University of Leicester, 2016]

Gyrokinetic Poisson Equation

5

Z

▪ Electron and Ion densities from the distribution function

▪ Quasi-neutrality and gyrokinetic Poisson equation

[Dubin, et. al., Phys. Fluids 26, 3524 (1983)]

Particle-In-Cell Method

6

𝒗 =1

𝐁𝟎 ⋅ 𝐁𝟎∗ + 𝛅𝐁

1

𝑞𝑠

𝜕𝐻0𝜕𝜌∥

𝐁𝟎∗ + 𝛅𝐁 +

1

𝑞𝑠𝐁𝟎 × 𝛻𝐻0

▪ Equation of motion

𝑑𝜌∥𝑑𝑡

=𝐁𝟎∗ + 𝛅𝐁

𝐁𝟎 ⋅ 𝐁𝟎∗ + 𝛅𝐁

⋅ −1

𝑞𝑠𝛻𝐻0

1

𝑞𝑠

𝜕𝐻0𝜕𝜌∥

= 𝑩𝟎 ⋅ 𝒗 = 𝑣∥0

𝛻𝐻0 =𝑚𝑠 𝑣∥

0 2

𝐵0+𝜇

𝑞𝛻𝐵0 + 𝛻ഥΦ

𝜌∥ =𝑚𝑠𝑣∥

0

𝑞𝑠𝐵0

𝑑𝛿𝑓

𝑑𝑡= −

𝑑𝑓0𝑑𝑡

= − ሶ𝐙 ⋅ 𝛻𝐙𝑓0 = − ሶ𝐙𝟎 + ሶ𝐙𝟏 ⋅ 𝛻𝐙𝑓0

▪ 𝜹𝒇 weight evolution equation

𝑓tot = 𝑓0 + 𝛿𝑓

𝑑𝑓tot𝑑𝑡

=𝜕𝑓tot𝜕𝑡

+ ሶ𝐙 ⋅ 𝛻𝐙 𝑓 = 0

▪ The distribution function 𝒇𝒔 is represented by Marker Particles

𝑓𝑠 ≈

𝑖=1

𝑁𝑀

𝑤𝑖

𝛿 𝑥 − 𝑥𝑖 𝛿 𝑣 − 𝑣𝑖𝐽(𝑥𝑖)

Particle-In-Cell Method

7

Move particles to new positions(Particles only)

Charge deposition to Grid Nodes(Particles to Grid)

Assigning fields to particles(Grid to Particles)

Update potential & electric fields(Grid only)

𝜙𝑖,𝑗𝑛+1 𝜙𝑖+1,𝑗

𝑛+1

𝜙𝑖,𝑗−1𝑛+1

𝜙𝑖−1,𝑗𝑛+1

𝜙𝑖,𝑗+1𝑛+1

Acceleration of Particle Parts by OpenACC

8

▪ Assigning fields to particles (Grid to Particles)

➢ Particle arrays are global variables and created on device side to reduce the communication time

➢ Particles are independent from each other and has a single level loop

➢ A simple acc parallel directives is very efficient for a huge number of marker particles (𝑁𝑀 > 106)

▪ Particle parts are easily and efficiently parallelized by OpenACC

Acceleration of Particle Parts by OpenACC

9

▪ Charge deposition to Grid Nodes (Particles to Grid)

▪ Move particles to new positions (Particles only)

Gyrokinetic Poisson Equation

10

−∇ ⋅ 𝜖0ി𝐠 +

𝑠

𝑛𝑠𝑚𝑠

𝐵2ി𝐠 −

𝑩𝑩

𝐵⋅ ∇Φ = 𝑒 𝛿 ഥ𝑛𝑖 − 𝛿𝑛𝑒

▪ Coupled (2D+1D) field equations on Unstructured Mesh (use FEM)

▪ A direct 3D field equation on Structured Mesh (use FDM)

Unstructured grid

Structured grid

Solved iteratively

▪ Algebraic Multigrid solver is used to invert linearized equations

𝑨𝒙 = 𝒃

- BoomerAMG method in Hypre library through PETSc interface

- (CPU version) vs (cuda version)

Traverse Cluster at Princeton University

Traverse consists of:• 46 IBM AC922 Power 9 nodes, with each node having

– 2 IBM Power 9 processors (sockets)

• 16 cores per processor

• 4 hardware threads per core

– 32 cores per node

– 256 GB of RAM per node

– 4 NVIDIA V100 GPUs (2 per socket) with 32GB of memory each

– 3.2TB NVMe (solid state) local storage (not shared between nodes)

– EDR InfiniBand, 1:1 per rack, 2:1 rack to rack interconnect

– GPFS high performance parallel scratch storage: 2.9PB raw

– Globus transfer node (10 GbE external, EDR to storage)

• InfiniBand Network

– EDR InfiniBand (100 Gb/s)

– Fully non-blocking (1:1) within a chassis, 2:1 oversubscription between chassis.

11

Elapsed Time

12

0.1

1

10

100

1000

4 CPU (mz4np1)

32 CPU (mz4np8)

4CPU + 4GPU (mz4np1)

32CPU + 4GPU (mz4np8)Tim

e lo

g sc

ale

(s)

micell=50, mpsi=100

MGRID= 45,137MISUM= 9,007,200

Lower is better

Speed-Up Factor

13

0.1

1

10

100

1000

4 CPU (mz4np1)

32 CPU (mz4np8)



Spe

ed

-Up

x18.8

x6.2

x5.5

micell=50, mpsi=100

MGRID= 45,137MISUM= 9,007,200

Higher is better

Full Power of 1 node on Traverse

14

push_ion11%

shifti1%

charge_ion12%

poisson24%

smooth0%

field1%

collision_ion8%

collision_eon7%

push_eon22%

shifte2%

charge_eon9%

check_tracers2%

snapshot0%

diagnostics1%

32 CPU

push_ion2%

shifti1%charge_ion

4%

poisson66%

smooth1%

field2%

collision_ion2%

collision_eon2%

push_eon5%

shifte4%

charge_eon3%

check_tracers6%

snapshot0%

diagnostics2%

32 CPU + 4 GPU

Performance Comparison (original Poisson vs new 3D Poisson)

15

0

1

2

3

4

5

6

7

ori. Tol=1e-3

ori. Tol=1e-5

New Poisson

Elapsed time/step

Breakdown of Elapsed Time

16

push_ion3%

shifti2%

charge_ion4%

poisson44%

smooth1%

field3%

collision_ion3%

collision_eon4%

push_eon8%

shifte7%

charge_eon4%

check_tracers2%

snapshot10%

diagnostics5%

2%1%

2%

71%

1%1%

2%2% 4%

4%2%

1%

5%

2%

4%

3%

5%

29%

1%4%

5%5%

10%

9%

5%

2%

12%

6%

Ori. Poisson (Tol=1e-3) Ori. Poisson (Tol=1e-5) New Poisson 3D

Algebraic Multigrid Solver: (HYPRE) vs (GAMG)

17

0

2

4

6

8

10

12

14

16

HYPRE HYPRE cuda GAMG GAMG cuda

Tim

e (

s)

Poisson time / istep

micell=50, mpsi=100

MGRID= 45,137MISUM= 9,007,200

1 Node 4 MPI ranks mzetamax=4npartdom=1

1 CPU / 1 GPU

HYPRE vs GAMG in Production Run case

18

Tim

e (

s)

micell=100 mpsi=150

MGRID= 113,185*16MISUM= 180,854,400

4 Node 128 MPImzetamax=16npartdom=8

0

5

10

15

20

25

30

35

40

45

Pure CPU + Hypre

GPU + Hypre

GPU + cuda Hypre

GPU + GAMG

GPU + cuda GAMG

4 CPU / 1 GPU

Summary

19


• OpenACC directives ported the code to GPU keeping compatibilities with CPU machine

Porting to GPU machine

• A global gyrokinetic particle simulation code for micro-turbulence study in tokamak

• Recently upgraded for physics studies associated with the thermal quench transport

- Significant speed-up (>20x) for the particle parts

• Particle-In-Cell algorithm (particles + grid-based field solve)

• GTS is now running production simulations on Traverse with a significant acceleration,making efficient use of the Traverse computational resource

- Field solve part (Poisson equation) will be ported to GPU via some libraries (ex. PETSc, Hypre, AMGx)

• Attended GPU Hackathon in 2019 at Princeton University (Mentor: Rueben Budiardja (ORNL))

Accelerating Gyrokinetic Tokamak Simulation (GTS) Code ...

Documents

Transcript of Accelerating Gyrokinetic Tokamak Simulation (GTS) Code ...