Evolving the GTC plasma PIC simulation for Mira · Evolving the GTC plasma PIC simulation for Mira...

20
Evolving the GTC plasma PIC simulation for Mira Bei Wang 1 Stephane Ethier 2 William Tang 1,2 1 Princeton Institute for Computational Science and Engineering, Princeton University 2 Princeton Plasma Physics Laboratory, Princeton University Mira Community Conference Argonne Leadership Computing Facility, Chicago Mar 6, 2013

Transcript of Evolving the GTC plasma PIC simulation for Mira · Evolving the GTC plasma PIC simulation for Mira...

Evolving the GTC plasma PIC simulation for Mira

Bei Wang1 Stephane Ethier2 William Tang1,2

1Princeton Institute for Computational Science and Engineering, Princeton University

2Princeton Plasma Physics Laboratory, Princeton University

Mira Community ConferenceArgonne Leadership Computing Facility, Chicago

Mar 6, 2013

Outline

Gyrokinetic Particle in Cell Method for Fusion Plasma SimulationGyrokinetic Vlasov-Poisson EquationsParticle in Cell Method

Ultrafast Gyrokinetic Particle Simulations for ITER Size Plasma onBG/Q

Gyrokinetic Toroidal Code (GTC)Porting GTC from BG/P to BG/QPerformance Benchmark on MiraEarly Science Results

Collaborators

• Tim Williams, Catalyst in Argonne Leadership ComputingFacility

• Khaled Ibrahim, Kamesh Madduri (now at Penn StateUniver.), Sam Williams, Leonid Oliker of the FutureTechnologies Group at Lawrence Berkeley National Laboratory

The Challenge: Global Simulation of ITER

X This is extremely largecompared to currentexperiments

X We need advancedcomputational method,mathematical model andhigh performance computingto predict its performance

X Very large simulation

Gyrokinetic Simulations: for studying turbulenttransport

• Macroscopic Stability

What limits the pressure in plasmas?

• Wave-particle interactions

How do particles and plasma wavesinteract?

• Miroturbulence and Transport

What causes plasma transport?

• Plasma-material Interactions

How can high-temperature plasmaand material surfaces co-exist?

Gyrokinetic Vlasov-Poisson Equation

We consider kinetic ions and adiabatic electrons. The dynamics ofthe ions is described by the 5D gyrophase-averaged Vlasovequation in toroidal geometry.

• gyro-motion: guidingcenter drifts +charged ring

• suppress plasmaoscillation and motion

• larger time step andgrid size

df

dt=∂f

∂t+

dR

dt·∂f

∂R+

dv‖

dt

∂f

∂v‖= 0

dR

dt= v‖b̂ + vd −

∂φ̄

∂R× b

dv‖

dt= −b∗ · (

v2⊥2

∂RlnB +

∂φ̄

∂R)

b∗ = b̂ + v‖b̂× (b̂ ·∂

∂R)b̂

∇2φ+τ

λ2D

(φ− φ̃) = 4πe(δn̄ − δne)

δn̄(x) =1

∫(f (R)− feq(R))δ(R− x− ρ)dxdµdv‖dϕ.

Particle in cell methods

• In particle methods, we approximate the distribution function by a collection offinite-size particles

f (x, v, t) ≈∑k

qkδεx (x− Xk (t))δεv (v − Vk (t))

∫ ∞−∞

δε(y)dy = 1

δε(y) =1

εu(

y

ε)

0

0.2

0.4

0.6

0.8

1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

u(y)

y

u1u0

• At each time step, particles are transported along trajectories described by theequation of motion

dqk

dt= 0

dXk

dt= Vk

dVk

dt= Fk

• the long range forces are usually solved on a grid

ρj =∑k

qk

εxu1(

xj − Xk

εx) ∆φj = −ρj Ej = −∇φj E(Xk ) =

∑j

Eju1(xj − Xk

εx)

Charge assignment and field interpolation in GKPICmethod

ρ(xj) =∑k

p=4∑p=1

qk4εx

u1(xj − Xp

εx)δ(Rk − Xp + ρk)

E(Rk) =

p=4∑p=1

∑j

Eju1(xj − Xp

εx)δ(Rk − Xp + ρk))

Classic PIC 4-Point Average GK (W.W. Lee)

Charge Deposition Step (SCATTER operation)

GTC

Gyrokinetic Toroidal Code (GTC, Z.Lin Science 98) formagnetically confined plasma simulations

• Solve 5D gyrokineticVlasov-Poisson equationusing PIC methods

• 3D toroidal mesh

• Two data structures:grid-based array andparticle-based array

• Six subroutines

Charge  

Poisson  

Smooth  

Smooth  Field  

Push  

Shi4  

Δtψ

GTC Parallelization

• Written in Fortran 90

• 3 levels of parallelism in original GTC:

∗ 1d domain decomposition in toroidal dimensional∗ particle decomposition in each toroidal domain∗ loop-level parallelization with OpenMP

• Massive grid memory footprint: difficulty for simulating largescale plasmas such as ITER

Problem Size grid size num. of particles in one toro. dom.A (a/ρ = 125) 32,449 (1M) 3,235,896 (0.3G)B (a/ρ = 250) 128,893 (4M) 12,943,584 (1.2G)C (a/ρ = 500) 513,785 (16M) 51,774,336 (4.8G)D (a/ρ = 1000) 2,051,567 (64M) 207,097,344 (19.2G)

using 100 particle per cell

• Introduce the key additional level of domain decomposition inthe radial dimension-which is essential for efficiently carryingout ITER size simulations (Adams-Ethier 08), called GTCP

• Use PETSc 2.3.3 for Poisson solver

Porting GTCP from BG/P to BG/Q

• Changed the size of some OpenMP related array from 4 to 64

• Ported PETSc 2.3.3 version to BG/Q system −→ Thanks toour catalyst Tim Williams

• Complier flags: -O3 -qnohot -qsmp=omp:noauto

0

100

200

300

400

500

600

512 2048 8192 32768

wal

l-clo

ck ti

me

number of cores

Weak scaling study of A,B,C,D size plasma on Mira and Intrepid

DCBA

Intrepid 4mpi/node, 1thread/coreMira 16mpi/node, 4threads/core ∗ Q/P performance ratio per

core is 2.02-2.79

∗ Q/P performance ratio pernode is 8.08-11.16

Strong Scaling Study of GTCP on BG/Q

0  

50  

100  

150  

200  

250  

2   4   8   16   32   64   128   256   512   1024   2048  

A  

A  ideal  

B  

B  ideal  

C  

C  ideal  

D  

D  ideal  

B  

C  

28%  20%  

16%  

A  

D  

35%  

Number  of  nodes  4-­‐way  parAAon  in  toroidal  dim.  

Wall  clock  Ame  (100  Ame  step

s)  

Achieved only 16%− 35% maximal parallel efficiency (strongscaling)• Poisson solver doesn’t scale with the number of threads

• Small number of iteration in the outer level loop in smooth and field subroutine

• Atomic operation in charge subroutine

• Memory copy without multithreading in shift subroutine

Strong Scaling Study of GTCP on BG/Q

0  

50  

100  

150  

200  

250  

2   4   8   16   32   64   128   256   512   1024   2048  

A  

A  ideal  

B  

B  ideal  

C  

C  ideal  

D  

D  ideal  

B  

C  

28%  20%  

16%  

A  

D  

35%  

Number  of  nodes  4-­‐way  parAAon  in  toroidal  dim.  

Wall  clock  Ame  (100  Ame  step

s)  

Achieved only 16%− 35% maximal parallel efficiency (strongscaling)• Poisson solver doesn’t scale with the number of threads

• Small number of iteration in the outer level loop in smooth and field subroutine

• Atomic operation in charge subroutine

• Memory copy without multithreading in shift subroutine

Improve single node performance with multicore andmanycore optimizations(collaborated with Future Tech. Group at LBL)

• The whole code is converted to C

• A structure of array (SOA) data structure for particle array;aligned memory allocation to facilitate use of SIMD intrinsics

• Loop fusion to increase computational intensity

• Particle binning to improve locality (Kamesh et al SC11)

• A multilevel binning algorithm to improve locality and reducedata conflict: Binning the gyro-center of the particlesperiodically; Binning the four points of the gyro-particles atevery time step (Wang et al SC12 poster)

• Atomic operation is avoided by providing a copy of the localgrid for each thread

• Use a hand-coded iterative solver ⇒ Enable multithreading

• Flat some 2D iteration to 1D

Scaling Study of GTCP-C on BG/Q

Number  of  nodes  64-­‐way  par44on  in  toroidal  dim.  

Wall  clock  4me  (100  4me  step

s)  

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

32   64   128   256   512   1024   2048   4096   8192  

A  

A  ideal  

B  

B  ideal  

C  

C  ideal  

C  

73%   74%  64%  

A  B  

0

10

20

30

40

50

60

70

80

90

100

110

512 2048 8192 32768

wal

l clo

ck ti

me

for

100

time

step

s w

ith 1

00pp

c

number of nodes

Weak scaling study for A to D size devices on BGQ

GTCP-FORTRANGTCP-C

• For weak scaling plot on the right, we use one MPI in a singlenode with 64 OpenMP threads.

• Achieved 64%− 73% of maximal parallel efficiency (strongscaling)

• Obtained 4-5x speed up compared with the originalFORTRAN code

Computer power of GTCP on BG/P and BG/Q

101

102

103

104

105

512 2048 8192 32768

com

pute

pow

er (

mill

ions

num

ber

of p

artic

les

per

seco

nd p

er s

tep)

number of nodes

Particle Scaling Study on BG/P and BG/Q system

D(ITER)

C

B

A

GTC-P FORTRAN on IntrepidGTC-P FORTRAN on Mira

GTC-P C on MiraGTC-P C on Mira (fine tuning)

• red → blue: BG/P to BG/Q• blue → green: multicore optimizations• green → red purple: compiler flags and processing mapping

BG SMP FAST WAKEUP = YES : OMP WAIT POLICY = active;RUNJOB MAPPING

• The “time to solution” is 50x compared with using GTCP FORTRAN code onBG/P with the same number of nodes

Study ITG driven turbulence spreading with theultrafast GTCP-C code

0

1

2

3

4

5

0 200 400 600 800 1000 1200 1400 1600 1800

χ i(c

sρs2 /a

)

time(a/cs)

w/o heat bath, with paranl, rw=0.35, with 100-300ppc

A(a125)B(a250)

a375C(a500)

a750,300ppcD(a1000),300ppc

• 2x longer, 7.5x more number of particles than the previous run

• Now we can simulate ITER size plasma with 300 ppc (total 40 billion particles)for 30k time steps < 4 hours by using 2/3 of Mira

Conclusion

• Full application multicore and manycore optimizations

• Introduce radial domain decomposition

• Excellent scaling on BG/Q up to 524,288 cores

• Simulate ITER size plasma with 300 ppc (total 40 billionparticles) for 30k time steps < 4 hours by using 2/3 of Mira

• 50x “time to solution” speed up compared with GTCPFORTRAN on BG/P system

Current and future work

• Study turbulence spreading and size scaling up to ITER sizewith high resolution simulations

• Study phase space remapping for long time simulations withgyrokinetic δf particle in cell method

• Develop full-f capability the code

Thank you!Questions?