Evolving the GTC plasma PIC simulation for Mira · Evolving the GTC plasma PIC simulation for Mira...
Transcript of Evolving the GTC plasma PIC simulation for Mira · Evolving the GTC plasma PIC simulation for Mira...
Evolving the GTC plasma PIC simulation for Mira
Bei Wang1 Stephane Ethier2 William Tang1,2
1Princeton Institute for Computational Science and Engineering, Princeton University
2Princeton Plasma Physics Laboratory, Princeton University
Mira Community ConferenceArgonne Leadership Computing Facility, Chicago
Mar 6, 2013
Outline
Gyrokinetic Particle in Cell Method for Fusion Plasma SimulationGyrokinetic Vlasov-Poisson EquationsParticle in Cell Method
Ultrafast Gyrokinetic Particle Simulations for ITER Size Plasma onBG/Q
Gyrokinetic Toroidal Code (GTC)Porting GTC from BG/P to BG/QPerformance Benchmark on MiraEarly Science Results
Collaborators
• Tim Williams, Catalyst in Argonne Leadership ComputingFacility
• Khaled Ibrahim, Kamesh Madduri (now at Penn StateUniver.), Sam Williams, Leonid Oliker of the FutureTechnologies Group at Lawrence Berkeley National Laboratory
The Challenge: Global Simulation of ITER
X This is extremely largecompared to currentexperiments
X We need advancedcomputational method,mathematical model andhigh performance computingto predict its performance
X Very large simulation
Gyrokinetic Simulations: for studying turbulenttransport
• Macroscopic Stability
What limits the pressure in plasmas?
• Wave-particle interactions
How do particles and plasma wavesinteract?
• Miroturbulence and Transport
What causes plasma transport?
• Plasma-material Interactions
How can high-temperature plasmaand material surfaces co-exist?
Gyrokinetic Vlasov-Poisson Equation
We consider kinetic ions and adiabatic electrons. The dynamics ofthe ions is described by the 5D gyrophase-averaged Vlasovequation in toroidal geometry.
• gyro-motion: guidingcenter drifts +charged ring
• suppress plasmaoscillation and motion
• larger time step andgrid size
df
dt=∂f
∂t+
dR
dt·∂f
∂R+
dv‖
dt
∂f
∂v‖= 0
dR
dt= v‖b̂ + vd −
∂φ̄
∂R× b
dv‖
dt= −b∗ · (
v2⊥2
∂
∂RlnB +
∂φ̄
∂R)
b∗ = b̂ + v‖b̂× (b̂ ·∂
∂R)b̂
∇2φ+τ
λ2D
(φ− φ̃) = 4πe(δn̄ − δne)
δn̄(x) =1
2π
∫(f (R)− feq(R))δ(R− x− ρ)dxdµdv‖dϕ.
Particle in cell methods
• In particle methods, we approximate the distribution function by a collection offinite-size particles
f (x, v, t) ≈∑k
qkδεx (x− Xk (t))δεv (v − Vk (t))
∫ ∞−∞
δε(y)dy = 1
δε(y) =1
εu(
y
ε)
0
0.2
0.4
0.6
0.8
1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
u(y)
y
u1u0
• At each time step, particles are transported along trajectories described by theequation of motion
dqk
dt= 0
dXk
dt= Vk
dVk
dt= Fk
• the long range forces are usually solved on a grid
ρj =∑k
qk
εxu1(
xj − Xk
εx) ∆φj = −ρj Ej = −∇φj E(Xk ) =
∑j
Eju1(xj − Xk
εx)
Charge assignment and field interpolation in GKPICmethod
ρ(xj) =∑k
p=4∑p=1
qk4εx
u1(xj − Xp
εx)δ(Rk − Xp + ρk)
E(Rk) =
p=4∑p=1
∑j
Eju1(xj − Xp
εx)δ(Rk − Xp + ρk))
Classic PIC 4-Point Average GK (W.W. Lee)
Charge Deposition Step (SCATTER operation)
GTC
Gyrokinetic Toroidal Code (GTC, Z.Lin Science 98) formagnetically confined plasma simulations
• Solve 5D gyrokineticVlasov-Poisson equationusing PIC methods
• 3D toroidal mesh
• Two data structures:grid-based array andparticle-based array
• Six subroutines
Charge
Poisson
Smooth
Smooth Field
Push
Shi4
Δtψ
GTC Parallelization
• Written in Fortran 90
• 3 levels of parallelism in original GTC:
∗ 1d domain decomposition in toroidal dimensional∗ particle decomposition in each toroidal domain∗ loop-level parallelization with OpenMP
• Massive grid memory footprint: difficulty for simulating largescale plasmas such as ITER
Problem Size grid size num. of particles in one toro. dom.A (a/ρ = 125) 32,449 (1M) 3,235,896 (0.3G)B (a/ρ = 250) 128,893 (4M) 12,943,584 (1.2G)C (a/ρ = 500) 513,785 (16M) 51,774,336 (4.8G)D (a/ρ = 1000) 2,051,567 (64M) 207,097,344 (19.2G)
using 100 particle per cell
• Introduce the key additional level of domain decomposition inthe radial dimension-which is essential for efficiently carryingout ITER size simulations (Adams-Ethier 08), called GTCP
• Use PETSc 2.3.3 for Poisson solver
Porting GTCP from BG/P to BG/Q
• Changed the size of some OpenMP related array from 4 to 64
• Ported PETSc 2.3.3 version to BG/Q system −→ Thanks toour catalyst Tim Williams
• Complier flags: -O3 -qnohot -qsmp=omp:noauto
0
100
200
300
400
500
600
512 2048 8192 32768
wal
l-clo
ck ti
me
number of cores
Weak scaling study of A,B,C,D size plasma on Mira and Intrepid
DCBA
Intrepid 4mpi/node, 1thread/coreMira 16mpi/node, 4threads/core ∗ Q/P performance ratio per
core is 2.02-2.79
∗ Q/P performance ratio pernode is 8.08-11.16
Strong Scaling Study of GTCP on BG/Q
0
50
100
150
200
250
2 4 8 16 32 64 128 256 512 1024 2048
A
A ideal
B
B ideal
C
C ideal
D
D ideal
B
C
28% 20%
16%
A
D
35%
Number of nodes 4-‐way parAAon in toroidal dim.
Wall clock Ame (100 Ame step
s)
Achieved only 16%− 35% maximal parallel efficiency (strongscaling)• Poisson solver doesn’t scale with the number of threads
• Small number of iteration in the outer level loop in smooth and field subroutine
• Atomic operation in charge subroutine
• Memory copy without multithreading in shift subroutine
Strong Scaling Study of GTCP on BG/Q
0
50
100
150
200
250
2 4 8 16 32 64 128 256 512 1024 2048
A
A ideal
B
B ideal
C
C ideal
D
D ideal
B
C
28% 20%
16%
A
D
35%
Number of nodes 4-‐way parAAon in toroidal dim.
Wall clock Ame (100 Ame step
s)
Achieved only 16%− 35% maximal parallel efficiency (strongscaling)• Poisson solver doesn’t scale with the number of threads
• Small number of iteration in the outer level loop in smooth and field subroutine
• Atomic operation in charge subroutine
• Memory copy without multithreading in shift subroutine
Improve single node performance with multicore andmanycore optimizations(collaborated with Future Tech. Group at LBL)
• The whole code is converted to C
• A structure of array (SOA) data structure for particle array;aligned memory allocation to facilitate use of SIMD intrinsics
• Loop fusion to increase computational intensity
• Particle binning to improve locality (Kamesh et al SC11)
• A multilevel binning algorithm to improve locality and reducedata conflict: Binning the gyro-center of the particlesperiodically; Binning the four points of the gyro-particles atevery time step (Wang et al SC12 poster)
• Atomic operation is avoided by providing a copy of the localgrid for each thread
• Use a hand-coded iterative solver ⇒ Enable multithreading
• Flat some 2D iteration to 1D
Scaling Study of GTCP-C on BG/Q
Number of nodes 64-‐way par44on in toroidal dim.
Wall clock 4me (100 4me step
s)
0
20
40
60
80
100
120
140
160
180
32 64 128 256 512 1024 2048 4096 8192
A
A ideal
B
B ideal
C
C ideal
C
73% 74% 64%
A B
0
10
20
30
40
50
60
70
80
90
100
110
512 2048 8192 32768
wal
l clo
ck ti
me
for
100
time
step
s w
ith 1
00pp
c
number of nodes
Weak scaling study for A to D size devices on BGQ
GTCP-FORTRANGTCP-C
• For weak scaling plot on the right, we use one MPI in a singlenode with 64 OpenMP threads.
• Achieved 64%− 73% of maximal parallel efficiency (strongscaling)
• Obtained 4-5x speed up compared with the originalFORTRAN code
Computer power of GTCP on BG/P and BG/Q
101
102
103
104
105
512 2048 8192 32768
com
pute
pow
er (
mill
ions
num
ber
of p
artic
les
per
seco
nd p
er s
tep)
number of nodes
Particle Scaling Study on BG/P and BG/Q system
D(ITER)
C
B
A
GTC-P FORTRAN on IntrepidGTC-P FORTRAN on Mira
GTC-P C on MiraGTC-P C on Mira (fine tuning)
• red → blue: BG/P to BG/Q• blue → green: multicore optimizations• green → red purple: compiler flags and processing mapping
BG SMP FAST WAKEUP = YES : OMP WAIT POLICY = active;RUNJOB MAPPING
• The “time to solution” is 50x compared with using GTCP FORTRAN code onBG/P with the same number of nodes
Study ITG driven turbulence spreading with theultrafast GTCP-C code
0
1
2
3
4
5
0 200 400 600 800 1000 1200 1400 1600 1800
χ i(c
sρs2 /a
)
time(a/cs)
w/o heat bath, with paranl, rw=0.35, with 100-300ppc
A(a125)B(a250)
a375C(a500)
a750,300ppcD(a1000),300ppc
• 2x longer, 7.5x more number of particles than the previous run
• Now we can simulate ITER size plasma with 300 ppc (total 40 billion particles)for 30k time steps < 4 hours by using 2/3 of Mira
Conclusion
• Full application multicore and manycore optimizations
• Introduce radial domain decomposition
• Excellent scaling on BG/Q up to 524,288 cores
• Simulate ITER size plasma with 300 ppc (total 40 billionparticles) for 30k time steps < 4 hours by using 2/3 of Mira
• 50x “time to solution” speed up compared with GTCPFORTRAN on BG/P system
Current and future work
• Study turbulence spreading and size scaling up to ITER sizewith high resolution simulations
• Study phase space remapping for long time simulations withgyrokinetic δf particle in cell method
• Develop full-f capability the code