SIRIUS: domain specific library prototype for electronic structure ...€¦ · refactoring? Low...
Transcript of SIRIUS: domain specific library prototype for electronic structure ...€¦ · refactoring? Low...
SIRIUS: domain specific library prototype for electronic structure calculations
Trieste, January 14th 2015
Anton Kozhevnikov, CSCS
Low level libraries
Introduction
LibXC
FFTWScaLAPACK and PBLASLAPACK and BLAS
Exciting Elk
PRACE-2IP WP8: how to approach the problem of community code refactoring?
Low level libraries
Introduction
SIRIUS is a prototype low level library developed under PRACE-2IP work package 8 (community codes refactoring and optimization) as a general-purpose solution for optimization and scaling of both Exciting and Elk full-potential LAPW codes.
SIRIUS C++ libraryMPI + OpenMP parallel model with GPU acceleration
LibXCGNU scientific library
FFTW
HDF5
ELPA MAGMAScaLAPACK and PBLASLAPACK and BLAS
Exciting Elk
spglib
LAPW functionality
Low level libraries
Introduction
SIRIUS C++ library
MPI + OpenMP parallel model with GPU acceleration
LibXCGNU scientific library
FFTW
HDF5
ELPA MAGMAScaLAPACK and PBLASLAPACK and BLAS
Exciting Elk
spglib
LAPW specific functionality
QE
Pseudopotential specific functionality
Common objects: unit cell description, reciprocal lattice description, FFT mesh, G-vector indexing, radial functions (local orbitals or beta projectors) indexing, XC potential generation, etc.
Research topics
•Wave-function distribution
•Iterative solvers:
•Davidson solver (serial & parallel, CPU & GPU version)
•Chebyshev-filtered subspace
•RMM-DIIS
•Real-space beta projectors
•Second-variational approach to magnetism
Wave-function distribution
[0, 0] [0, 1]
[1, 0] [1, 1]
2D BLACS grid
Index of bands
Inde
x of
G-v
ecto
rs
+
Wave-functions
Wave-function distribution
Split bands Split G-vectors Split bands and G-vectors
2D block-cyclic distribution
2D block cyclic distribution
•“ScaLAPACK friendly”
• task groups are introduced naturally•allows fast change of wave-function distribution
= x
h�i|H|�ji h�i| H|�ji
pzgemm
Changing wave-function distribution
[0, 0] [0, 1]
[1, 0] [1, 1]
[0, 0] [0, 1]
[1, 0] [1, 1]
[0, 0] [0, 1]
[1, 0] [1, 1]
split bands
MPI
all-
to-a
ll al
ong
colu
mns
MPI all-to-all along rows
split G-vectors
FFTs
First, block-cyclic distributed wave-functions are gathered into ‘slices’ on individual nodes and then each node executes the whole FFTs on a subset of full wave-function vectors.
CPU threads
sing
le 3
D F
FT
sing
le 3
D F
FT
sing
le 3
D F
FT
sing
le 3
D F
FT
sing
le 3
D F
FT
sing
le 3
D F
FT
sing
le 3
D F
FT
cont
rol t
hrea
d
wor
k th
read
s
batc
h 3D
cuF
FT
[0, 0] [0, 1]
[1, 0] [1, 1]
SiO ANA zeolite benchmark
Number of atoms 144
Number of k-points 8 (nosym)
Number of bands 461
Number of beta-projectors 2112
ecutwfc 30 Ry
ecutrho 400 Ry
|G+k> basis size ~46700
Coarse FFT mesh 90x90x90
Fine FFT mesh 180x180x180
Platform and code
Latest svn version of QE (+ELPA_2013.11 +SIRIUS)
Latest git version of QE+GPU
Piz Daint: Intel 8-core SB @2.6 GHz + Tesla K20X, Aries interconnect
Performance benchmark
1
5
9
12
16
20
8 (64) 32 (256) 72 (576) 128 (1024)
Scaling
Number of CPU sockets (cores)
QE (sirius)QE (many threads)QE (many ranks)Ideal
ranks per k-point
% of ranks for diagonalziation
8 50%
32 78%
72 89%
128 94%
“Many ranks” run
Performance benchmark
0
620
1240
1860
2480
3100
8 (64) 32 (256) 72 (576) 128 (1024)
365
1243
154222
377
1081
133217
546
2639
207359
777
3011
Time to solution
Tim
e to
sol
utio
n (s
ec)
Number of CPU sockets (cores)
QE (many threads, CPU)QE (many ranks, CPU)QE (sirius, CPU)QE+GPUQE (sirius, GPU)
Valence charge density has to be augmented with:
where
⇢(G) =X
↵
X
⇠⇠0
q↵⇠⇠0Q↵⇠0⇠(G)
q↵⇠⇠0 =X
jk
h�↵⇠ | jkinjkh jk|�↵
⇠0i
Q↵(A)⇠0⇠ (G) = e�iG⌧↵(A)QA
⇠0⇠(G)
Charge augmentation
zgemm
Relation between plane-wave coefficients of the Q-operator for a given atomα and the corresponding coefficients of the Q-operator for a given atom type A:
Split the sum over atoms into sum over atom types and inner sum over atoms of the same type:
zgemm
⇢(G) =X
A
X
⇠⇠0
QA⇠0⇠(G)
X
↵(A)
q↵(A)⇠⇠0 e�iG⌧↵(A) =
X
A
X
⇠⇠0
QA⇠0⇠(G)dA⇠⇠0(G)
•treat as matrix-vector multiplication
•(even better) treat as matrix-matrix multiplication (to be tested)
D↵⇠⇠0 =
X
G
Q↵⇠⇠0(G)V (G)
D-operator evaluation
zgemv
D↵⇠⇠0 =
X
G
QA⇠⇠0(G)e�iG⌧↵(A)V (G) zgemm
Thank you for your attention.