AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cloud (CMP318)
The code BigDFTapproachtoHighPerformance Computing · BigDFT and HPC The code Properties BigDFT and...
Transcript of The code BigDFTapproachtoHighPerformance Computing · BigDFT and HPC The code Properties BigDFT and...
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Daubechies Wavelets in Electronic StructureCalculation: BigDFT Code Tutorial
ORNL/NICS – OAK RIDGE, TENNESSEE
BigDFT approach to High PerformanceComputing
Luigi Genovese
L_Sim – CEA Grenoble
February 8, 2012
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Outline
1 The codeFormalism and propertiesThe needs for hybrid DFT codesMain operations, parallelisation
2 Performance evaluationEvaluating GPU gainPractical cases
3 Discussion
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
A DFT code based on Daubechies wavelets
BigDFT: a PSP Kohn-Sham codeA Daubechies wavelets basis has uniqueproperties for DFT usage
Systematic, Orthogonal
Localised, Adaptive
Kohn-Sham operators are analytic
Short, Separable convolutions
c̃` = ∑j aj c`−j
Peculiar numerical properties
Real space based, highly flexibleBig & inhomogeneous systems
Daubechies Wavelets
-1.5
-1
-0.5
0
0.5
1
1.5
-6 -4 -2 0 2 4 6 8
x
φ(x)
ψ(x)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Basis set features
BigDFT features in a nutshell4 Arbitrary absolute precision can be achieved
Good convergence ratio for real-space approach (O(h14))
4 Optimal usage of the degrees of freedom (adaptivity)Optimal speed for a systematic approach (less memory)
4 Hartree potential accurate for various boundaryconditionsFree and Surfaces BC Poisson Solver(present also in CP2K, ABINIT, OCTOPUS)
* Data repartition is suitable for optimal scalabilitySimple communications paradigm, multi-level parallelisationpossible (and implemented)
Improve and develop know-howOptimal for advanced DFT functionalities in HPC framework
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Moore’s law
40 years of improvementsTransistor counts double everytwo years. . .
. . . but how?
Power is the limiting factor
Power ∝ Frequency3 * Clock rate is limitedMultiple slower devices preferable than one superfast device* More performance with less power→ software problem?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Hybrid Supercomputing nowadays
GPGPU on SupercomputersTraditional architectures are somehow saturatingMore cores/node, memories (slightly) larger but not faster
Architectures of Supercomputers are becoming hybrid3 out to 5 Top Supercomputers are hybrid machines
Extrapolation: In 2015, No. 500 will become petafloppicLikely it will be a hybrid machine
Codes should be conceived differently# MPI processes is limited for a fixed problem size
Performances increase only by enhancing parallelism
Further parallelisation levels should be added (OpenMP,GPU)
Does electronic structure calculations codes are suitable?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
How far is petaflop (for DFT)?
At present, with traditional architecturesRoutinely used DFT calculations are:
Few dozens (hundreds) of processors
Parallel intensive operations (blocking communications,60-70 percent efficiency)
Not freshly optimised (legacy codes, monster codes)
* Optimistic estimation: 5 GFlop/s per core × 2000 cores ×0.9 = 9 TFlop/s = 200 times less than Top 500’s #3!
It is such asDistance Earth-Moon = 384 MmDistance Earth-Mars = 78.4 Gm = 200 times more
Moon is reached. . . can we go to Mars? (. . . in 2015?)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Operations performed
The SCF cycleOrbital scheme:
Hamiltonian
Preconditioner
Coefficient Scheme:
Overlap matrices
Orthogonalisation
Comput. operationsConvolutions
BLAS routines
FFT (Poisson Solver)
Why not GPUs?
Real Space Daub. Wavelets
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Separable convolutions
We must calculate
F(I1, I2, I3) =L
∑j1,j2,j3=0
hj1hj2hj3G(I1− j1, I2− j2, I3− j3)
=L
∑j1=0
hj1
L
∑j2=0
hj2
L
∑j3=0
hj3G(i1− j1, i2− j2, i3− j3)
Application of three successive operations1 A3(I3, i1, i2) = ∑j hjG(i1, i2, I3− j) ∀i1, i2;2 A2(I2, I3, i1) = ∑j hjA3(I3, i1, I2− j) ∀I3, i1;3 F(I1, I2, i3) = ∑j hjA2(I2, I3, I1− j) ∀I2, I3.
Main routine: Convolution + transposition
F(I,a) = ∑j
hjG(a, I− j) ∀a ;
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
CPU performances of the convolutionsInitially, naive routines (FORTRAN?)
y(j, I) =U
∑`=L
h`x(I + `, j)
Easy to write anddebug
Define referenceresults
do j=1,ndatdo i=0,n1tt=0.d0do l=lowfil,lupfil
tt=tt+x(i+l,j)*h(l)enddoy(j,i)=tt
enddo
enddo
Optimisation can then start (Ex. X5550,2.67 GHz)
Method GFlop/s % of peak SpeedUpNaive (FORTRAN) 0.54 5.1 1/(6.25)
Current (FORTRAN) 3.3 31 1Best (C, SSE) 7.8 73 2.3
OpenCL (Fermi) 97 20 29 (12.4)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
How to optimize?
A trade-off between benefit and effort
FORTRAN based4 Relatively accessible (loop unrolling)
4 Moderate optimisation can be achieved relatively fast
6 Compilers fail to use vector engine efficiently
Push optimisation at the bestOnly one out of 3 convolution type has beenimplemented
About 20 different patterns have been studied for one1D convolution
Tedious work, huge code −→ Maintainability?
* Automatic code generation under study
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
MPI parallelization I: Orbital distribution scheme
Used for the application of the hamiltonianOperator approach: The hamiltonian (convolutions) is appliedseparately onto each wavefunction
ψ5
ψ4
ψ3
ψ2
ψ1
MPI 0
MPI 1
MPI 2
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
MPI parallelization II: Coefficient distribution scheme
Used for scalar products & orthonormalisationBLAS routines (level 3) are called, then result is reduced
ψ5
ψ4
ψ3
ψ2
ψ1
MPI 0 MPI 1 MPI 2
At present, MPI_ALLTOALL(V) is used to switch
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
OpenMP parallelisation
Innermost parallelisation level(Almost) Any BigDFT operation is parallelised via OpenMP
4 Useful for memory demanding calculations
4 Allows further increase of speedups
4 Saves MPI processes and intra-node Message Passing
6 Less efficient thanMPI
6 Compiler andsystem dependent
6 OMP sectionsshould be regularlymaintained
1.5
2
2.5
3
3.5
4
4.5
0 20 40 60 80 100 120
OM
P S
pe
ed
up
No. of MPI procs
2OMP3OMP6OMP
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Task repartition for a small system (ZnO, 128 atoms)
0
10
20
30
40
50
60
70
80
90
100
16 24 32 48 64 96 144 192 288 576 1
10
100
1000P
erce
nt
Sec
onds
(lo
g. s
cale
)
No. of cores
1 Th. OMP per core
CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)
What are the ideal conditions for GPUGPU-ported routines should take the majority of the time
What happens to parallel efficiency?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Parallelisation and architectures
Same code, same runs. Which is the best?
0
10
20
30
40
50
60
70
80
90
100
16 24 32 48 64 96 144 192 288 576 1
10
100
1000P
erce
nt
Sec
onds
(lo
g. s
cale
)
No. of cores
1 Th. OMP per core
CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)
0
10
20
30
40
50
60
70
80
90
100
16 24 32 48 64 96 144 192 288 576 1
10
100
1000
Per
cent
Sec
onds
(lo
g. s
cale
)
No. of cores
1 Th. OMP per core
CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)
CCRT Titane (Nehalem, Infiniband) CSCS Rosa (Opteron, Cray XT5)
Titane is 2.3 to 1.6 times faster than Rosa!
Degradation of parallel performances: why?1 Calculation power has increased more than networking2 Better libraries (MKL)
* Walltime reduced, but lower parallel efficiency
This will always happen while using GPU!Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Architectures, libraries, networking
Same runs, same sources; different user conditions
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
0 100 200 300 400 500 600
Cp
u H
ou
rs
No. of cores
Titane MpiBull2 no MKLJade MPT
Titane MpiBull2, MKLTitane OpenMPI, MKL
Rosa Cray, istanbul
Differencesup to afactor of 3!
A case-by-case studyConsideration are often system-dependent, a thumb rule notalways exists.
* Know your code!
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Intranode bandwidth problem
0
10
20
30
40
50
60
70
80
90
100
1-1
2-1
3-1
4-1
6-1
12-1 1-2
2-2
3-2
4-2
6-2
1-3
2-3
3-3
4-3
1-4
2-4
3-4
1-6
2-6
1
2
3
4
5
6
7
8
9
Per
cent
Spe
edup
MPI proc - OMP threads
B80 Cluster, one Keeneland Node (CPU only)
CommsLinAlgConvPotentialOtherSpeedupEfficiency (%)
Scalability does not depend only on communicationAmdahl’s law is a upper limit!
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Using GPUs in a Big systematic DFT code
Nature of the operationsOperators approach via convolutions
Linear Algebra due to orthogonality of the basis
Communications and calculations do not interfere
* A number of operations which can be accelerated
Evaluating GPU convenienceThree levels of evaluation
1 Bare speedups: GPU kernels vs. CPU routinesDoes the operations are suitable for GPU?
2 Full code speedup on one processAmdahl’s law: are there hot-spot operations?
3 Speedup in a (massively?) parallel environmentThe MPI layer adds an extra level of complexity
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
BigDFT in hybrid codes
Acceleration of the full BigDFT codeConsiderable gain may be achieved for suitable systemsAmdahl’s law should always be considered
Resources can be used concurrently (OpenCL queues)More MPI processes may share the same card!
0
20
40
60
80
100
CPU
-mkl
CPU
-mkl-m
pi
CUDA
CUDA-m
kl
OCL-cublas
OCL-m
kl
CUDA-m
pi
CUDA-m
kl-mpi
OCL-cublas-m
pi
OCL-m
kl-mpi
0
2
4
6
8
10
Perc
ent
Speedup
Badiane, X5550 + Fermi S2070 , ZnO 64 at.: CPU vs. Hybrid
CommsLinAlgConvCPUOtherSpeedup
(Lower time-to-solution for a given architecture)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
The time-to-solution problem I: Efficiency
Good example: 4 C at, surface BC, 113 Kpts
Parallel efficiency of 98%, convolutions largely dominate.
Node:2× Fermi + 8 ×Westmere8 MPI processes
# GPU added 2 4 8
SpeedUp (SU) 5.3 9.8 11.6# MPI equiv. 44 80 96
Acceler. Eff. 1 .94 .56
0
20
40
60
80
100
0 20 40 60 80 100
Sp
ee
dU
p
No. of MPI proc
idealCPU+MPI
GPU
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
The time-to-solution problem II:Robustness
Not so good example: A too small system
0
20
40
60
80
100
1 2 4 6 8 12 24 32 48 64 96 144 1 2 4 6 8 12 24 32 48 64 96 144 0
1
2
3
4
5
6
7
Per
cent
Spe
edup
with
GP
U
No. of MPI proc
Titane, ZnO 64 at.: CPU vs. Hybrid
CommsLinAlgConvCPUOtherEfficiency (%)
Speedup
Hybrid code (rel.)CPU code
8 CPU efficiency is poor (calculation is too fast)
8 Amdahl’s law not favorable (5x SU at most)
4 GPU SU is almost independent of the size
4 The hybrid code always goes faster
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Hybrid and Heterogeneous runs with OpenCL
NVidia S2070 Connected eachto a NehalemWorkstation
BigDFT may runon both
ATI HD 6970
Sample BigDFT run: Graphene, 4 C atoms, 52 kpts
No. of Flop: 8.053 · 1012
MPI 1 1 4 1 4 8GPU NO NV NV ATI ATI NV + ATITime (s) 6020 300 160 347 197 109Speedup 1 20.07 37.62 17.35 30.55 55.23GFlop/s 1.34 26.84 50.33 23.2 40.87 73.87
Next Step: handling of Load (un)balancing
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
A look in near future: science with HPC DFT codes
A concerted set of actionsImprove codes functionalities for present-day and nextgeneration supercomputers
Test and develop new formalisms
Insert ab-initio codes in new scientific workflows(Multiscale Modelling)
The Mars missionIs Petaflop performance possible?
GPU acceleration→ one order of magnitude
Bigger systems, heavier methods→ (more than) oneorder of magnitude bigger
BigDFT experience makes this feasibleAn opportunity to achieve important outcomes and know-how
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
BigDFT andHPC
The codeProperties
BigDFT and GPUs
Code details
BigDFT andHPCGPU
Practical cases
Discussion
Exercises of this afternoon
Profiling and interpreting BigDFT result
Dimension job size
Interpret anddistinguish intranodeand internodebehaviour
0
10
20
30
40
50
60
70
80
90
100
12
-1
24
-1
48
-1
12
0-1
6-2
12
-2
24
-2
60
-2
4-3
8-3
16
-3
40
-3
3-4
6-4
12
-4
30
-4
2-6
4-6
8-6
20
-6
1
2
3
4
5
6
7
Pe
rce
nt
Sp
ee
du
p
MPI proc - OMP threads
B80 Cluster, Keeneland, Multi-Node (CPU only)
CommsLinAlgConvPotentialOtherSpeedup
BigDFT usage of GPU resources
Combining MPI,OpenMP and GPUacceleration
0
10
20
30
40
50
60
70
80
90
100
1-1
-CP
U
3-4
-CP
U
12
-1-C
PU
1-1
-OC
L
3-1
-OC
L
3-4
-OC
L
6-2
-OC
L
12
-1-O
CL
1-1
-CU
DA
3-1
-CU
DA
12
-1-C
UD
A
1
2
3
4
5
6
7
8
9
10
11
Pe
rce
nt
Sp
ee
du
p
MPI proc - OMP threads - Acceleration
ZnO 8 atoms, gamma point, Keeneland, Intranode (GPU accelerated)
CommsLinAlgConvPotentialOtherSpeedup
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese