The code BigDFTapproachtoHighPerformance Computing · BigDFT and HPC The code Properties BigDFT and...

BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Daubechies Wavelets in Electronic StructureCalculation: BigDFT Code Tutorial

ORNL/NICS – OAK RIDGE, TENNESSEE

BigDFT approach to High PerformanceComputing

Luigi Genovese

L_Sim – CEA Grenoble

February 8, 2012

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Outline

1 The codeFormalism and propertiesThe needs for hybrid DFT codesMain operations, parallelisation

2 Performance evaluationEvaluating GPU gainPractical cases

3 Discussion


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

A DFT code based on Daubechies wavelets

BigDFT: a PSP Kohn-Sham codeA Daubechies wavelets basis has uniqueproperties for DFT usage

Systematic, Orthogonal

Localised, Adaptive

Kohn-Sham operators are analytic

Short, Separable convolutions

c̃` = ∑j aj c`−j

Peculiar numerical properties

Real space based, highly flexibleBig & inhomogeneous systems

Daubechies Wavelets

-1.5

-1

-0.5

0

0.5

1

1.5

-6 -4 -2 0 2 4 6 8

x

φ(x)

ψ(x)


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Basis set features

BigDFT features in a nutshell4 Arbitrary absolute precision can be achieved

Good convergence ratio for real-space approach (O(h14))

4 Optimal usage of the degrees of freedom (adaptivity)Optimal speed for a systematic approach (less memory)

4 Hartree potential accurate for various boundaryconditionsFree and Surfaces BC Poisson Solver(present also in CP2K, ABINIT, OCTOPUS)

* Data repartition is suitable for optimal scalabilitySimple communications paradigm, multi-level parallelisationpossible (and implemented)

Improve and develop know-howOptimal for advanced DFT functionalities in HPC framework


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Moore’s law

40 years of improvementsTransistor counts double everytwo years. . .

. . . but how?

Power is the limiting factor

Power ∝ Frequency3 * Clock rate is limitedMultiple slower devices preferable than one superfast device* More performance with less power→ software problem?


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Hybrid Supercomputing nowadays

GPGPU on SupercomputersTraditional architectures are somehow saturatingMore cores/node, memories (slightly) larger but not faster

Architectures of Supercomputers are becoming hybrid3 out to 5 Top Supercomputers are hybrid machines

Extrapolation: In 2015, No. 500 will become petafloppicLikely it will be a hybrid machine

Codes should be conceived differently# MPI processes is limited for a fixed problem size

Performances increase only by enhancing parallelism

Further parallelisation levels should be added (OpenMP,GPU)

Does electronic structure calculations codes are suitable?


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

How far is petaflop (for DFT)?

At present, with traditional architecturesRoutinely used DFT calculations are:

Few dozens (hundreds) of processors

Parallel intensive operations (blocking communications,60-70 percent efficiency)

Not freshly optimised (legacy codes, monster codes)

* Optimistic estimation: 5 GFlop/s per core × 2000 cores ×0.9 = 9 TFlop/s = 200 times less than Top 500’s #3!

It is such asDistance Earth-Moon = 384 MmDistance Earth-Mars = 78.4 Gm = 200 times more

Moon is reached. . . can we go to Mars? (. . . in 2015?)


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Operations performed

The SCF cycleOrbital scheme:

Hamiltonian

Preconditioner

Coefficient Scheme:

Overlap matrices

Orthogonalisation

Comput. operationsConvolutions

BLAS routines

FFT (Poisson Solver)

Why not GPUs?

Real Space Daub. Wavelets


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Separable convolutions

We must calculate

F(I1, I2, I3) =L

∑j1,j2,j3=0

hj1hj2hj3G(I1− j1, I2− j2, I3− j3)

=L

∑j1=0

hj1

L

∑j2=0

hj2

L

∑j3=0

hj3G(i1− j1, i2− j2, i3− j3)

Application of three successive operations1 A3(I3, i1, i2) = ∑j hjG(i1, i2, I3− j) ∀i1, i2;2 A2(I2, I3, i1) = ∑j hjA3(I3, i1, I2− j) ∀I3, i1;3 F(I1, I2, i3) = ∑j hjA2(I2, I3, I1− j) ∀I2, I3.

Main routine: Convolution + transposition

F(I,a) = ∑j

hjG(a, I− j) ∀a ;


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

CPU performances of the convolutionsInitially, naive routines (FORTRAN?)

y(j, I) =U

∑`=L

h`x(I + `, j)

Easy to write anddebug

Define referenceresults

do j=1,ndatdo i=0,n1tt=0.d0do l=lowfil,lupfil

tt=tt+x(i+l,j)*h(l)enddoy(j,i)=tt

enddo

enddo

Optimisation can then start (Ex. X5550,2.67 GHz)

Method GFlop/s % of peak SpeedUpNaive (FORTRAN) 0.54 5.1 1/(6.25)

Current (FORTRAN) 3.3 31 1Best (C, SSE) 7.8 73 2.3

OpenCL (Fermi) 97 20 29 (12.4)


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

How to optimize?

A trade-off between benefit and effort

FORTRAN based4 Relatively accessible (loop unrolling)

4 Moderate optimisation can be achieved relatively fast

6 Compilers fail to use vector engine efficiently

Push optimisation at the bestOnly one out of 3 convolution type has beenimplemented

About 20 different patterns have been studied for one1D convolution

Tedious work, huge code −→ Maintainability?

* Automatic code generation under study


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

MPI parallelization I: Orbital distribution scheme

Used for the application of the hamiltonianOperator approach: The hamiltonian (convolutions) is appliedseparately onto each wavefunction

ψ5

ψ4

ψ3

ψ2

ψ1

MPI 0

MPI 1

MPI 2


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

MPI parallelization II: Coefficient distribution scheme

Used for scalar products & orthonormalisationBLAS routines (level 3) are called, then result is reduced

ψ5

ψ4

ψ3

ψ2

ψ1

MPI 0 MPI 1 MPI 2

At present, MPI_ALLTOALL(V) is used to switch


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

OpenMP parallelisation

Innermost parallelisation level(Almost) Any BigDFT operation is parallelised via OpenMP

4 Useful for memory demanding calculations

4 Allows further increase of speedups

4 Saves MPI processes and intra-node Message Passing

6 Less efficient thanMPI

6 Compiler andsystem dependent

6 OMP sectionsshould be regularlymaintained

1.5

2

2.5

3

3.5

4

4.5

0 20 40 60 80 100 120

OM

P S

pe

ed

up

No. of MPI procs

2OMP3OMP6OMP


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Task repartition for a small system (ZnO, 128 atoms)

0

10

20

30

40

50

60

70

80

90

100

16 24 32 48 64 96 144 192 288 576 1

10

100

1000P

erce

nt

Sec

onds

(lo

g. s

cale

)

No. of cores

1 Th. OMP per core

CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)

What are the ideal conditions for GPUGPU-ported routines should take the majority of the time

What happens to parallel efficiency?


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Parallelisation and architectures

Same code, same runs. Which is the best?

0

10

20

30

40

50

60

70

80

90

100

16 24 32 48 64 96 144 192 288 576 1

10

100

1000P

erce

nt

Sec

onds

(lo

g. s

cale

)

No. of cores

1 Th. OMP per core


0

10

20

30

40

50

60

70

80

90

100

16 24 32 48 64 96 144 192 288 576 1

10

100

1000

Per

cent

Sec

onds

(lo

g. s

cale

)

No. of cores

1 Th. OMP per core


CCRT Titane (Nehalem, Infiniband) CSCS Rosa (Opteron, Cray XT5)

Titane is 2.3 to 1.6 times faster than Rosa!

Degradation of parallel performances: why?1 Calculation power has increased more than networking2 Better libraries (MKL)

* Walltime reduced, but lower parallel efficiency

This will always happen while using GPU!Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Architectures, libraries, networking

Same runs, same sources; different user conditions

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

0 100 200 300 400 500 600

Cp

u H

ou

rs

No. of cores

Titane MpiBull2 no MKLJade MPT

Titane MpiBull2, MKLTitane OpenMPI, MKL

Rosa Cray, istanbul

Differencesup to afactor of 3!

A case-by-case studyConsideration are often system-dependent, a thumb rule notalways exists.

* Know your code!


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Intranode bandwidth problem

0

10

20

30

40

50

60

70

80

90

100

1-1

2-1

3-1

4-1

6-1

12-1 1-2

2-2

3-2

4-2

6-2

1-3

2-3

3-3

4-3

1-4

2-4

3-4

1-6

2-6

1

2

3

4

5

6

7

8

9

Per

cent

Spe

edup

MPI proc - OMP threads

B80 Cluster, one Keeneland Node (CPU only)

CommsLinAlgConvPotentialOtherSpeedupEfficiency (%)

Scalability does not depend only on communicationAmdahl’s law is a upper limit!


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Using GPUs in a Big systematic DFT code

Nature of the operationsOperators approach via convolutions

Linear Algebra due to orthogonality of the basis

Communications and calculations do not interfere

* A number of operations which can be accelerated

Evaluating GPU convenienceThree levels of evaluation

1 Bare speedups: GPU kernels vs. CPU routinesDoes the operations are suitable for GPU?

2 Full code speedup on one processAmdahl’s law: are there hot-spot operations?

3 Speedup in a (massively?) parallel environmentThe MPI layer adds an extra level of complexity


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

BigDFT in hybrid codes

Acceleration of the full BigDFT codeConsiderable gain may be achieved for suitable systemsAmdahl’s law should always be considered

Resources can be used concurrently (OpenCL queues)More MPI processes may share the same card!

0

20

40

60

80

100

CPU

-mkl

CPU

-mkl-m

pi

CUDA

CUDA-m

kl

OCL-cublas

OCL-m

kl

CUDA-m

pi

CUDA-m

kl-mpi

OCL-cublas-m

pi

OCL-m

kl-mpi

0

2

4

6

8

10

Perc

ent

Speedup

Badiane, X5550 + Fermi S2070 , ZnO 64 at.: CPU vs. Hybrid

CommsLinAlgConvCPUOtherSpeedup

(Lower time-to-solution for a given architecture)


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

The time-to-solution problem I: Efficiency

Good example: 4 C at, surface BC, 113 Kpts

Parallel efficiency of 98%, convolutions largely dominate.

Node:2× Fermi + 8 ×Westmere8 MPI processes

# GPU added 2 4 8

SpeedUp (SU) 5.3 9.8 11.6# MPI equiv. 44 80 96

Acceler. Eff. 1 .94 .56

0

20

40

60

80

100

0 20 40 60 80 100

Sp

ee

dU

p

No. of MPI proc

idealCPU+MPI

GPU


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

The time-to-solution problem II:Robustness

Not so good example: A too small system

0

20

40

60

80

100

1 2 4 6 8 12 24 32 48 64 96 144 1 2 4 6 8 12 24 32 48 64 96 144 0

1

2

3

4

5

6

7

Per

cent

Spe

edup

with

GP

U

No. of MPI proc

Titane, ZnO 64 at.: CPU vs. Hybrid

CommsLinAlgConvCPUOtherEfficiency (%)

Speedup

Hybrid code (rel.)CPU code

8 CPU efficiency is poor (calculation is too fast)

8 Amdahl’s law not favorable (5x SU at most)

4 GPU SU is almost independent of the size

4 The hybrid code always goes faster


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Hybrid and Heterogeneous runs with OpenCL

NVidia S2070 Connected eachto a NehalemWorkstation

BigDFT may runon both

ATI HD 6970

Sample BigDFT run: Graphene, 4 C atoms, 52 kpts

No. of Flop: 8.053 · 1012

MPI 1 1 4 1 4 8GPU NO NV NV ATI ATI NV + ATITime (s) 6020 300 160 347 197 109Speedup 1 20.07 37.62 17.35 30.55 55.23GFlop/s 1.34 26.84 50.33 23.2 40.87 73.87

Next Step: handling of Load (un)balancing


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

A look in near future: science with HPC DFT codes

A concerted set of actionsImprove codes functionalities for present-day and nextgeneration supercomputers

Test and develop new formalisms

Insert ab-initio codes in new scientific workflows(Multiscale Modelling)

The Mars missionIs Petaflop performance possible?

GPU acceleration→ one order of magnitude

Bigger systems, heavier methods→ (more than) oneorder of magnitude bigger

BigDFT experience makes this feasibleAn opportunity to achieve important outcomes and know-how


BigDFT andHPC

The codeProperties

BigDFT and GPUs

Code details

BigDFT andHPCGPU

Practical cases

Discussion

Exercises of this afternoon

Profiling and interpreting BigDFT result

Dimension job size

Interpret anddistinguish intranodeand internodebehaviour

0

10

20

30

40

50

60

70

80

90

100

12

-1

24

-1

48

-1

12

0-1

6-2

12

-2

24

-2

60

-2

4-3

8-3

16

-3

40

-3

3-4

6-4

12

-4

30

-4

2-6

4-6

8-6

20

-6

1

2

3

4

5

6

7

Pe

rce

nt

Sp

ee

du

p

MPI proc - OMP threads

B80 Cluster, Keeneland, Multi-Node (CPU only)

CommsLinAlgConvPotentialOtherSpeedup

BigDFT usage of GPU resources

Combining MPI,OpenMP and GPUacceleration

0

10

20

30

40

50

60

70

80

90

100

1-1

-CP

U

3-4

-CP

U

12

-1-C

PU

1-1

-OC

L

3-1

-OC

L

3-4

-OC

L

6-2

-OC

L

12

-1-O

CL

1-1

-CU

DA

3-1

-CU

DA

12

-1-C

UD

A

1

2

3

4

5

6

7

8

9

10

11

Pe

rce

nt

Sp

ee

du

p

MPI proc - OMP threads - Acceleration

ZnO 8 atoms, gamma point, Keeneland, Intranode (GPU accelerated)

CommsLinAlgConvPotentialOtherSpeedup


The code BigDFTapproachtoHighPerformance Computing · BigDFT and HPC The code Properties BigDFT and...

Documents

Transcript of The code BigDFTapproachtoHighPerformance Computing · BigDFT and HPC The code Properties BigDFT and...