Running in Parallel : Theory and Practice

21
Running in Parallel : Theory and Practice Julian Gale Department of Chemistry Imperial College

description

Running in Parallel : Theory and Practice. Julian Gale Department of Chemistry Imperial College. Why Run in Parallel?. Increase real-time performance Allow larger calculations :- usually memory is the critical factor- distributed memory essential for all significant arrays - PowerPoint PPT Presentation

Transcript of Running in Parallel : Theory and Practice

Page 1: Running in Parallel : Theory and Practice

Running in Parallel : Theory

and Practice

Julian GaleDepartment of Chemistry

Imperial College

Page 2: Running in Parallel : Theory and Practice

Why Run in Parallel?

• Increase real-time performance• Allow larger calculations :

- usually memory is the critical factor- distributed memory essential for all significant arrays

• Several possible mechanisms for parallelism :- MPI / PVM / OpenMP

Page 3: Running in Parallel : Theory and Practice

Parallel Strategies

• Massive parallelism : - distribute according to spatial location - large systems (non-overlapping regions) - large numbers of processors

• Modest parallelism : - distribute by orbital index / K point - spatially compact systems - spatially inhomogeneous systems - small numbers of processors

• Replica parallelism (transition states / phonons)S. Itoh, P. Ordejón and R.M. Martin, CPC, 88, 173 (1995)

A. Canning, G. Galli, F. Mauri, A. de Vita and R. Car, CPC, 94, 89 (1996)

D.W. Bowler, T. Miyazaki and M. Gillan, CPC, 137, 255 (2001)

Page 4: Running in Parallel : Theory and Practice

Key Steps in Calculation

• Calculating H (and S) matrices

- Hartree potential

- Exchange-correlation potential

- Kinetic / overlap / pseudopotentials

• Solving for self-consistent solution

- Diagonalisation

- Order N

Page 5: Running in Parallel : Theory and Practice

One/Two-Centre Integrals• Integrals evaluated directly in real space

• Orbitals distributed according to 1-D block cyclic scheme

• Each node calculates integrals relevant to local orbitals

• Presently duplicated set up on each node for numerical tabulations

Orbital 1 2 3 4 5 6 7 8 9 10Node 0 0 0 0 1 1 1 1 0 0

Kinetic energy integrals = 45sOverlap integrals = 43sNon-local pseudopotential = 136sMesh = 2213s

16384 atoms of Si on 4 nodes :

Blocksize = 4

Page 6: Running in Parallel : Theory and Practice

Sparse Matrices

Order N memory

Compressed2-D

Compressed1-D

Page 7: Running in Parallel : Theory and Practice

Parallel Mesh Operations

00 44

11 55

22 66

33 77

88

99

1010

1111

2-D Blocked

• Spatial decomposition of mesh

• 2-D Blocked in y/z

• Map orbital to mesh distribution

• Perform parallel FFT Hartree

• XC calculation only involveslocal communication

• Map mesh back to orbitals

Page 8: Running in Parallel : Theory and Practice

Distribution of Processors

z

y

Better to divide work in ydirection than z

Command: ProcessorY

Example: • 8 nodes• ProcessorY 4• 4(y) x 2(z) grid of nodes

Page 9: Running in Parallel : Theory and Practice

Diagonalisation

• H and S stored as sparse matrices

• Solve generalised eigenvalue problem

• Currently convert back to dense form

• Direct sparse solution is possible- sparse solvers exist for standard eigenvalue problem- main issue is sparse factorisation

Page 10: Running in Parallel : Theory and Practice

Dense Parallel Diagonalisation

11

1-D Block Cyclic(size ≈ 12 - 20)

Two options : - Scalapack - Block Jacobi (Ian Bush, Daresbury) - Scaling vs absolute performance

00

00

Command: BlockSize

Page 11: Running in Parallel : Theory and Practice

Order N

Ebs =2 Hij −ηSij( ) 2δij −Sij( ) +ηNi, j

occ

Kim, Mauri, Galli functional :

Hij = CμihμνCνj = Cμiμ∑ Fμj

μ,ν∑

Sij = CμisμνCνj = Cμiμ∑ Fμj

s

μ,ν∑

Gμi =∂Ebs

∂Cμi

=4Fμi −2 FμjsHji −2 FμjSji

j∑

j∑

Page 12: Running in Parallel : Theory and Practice

Order N

• Direct minimisation of band structure energy=> co-efficients of orbitals in Wannier fns

• Three basic operations :- calculation of gradient

- 3 point extrapolation of energy- density matrix build

• Sparse matrices : C, G, H, S, h, s, F, Fs => localisation radius

• Arrays distributed by rhs index :- nbasis or nbands

Page 13: Running in Parallel : Theory and Practice

Putting it into practice….

• Model test system = bulk Si (a=5.43Å)• Conditions as previous scalar runs• Single-zeta basis set• Mesh cut-off = 40 Ry• Localisation radius = 5.0 Bohr• Kim / Mauri / Galli functional• Energy shift = 0.02 Ry

• Order N calculations -> 1 SCF cycle / 2 iterations• Calculations performed on SGI R12000 / 300 MHz• “Green” at CSAR / Manchester Computing Centre

Page 14: Running in Parallel : Theory and Practice

Scaling of Time with System Size

020000400006000080000

100000120000140000

0 50000 100000 150000

Atoms

Time (s)

32 processors32 processors

Page 15: Running in Parallel : Theory and Practice

Scaling of Memory with System Size

0

200

400

600

800

1000

1200

0 50000 100000 150000

Atoms

Peak memory (MB)

NB : Memory is per processorNB : Memory is per processor

Page 16: Running in Parallel : Theory and Practice

Parallel Performance on Mesh

0

1000

2000

3000

4000

0 20 40 60 80

Processors

Time (s)

• 16384 atoms of Si / Mesh = 180 x 360 x 360

• Mean time per call

• Loss of performance is due to orbital - mesh mapping (XC shows perfect scaling (LDA))

Page 17: Running in Parallel : Theory and Practice

Parallel Performance in Order N

0

200

400

600

800

1000

1200

0 20 40 60 80

Processors

Time (s)

• 16384 atoms of Si / Mesh = 180 x 360 x 360

• Mean total time per call in 3 point energy calculation

• Minimum memory algorithm

• Needs spatial decomposition to limit internode communication

Page 18: Running in Parallel : Theory and Practice

Installing Parallel SIESTA• What you need:

- f90- MPI- scalapack- blacs- blas- lapack

• Usually ready installed on parallel machines• Source/prebuilt binaries from www.netlib.org• If compiling, look out for f90/c cross compatibility• arch.make - available for several parallel machines

Also needed for serial runs

Page 19: Running in Parallel : Theory and Practice

Running Parallel SIESTA• To run a parallel job:

mpirun -np 4 siesta < job.fdf > job.out

• Sometimes must use “prun” on some sites• Notes:

- generally must run in queues- copy files on to local disk of run machine- times reported in output are sum over

nodes - times can be erratic (Green/Fermat)

Number of processors

Page 20: Running in Parallel : Theory and Practice

Useful Parallel Options

• ParallelOverK : Distribute K points over nodes - good for metals

• ProcessorY : Sets dimension of processor grid in Y direction

• BlockSize : Sets size of blocks into which orbitals are divided

• DiagMemory : Controls memory available for diagonalisation. Memory required depends on clusters of eig values See also DiagScale/TryMemoryIncrease

• DirectPhi : Phi values are calculated on the fly - saves memory

Page 21: Running in Parallel : Theory and Practice

Why does my job run like a dead donkey?

• Poor load balance between nodes: Alter BlockSize / ProcessorY• I/O is too slow: Could set “WriteDM false”• Job is swapping like crazy: Set “DirectPhi true”• Scaling with increasing number of nodes is poor:

Run a bigger job!!• General problems with parallelism: Latency / bandwidth Linux clusters with 100MB ethernet switch - forget it!