Download - Scientific Computations on Modern Parallel Vector Systems Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories.

Scientific Computations on Modern Parallel

Vector Systems

Leonid Oliker, Jonathan Carter, Andrew Canning, John ShalfLawrence Berkeley National Laboratories

Stephane EthierPrinceton Plasma Physics Laboratory

http://crd.lbl.gov/~oliker

Overview

Superscalar cache-based architectures dominate HPC market Leading architectures are commodity-based SMPs due to generality and

perception of cost effectiveness Growing gap between peak & sustained performance is well known in

scientific computing Modern parallel vectors may bridge gap this for many important

applications

In April 2002, the Earth Simulator (ES) became operational: Peak ES performance > all DOE and DOD systems combined Demonstrated high sustained performance on demanding scientific apps

Conducting evaluation study of scientific applications on modern vector systems

09/2003 MOU between ES and NERSC was completedFirst visit to ES center: December 8th-17th, 2003 (ES remote access not available)First international team to conduct performance evaluation study at ES

Examining best mapping between demanding applications and leading HPC systems - one size does not fit all

Vector Paradigm

High memory bandwidth• Allows systems to effectively feed ALUs (high byte to flop ratio)

Flexible memory addressing modes• Supports fine grained strided and irregular data access

Vector Registers• Hide memory latency via deep pipelining of memory load/stores

Vector ISA• Single instruction specifies large number of identical operations

Vector architectures allow for:• Reduced control complexity • Efficiently utilize large number of computational resources• Potential for automatic discovery of parallelism

However: most effective if sufficient regularity discoverable in program

structure• Suffers even if small % of code non-vectorizable (Amdahl’s Law)

Architectural Comparison

Node Type Where CPU/

NodeClockMHz

PeakGFlop

Mem BW GB/s

Peak byte/fl

op

NetwkBW

GB/s/P

BisectBW

byte/flop

MPI Latenc

yusec

NetworkTopolog

y

Power3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-tree

Power4 ORNL 32 1300 5.2 2.3 0.44 0.13 0.025 7.0 Fat-tree

Altix ORNL 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 Fat-treeES ESC 8 500 8.0 32.0 4.0 1.5 0.19 5.6 CrossbarX1 ORNL 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus

Custom vector architectures have •High memory bandwidth relative to peak•Superior interconnect: latency, point to point, and bisection bandwidth

Overall ES appears as the most balanced architecture, while Altix shows best architectural balance among superscalar architectures

A key ‘balance point’ for vector systems is the scalar:vector ratio

Applications studied

LBMHD Plasma Physics 1,500 lines grid basedLattice Boltzmann approach for magneto-hydrodynamics

CACTUS Astrophysics 100,000 lines grid based Solves Einstein’s equations of general relativity

PARATEC Material Science 50,000 lines Fourier space/grid Density Functional Theory electronic structures codes

GTC Magnetic Fusion 5,000 lines particle based Particle in cell method for gyrokinetic Vlasov-Poisson equation

Applications chosen with potential to run at ultrascale Computations contain abundant data parallelism

• ES runs require minimum parallelization and vectorization hurdles Codes originally designed for superscalar systems Ported onto single node of SX6, first multi-node experiments

performed at ESC

Plasma Physics: LBMHD LBMHD uses a Lattice Boltzmann method to

model magneto-hydrodynamics (MHD)

Performs 2D simulation of high temperature plasma

Evolves from initial conditions and decaying to form current sheets

2D spatial grid is coupled to octagonal streaming lattice

Block distributed over 2D processor grid

Main computational components:

Collision requires coefficients for local gridpoint only, no communication Stream values at gridpoints are streamed to neighbors,

at cell boundaries information is exchanged via MPI Interpolation step required between spatial and stream lattices

Developed George Vahala’s group College of William and Mary, ported Jonathan Carter

Current density decays of two cross-shaped structures

LBMHD: Porting Details

Collision routine rewritten: For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather

than velocity or magnetic field loops (~10 iterations) X1 compiler made this transformation automatically: multistreaming outer loop and

vectorizing (via strip mining) inner loop Temporary arrays padded reduce bank conflicts

Stream routine performs well: Array shift operations, block copies, 3rd-degree polynomial eval

Boundary value exchange MPI_Isend, MPI_Irecv pairs Further work: plan to use ES "global memory" to remove message copies

(left) octagonal streaming lattice coupled with square spatial grid

(right) example of diagonal streaming vector updating three spatial cells

LBMHD: Performance

ES achieves highest performance to date: over 3.3 Tflops for P=1024 X1 comparable absolute speed up to P=64 (lower % peak) But performs 1.5X slower at P=256 (decreased scalability)

CAF improved X1 to slightly exceed ES at P=64 (up to 4.70 Gflop/P) ES is 44X, 16X, and 7X faster than Power3, Power4, and Altix

• Low CI (1.5) and high memory requirement (30GB) hurt scalar performance

Altix best scalar due to: high memory bandwidth, fast interconnect

DataSize P

Power 3 Power4 Altix ES X1

Gflops/P

%peak

Gflops/P

%peak

Gflops/P

%peak

Gflops/P

%peak

Gflops/P %peak

4096 x

4096

16 0.11 7% 0.28 5% 0.60 10% 4.6 58% 4.3 34%64 0.14 9% 0.30 6% 0.62 10% 4.3 54% 4.4 34%256 0.14 9% 0.28 5% --- --- 3.2 40% --- ---

8192x

8192

64 0.11 7% 0.27 5% 0.65 11% 4.6 58% 4.5 35%256 0.12 8% 0.28 5% --- --- 4.3 53% 2.7 21%

1024 0.11 7% --- --- --- --- 3.3 41% --- ---

LBMHD on X1 MPI vs CAF

X1 well-suited for one-sided parallel languages (globally addressable mem)• MPI hinders this feature and requires scalar tag matching

CAF allows much simpler coding of boundary exchange (array subscripting):• feq(ista-1,jsta:jend,1) = feq(iend,jsta:jend,1)[iprev,myrankj]

MPI requires non-contiguous data copies into buffer, unpacked at destination Since communication about 10% of LBMHD, only slight improvements However, for P=64 on 40962 performance degrades. Tradeoffs:

• CAF reduced total message volume 3X (eliminates user and system buffer copy)• But CAF used more numerous and smaller sized message

DataSize P

X1-MPI X1-CAF

Gflops/P %peak Gflops/

P%peak

4096216 4.32 34% 4.55 36%64 4.35 34% 4.26 33%

8192264 4.48 35% 4.70 37%256 2.70 21% 2.91 23%

Astrophysics: CACTUS Numerical solution of Einstein’s equations

from theory of general relativity

Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms

CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes

Evolves PDE’s on regular grid using finite differences

Uses ADM formulation: domain decomposed into 3D hypersurfaces for different slices of space along time dimension

Exciting new field about to be born: Gravitational Wave Astronomy - fundamentally new information about Universe

Gravitational waves: Ripples in spacetime curvature, caused by matter motion, causing distances to change.

Developed at Max Planck Institute, vectorized by John Shalf

Visualization of grazing collision of two black holes

Communication at boundariesExpect high parallel efficiency

CACTUS: Performance

ES achieves fastest performance to date: 45X faster than Power3! Vector performance related to x-dim (vector length) Excellent scaling on ES using fixed data size per proc (weak scaling) Scalar performance better on smaller problem size (cache effects)

X1 surprisingly poor (4X slower than ES) - low ratio scalar:vector Unvectorized boundary, required 15% of runtime on ES and 30+% on X1 < 5% for the scalar version: unvectorized code can quickly dominate cost

Poor superscalar performance despite high computational intensity Register spilling due to large number of loop variables Prefetch engines inhibited due to multi-layer ghost zones calculations

ProblemSize P

Power 3 Power 4 Altix ES X1

Gflops/P

%peak

Gflops/P

%peak

Gflops/P

%peak


P%pea

k80X80x8

0per

processor

16 0.31 21% 0.58 11% 0.89 15% 1.5 18% 0.54 4%64 0.22 14% 0.50 10% 0.70 12% 1.4 17% 0.43 3%256 0.22 14% 0.48 9% --- --- 1.4 17% 0.41 3%

250x80x80

perprocessor

16 0.10 6% 0.56 11% 0.51 9% 2.8 35% 0.81 6%64 0.08 6% --- --- 0.42 7% 2.7 34% 0.72 6%256 0.07 5% --- --- --- --- 2.7 34% 0.68 5%

Material Science: PARATEC

PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set

Density Functional Theory to calc structure & electronic properties of new materials

DFT calc are one of the largest consumers of supercomputer cycles in the world

Induced current and chargedensity in crystallized glycine

Uses all-band CG approach to obtain wavefunction of electrons

33% 3D FFT, 33% BLAS3, 33% Hand coded F90 Part of calculation in real space other in Fourier space

• Uses specialized 3D FFT to transform wavefunction Computationally intensive - generally obtains high percentage of peak Developed Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)

Transpose from Fourier to real space

3D FFT done via 3 sets of 1D FFTs and 2 transposes

Most communication in global transpose (b) to (c) little communication (d) to (e)

Many FFTs done at the same timeto avoid latency issues

Only non-zero elements communicated/calculated

Much faster than vendor 3D-FFT

PARATEC:Wavefunction Transpose(a) (b)

(e)

(c)

(f)

(d)

PARATEC: Performance

DataSize P


Gflops/P%peak


P %peakGflops/P

%peak

Gflops/P

%peak

432Atom

32 0.95 63% 2.0 39% 3.7 62% 4.7 60% 3.0 24%64 0.85 57% 1.7 33% 3.2 54% 4.7 59% 2.6 20%128 0.74 49% 1.5 29% --- --- 4.7 59% 1.9 15%256 0.57 38% 1.1 21% --- --- 4.2 52% --- ---512 0.41 28% --- --- --- --- 3.4 42% --- ---

686Atom

128 4.9 62% 3.0 24%256 4.6 57% 1.3 10%

ES achieves fastest performance to date! Over 2Tflop/s on 1024 procs Main advantage for this type of code is fast interconnect system

X1 3.5X slower than ES (although peak is 50% higher) Non-vectorizable code can be much more expensive on X1 (32:1 vs 8:1) Lower bisection bandwidth to computation ratio

Limited scalability due to increasing cost of global transpose and reduced vector length

Plan to run larger problem size next ES visit Scalar architectures generally perform well due to high computational intensity

Power3, Power4, Alitx are 8X, 4X, 1.5X slower than ES Vector arch allow opportunity to simulate systems not possible on scalar

platforms

Magnetic Fusion: GTC Gyrokinetic Toroidal Code: transport of thermal

energy (plasma microturbulence) Goal magnetic fusion is burning plasma power

plant producing cleaner energy GTC solves 3D gyroaveraged gyrokinetic

system w/ particle-in-cell approach (PIC) PIC scales N instead of N2 – particles interact w/

electromagnetic field on grid Allows solving equation of particle motion with

ODEs (instead of nonlinear PDEs) Main computational tasks:

Scatter deposit particle charge to nearest point Solve Poisson eqn to get potential for each

point Gather calc force based on neighbors potential Move particles by solving eqn of motion Shift particles moved outside local domain

3D visualization of electrostatic potential in magnetic fusion device

Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier

GTC: Scatter operation

Particle charge deposited amongst nearest grid points. Calculate force based on neighbors potential, then move particle accordingly

Several particles can contribute to same grid points, resulting in memory conflicts (dependencies) that prevent vectorization

Solution: VLEN copies of charge deposition array with reduction after main loop• However, greatly increases memory footprint (8X)

Since particles are randomly localized - scatter also hinders cache reuse

GTC: Performance

Number

Particles

P


Gflops/P

%peak

Gflops/P

%peak

Gflops/P

%peak

Gflops/P

%peak

Gflops/P

%peak

10/cell

20M

32 0.13 9% 0.29 5% 0.29 5% 1.15 14% 1.00 8%64 0.13 9% 0.32 5% 0.26 4% 1.00 13% 0.80 6%

100/cell

200M

32 0.13 9% 0.29 5% 0.33 6% 1.62 20% 1.50 12%64 0.13 9% 0.29 5% 0.31 5% 1.56 20% 1.36 11%1024 0.06 4% ES achieves fastest performance of any tested architecture!

• First time code achieved 20% of peak - compared with less 10% on superscalar systems• Vector hybrid (OpenMP) parallelism not possible due to increased memory requirements • P=64 on ES is 1.6X faster than P=1024 on Power3!• Reduced scalability due to decreasing vector length, not MPI performance

Non-vectorizable code portions expensive on X1• Before vectorization shift routine accounted for 11% of ES and 54% of X1 overhead

Larger tests could not be performed at ES due to parallelization/vectorization hurdles• Currently developing new version with increased particle decomposition

Advantage of ES for PIC codes may reside in higher statistical resolution simulations• Greater speed allow more particles per cell

Overview

Tremendous potential of vector architectures: 4 codes running faster than ever before Vector systems allows resolution not possible with scalar arch (regardless of # procs)

Opportunity to perform scientific runs at unprecedented scale

ES shows high raw and much higher sustained performance compared with X1• Limited X1 specific optimization - optimal programming approach still unclear (CAF, etc)• Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio)

• Evaluation codes contain sufficient regularity in computation for high vector performance• GTC example code at odds with data-parallelism• Much more difficult to evaluate codes poorly suited for vectorization

Vectors potentially at odds w/ emerging techniques (irregular, multi-physics, multi-scale)

Plan to expand scope of application domains/methods, and examine latest HPC architectures

Code(P=64) % peak (P=Max avail) Speedup ES

vs.

Pwr3 Pwr4 Altix ES X1 Pwr3 Pwr4 Altix X1LBMHD 7% 5% 11% 58% 37% 30.6 15.3 7.2 1.5CACTUS 6% 11% 7% 34% 6% 45.0 5.1 6.4 4.0

GTC 9% 6% 5% 20% 11% 9.4 4.3 4.1 1.1PARATE

C 57% 33% 54% 58% 20% 8.2 3.9 1.4 3.9

Average 23.3 7.2 4.8 2.6

Second ES visit

Evaluate high-concurrency PARATEC performance using large-scale Quantum Dot simulation

Evaluate CACTUS performance using updated vectorization of radiation boundary condition

Evaluate MADCAP performance using a newly optimized version, without global file systems requirements and improved I/O behavior

Examine 3D version of LBMHD, and explore optimization strategies Evaluate GTC performance using updated vectorization of shift

routine as well as new particle decomposition approach designed to increase concurrency

Evaluate performance of FVCAM3 (Finite Volume atmospheric model), at high concurrencies and resolution (1x1.25 , 0.5 x 0.625, 0.25 x 0.375)

Papers available at http://crd.lbl.gov/~oliker