Benchmarks on BG/L: Parallel and Serial

Benchmarks on BG/L: Parallel and Serial

John A. GunnelsMathematical Sciences Dept.

IBM T. J. Watson Research Center

Overview Single node benchmarks

Architecture Algorithms

Linpack Dealing with a bottleneck Communication operations

Benchmarks of the Future

Compute Node: BG/L Dual Core Dual FPU/SIMD

Alignment issues Three-level cache

Pre-fetching Non-coherent L1 caches

32 KB, 64-way, Round-Robin L2 & L3 caches coherent

Outstanding L1 misses (limited)

Programming OptionsHigh Low Level

Compiler optimization to find SIMD parallelism User input for specifying memory alignment and lack of

aliasing alignx assertion disjoint pragma

Dual FPU intrinsics (“built-ins”) Complex data type used to model pair of double-precision

numbers that occupy a (P, S) register pair Compiler responsible for register allocation and scheduling

In-line assembly User responsible for instruction selection, register allocation,

and scheduling

BG/L SIngle-Node STREAM Performance (444 MHz)28 July 2003

0

500

1000

1500

2000

2500

3000

0 500000 1000000 1500000 2000000

Vector size (8-byte elements)

Ban

dw

idth

(M

B/s

)

Tuned copy (MB/s)

Tuned scale (MB/s)

Tuned add (MB/s)

Tuned triad (MB/s)

OOB copy (MB/s)

OOB scale (MB/s)

OOB add (MB/s)

OOB triad (MB/s)

STREAM Performance

Out-of-box performance is 50-65% of tuned performance Lessons learned in

tuning will be transferred to compiler where possible

Comparison with commodity microprocessors is competitive

Machine Frequency STREAM FP peak Balance

(MHz) (MB/s) (Mflop/s) (B/F)

Intel Xeon 3060 2900 6120 0.474

BG/L 444 2355 3552 0.663

BG/L 670 3579 5360 0.668

AMD Opteron 1800 3600 3600 1.000

DAXPY Bandwidth Utilization

Memory Bandwidth Utilization vs Vector Size for Different Implementations of DAXPY

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

Vector Size (bytes)

Me

m.

Ba

nd

wid

th (

by

tes

/cy

cle

)

Intrinsics

Assembly

Vanilla

5.3 bytes/cycle(L3 – bandwidth)

16 bytes/cycle(L1 – bandwidth)

Matrix MultiplicationTiling for Registers (Analysis)

Latency tolerance (not bandwidth) Take advantage of register count

Unroll by factor of two 24 register pairs 32 cycles per unrolled iteration 15 cycle load-to-use latency (L2 hit)

Could go to 3-way unroll if needed 32 register pairs 32 cycles per unrolled iteration 31 cycle load-to-use latency

F2

F1

M1

M2

8

8

16

16

Recursive Data Format Mapping 2-D (Matrix) to

1-D (RAM) C/Fortran do not map well

Space-Filling Curve Approximation Recursive Tiling

Enables Streaming/pre-fetching Dual core “scaling”

Register Set

Blocks

L1 Cache Blocks

L3 Cache Blocks

Dual Register Blocks

Dual Core

Why? It’s a effortless way to double your

performance

Dual Core

Why? It exploits the architecture and

may allow one to double the performance of their code in some cases/regions

Single-Node DGEMM Performance at 92% of Peak

Single-node DGEMM (444 MHz)18 July 2003

0

1

2

3

4

0 50 100 150 200 250

Matrix size (N)

Per

form

ance

(G

Flo

p/s

)

Single core (GF/s)

Dual core (GF/s)

Single core peak (GF/s)

Dual core peak (GF/s)

Near-perfect scalability (1.99) going from single-core to dual-core Dual-core code delivers 92.27% of peak flops (8 flop/pclk) Performance (as fraction of peak) competitive with that of Power3

and Power4

92.27%

Performance Scales Linearly with Clock Frequency

40

0

44

0

48

0

52

0

56

0

60

0

64

0 200000

900000

16000000

1000

2000

3000

4000

Ba

nd

wid

th (

MB

/s)

Frequency (MHz)

N (elts)

Speed test of STREAM COPY, 25 July 2003

Measured performance of DGEMM and STREAM scale linearly with frequency DGEMM at 650 MHz delivers 4.79 Gflop/s STREAM COPY at 670 MHz delivers 3579 MB/s

400

440

480

520

560

600

640 16

96

1760

1

2

3

4

5

Performance (GFlop/s)

Frequency (MHz)

N (elts)

Speed test of DGEMM, 25 July 2003

The Linpack Benchmark

LU Factorization: Brief Review

Alreadyfactored

Pivot and scalecolumns

DTRSM

DGEMM

Current block

LINPACKProblem Mapping

...16nrepetitions n repetitions

N

Panel Factorization: Option #1 Stagger the computations PF Distributed over relatively few

processors May take as long as several DGEMM

updates DGEMM load imbalance

Block size trades balance for speed Use collective communication primitives

May require no “holes” in communication fabric

Speed-up Option #2

Change the data distribution Decrease the critical path length Consider the communication abilities of

machine Complements Option #1 Memory size (small favors #2; large #1)

Memory hierarchy (higher latency: #1) The two options can be used in concert

Communication Routines

Broadcasts precede DGEMM update Needs to be architecturally aware

Multiple “pipes” connect processors Physical to logical mapping Careful orchestration is required to

take advantage of machines considerable abilities

Row BroadcastMesh

Row BroadcastMesh

Recv 2Send 4Hot Spot!

Row BroadcastMesh

Recv 2Send 3

Row BroadcastTorus

Row BroadcastTorus (sorry for the “fruit salad”)

Broadcast Bandwidth/Latency

Bandwidth: 2 bytes/cycle per wire Latency:

Sqrt(p), pipelined (large msg.) Deposit bit: 3 hops

Mesh Recv 2/Send 3

Torus Recv 4/Send 4 (no “hot spot”) Recv 2/Send 2 (red-blue only … again, no bottleneck)

Pipe Recv/Send: 1/1 on mesh; 2/2 on torus

What Else? It’s a(n) …

FPU Test Memory Test Power Test Torus Test Mode Test

Conclusion

1.435 TF Linpack #73 in TOP500 List (11/2003) Limited Machine Access Time

Made analysis (prediction) more important 500 MHz Chip

1.507 TF run at 525MHz demonstrates scaling Would achieve >2 TF at 700MHz

1TF even if machine used in “true” heater mode

Conclusion

1.4 TF Linpack on BG/L Prototype: Components

0.02%

0.05%

81.35%

0.73%

8.97%

2.07%0.15%0.52%0.03%0.09%

1.55%0.01%

3.39%0.48%

0.60%

6.03%

Scale

Rank1

Gemm

Trsm

BcastA

BcastD

Pack

Unpack

Idamax

pdgemm

FWDPiv

BackPiv

Waiting

Additional Conclusions Models, extrapolated data

Use models to the extent that the architecture and algorithm are understood

Extrapolate from small processor sets Vary as many (yes) parameters as possible

at the same time Consider how they interact and how they don’t Also remember that instruments affect timing

Often can compensate (incorrect answer results) Utilize observed “eccentricities” with

caution (MPI_Reduce)

Current Fronts

HPC Challenge Benchmark Suite STREAMS, HPL, etc.

HPCS Productivity BenchmarksMath Libraries

Focused Feedback to Toronto PERCS Compiler/Persistent

Optimization Linpack Algorithm on Other Machines

Thanks to …

Leonardo Bachega: BLAS-1, performance results

Sid Chatterjee, Xavier Martorell: Coprocessor, BLAS-1

Fred Gustavson, James Sexton: Data structure investigations, design, sanity tests

Thanks to … Gheorghe Almasi, Phil

Heidelberger & Nils Smeds: MPI/Communications

Vernon Austel: Data copy routines Gerry Kopcsay & Jose Moreira:

System & machine configuration Derek Lieber & Martin Ohmacht:

Refined memory settings Everyone else: System software,

hardware, & Machine time!

Benchmarks on BG/L: Parallel and Serial

John A. GunnelsMathematical Sciences Dept.

IBM T. J. Watson Research Center

Benchmarks on BG/L: Parallel and Serial

Documents

Transcript of Benchmarks on BG/L: Parallel and Serial