Benchmarks on BG/L: Parallel and Serial

44
Benchmarks on BG/L: Parallel and Serial John A. Gunnels Mathematical Sciences Dept. IBM T. J. Watson Research Center

description

Benchmarks on BG/L: Parallel and Serial. John A. Gunnels Mathematical Sciences Dept. IBM T. J. Watson Research Center. Overview. Single node benchmarks Architecture Algorithms Linpack Dealing with a bottleneck Communication operations Benchmarks of the Future. Dual Core Dual FPU/SIMD - PowerPoint PPT Presentation

Transcript of Benchmarks on BG/L: Parallel and Serial

Page 1: Benchmarks on BG/L:  Parallel and Serial

Benchmarks on BG/L: Parallel and Serial

John A. GunnelsMathematical Sciences Dept.

IBM T. J. Watson Research Center

Page 2: Benchmarks on BG/L:  Parallel and Serial

Overview Single node benchmarks

Architecture Algorithms

Linpack Dealing with a bottleneck Communication operations

Benchmarks of the Future

Page 3: Benchmarks on BG/L:  Parallel and Serial

Compute Node: BG/L Dual Core Dual FPU/SIMD

Alignment issues Three-level cache

Pre-fetching Non-coherent L1 caches

32 KB, 64-way, Round-Robin L2 & L3 caches coherent

Outstanding L1 misses (limited)

Page 4: Benchmarks on BG/L:  Parallel and Serial

Programming OptionsHigh Low Level

Compiler optimization to find SIMD parallelism User input for specifying memory alignment and lack of

aliasing alignx assertion disjoint pragma

Dual FPU intrinsics (“built-ins”) Complex data type used to model pair of double-precision

numbers that occupy a (P, S) register pair Compiler responsible for register allocation and scheduling

In-line assembly User responsible for instruction selection, register allocation,

and scheduling

Page 5: Benchmarks on BG/L:  Parallel and Serial

BG/L SIngle-Node STREAM Performance (444 MHz)28 July 2003

0

500

1000

1500

2000

2500

3000

0 500000 1000000 1500000 2000000

Vector size (8-byte elements)

Ban

dw

idth

(M

B/s

)

Tuned copy (MB/s)

Tuned scale (MB/s)

Tuned add (MB/s)

Tuned triad (MB/s)

OOB copy (MB/s)

OOB scale (MB/s)

OOB add (MB/s)

OOB triad (MB/s)

STREAM Performance

Out-of-box performance is 50-65% of tuned performance Lessons learned in

tuning will be transferred to compiler where possible

Comparison with commodity microprocessors is competitive

Machine Frequency STREAM FP peak Balance

  (MHz) (MB/s) (Mflop/s) (B/F)

Intel Xeon 3060 2900 6120 0.474

BG/L 444 2355 3552 0.663

BG/L 670 3579 5360 0.668

AMD Opteron 1800 3600 3600 1.000

Page 6: Benchmarks on BG/L:  Parallel and Serial

DAXPY Bandwidth Utilization

Memory Bandwidth Utilization vs Vector Size for Different Implementations of DAXPY

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

Vector Size (bytes)

Me

m.

Ba

nd

wid

th (

by

tes

/cy

cle

)

Intrinsics

Assembly

Vanilla

5.3 bytes/cycle(L3 – bandwidth)

16 bytes/cycle(L1 – bandwidth)

Page 7: Benchmarks on BG/L:  Parallel and Serial

Matrix MultiplicationTiling for Registers (Analysis)

Latency tolerance (not bandwidth) Take advantage of register count

Unroll by factor of two 24 register pairs 32 cycles per unrolled iteration 15 cycle load-to-use latency (L2 hit)

Could go to 3-way unroll if needed 32 register pairs 32 cycles per unrolled iteration 31 cycle load-to-use latency

F2

F1

M1

M2

8

8

16

16

Page 8: Benchmarks on BG/L:  Parallel and Serial

Recursive Data Format Mapping 2-D (Matrix) to

1-D (RAM) C/Fortran do not map well

Space-Filling Curve Approximation Recursive Tiling

Enables Streaming/pre-fetching Dual core “scaling”

Register Set

Blocks

L1 Cache Blocks

L3 Cache Blocks

Dual Register Blocks

Page 9: Benchmarks on BG/L:  Parallel and Serial

Dual Core

Why? It’s a effortless way to double your

performance

Page 10: Benchmarks on BG/L:  Parallel and Serial

Dual Core

Why? It exploits the architecture and

may allow one to double the performance of their code in some cases/regions

Page 11: Benchmarks on BG/L:  Parallel and Serial

Single-Node DGEMM Performance at 92% of Peak

Single-node DGEMM (444 MHz)18 July 2003

0

1

2

3

4

0 50 100 150 200 250

Matrix size (N)

Per

form

ance

(G

Flo

p/s

)

Single core (GF/s)

Dual core (GF/s)

Single core peak (GF/s)

Dual core peak (GF/s)

Near-perfect scalability (1.99) going from single-core to dual-core Dual-core code delivers 92.27% of peak flops (8 flop/pclk) Performance (as fraction of peak) competitive with that of Power3

and Power4

92.27%

Page 12: Benchmarks on BG/L:  Parallel and Serial

Performance Scales Linearly with Clock Frequency

40

0

44

0

48

0

52

0

56

0

60

0

64

0 200000

900000

16000000

1000

2000

3000

4000

Ba

nd

wid

th (

MB

/s)

Frequency (MHz)

N (elts)

Speed test of STREAM COPY, 25 July 2003

Measured performance of DGEMM and STREAM scale linearly with frequency DGEMM at 650 MHz delivers 4.79 Gflop/s STREAM COPY at 670 MHz delivers 3579 MB/s

400

440

480

520

560

600

640 16

96

1760

1

2

3

4

5

Performance (GFlop/s)

Frequency (MHz)

N (elts)

Speed test of DGEMM, 25 July 2003

Page 13: Benchmarks on BG/L:  Parallel and Serial

The Linpack Benchmark

Page 14: Benchmarks on BG/L:  Parallel and Serial

LU Factorization: Brief Review

Alreadyfactored

Pivot and scalecolumns

DTRSM

DGEMM

Current block

Page 15: Benchmarks on BG/L:  Parallel and Serial

LINPACKProblem Mapping

...16nrepetitions n repetitions

N

Page 16: Benchmarks on BG/L:  Parallel and Serial

Panel Factorization: Option #1 Stagger the computations PF Distributed over relatively few

processors May take as long as several DGEMM

updates DGEMM load imbalance

Block size trades balance for speed Use collective communication primitives

May require no “holes” in communication fabric

Page 17: Benchmarks on BG/L:  Parallel and Serial

Speed-up Option #2

Change the data distribution Decrease the critical path length Consider the communication abilities of

machine Complements Option #1 Memory size (small favors #2; large #1)

Memory hierarchy (higher latency: #1) The two options can be used in concert

Page 18: Benchmarks on BG/L:  Parallel and Serial

Communication Routines

Broadcasts precede DGEMM update Needs to be architecturally aware

Multiple “pipes” connect processors Physical to logical mapping Careful orchestration is required to

take advantage of machines considerable abilities

Page 19: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 20: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 21: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 22: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 23: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 24: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 25: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 26: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 27: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 28: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Page 29: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Recv 2Send 4Hot Spot!

Page 30: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastMesh

Recv 2Send 3

Page 31: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastTorus

Page 32: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastTorus

Page 33: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastTorus

Page 34: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastTorus

Page 35: Benchmarks on BG/L:  Parallel and Serial

Row BroadcastTorus (sorry for the “fruit salad”)

Page 36: Benchmarks on BG/L:  Parallel and Serial

Broadcast Bandwidth/Latency

Bandwidth: 2 bytes/cycle per wire Latency:

Sqrt(p), pipelined (large msg.) Deposit bit: 3 hops

Mesh Recv 2/Send 3

Torus Recv 4/Send 4 (no “hot spot”) Recv 2/Send 2 (red-blue only … again, no bottleneck)

Pipe Recv/Send: 1/1 on mesh; 2/2 on torus

Page 37: Benchmarks on BG/L:  Parallel and Serial

What Else? It’s a(n) …

FPU Test Memory Test Power Test Torus Test Mode Test

Page 38: Benchmarks on BG/L:  Parallel and Serial

Conclusion

1.435 TF Linpack #73 in TOP500 List (11/2003) Limited Machine Access Time

Made analysis (prediction) more important 500 MHz Chip

1.507 TF run at 525MHz demonstrates scaling Would achieve >2 TF at 700MHz

1TF even if machine used in “true” heater mode

Page 39: Benchmarks on BG/L:  Parallel and Serial

Conclusion

1.4 TF Linpack on BG/L Prototype: Components

0.02%

0.05%

81.35%

0.73%

8.97%

2.07%0.15%0.52%0.03%0.09%

1.55%0.01%

3.39%0.48%

0.60%

6.03%

Scale

Rank1

Gemm

Trsm

BcastA

BcastD

Pack

Unpack

Idamax

pdgemm

FWDPiv

BackPiv

Waiting

Page 40: Benchmarks on BG/L:  Parallel and Serial

Additional Conclusions Models, extrapolated data

Use models to the extent that the architecture and algorithm are understood

Extrapolate from small processor sets Vary as many (yes) parameters as possible

at the same time Consider how they interact and how they don’t Also remember that instruments affect timing

Often can compensate (incorrect answer results) Utilize observed “eccentricities” with

caution (MPI_Reduce)

Page 41: Benchmarks on BG/L:  Parallel and Serial

Current Fronts

HPC Challenge Benchmark Suite STREAMS, HPL, etc.

HPCS Productivity BenchmarksMath Libraries

Focused Feedback to Toronto PERCS Compiler/Persistent

Optimization Linpack Algorithm on Other Machines

Page 42: Benchmarks on BG/L:  Parallel and Serial

Thanks to …

Leonardo Bachega: BLAS-1, performance results

Sid Chatterjee, Xavier Martorell: Coprocessor, BLAS-1

Fred Gustavson, James Sexton: Data structure investigations, design, sanity tests

Page 43: Benchmarks on BG/L:  Parallel and Serial

Thanks to … Gheorghe Almasi, Phil

Heidelberger & Nils Smeds: MPI/Communications

Vernon Austel: Data copy routines Gerry Kopcsay & Jose Moreira:

System & machine configuration Derek Lieber & Martin Ohmacht:

Refined memory settings Everyone else: System software,

hardware, & Machine time!

Page 44: Benchmarks on BG/L:  Parallel and Serial

Benchmarks on BG/L: Parallel and Serial

John A. GunnelsMathematical Sciences Dept.

IBM T. J. Watson Research Center