Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

45
BIPS BIPS T A T I O N A L R E S E A R C H D I V Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07) Samuel Williams 1,2 , Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology [email protected]

description

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07). Samuel Williams 1,2 , Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 - PowerPoint PPT Presentation

Transcript of Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

Page 1: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs

(details in paper at SC07)

Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2

1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology

[email protected]

Page 2: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Overview

Multicore is the de facto performance solution for the next decade

Examined Sparse Matrix Vector Multiplication (SpMV) kernel Important HPC kernel Memory intensive Challenging for multicore

Present two auto-tuned threaded implementations: Pthread, cache-based implementation Cell local store-based implementation

Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 IBM Cell Broadband Engine

Compare with leading MPI implementation(PETSc) with an auto-tuned serial kernel (OSKI)

Show Cell delivers good performance and efficiency, while Niagara2 delivers good performance and productivity

Page 3: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Sparse Matrix Vector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate y=Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)

A x y

Page 4: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Dataset (Matrices)Multicore SMPs

Test Suite

Page 5: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Matrices Used

Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M

Dense

Protein FEM /Spheres

FEM /Cantilever

WindTunnel

FEM /Harbor QCD FEM /

Ship Economics Epidemiology

FEM /Accelerator Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

Page 6: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Multicore SMP Systems

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Page 7: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSMulticore SMP Systems

(memory hierarchy)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Conventional Cache-based

Memory Hierarchy

Disjoint Local Store

Memory Hierarchy

Page 8: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSMulticore SMP Systems

(2 implementations)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Cache+Pthreads SpMV

implementation

Local Store SpMV

Implementation

Page 9: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSMulticore SMP Systems

(cache)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

16MB(vectors fit)

4MB

4MB(local store)

4MB

Page 10: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSMulticore SMP Systems

(peak flops)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

75 Gflop/s(TLP + ILP + SIMD)

17 Gflop/s(TLP + ILP)

29 Gflop/s(TLP + ILP + SIMD)

11 Gflop/s(TLP)

Page 11: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSMulticore SMP Systems

(peak read bandwidth)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

21 GB/s 21 GB/s

51 GB/s43 GB/s

Page 12: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSMulticore SMP Systems

(NUMA)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Unifo

rm M

emor

y Ac

cess

Non-

Unifo

rm M

emor

y Ac

cess

Page 13: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Naïve Implementation for Cache-based Machines

Performance across suiteAlso, included a median performance number

Page 14: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS vanilla C Performance

0.00.10.20.30.40.50.60.70.80.91.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00.10.20.30.40.50.60.70.80.91.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00.10.20.30.40.50.60.70.80.91.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 Vanilla C implementation Matrix stored in CSR (compressed

sparse row) Explored compiler options - only

the best is presented here x86 core delivers > 10x

performance of Niagara2 thread

Page 15: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

SPMD, strong scalingOptimized for multicore/threadingVariety of shared memory programming models are acceptable(not just Pthreads)More colors = more optimizations = more work

Pthread Implementationfor Cache-Based Machines

Page 16: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Naïve Parallel Performance

Naïve Pthreads

Naïve Single Thread

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86

machines

Page 17: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Naïve Parallel Performance

Naïve Pthreads

Naïve Single Thread

Intel Clovertown AMD Opteron

Sun Niagara2

8x cores = 1.9x performance 4x cores = 1.5x performance

64x threads = 41x performance Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86

machines

Page 18: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Naïve Parallel Performance

Naïve Pthreads

Naïve Single Thread

Intel Clovertown AMD Opteron

Sun Niagara2

1.4% of peak flops29% of bandwidth

4% of peak flops20% of bandwidth

25% of peak flops39% of bandwidth

Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86

machines

Page 19: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Case for Auto-tuning

How do we deliver good performance across all these architectures, across all matrices without exhaustively optimizing every combination

Auto-tuning (in general) Write a Perl script that generates all possible optimizations Heuristically, or exhaustively searches the optimizations Existing SpMV solution: OSKI (developed at UCB)

This work: Tuning geared toward multi-core/-threading generates SSE/SIMD intrinsics, prefetching, loop

transformations, alternate data structures, etc… “prototype for parallel OSKI”

Page 20: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSPerformance

(+NUMA & SW Prefetching)

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 NUMA aware allocation Tag prefetches with the

appropriate temporal locality Tune for the optimal distance

Page 21: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Matrix Compression

For memory bound kernels, minimizing memory traffic should maximize performance

Compress the meta data Exploit structure to eliminate meta data

Heuristic: select the compression thatminimizes the matrix size: power of 2 register blocking CSR/COO format 16b/32b indices etc…

Side effect: matrix may be minimized to the point where it fits entirely in cache

Page 22: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSPerformance

(+matrix compression)

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 Very significant on some missed the boat on others

Page 23: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSPerformance(+cache & TLB blocking)

+Cache/TLB Blocking

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 Cache blocking geared towards

sparse accesses

Page 24: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPSPerformance

(more DIMMs,firmware fix …)

+More DIMMs, Rank configuration, etc…

+Cache/TLB Blocking

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2

Page 25: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Performance(more DIMMs,firmware fix …)

+More DIMMs, Rank configuration, etc…

+Cache/TLB Blocking

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

Intel Clovertown AMD Opteron

Sun Niagara2

4% of peak flops52% of bandwidth

20% of peak flops66% of bandwidth

52% of peak flops54% of bandwidth

Page 26: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Performance(more DIMMs,firmware fix …)

+More DIMMs, Rank configuration, etc…

+Cache/TLB Blocking

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

Intel Clovertown AMD Opteron

Sun Niagara2

3 essential optimizations: Parallelize, Prefetch, Compress

4 essential optimizations: Parallelize, NUMA, Prefetch, Compress

2 essential optimizations: Parallelize, Compress

Page 27: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

CommentsPerformance

Cell Implementation

Page 28: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Cell Implementation

No vanilla C implementation (aside from the PPE) In some cases, what were optional optimizations on cache based

machines, are requirements for correctness on Cell

Even SIMDized double precision is extremely weak Scalar double precision is unbearable Minimum register blocking is 2x1 (SIMDizable) Can increase memory traffic by 66%

Optional Cache blocking is now required local store blocking Spatial and temporal locality is captured by software when

the matrix is optimized In essence, the high bits of column indices are grouped into DMA

lists Branchless implementation

Despite the following performance numbers, Cell was still handicapped by double precision

Page 29: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

16 SPEs8 SPEs1 SPE

Performance(all)

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 IBM Cell Broadband Engine

Page 30: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

16 SPEs8 SPEs1 SPE

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2

Hardware Efficiency

IBM Cell Broadband Engine39% of peak flops89% of bandwidth

4% of peak flops52% of bandwidth

20% of peak flops66% of bandwidth

52% of peak flops54% of bandwidth

Page 31: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

16 SPEs8 SPEs1 SPE

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2

Productivity

IBM Cell Broadband Engine

3 essential optimizations: Parallelize, Prefetch, Compress

4 essential optimizations: Parallelize, NUMA, Prefetch, Compress

2 essential optimizations: Parallelize, Compress

5 essential optimizations: Parallelize, NUMA, DMA, Compress, Cache(LS) Block

Page 32: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

0.000

0.200

0.400

0.600

0.800

1.000

0 8 16 24 32 40 48 56 64threads

Parallel Efficiency

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4cores

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1 2 3 4 5 6 7 8cores

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores

Parallel Efficiency

Intel Clovertown AMD Opteron

Sun Niagara2 IBM Cell Broadband Engine

Page 33: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

0.000

0.200

0.400

0.600

0.800

1.000

0 8 16 24 32 40 48 56 64threads

Parallel Efficiency

Parallel Efficiency(breakdown)

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4cores

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1 2 3 4 5 6 7 8cores

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores

Parallel Efficiency

Intel Clovertown AMD Opteron

Sun Niagara2 IBM Cell Broadband Engine

multicore

multicore

multicore

multicore

multisocket

multisocket

multisocket

MT

Page 34: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Multicore MPI ImplementationThis is the default approach to programming multicore

Page 35: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Multicore MPI Implementation

Used PETSc with shared memory MPICH Used OSKI (developed @ UCB) to optimize each thread = good autotuned shared memory MPI implementation

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

MPI(autotuned) Pthreads(autotuned)Naïve Single Thread

Intel Clovertown AMD Opteron

Page 36: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Summary

Page 37: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

Median Performance

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

IntelClovertown

AMD Opteron Sun Niagara2 IBM Cell

GFlop/s

Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)

Median Power Efficiency

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

22.0

24.0

IntelClovertown

AMD Opteron Sun Niagara2 IBM Cell

MFlop/s/Watt

Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)

Median Performance & Efficiency

1P Niagara2 was consistently better than 2P x86 machines (more potential bandwidth)

Cell delivered by far the best performance (better utilization)

Used digital power meter to measure sustained system power

FBDIMM is power hungry (12W/DIMM) Clovertown(330W) Niagara2 (350W) power

Page 38: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Summary

Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.

Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency

90% of memory bandwidth (most get ~50%) High power efficiency Easily understood performance Extra traffic = lower performance (future work can address this)

Our multicore specific autotuned implementation significantly outperformed an autotuned MPI implementation

It exploited: Matrix compression geared towards multicore, rather than single NUMA Prefetching

Page 39: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Acknowledgments

UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)

Sun Microsystems Niagara2 access

Forschungszentrum Jülich Cell blade cluster access

Page 40: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Questions?

Page 41: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Backup Slides

Page 42: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Parallelization

Matrix partitioned by rows and balanced by the number of nonzeros

SPMD like approach A barrier() is called before and after the SpMV kernel Each sub matrix stored separately in CSR Load balancing can be challenging

# of threads explored in powers of 2 (in paper)

A

x

y

Page 43: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Exploiting NUMA, Affinity

Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data

Bind each sub matrix and the thread to process it together

Explored libnuma, Linux, and Solaris routines Adjacent blocks bound to adjacent cores

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Single ThreadMultiple Threads,

One memory controllerMultiple Threads,

Both memory controllers

Page 44: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Cache and TLB Blocking

Accesses to the matrix and destination vector are streaming But, access to the source vector can be random Reorganize matrix (and thus access pattern) to maximize reuse. Applies equally to TLB blocking (caching PTEs)

Heuristic: block destination, then keep addingmore columns as long as the number ofsource vector cache lines(or pages) touchedis less than the cache(or TLB). Apply allprevious optimizations individually to eachcache block.

Search: neither, cache, cache&TLB

Better locality at the expense of confusingthe hardware prefetchers.

A

x

y

Page 45: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

BIPSBIPS Banks, Ranks, and DIMMs

In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory.

Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this)

Addressing/Bank conflicts become increasingly likely

Add more DIMMs, configuration of ranks can help Clovertown system was already fully populated