Sandia Fast Matmul

33
A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION Austin Benson ([email protected]), ICME, Stanford Grey Ballard, Sandia National Laboratories Sandia National Laboratories, October 21, 2014 1 arXiv: 1409.2908

description

Slides from my talk at Sandia.

Transcript of Sandia Fast Matmul

Page 1: Sandia Fast Matmul

1

A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

Sandia National Laboratories, October 21, 2014

arXiv: 1409.2908

Page 2: Sandia Fast Matmul

2

Fast matrix multiplication:bridging theory and practice

• There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14]

• We show that they can achieve higher performance with respect to Intel’s Math Kernel Library (MKL)

• We use code generation for extensive prototyping.

32 2.81[Strassen79]

2.37[Williams12]

xxx xx xx xx

Page 3: Sandia Fast Matmul

3

Strassen’s algorithm

Page 4: Sandia Fast Matmul

4

Key ingredients of Strassen’s algorithm

• 1. Block partitioning of matrices (<2, 2, 2>)• 2. Seven linear combinations of sub-blocks of A• 3. Seven linear combinations of sub-blocks of B• 4. Seven matrix multiplies to form Mr (recursive)

• 5. Linear combinations of Mr to form Cij

Page 5: Sandia Fast Matmul

5

Key ingredients of fast matmul algorithms

• 1. Block partitioning of matrices (<M, K, N>)• 2. R linear combinations of sub-blocks of A

• 3. R linear combinations of sub-blocks of B• 4. R matrix multiplies to form Mr (recursive)

R < MKN faster than classical

• 5. Linear combinations of Mr to form Cij

Main idea: Trade N^3 computations (matmul) for more N^2 computations (matrix addition)

Page 6: Sandia Fast Matmul

6

“Outer product” fast algorithm

• <4, 2, 4> partitioning• R = 26 multiplies (< 4 * 2 * 4 = 32)

23% speedup per recursive step (if everything else free)• Linear combinations of Aij to form Sr: 68 terms

• Linear combinations of Bij to form Tr: 52 terms

• Linear combinations of Mr to form Cij: 69 terms

Page 7: Sandia Fast Matmul

7

Discovering fast algorithms is anumerical challenge

• Low-rank tensor decompositions lead to fast algorithms• Tensors are small, but we need exact decompositions

NP-hard• Use alternating least squares with regularization and

rounding tricks [Smirnov13], [Benson&Ballard14]

• We have around 10 fast algorithms for <M, K, N> partitions. Also have permutations, e.g., <K, M, N>.

Page 8: Sandia Fast Matmul

8

[Smirnov13]

[Strassen69]

Page 9: Sandia Fast Matmul

9

Code generation lets us prototype algorithms quickly

• We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij

• We use code generation to rapidly prototype fast algorithms

• Our approach: test all algorithms on a bunch of different problem sizes and look for patterns

Page 10: Sandia Fast Matmul

10

Sequential performance =

Effective GFLOPS for M x P x Q multiplies = 1e-9 * 2 * MPQ / time in seconds

True peak

Page 11: Sandia Fast Matmul

11

Sequential performance

• All algorithms beat MKL on large problems• Strassen’s algorithm is hard to beat

=

Page 12: Sandia Fast Matmul

12

Sequential performance =

• Almost all algorithms beat MKL• <4, 2, 4> and <3, 2, 3> tend to perform the best

Page 13: Sandia Fast Matmul

13

Sequential performance =

• Almost all algorithms beat MKL• <4, 3, 3> and <4, 2, 3> tend to perform the best

Page 14: Sandia Fast Matmul

14

Practical issues, or what is not obvious when

…you just think about the theory

…your performance models are too simple

Page 15: Sandia Fast Matmul

15

Practical issue #1: when to stop recursion?Look at the gemm curve.

Basic idea: take another recursive step if the sub-problems will still operate at high performance

<M, K, N> = <4, 2, 3>

Page 16: Sandia Fast Matmul

16

Practical issue #2: matrix additions

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Pairwise”

2xDAXPY

2xDAXPY

Y := AX + Y

Page 17: Sandia Fast Matmul

17

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Write once”

custom“DAXPY”

custom“DAXPY”

Practical issue #2: matrix additions

Page 18: Sandia Fast Matmul

18

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Streaming”

Entry-wise updates

Practical issue #2: matrix additions

Page 19: Sandia Fast Matmul

19

Practical issue #3: redundant linear combinations

• Example in <4, 2, 4> algorithm (R = 26 multiplies):

B24B12 B22 B23

T11 T25

Four additions, six reads, two writes

Page 20: Sandia Fast Matmul

20

Common subexpression elimination

• Example in <4, 2, 4> algorithm (R = 26 multiplies):

B24B12 B22 B23

T11 T25

Y

Three additions, six reads, three writes Net increase in communication!

Page 21: Sandia Fast Matmul

21

Common subexpression elimination (CSE)does not really help

Page 22: Sandia Fast Matmul

22

Practical issue #4: parallelization on shared memory

Page 23: Sandia Fast Matmul

23

Fast matmul recursion tree

C

M1 M7

+

M2 …

M1 M7

+

M2 …

M1 M7

+

M2 …

Page 24: Sandia Fast Matmul

24

DFS ParallelizationC

M1 M7

+

M2 …

M1 M7

+

M2 …

All threads

Use parallel MKL

+ Easy to implement+ Load balanced+ Same memory footprint as sequential- Need large base cases for high performance

Page 25: Sandia Fast Matmul

25

BFS ParallelizationC

M1 M7

+

M2 …

M1 M7

+

M2 …

omp taskwait

omp taskwait

1 thread

+ High performance for smaller base cases- Sometimes harder to load balance: 24 threads, 49 subproblems- More memory

1 thread 1 thread

Page 26: Sandia Fast Matmul

26

HYBRID parallelization

C

M1 M7

+

M2 …

M1 M7

+

M2 …

omp taskwait

omp taskwait

1 thread 1 thread all threads

+ Better load balancing- Explicit synchronization or else we can over-subscribe threads

Page 27: Sandia Fast Matmul

27

Page 28: Sandia Fast Matmul

28

Bandwidth problems• We rely on the cost of matrix multiplications to be much

more expensive than the cost of matrix additions• Parallel dgemm on 24 cores: easily get 50-75% of peak• STREAM benchmark: < 6x speedup in read/write

performance on 24 cores

C

M1 M7

+

M2 …

Page 29: Sandia Fast Matmul

29

Parallel performance =

• 6 cores: similar performance to sequential• 24 cores: can sometimes beat MKL, but barely

Page 30: Sandia Fast Matmul

30

Parallel performance =Bad MKLperformance

• 6 cores: similar performance to sequential• 24 cores: MKL best for large problems

Page 31: Sandia Fast Matmul

31

Parallel performance =

• 6 cores: similar performance to sequential• 24 cores: MKL usually the best

Page 32: Sandia Fast Matmul

32

High-level conclusions• For square matrix multiplication, Strassen’s algorithm is

hard to beat• For rectangular matrix multiplication, use a fast algorithm

that “matches the shape”• Bandwidth limits the performance of shared memory

parallel fast matrix multiplication should be less of an issue in distributed memory

Future work:• Numerical stability• Using fast matmul as a kernel for other algorithms in

numerical linear algebra

Page 33: Sandia Fast Matmul

33

A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

Sandia National Laboratories, October 21, 2014

arXiv: 1409.2908