Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
Introduction to Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial
description
Transcript of Introduction to Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial
Introduction to Communication-Avoiding Algorithms
wwwcsberkeleyedu~demmelSC12_tutorial
Jim DemmelEECS amp Math Departments
UC Berkeley
2
Why avoid communication (12)
Algorithms have two costs (measured in time or energy)1 Arithmetic (FLOPS)2 Communication moving data between
ndash levels of a memory hierarchy (sequential case) ndash processors over a network (parallel case)
CPUCache
DRAM
CPUDRAM
CPUDRAM
CPUDRAM
CPUDRAM
Why avoid communication (23)
bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency
3
communication
bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]
bull Avoid communication to save time
Annual improvements
Time_per_flop Bandwidth Latency
Network 26 15
DRAM 23 559
Why Minimize Communication (22)
Source John Shalf LBL
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura
Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick
bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik
bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects
bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays
(eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
2
Why avoid communication (12)
Algorithms have two costs (measured in time or energy)1 Arithmetic (FLOPS)2 Communication moving data between
ndash levels of a memory hierarchy (sequential case) ndash processors over a network (parallel case)
CPUCache
DRAM
CPUDRAM
CPUDRAM
CPUDRAM
CPUDRAM
Why avoid communication (23)
bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency
3
communication
bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]
bull Avoid communication to save time
Annual improvements
Time_per_flop Bandwidth Latency
Network 26 15
DRAM 23 559
Why Minimize Communication (22)
Source John Shalf LBL
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura
Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick
bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik
bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects
bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays
(eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Why avoid communication (23)
bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency
3
communication
bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]
bull Avoid communication to save time
Annual improvements
Time_per_flop Bandwidth Latency
Network 26 15
DRAM 23 559
Why Minimize Communication (22)
Source John Shalf LBL
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura
Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick
bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik
bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects
bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays
(eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Why Minimize Communication (22)
Source John Shalf LBL
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura
Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick
bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik
bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects
bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays
(eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura
Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick
bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik
bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects
bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays
(eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura
Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick
bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik
bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects
bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays
(eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura
Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick
bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik
bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects
bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays
(eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura
Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick
bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik
bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects
bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays
(eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays
(eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= + C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= + C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= + C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= + C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
22
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Recursive Matrix Multiplication (RMM) (22)
23
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
How hard is hand-tuning matmul anyway
24
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
How hard is hand-tuning matmul anyway
25
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
27
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS
bull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions
layouts and Cannon may use more memory
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
28
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
kk
B(kj)
C(ij)
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
29
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
kk
B(kj)
C(ij)
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
Brow
Acol
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
30
SUMMA Communication Costs
For k=0 to nb-1 for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow
deg Total words = log P n2 P12
deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)
deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
31
Can we do better
bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P
ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)
bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E
needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T
that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
39
Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU
for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update
for i=1 to n-1 update column i update trailing matrix
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Communication in sequentialOne-sided Factorizations (LU QR hellip)
bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)
40
bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel
bull Need more ideas
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
TSQR QR of a Tall Skinny matrix
41
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
TSQR QR of a Tall Skinny matrix
42
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01 R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others
44
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
45
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D vs 2D LUWith and Without Pivoting
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Summary of dense sequential algorithms attaining communication lower bounds
bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts
bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work
Algorithm 2 Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]
Cholesky LAPACK (with b = M12) [Gustavson 97]
[BDHS09]
[Gustavson97] [AhmedPingali00]
[BDHS09]
(larrsame) (larrsame)
LU withpivoting
LAPACK (sometimes)[Toledo97] [GDX 08]
[GDX 08] partial pivoting
[Toledo 97][GDX 08]
[GDX 08]
QRRank-revealing
LAPACK (sometimes) [ElmrothGustavson98]
[DGHL08]
[FrensWise03]but 3x flops
[DGHL08]
[ElmrothGustavson98]
[DGHL08]
[FrensWise03] [DGHL08]
Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]
[BDD11BDK11]
[BDD11BDK11]
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12
QR ScaLAPACK log P n log P P12
Sym Eig SVD ScaLAPACK log P n P12
Nonsym Eig ScaLAPACK P12 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Can we do even better
bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )
Algorithm Reference Factor exceeding lower bound for words_moved
Factor exceeding lower bound for messages
Matrix Multiply
[DS11SBD11] polylog P polylog P
Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal
LU [DS11SBD11] polylog P c2 polylog P ndash optimal
QR Via Cholesky QR polylog P c2 polylog P ndash optimal
Sym Eig SVD Nonsym Eig
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
b+1
b+1
Successive Band Reduction
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
1Q1
b+1
b+1
d+1
c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
12
Q1
b+1
b+1
d+1
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
Successive Band Reduction
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
Successive Band Reduction
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
Successive Band Reduction
Only need to zero out leadingparallelogram of each trapezoid
2
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Conventional vs CA - SBR
Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Speedups of Sym Band Reductionvs DSBTRD
bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads
bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads
bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads
bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads
bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
71
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and
practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation
74
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
71
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and
practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation
74
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
71
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and
practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation
74
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
71
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and
practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation
74
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and
practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation
74
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation
74
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation
74
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost88
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
89
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
90
Requires Co-tuning Kernels
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
91
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis
bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing
arrays (eg n-body)
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s
bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s
bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s
bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
General Communication Bound
bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))
bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then
words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by
ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Is this bound attainable (12)
bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q
(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases
of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate
bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP
gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T
Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj
bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel
ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-
Summary
Donrsquot Communichellip
108
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- SUMMA Communication Costs
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
- Perfect Strong Scaling ndash in Time and Energy (22)
- Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
- Communication in sequential One-sided Factorizations (LU QR hellip
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Back to LU Using similar idea for TSLU as TSQR Use reduction
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- 25D vs 2D LU With and Without Pivoting
- Summary of dense sequential algorithms attaining communication
- Summary of dense parallel algorithms attaining communication l
- Summary of dense parallel algorithms attaining communication l (2)
- Summary of dense parallel algorithms attaining communication l (3)
- Summary of dense parallel algorithms attaining communication l (4)
- Can we do even better
- Symmetric Band Reduction
- Slide 56
- Slide 57
- Slide 58
- Slide 59
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- Slide 64
- Slide 65
- Slide 66
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs DSBTRD
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 71
- Summary of Direct Linear Algebra
- Outline (3)
- Avoiding Communication in Iterative Linear Algebra
- Slide 75
- Slide 76
- Slide 77
- Slide 78
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Minimizing Communication of GMRES to solve Ax=b
- Slide 89
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 91
- Slide 92
- Slide 93
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- Outline (4)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- For more details
- Summary
-