Oded Schwartz

49
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture #3 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Based on: D. Irony, S. Toledo, and A. Tiskin: Communication lower bounds for distributed-memory matrix multiplication G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Minimizing communication in linear algebra.

description

CS294, Lecture #3 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294. How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding. Oded Schwartz. Based on: D. Irony, S. Toledo, and A. Tiskin: - PowerPoint PPT Presentation

Transcript of Oded Schwartz

Page 1: Oded Schwartz

How to Compute and Prove

Lower and Upper Boundson the

Communication Costsof Your Algorithm

Part II: Geometric embedding

Oded Schwartz

CS294, Lecture #3 Fall, 2011Communication-Avoiding Algorithms

www.cs.berkeley.edu/~odedsc/CS294

Based on:

D. Irony, S. Toledo, and A. Tiskin:

Communication lower bounds for distributed-memory matrix multiplication.

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz:

Minimizing communication in linear algebra.

Page 2: Oded Schwartz

2

Last time: the models

Two kinds of costs:Arithmetic (FLOPs)Communication: moving data between

• levels of a memory hierarchy (sequential case) • over a network connecting processors (parallel case)

CPURAM

CPURAM

CPURAM

CPURAM

Parallel

CPUCache

RAM

Sequential

M1

M2

M3

Mk =

Hierarchy

Page 3: Oded Schwartz

3

Last time: Communication Lower Bounds

Approaches:

1. Reduction [Ballard, Demmel, Holtz, S. 2009]

2. Geometric Embedding[Irony,Toledo,Tiskin 04],

[Ballard, Demmel, Holtz, S. 2011a]

3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b]

Proving that your algorithm/implementation is as good as it gets.

Page 4: Oded Schwartz

4

Last time: Lower bounds for matrix multiplicationBandwidth:

[Hong & Kung 81]• Sequential

[Irony,Toledo,Tiskin 04]• Sequential and parallel

Latency: Divide by M.

M

M

n3

P

M

M

n3

Page 5: Oded Schwartz

5

Last time: Reduction (1st approach) [Ballard, Demmel, Holtz, S. 2009a]

Thm:Cholesky and LU decompositions are(communication-wise) as hard as matrix-multiplication

Proof:By a reduction (from matrix-multiplication) that preserves communication bandwidth, latency, and arithmetic.

Cor: Any classical O(n3) algorithm for Cholesky and LU decomposition requires:

Bandwidth: (n3 / M1/2)Latency: (n3 / M3/2)

(similar cor. for the parallel model).

Page 6: Oded Schwartz

6

Today: Communication Lower Bounds

Approaches:

1. Reduction [Ballard, Demmel, Holtz, S. 2009]

2. Geometric Embedding[Irony,Toledo,Tiskin 04],

[Ballard, Demmel, Holtz, S. 2011a]

3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b]

Proving that your algorithm/implementation is as good as it gets.

Page 7: Oded Schwartz

7

Lower bounds: for matrix multiplicationusing geometric embedding

[Hong & Kung 81]• Sequential

[Irony,Toledo,Tiskin 04]• Sequential and parallel

Now: prove both, using the geometric embedding approach of [Irony,Toledo,Tiskin 04].

M

M

n3

P

M

M

n3

Page 8: Oded Schwartz

8

Geometric Embedding (2nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

Matrix multiplication form:

(i,j) n x n, C(i,j) = k A(i,k) B(k,j),

Thm: If an algorithm agrees with this form (regardless of the order of computation) then

BW = (n3/ M1/2)

BW = (n3/ PM1/2) in P-parallel model.

Page 9: Oded Schwartz

9

S1

S2

S3

Read

Read

Read

Read

Read

Read

Write

Write

Write

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

Tim

e

...

M

Example of a partition,M = 3

For a given run (algorithm, machine, input)1. Partition computations into segments

of M reads / writes

2. Any segment S has 3M inputs/outputs.

3. Show that #multiplications in S k

4. The total communication BW isBW = BW of one segment #segments

M #mults / k

...

Geometric Embedding (2nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

Page 10: Oded Schwartz

10

Volume of boxV = x·y·z = ( xz · zy · yx)1/2

Thm: (Loomis & Whitney, 1949) Volume of 3D set V ≤ (area(A shadow) · area(B shadow) · area(C shadow) ) 1/2

x

z

z

y

x

y

A BC

“A shadow”

“B shadow”

“C shadow”

A BC

VV

Matrix multiplication form:

(i,j) n x n, C(i,j) = k A(i,k)B(k,j),

Geometric Embedding (2nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

Page 11: Oded Schwartz

11

S1

S2

S3

Read

Read

Read

Read

Read

Read

Write

Write

Write

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

Tim

e

...

M

Example of a partition,M = 3

For a given run (algorithm, machine, input)1. Partition computations into segments

of M reads / writes

2. Any segment S has 3M inputs/outputs.

3. Show that #multiplications in S k

4. The total communication BW isBW = BW of one segment #segments

M #mults / k = M n3 / k

5. By Loomis-Whitney:BW M n3 / (3M)3/2 ..

.

Geometric Embedding (2nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

Page 12: Oded Schwartz

12

From Sequential Lower bound to Parallel Lower Bound

We showed:Any classical O(n3) algorithm for matrix multiplication on sequential model requires:Bandwidth: (n3 / M1/2)Latency: (n3 / M3/2)

Cor:Any classical O(n3) algorithm for matrix multiplication on P-processors machine (with balanced workload) requires:

2D-layout: M=O(n2/P)Bandwidth: (n3 /PM1/2) (n2/P1/2) Latency: (n3 / PM3/2) (P1/2)

Page 13: Oded Schwartz

13

From Sequential Lower bound to Parallel Lower Bound

Proof:Observe one processor.

Is it always true?“A shadow”

“B shadow”

“C shadow”

A BC

Let Alg be an algorithm with communication lower bound B = B(n,M).

Then any parallel implementation of Alg has a communication lower bound B’(n, M, p) = B(n, M)/p

?

Page 14: Oded Schwartz

Proof of Loomis-Whitney inequality

• T = 3D set of 1x1x1 cubes on lattice • N = |T| = #cubes

• Tx = projection of T onto x=0 plane

• Nx = |Tx| = #squares in Tx, same for Ty, Ny, etc

• Goal: N ≤ (Nx · Ny · Nz)1/2

14

• T(x=i) = subset of T with x=i• T(x=i | y ) = projection of T(x=i) onto y=0 plane• N(x=i) = |T(x=i)| etc

• N = i N(x=i) = i (N(x=i))1/2 · (N(x=i))1/2

≤ i (Nx)1/2 · (N(x=i))1/2

≤ (Nx)1/2 · i (N(x=i | y ) · N(x=i | z ) )1/2

= (Nx)1/2 · i (N(x=i | y ) )1/2 · (N(x=i | z ) )1/2

≤ (Nx)1/2 · (i N(x=i | y ) )1/2 · (i N(x=i | z ) )1/2

= (Nx)1/2 · (Ny)1/2 · (Nz)1/2

z

y

x

T(x=i)

T(x=i | y)

T

x=i

N(x=i|y)

N(x=i) N(x=i|y) ·N(x=i|z)

N(x

=i|z

)

T(x=i)

Page 15: Oded Schwartz

15

Communication Lower Bounds

Approaches:

1. Reduction [Ballard, Demmel, Holtz, S. 2009]

2. Geometric Embedding[Irony,Toledo,Tiskin 04],

[Ballard, Demmel, Holtz, S. 2011a]

3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b]

Proving that your algorithm/implementation is as good as it gets.

Page 16: Oded Schwartz

How to generalize this lower bound

16

Matrix multiplication form:

(i,j) n x n, C(i,j) = k A(i,k)B(k,j),

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

• C(i,j) any unique memory location.Same for A(i,k) and B(k,j). A,B and C may overlap.

• Lower bound for all reorderings. Incorrect ones too.• It does assume each operand generate load/store.

• Turns out QR, eig, SVD all may do this• Need a different analysis. Not today…

• fij and gijk are “nontrivial” functions

Page 17: Oded Schwartz

17

Geometric Embedding (2nd approach)

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

Thm: [Ballard, Demmel, Holtz, S. 2011a]If an algorithm agrees with the generalized form then

BW = (G/ M1/2) where G = |{g(i,j,k) | (i,j) S, k Sij }

BW = (G/ PM1/2) in P-parallel model.

Page 18: Oded Schwartz

18

Example: Application to Cholesky decomposition

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

jikjAjiAjiAjjL

jiLik

,,,,,

1,

1

1

2,,,ik

kiAiiAiiL

Page 19: Oded Schwartz

19

From Sequential Lower bound to Parallel Lower Bound

We showed:Any algorithm that agrees with Form (1) on sequential model requires:Bandwidth: (G / M1/2)Latency: (G / M3/2)where G is the gijk count.

Cor:Any algorithm that agrees with Form (1), on a P-processors machine, where at least two processors perform (1/P) of G each requires:

Bandwidth: (G /PM1/2)Latency: (G / PM3/2)

Page 20: Oded Schwartz

20

Geometric Embedding (2nd approach) [Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

Lower bounds: for algorithms with “flavor” of 3 nested loopsBLAS, LU, Cholesky, LDLT, and QR factorizations, eigenvalues and SVD, i.e., essentially all direct methods of linear algebra.

• Dense or sparse matricesIn sparse cases: bandwidth is a function NNZ.

• Bandwidth and latency.• Sequential, hierarchical, and

parallel – distributed and shared memory models.• Compositions of linear algebra operations.• Certain graph optimization problems

[Demmel, Pearson, Poloni, Van Loan, 11], [Ballard, Demmel, S. 11]• Tensor contraction

M

M

n3

P

M

M

n3

For dense:

Page 21: Oded Schwartz

21

Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds?

Mostly not.

Are there other algorithms that do?Mostly yes.

Page 22: Oded Schwartz

22

Dense Linear Algebra: Sequential Model

Lower bound Attaining algorithm

Algorithm Bandwidth Latency Bandwidth Latency

Matrix-Multiplication

[Ballard, Demmel, Holtz, S. 11]

[Ballard, Demmel, Holtz, S. 11]

[Frigo, Leiserson, Prokop, Ramachandran 99]

Cholesky [Ahmad, Pingali 00][Ballard, Demmel, Holtz, S. 09]

LU [Toledo97] [DGX08]

QR [EG98] [DGHL08a]

Symmetric Eigenvalues

[Ballard,Demmel,Dumitriu 10]

SVD [Ballard,Demmel,Dumitriu 10]

(Generalized) Nonsymetric Eigenvalues

[Ballard,Demmel,Dumitriu 10]

M

M

n3

3

M

n

Page 23: Oded Schwartz

Dense 2D parallel algorithms • Assume nxn matrices on P processors, memory per processor = O(n2 / P)

• ScaLAPACK assumes best block size b chosen

• Many references (see reports), Blue are new

• Recall lower bounds:

#words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 )Algorithm Reference Factor exceeding

lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix multiply ]Cannon, 69[ 1 1

Cholesky ScaLAPACK log P log P

LU ]GDX08[ScaLAPACK

log P log P

log P( N / P1/2 ) · log P

QR ]DGHL08 [ScaLAPACK

log Plog P

log3 P( N / P1/2 ) · log P

Sym Eig, SVD ]BDD10[ScaLAPACK

log Plog P

log3 PN / P1/2

Nonsym Eig ]BDD10[ScaLAPACK

log PP1/2 · log P

log3 PN · log P

Relax: 2.5D AlgorithmsSolomonik & Demmel ‘11

Page 24: Oded Schwartz

24

S1

S2

S3

Read

Read

Read

Read

Read

Read

Write

Write

Write

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

Tim

e

...

M

Example of a partition,M = 3

For a given run (algorithm, machine, input)1. Partition computations into segments

of M reads / writes

2. Any segment S has 3M inputs/outputs.

3. Show that S performs k FLOPs gijk

4. The total communication BW isBW = BW of one segment #segments

M G / k, where G is #gi,j,k

...

Geometric Embedding (2nd approach)

Page 25: Oded Schwartz

25

Geometric Embedding (2nd approach) [Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

Volume of boxV = x·y·z = ( xz · zy · yx)1/2

Thm: (Loomis & Whitney, 1949) Volume of 3D set V ≤ (area(A shadow) · area(B shadow) · area(C shadow) ) 1/2

x

z

z

y

x

y

A BC

“A shadow”

“B shadow”

“C shadow”

A BC

VV

Page 26: Oded Schwartz

26

S1

S2

S3

Read

Read

Read

Read

Read

Read

Write

Write

Write

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

Tim

e

...

M

Example of a partition,M = 3

For a given run (algorithm, machine, input)1. Partition computations into segments

of M reads / writes

2. Any segment S has 3M inputs/outputs.

3. Show that S performs k FLOPs gijk

4. The total communication BW isBW = BW of one segment #segments

M G / kwhere G is #gi,j,k

5. By Loomis-Whitney:BW M G / (3M)3/2

...

Geometric Embedding (2nd approach)

Page 27: Oded Schwartz

27

Applications

27

BW = (G/ M1/2) where G = |{g(i,j,k) | (i,j) S, k Sij }BW = (G/ PM1/2) in P-parallel model.

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

Page 28: Oded Schwartz

28

Geometric Embedding (2nd approach) [Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

But many algorithms just don’t fit the generalized form!

For example: Strassen’s fast matrix multiplication

Page 29: Oded Schwartz

29

Beyond 3-nested loops

How about the communication costs of algorithmsthat have a more complex structure?

Page 30: Oded Schwartz

30

Communication Lower Bounds – to be continued…

Approaches:

1. Reduction [Ballard, Demmel, Holtz, S. 2009]

2. Geometric Embedding[Irony,Toledo,Tiskin 04],

[Ballard, Demmel, Holtz, S. 2011a]

3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b]

Proving that your algorithm/implementation is as good as it gets.

Page 31: Oded Schwartz

31

Further reduction techniques: Imposing reads and writes

Example: Computing ||A∙B|| where each matrix element is a formulas, computed only once.

Problem: Input/output do not agree with Form (1).Solution: •Impose writes/reads of (computed) entries of A and B.•Impose writes of the entries of C.•The new algorithm has lower bound

For the original algorithm

i.e., for(which we assume anyway).

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

M

M

nBW

3

2

3

cnMM

nBW

M

M

nBW

3

223 ''/ ncMncMn

Page 32: Oded Schwartz

32

Further reduction techniques: Imposing reads and writes

The previous example can be generalized to other “black-box” uses of algorithms that fit Form (1).

Consider a more general class of algorithms:•Some arguments of the generalized form may be computed “on the fly” and discarded immediately after used. …

Page 33: Oded Schwartz

33

S1

S2

S3

Read

Read

Read

Read

Read

Read

Write

Write

Write

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

Tim

e

...

M

Example of a partition,M = 3

For a given run (Algorithm, Machine, Input)1. Partition computations into segments

of 3M reads / writes

2. Any segment S has M inputs/outputs.

3. Show that S performs G(3M) FLOPs gijk

4. The total communication BW isBW = BW of one segment #segments

M G / G(3M)

But now some operands inside a segmentmay be computed on-the fly and discarded.So no read/write performed.

...

Recall…

Page 34: Oded Schwartz

How to generalize this lower bound:How to deal with on-the-fly generated operands

34

• Need to distinguish Sources, Destinations of each operand in fast memory during a segment:

• Possible Sources:R1: Already in fast memory at start of segment,

or read; at most 2MR2: Created during segment;

no bound without more information• Possible Destinations:

D1: Left in fast memory at end of segment, or written; at most 2M

D2: Discarded; no bound without more information

S1

S2

S3

Read

Read

Read

Read

Read

Read

Write

Write

Write

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

...

Page 35: Oded Schwartz

How to generalize this lower bound:How to deal with on-the-fly generated operands

35

There are at most 4M of types: R1/D1, R1/D2, R2/D1.

Need to assume/prove: not too many R2/D2 arguments;

Then we can use LW, and obtain the lower bound of Form (1).

Bounding R2/D2 is sometimes quite subtle.

S1

S2

S3

Read

Read

Read

Read

Read

Read

Write

Write

Write

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

...

“A shadow”“B shadow”

“C shadow”

A BC

V

Page 36: Oded Schwartz

36

Composition of algorithms

Many algorithms and applications use composition of other (linear algebra) algorithms.

How to compute lower and upper bounds for such cases?

Example - Dense matrix poweringCompute An by (log n times) repeated squaring:

A A2 A4 … An

Each squaring step agrees with Form (1).Do we get

or is there a way to reorder (interleave) computations to reduce communication?

M

M

nnBW

3

log

Page 37: Oded Schwartz

37

Communication hiding vs. Communication avoiding

Q.The Model assumes that computation and communication do not overlap.Is this a realistic assumption? Can we not gain time by such overlapping?

A.Right. This is called communication hiding. It is done in practice, and ignored in our model. It may save up to a factor of 2 in the running time. Note that the speedup gained by avoiding (minimizing) communication is typically larger than a constant factor.

Page 38: Oded Schwartz

38

Two-nested loops: when the input/output size dominates

Q. Do two-nested-loops algorithms fall into the paradigm of Form (1)?For example, what lower bound do we obtain for computing Matrix-vector multiplication?

A. Yes, but the lower bound we obtain is

Where just reading the input costsMore generally, the communication cost lower bound for algorithms that agree with Form (1) is

where LW is the one we obtain from the geometric embedding, and #inputs+#outputs is the size of the inputs and outputs.

For some algorithms LW dominates, for others #inputs+#outputs dominate.

outputsinputsLWMaxBW ##,

M

nBW

2

2nBW

Page 39: Oded Schwartz

39

Composition of algorithms

Claim: any implementation of An by (log n times) repeated squaring requires

Therefore we cannot reduce communication by more than a constant factor (compared to log n separate calls to matrix multiplications)by reordering computations.

M

M

nnBW

3

log

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

Page 40: Oded Schwartz

40

Composition of algorithms

Proof: by imposing reads/writes on each entry of every intermediate matrix.The total number of gi,j,k is (n3log n).

The total number of imposed reads/writes is (n2log n).

The lower bound for the original algorithm is

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

nnM

M

nnBW loglog 2

3

M

M

nn

3

log

Page 41: Oded Schwartz

41

Composition of algorithms: when interleaving does matter

Example 1: Input: A,v1,v2,…,vn

Output: Av1,Av2,…,Avn

The phased solution costs

But we already know that we can save a M1/2 factor:Set B = (v1,v2,…,vn), and compute AB, then the cost is

Other examples?

M

M

nBW

3

3nBW

Page 42: Oded Schwartz

42

Composition of algorithms: when interleaving does matter

Example 2: Input: A,B, t

Output: C(k) = A B(k) for k = 1,2,…,t

where Bi,j(k) = Bi,j

1/k

Phased solution:

Upper bound:

(by adding up the BW cost of t matrix multiplication calls).

Lower bound:

(by imposing writes/reads between phases).

M

M

ntBW

3

M

M

ntOBW

3

Page 43: Oded Schwartz

43

Composition of algorithms: when interleaving does matter

Example 2: Input: A,B, t

Output: C(k) = A B(k) for k = 1,2,…,t

where Bi,j(k) = Bi,j

1/k

Can we do better than ?

M

M

ntBW

3

Page 44: Oded Schwartz

44

Composition of algorithms: when interleaving does matter

Example 2: Input: A,B, t

Output: C(k) = A B(k) for k = 1,2,…,t

where Bi,j(k) = Bi,j

1/k

Can we do better than ?

Yes.Claim:There exists an implementation for the above algorithm, with communication cost (tight lower and upper bounds):

M

M

ntBW

3

M

M

ntBW

3

Page 45: Oded Schwartz

45

Composition of algorithms: when interleaving does matter

Example 2: Input: A,B, t

Output: C(k) = A B(k) for k = 1,2,…,t

where Bi,j(k) = Bi,j

1/k

Proofs idea:

•Upper bound: Having both Ai,k and Bk,j in fast memory lets us do up to t evaluations of gijk.

•Lower bound: The union of all these tn3 operations does not match Form (1), since the inputs Bk,j cannot be indexed in a one-to-one fashion. We need a more careful argument regarding the numbers of gijk. Operations in a segment as a function of the number of accessed elements of A, B and C(k).

Page 46: Oded Schwartz

46

Composition of algorithms: when interleaving does matter

Can you think of natural examples where reordering / interleaving of known algorithms may improve the communication costs, compared to the phased implementation?

Page 47: Oded Schwartz

47

Summary

How to compute an upper bound on the communication costs of your algorithm?

Typically straightforward. Not always.

How to compute and prove a lower bound on the communication costs of your algorithm?

Reductions: from another algorithm/problem from another model of computing By using the generalized form (“flavor” of 3 nested loops) and imposing reads/writes – black-box-wise or bounding the number of R2/D2 operands By carefully composing the lower bounds of the building blocks.

Next time: by graph analysis

Page 48: Oded Schwartz

48

Open Problems

Find algorithms that attain the lower bounds:• Sparse matrix algorithms• that auto-tune or are cache oblivious• cache oblivious for parallel (distributed memory) • Cache oblivious parallel matrix multiplication? (Cilk++ ?)

Address complex heterogeneous hardware:• Lower bounds and algorithms

[Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11]

Page 49: Oded Schwartz

How to Compute and Prove

Lower Boundson the

Communication Costsof your Algorithm

Oded Schwartz

CS294, Lecture #2 Fall, 2011Communication-Avoiding Algorithms

Based on:

D. Irony, S. Toledo, and A. Tiskin:

Communication lower bounds for distributed-memory matrix multiplication.

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz:

Minimizing communication in linear algebra.

Thank you!