Download - Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Reducing Communication in Parallel Graph Computations

Aydın Buluç Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting Albuquerque, NM

Acknowledgments (past & present)

•  Krste Asanovic (UC Berkeley) •  Grey Ballard (UC Berkeley) •  ScoA Beamer (UC Berkeley) •  Jim Demmel (UC Berkeley) •  Erika Duriakova (UC Dublin) •  Armando Fox (UC Berkeley) •  John Gilbert (UCSB) •  Laura Grigori (INRIA) •  Shoaib Kamil (MIT) •  Ben Lipshitz (UC Berkeley) •  Adam Lugowski (UCSB)

•  Lenny Oliker (Berkeley Lab) •  Lenny Oliker (Berkeley Lab) •  Dave PaAerson (UC Berkeley) •  Steve Reinhardt (YarcData) •  Oded Schwartz (UC Berkeley) •  Edgar Solomonik (UC Berkeley) •  Sivan Toledo (Tel Aviv Univ) •  Sam Williams (Berkeley Lab)

This work is funded by:

Informal Outline

•  Large-‐scale graphs •  Communica8on costs of parallel graph computaVons •  CommunicaVon takes energy •  Linear algebraic primi8ves enable scalable graph computaVon •  Semiring abstracVon glues graphs to linear algebra •  Direc8on-‐op8mizing BFS does >6X less communicaVon •  New communicaVon-‐opVmal sparse matrix-‐matrix mul8ply •  New communicaVon-‐opVmal dense all-‐pairs shortest-‐paths

Graph abstracVons in Computer Science

Compiler op8miza8on: Control flow graph : graph dominators Register allocaVon: graph coloring

Scien8fic compu8ng: PrecondiVoning: support graphs, spanning trees Sparse direct solvers: chordal graphs Parallel compuVng: graph separators 10

1 3

2

4

5

6

7

8

9

G+(A) [chordal]

Computer Networks: RouVng: shortest path algorithms Web crawling: graph traversal Interconnect design: Cayley graphs

Large graphs are everywhere

WWW snapshot, courtesy Y. Hyun Yeast protein interaction network, courtesy H. Jeong

Internet structure Social interactions

Scientific datasets: biological, chemical, cosmological, ecological, …

Two kinds of costs: Arithmetic (FLOPs) Communication: moving data

Running time = γ ⋅ #FLOPs + β ⋅ #Words + (α ⋅ #Messages)

CPU M

CPU M

CPU M

CPU M

Distributed

CPU M

RAM

Sequential Fast/local memory of size M

P processors 6

23%/Year 26%/Year

59%/Year 59%/Year

2004: trend transition into multi-core, further communication costs

Develop faster algorithms: minimize communicaVon (to lower bound if possible)

Model and MoVvaVon

Communication-avoiding algorithms: Save time

Communication-avoiding algorithms: Save energy

Image courtesy of John Shalf (LBNL)

CommunicaVon -‐> Energy

- Often no surface to volume ratio. - Very little data reuse in existing algorithmic formulations * - Already heavily communication bound

CommunicaVon crucial for graphs

100

1 4 16 64 256

1024

4096

Perc

enta

ge o

f Tim

e Sp

ent

Number of Cores

ComputationCommunication

0

20

40

60

80

2D sparse matrix-‐matrix mulVply emulaVng: -‐  Graph contracVon -‐  AMG restricVon operaVons

Scale 23 R-‐MAT (scale-‐free graph) 8mes order 4 restricVon operator Cray XT4, Franklin, NERSC

B., Gilbert. Parallel sparse matrix-‐matrix mulVplicaVon and indexing: ImplementaVon and experiments. SIAM Journal of Scien1fic Compu1ng, 2012

Sparse matrix-‐sparse matrix mulVplicaVon (Sparse GEMM)

x

The Combinatorial BLAS implements these, and more, on arbitrary semirings, e.g. (×, +), (and, or), (+, min)

Sparse matrix-‐sparse vector mulVplicaVon (SpMV)

x

.*

Linear-‐algebraic primiVves

Element-‐wise operaVons Sparse matrix indexing

The case for sparse matrices

Many irregular applications contain coarse-grained parallelism that can be exploited

by abstractions at the proper level.

Tradi8onal graph computa8ons Data driven,

unpredictable communicaVon.

Irregular and unstructured, poor locality of reference

Fine grained data accesses, dominated by latency

The case for sparse matrices

Many irregular applications contain coarse-grained parallelism that can be exploited

by abstractions at the proper level.

Tradi8onal graph computa8ons

Graphs in the language of linear algebra

Data driven, unpredictable communicaVon.

Fixed communicaVon paAerns

Irregular and unstructured, poor locality of reference

OperaVons on matrix blocks exploit memory hierarchy

Fine grained data accesses, dominated by latency

Coarse grained parallelism, bandwidth limited

1 2

3

4 7

6

5

AT

1

7

7 1 from

to

Breadth-‐first search in the language of linear

algebra

1 2

3

4 7

6

5

X AT

1

7

7 1 from

to

ATX

à

1

1

1

1

1 parents:

ParVcular semiring operaVons: Mul8ply: select Add: minimum

1 2

3

4 7

6

5

X

4

2

2

AT

1

7

7 1 from

to

ATX

à

2

4

4

2

2 4

1

1 parents: 4

2

2

Mul8ple traverses outgoing edges Add chooses among incoming edges

1 2

3

4 7

6

5

X

3

AT

1

7

7 1 from

to

ATX

à 3

5

7

3

1

1 parents: 4

2

2

5

3

Select vertex with minimum label as parent

X AT

1

7

7 1 from

to

ATX

à

6

1 2

3

4 7

6

5

Result: DeterminisVc breadth-‐first search

Performance observaVons of level-‐synchronous BFS

Phase #1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Per

cent

age

of to

tal

0

10

20

30

40

50

60 # of vertices in frontier arrayExecution time

Youtube social network

When the fron8er is at its peak, almost all edge examinaVons “fail” to claim a child

SyntheVc network

BoAom-‐up BFS algorithm [Single-‐node: Beamer et al. 2012]

Classical (top-‐down) algorithm is opVmal in worst case, but pessimisVc for low-‐diameter graphs (previous slide).

Direc8on-‐op8mizing approach: -‐  Switch from top-‐down to

boAom-‐up search -‐  When the majority of the

verVces are discovered.

DirecVon-‐opVmizing BFS on distributed memory

•  Kronecker (Graph500): 16 billion verVces and 256 billion edges. •  Implemented on top of Combinatorial BLAS (i.e. uses 2D decomposiVon)

Beamer, B., Asanović, and PaAerson, "Distributed Memory Breadth-‐First Search Revisited: Enabling BoAom-‐Up Search”, MTAAP’13, in conjunc1on with IPDPS

DirecVon-‐opVmizing BFS reduces communicaVon

k: average vertex degree of the graph

CommunicaVon volume reducVon based on analyVcal model derived in the paper, assuming 4 boAom-‐up steps (typical for power-‐law graphs)

DirecVon-‐opVmizing BFS saves energy

For a fixed-‐sized real input, direcVon-‐opVmizing algorithm completes at the same Vme using 1/16th of processors (and energy)

•  Parallel breadth-first search is implemented with sparse GEMM

•  Work-efficient, highly-parallel implementation of Brandes’ algorithm

Betweenness Centrality using Sparse GEMM

6

B AT AT � B

à

1 2

3

4 7 5

Betweenness Centrality is a measure of influence.

CB(v): Among all the shortest paths, what fracVon passes through v?

Sparse Matrix-‐Matrix MulVplicaVon

Why sparse matrix-matrix multiplication? Used for algebraic multigrid, graph clustering, betweenness

centrality, graph contraction, subgraph extraction, cycle detection, quantum chemistry,…

How do dense and sparse GEMM compare? Dense: Sparse: Lower bounds match algorithms. Significant gap Allows extensive data reuse Inherent poor reuse?

What do we obtain here? Improved (higher) lower bound New optimal algorithms Only for random matrices

Erdős-Rényi(n,d) graphs aka G(n, p=d/n)

The computaVon cube and sparsity-‐independent algorithms

Matrix multiplication: ∀(i,j) ∈ n x n, C(i,j) = Σk A(i,k)B(k,j),

A B C

The computa1on (discrete) cube:

•  A face for each (input/output) matrix

•  A grid point for each mulVplicaVon

1D algorithms 2D algorithms 3D algorithms

How about sparse algorithms?

The computaVon cube and sparsity-‐independent algorithms

Matrix multiplication: ∀(i,j) ∈ n x n, C(i,j) = Σk A(i,k)B(k,j),

A B C

The computa1on (discrete) cube:

•  A face for each (input/output) matrix

•  A grid point for each mulVplicaVon

1D algorithms 2D algorithms 3D algorithms

How about sparse algorithms?

Sparsity independent algorithms: assigning grid-‐points to processors is independent of sparsity structure. -‐ In parVcular: if Cij is non-‐zero, who holds it? -‐ all standard algorithms are sparsity independent

AssumpVons: -‐ Sparsity independent algorithms -‐ input (and output) are sparse: -‐ The algorithm is load balanced

Algorithms aAaining lower bounds

Previous Sparse Classical:

[Ballard, et al. SIMAX’11]

( ) ⎟⎟

⎠

⎞

⎜⎜

⎝

⎛⋅ΩPM

M

FLOPs3

#⎟⎟⎠

⎞⎜⎜⎝

⎛Ω=

MPnd 2

New Lower bound for Erdős-Rényi(n,d) :

[here] Expected (Under some technical assumpVons) No previous algorithm aAain these.

⎟⎟⎠

⎞⎜⎜⎝

⎛

⎭⎬⎫

⎩⎨⎧

ΩPnd

Pdn 2

,min

û No algorithm aAain this bound!

ü Two new algorithms achieving the bounds (Up to a logarithmic factor)

i.  Recursive 3D, based on [Ballard, et al. SPAA’12]

ii.  IteraVve 3D, based on [Solomonik & Demmel EuroPar’11]

Ballard, B., Demmel, Grigori, Lipshitz, Schwartz, and Toledo. CommunicaVon opVmal parallel mulVplicaVon of sparse random matrices. In SPAA 2013.

3D IteraVve Sparse GEMM [Dense case: Solomonik and Demmel, 2011]

c =Θ pd 2"

#$

%

&'OpVmal replicas:

(theoreVcally)

3D Algorithms: Using extra memory (c replicas) reduces communicaVon volume by a factor of c1/2 compared to 2D

Rerformance results (Erdős-‐Rényi graphs)

IteraVve 2D-‐3D performance results (R-‐MAT graphs)

4 16

Seco

nds

Number of grid layers (c)

p=4096 (s20)

p=16384 (s22)

p=65536 (s24)

Imbalance Merge HypersparseGEMM Comm_Reduce Comm_Bcast

0

5

10

15

20

251 4 16 1 4 16 1

2X Faster

5X faster if load balanced

•  Input: Directed graph with “costs” on edges •  Find least-‐cost paths between all reachable vertex pairs •  Classical algorithm: Floyd-‐Warshall

•  It turns out a previously overlooked recursive version is more parallelizable than the triple nested loop

for k=1:n // the induction sequence for i = 1:n for j = 1:n if(w(i→k)+w(k→j) < w(i→j))

w(i→j):= w(i→k) + w(k→j)

1 5 2 3 4 1

5

2

3

4

k = 1 case

All-‐pairs shortest-‐paths problem

6

3

12

5

43

39

5

7

-3 4

4-1

1

45

-2

0 5 9 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 ∞ 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0

#

$

%%%%%%%

&

'

(((((((

A B

C D A

B D

C

A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;

+ is “min”, × is “add”

V1 V2

6

3

12

5

43

39

5

7

-3 4

4-1

1

45

-2

0 5 9 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 8 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0

#

$

%%%%%%%

&

'

(((((((

∞3

"

#$

%

&' 5 9!

"#$

∞ ∞8 12

"

#$

%

&'

C B

=

The cost of 3-1-2 path

Π =

1 1 1 1 1 12 2 2 2 2 23 1 3 3 3 34 4 4 4 4 45 5 5 5 5 56 6 6 6 6 6

"

#

$$$$$$$

%

&

'''''''

a(3, 2) = a(3,1)+ a(1, 2) then! →! Π(3, 2) =Π(1, 2)

8

Distances Parents

6

3

12

5

43

36

5

7

-3 4

4-1

1

45

-2

0 5 6 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 8 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0

#

$

%%%%%%%

&

'

(((((((

Π =

1 1 2 1 1 12 2 2 2 2 23 1 3 3 3 34 4 4 4 4 45 5 5 5 5 56 6 6 6 6 6

"

#

$$$$$$$

%

&

'''''''

8

D = D*: no change

5 9!"

#$

0 18 0

!

"#

$

%&

B

=

D

5 6!"

#$

Path: 1-2-3

a(1,3) = a(1, 2)+ a(2,3) then! →! Π(1,3) =Π(2,3)

Distances Parents

Wbc-2.5D(n, p) =O(n2 cp )

Sbc-2.5D(p) =O cp log2(p)( )

Bandwidth:

Latency:

c: number of replicas Optimal for any memory size !

CommunicaVon-‐avoiding APSP on distributed memory

c=1

0

200

400

600

800

1000

1200

1 4 16 64 256

1024 1 4 16 64 25

610

24

GFl

ops

Number of compute nodes

n=4096

n=8192

c=16c=4

CommunicaVon-‐avoiding APSP on distributed memory

Solomonik, B., and J. Demmel. “Minimizing communication in all-pairs shortest paths”, Proceedings of the IPDPS. 2013.

Conclusions