Reducing Communication in Parallel Graph Computations
Aydın Buluç Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting Albuquerque, NM
Acknowledgments (past & present)
• Krste Asanovic (UC Berkeley) • Grey Ballard (UC Berkeley) • ScoA Beamer (UC Berkeley) • Jim Demmel (UC Berkeley) • Erika Duriakova (UC Dublin) • Armando Fox (UC Berkeley) • John Gilbert (UCSB) • Laura Grigori (INRIA) • Shoaib Kamil (MIT) • Ben Lipshitz (UC Berkeley) • Adam Lugowski (UCSB)
• Lenny Oliker (Berkeley Lab) • Lenny Oliker (Berkeley Lab) • Dave PaAerson (UC Berkeley) • Steve Reinhardt (YarcData) • Oded Schwartz (UC Berkeley) • Edgar Solomonik (UC Berkeley) • Sivan Toledo (Tel Aviv Univ) • Sam Williams (Berkeley Lab)
This work is funded by:
Informal Outline
• Large-‐scale graphs • Communica8on costs of parallel graph computaVons • CommunicaVon takes energy • Linear algebraic primi8ves enable scalable graph computaVon • Semiring abstracVon glues graphs to linear algebra • Direc8on-‐op8mizing BFS does >6X less communicaVon • New communicaVon-‐opVmal sparse matrix-‐matrix mul8ply • New communicaVon-‐opVmal dense all-‐pairs shortest-‐paths
Graph abstracVons in Computer Science
Compiler op8miza8on: Control flow graph : graph dominators Register allocaVon: graph coloring
Scien8fic compu8ng: PrecondiVoning: support graphs, spanning trees Sparse direct solvers: chordal graphs Parallel compuVng: graph separators 10
1 3
2
4
5
6
7
8
9
G+(A) [chordal]
Computer Networks: RouVng: shortest path algorithms Web crawling: graph traversal Interconnect design: Cayley graphs
Large graphs are everywhere
WWW snapshot, courtesy Y. Hyun Yeast protein interaction network, courtesy H. Jeong
Internet structure Social interactions
Scientific datasets: biological, chemical, cosmological, ecological, …
Two kinds of costs: Arithmetic (FLOPs) Communication: moving data
Running time = γ ⋅ #FLOPs + β ⋅ #Words + (α ⋅ #Messages)
CPU M
CPU M
CPU M
CPU M
Distributed
CPU M
RAM
Sequential Fast/local memory of size M
P processors 6
23%/Year 26%/Year
59%/Year 59%/Year
2004: trend transition into multi-core, further communication costs
Develop faster algorithms: minimize communicaVon (to lower bound if possible)
Model and MoVvaVon
Communication-avoiding algorithms: Save time
Communication-avoiding algorithms: Save energy
Image courtesy of John Shalf (LBNL)
CommunicaVon -‐> Energy
- Often no surface to volume ratio. - Very little data reuse in existing algorithmic formulations * - Already heavily communication bound
CommunicaVon crucial for graphs
100
1 4 16 64 256
1024
4096
Perc
enta
ge o
f Tim
e Sp
ent
Number of Cores
ComputationCommunication
0
20
40
60
80
2D sparse matrix-‐matrix mulVply emulaVng: -‐ Graph contracVon -‐ AMG restricVon operaVons
Scale 23 R-‐MAT (scale-‐free graph) 8mes order 4 restricVon operator Cray XT4, Franklin, NERSC
B., Gilbert. Parallel sparse matrix-‐matrix mulVplicaVon and indexing: ImplementaVon and experiments. SIAM Journal of Scien1fic Compu1ng, 2012
Sparse matrix-‐sparse matrix mulVplicaVon (Sparse GEMM)
x
The Combinatorial BLAS implements these, and more, on arbitrary semirings, e.g. (×, +), (and, or), (+, min)
Sparse matrix-‐sparse vector mulVplicaVon (SpMV)
x
.*
Linear-‐algebraic primiVves
Element-‐wise operaVons Sparse matrix indexing
The case for sparse matrices
Many irregular applications contain coarse-grained parallelism that can be exploited
by abstractions at the proper level.
Tradi8onal graph computa8ons Data driven,
unpredictable communicaVon.
Irregular and unstructured, poor locality of reference
Fine grained data accesses, dominated by latency
The case for sparse matrices
Many irregular applications contain coarse-grained parallelism that can be exploited
by abstractions at the proper level.
Tradi8onal graph computa8ons
Graphs in the language of linear algebra
Data driven, unpredictable communicaVon.
Fixed communicaVon paAerns
Irregular and unstructured, poor locality of reference
OperaVons on matrix blocks exploit memory hierarchy
Fine grained data accesses, dominated by latency
Coarse grained parallelism, bandwidth limited
1 2
3
4 7
6
5
AT
1
7
7 1 from
to
Breadth-‐first search in the language of linear
algebra
1 2
3
4 7
6
5
X AT
1
7
7 1 from
to
ATX
à
1
1
1
1
1 parents:
ParVcular semiring operaVons: Mul8ply: select Add: minimum
1 2
3
4 7
6
5
X
4
2
2
AT
1
7
7 1 from
to
ATX
à
2
4
4
2
2 4
1
1 parents: 4
2
2
Mul8ple traverses outgoing edges Add chooses among incoming edges
1 2
3
4 7
6
5
X
3
AT
1
7
7 1 from
to
ATX
à 3
5
7
3
1
1 parents: 4
2
2
5
3
Select vertex with minimum label as parent
X AT
1
7
7 1 from
to
ATX
à
6
1 2
3
4 7
6
5
Result: DeterminisVc breadth-‐first search
Performance observaVons of level-‐synchronous BFS
Phase #1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Per
cent
age
of to
tal
0
10
20
30
40
50
60 # of vertices in frontier arrayExecution time
Youtube social network
When the fron8er is at its peak, almost all edge examinaVons “fail” to claim a child
SyntheVc network
BoAom-‐up BFS algorithm [Single-‐node: Beamer et al. 2012]
Classical (top-‐down) algorithm is opVmal in worst case, but pessimisVc for low-‐diameter graphs (previous slide).
Direc8on-‐op8mizing approach: -‐ Switch from top-‐down to
boAom-‐up search -‐ When the majority of the
verVces are discovered.
DirecVon-‐opVmizing BFS on distributed memory
• Kronecker (Graph500): 16 billion verVces and 256 billion edges. • Implemented on top of Combinatorial BLAS (i.e. uses 2D decomposiVon)
Beamer, B., Asanović, and PaAerson, "Distributed Memory Breadth-‐First Search Revisited: Enabling BoAom-‐Up Search”, MTAAP’13, in conjunc1on with IPDPS
DirecVon-‐opVmizing BFS reduces communicaVon
k: average vertex degree of the graph
CommunicaVon volume reducVon based on analyVcal model derived in the paper, assuming 4 boAom-‐up steps (typical for power-‐law graphs)
DirecVon-‐opVmizing BFS saves energy
For a fixed-‐sized real input, direcVon-‐opVmizing algorithm completes at the same Vme using 1/16th of processors (and energy)
• Parallel breadth-first search is implemented with sparse GEMM
• Work-efficient, highly-parallel implementation of Brandes’ algorithm
Betweenness Centrality using Sparse GEMM
6
B AT AT � B
à
1 2
3
4 7 5
Betweenness Centrality is a measure of influence.
CB(v): Among all the shortest paths, what fracVon passes through v?
Sparse Matrix-‐Matrix MulVplicaVon
Why sparse matrix-matrix multiplication? Used for algebraic multigrid, graph clustering, betweenness
centrality, graph contraction, subgraph extraction, cycle detection, quantum chemistry,…
How do dense and sparse GEMM compare? Dense: Sparse: Lower bounds match algorithms. Significant gap Allows extensive data reuse Inherent poor reuse?
What do we obtain here? Improved (higher) lower bound New optimal algorithms Only for random matrices
Erdős-Rényi(n,d) graphs aka G(n, p=d/n)
The computaVon cube and sparsity-‐independent algorithms
Matrix multiplication: ∀(i,j) ∈ n x n, C(i,j) = Σk A(i,k)B(k,j),
A B C
The computa1on (discrete) cube:
• A face for each (input/output) matrix
• A grid point for each mulVplicaVon
1D algorithms 2D algorithms 3D algorithms
How about sparse algorithms?
The computaVon cube and sparsity-‐independent algorithms
Matrix multiplication: ∀(i,j) ∈ n x n, C(i,j) = Σk A(i,k)B(k,j),
A B C
The computa1on (discrete) cube:
• A face for each (input/output) matrix
• A grid point for each mulVplicaVon
1D algorithms 2D algorithms 3D algorithms
How about sparse algorithms?
Sparsity independent algorithms: assigning grid-‐points to processors is independent of sparsity structure. -‐ In parVcular: if Cij is non-‐zero, who holds it? -‐ all standard algorithms are sparsity independent
AssumpVons: -‐ Sparsity independent algorithms -‐ input (and output) are sparse: -‐ The algorithm is load balanced
Algorithms aAaining lower bounds
Previous Sparse Classical:
[Ballard, et al. SIMAX’11]
( ) ⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⋅ΩPM
M
FLOPs3
#⎟⎟⎠
⎞⎜⎜⎝
⎛Ω=
MPnd 2
New Lower bound for Erdős-Rényi(n,d) :
[here] Expected (Under some technical assumpVons) No previous algorithm aAain these.
⎟⎟⎠
⎞⎜⎜⎝
⎛
⎭⎬⎫
⎩⎨⎧
ΩPnd
Pdn 2
,min
û No algorithm aAain this bound!
ü Two new algorithms achieving the bounds (Up to a logarithmic factor)
i. Recursive 3D, based on [Ballard, et al. SPAA’12]
ii. IteraVve 3D, based on [Solomonik & Demmel EuroPar’11]
Ballard, B., Demmel, Grigori, Lipshitz, Schwartz, and Toledo. CommunicaVon opVmal parallel mulVplicaVon of sparse random matrices. In SPAA 2013.
3D IteraVve Sparse GEMM [Dense case: Solomonik and Demmel, 2011]
c =Θ pd 2"
#$
%
&'OpVmal replicas:
(theoreVcally)
3D Algorithms: Using extra memory (c replicas) reduces communicaVon volume by a factor of c1/2 compared to 2D
Rerformance results (Erdős-‐Rényi graphs)
IteraVve 2D-‐3D performance results (R-‐MAT graphs)
4 16
Seco
nds
Number of grid layers (c)
p=4096 (s20)
p=16384 (s22)
p=65536 (s24)
Imbalance Merge HypersparseGEMM Comm_Reduce Comm_Bcast
0
5
10
15
20
251 4 16 1 4 16 1
2X Faster
5X faster if load balanced
• Input: Directed graph with “costs” on edges • Find least-‐cost paths between all reachable vertex pairs • Classical algorithm: Floyd-‐Warshall
• It turns out a previously overlooked recursive version is more parallelizable than the triple nested loop
for k=1:n // the induction sequence for i = 1:n for j = 1:n if(w(i→k)+w(k→j) < w(i→j))
w(i→j):= w(i→k) + w(k→j)
1 5 2 3 4 1
5
2
3
4
k = 1 case
All-‐pairs shortest-‐paths problem
6
3
12
5
43
39
5
7
-3 4
4-1
1
45
-2
0 5 9 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 ∞ 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0
#
$
%%%%%%%
&
'
(((((((
A B
C D A
B D
C
A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;
+ is “min”, × is “add”
V1 V2
6
3
12
5
43
39
5
7
-3 4
4-1
1
45
-2
0 5 9 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 8 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0
#
$
%%%%%%%
&
'
(((((((
∞3
"
#$
%
&' 5 9!
"#$
∞ ∞8 12
"
#$
%
&'
C B
=
The cost of 3-1-2 path
Π =
1 1 1 1 1 12 2 2 2 2 23 1 3 3 3 34 4 4 4 4 45 5 5 5 5 56 6 6 6 6 6
"
#
$$$$$$$
%
&
'''''''
a(3, 2) = a(3,1)+ a(1, 2) then! →! Π(3, 2) =Π(1, 2)
8
Distances Parents
6
3
12
5
43
36
5
7
-3 4
4-1
1
45
-2
0 5 6 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 8 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0
#
$
%%%%%%%
&
'
(((((((
Π =
1 1 2 1 1 12 2 2 2 2 23 1 3 3 3 34 4 4 4 4 45 5 5 5 5 56 6 6 6 6 6
"
#
$$$$$$$
%
&
'''''''
8
D = D*: no change
5 9!"
#$
0 18 0
!
"#
$
%&
B
=
D
5 6!"
#$
Path: 1-2-3
a(1,3) = a(1, 2)+ a(2,3) then! →! Π(1,3) =Π(2,3)
Distances Parents
Wbc-2.5D(n, p) =O(n2 cp )
Sbc-2.5D(p) =O cp log2(p)( )
Bandwidth:
Latency:
c: number of replicas Optimal for any memory size !
CommunicaVon-‐avoiding APSP on distributed memory
c=1
0
200
400
600
800
1000
1200
1 4 16 64 256
1024 1 4 16 64 25
610
24
GFl
ops
Number of compute nodes
n=4096
n=8192
c=16c=4
CommunicaVon-‐avoiding APSP on distributed memory
Solomonik, B., and J. Demmel. “Minimizing communication in all-pairs shortest paths”, Proceedings of the IPDPS. 2013.
Conclusions
Top Related