Collective communication2.5D algorithms
Modelling exascale
Improving communication performance in denselinear algebra via topology-aware collectives
Edgar Solomonik∗, Abhinav Bhatele†, and James Demmel∗
∗ University of California, Berkeley† Lawrence Livermore National Laboratory
Supercomputing, November 2011
Edgar Solomonik Mapping dense linear algebra 1/ 29
Collective communication2.5D algorithms
Modelling exascale
Outline
Collective communicationRectangular collectives
2.5D algorithms2.5D Matrix Multiplication2.5D LU factorization
Modelling exascaleMulticast performanceMM and LU performance
Edgar Solomonik Mapping dense linear algebra 2/ 29
Collective communication2.5D algorithms
Modelling exascaleRectangular collectives
Performance of multicast (BG/P vs Cray)
128
256
512
1024
2048
4096
8192
8 64 512 4096
Ban
dwid
th (M
B/s
ec)
#nodes
1 MB multicast on BG/P, Cray XT5, and Cray XE6
BG/PXE6XT5
Edgar Solomonik Mapping dense linear algebra 3/ 29
Collective communication2.5D algorithms
Modelling exascaleRectangular collectives
Why the performance discrepancy in multicasts?
I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus
I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees
I Route in different dimensional orderI Use both directions of bidirectional network
Edgar Solomonik Mapping dense linear algebra 4/ 29
Collective communication2.5D algorithms
Modelling exascaleRectangular collectives
2D rectangular multicasts trees
root2D 4X4 Torus Spanning tree 1 Spanning tree 2
Spanning tree 3 Spanning tree 4 All 4 trees combined
[Watts and Van De Geijn 95]
Edgar Solomonik Mapping dense linear algebra 5/ 29
Collective communication2.5D algorithms
Modelling exascaleRectangular collectives
Another look at that first plot
How much better are rectangularalgorithms on P = 4096 nodes?
I Binomial collectives on XE6I 1/30th of link
bandwidth
I Rectangular collectives onBG/P
I 4X the link bandwidth
I 120X improvement inefficiency!
How can we apply this?
128
256
512
1024
2048
4096
8192
8 64 512 4096
Ban
dwid
th (M
B/s
ec)
#nodes
1 MB multicast on BG/P, Cray XT5, and Cray XE6
BG/PXE6XT5
Edgar Solomonik Mapping dense linear algebra 6/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
Matrix multiplication
A
BA
B
A
B
AB
Edgar Solomonik Mapping dense linear algebra 7/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
2D matrix multiplication
A
BA
B
A
B
AB
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
16 CPUs (4x4)
[Cannon 69], [Van De Geijn and Watts 97]
Edgar Solomonik Mapping dense linear algebra 8/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
3D matrix multiplication
A
BA
B
A
B
AB
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
64 CPUs (4x4x4)
4 copies of matrices
[Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89]
Edgar Solomonik Mapping dense linear algebra 9/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
2.5D matrix multiplication
A
BA
B
A
B
AB
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
32 CPUs (4x4x2)
2 copies of matrices
Edgar Solomonik Mapping dense linear algebra 10/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
Strong scaling matrix multiplication
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
#nodes
2.5D MM on BG/P (n=65,536)
2.5D SUMMA2D SUMMA
ScaLAPACK PDGEMM
Edgar Solomonik Mapping dense linear algebra 11/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
2.5D MM on 65,536 cores
0
20
40
60
80
100
8192 131072
Per
cent
age
of m
achi
ne p
eak
n
Matrix multiplication on 16,384 nodes of BG/P
12X faster
2.7X faster
Using 16 matrix copies
2D MM2.5D MM
Edgar Solomonik Mapping dense linear algebra 12/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
Cost breakdown of MM on 65,536 cores
0
0.2
0.4
0.6
0.8
1
1.2
1.4
n=8192, 2D
n=8192, 2.5D
n=131072, 2D
n=131072, 2.5D
Exe
cutio
n tim
e no
rmal
ized
by
2D
Matrix multiplication on 16,384 nodes of BG/P
95% reduction in comm computationidle
communication
Edgar Solomonik Mapping dense linear algebra 13/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
2.5D LU factorization
L₀₀
U₀₀
U₀₃
U₀₃
U₀₁
L₂₀L₃₀
L₁₀
(A)
U₀₀
U₀₀
L₀₀
L₀₀
U₀₀
L₀₀
Edgar Solomonik Mapping dense linear algebra 14/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
2.5D LU factorization
L₀₀
U₀₀
U₀₃
U₀₃
U₀₁
L₂₀L₃₀
L₁₀
(A)
(B)
U₀₀
U₀₀
L₀₀
L₀₀
U₀₀
L₀₀
Edgar Solomonik Mapping dense linear algebra 15/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
2.5D LU factorization
Look at how thisupdate is distributed.
A
BA
B
A
B
AB
Same 3D updatein multiplication
Edgar Solomonik Mapping dense linear algebra 16/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
2.5D LU factorization
L₀₀
U₀₀
U₀₃
U₀₃
U₀₁
L₂₀L₃₀
L₁₀
(A)
(B)
U
L
(C)(D)
U₀₀
U₀₀
L₀₀
L₀₀
U₀₀
L₀₀
[Solomonik and Demmel, EuroPar ’11, Distinguished Paper]
Edgar Solomonik Mapping dense linear algebra 17/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
2.5D LU on 65,536 cores
0
20
40
60
80
100
NO-pivot 2D
NO-pivot 2.5D
CA-pivot 2D
CA-pivot 2.5D
Tim
e (s
ec)
LU on 16,384 nodes of BG/P (n=131,072)
2X faster
2X faster
computeidle
communication
Edgar Solomonik Mapping dense linear algebra 18/ 29
Collective communication2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication2.5D LU factorization
Rectangular (RCT) vs binomial (BNM) collectives
0
10
20
30
40
50
60
70
80
MM LU LU+PVT
Per
cent
age
of m
achi
ne p
eak
Binomial vs rectangular collectives on BG/P (n=131,072, p=16,384)
2D BNM2.5D BNM2.5D RCT
Edgar Solomonik Mapping dense linear algebra 19/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
A model for rectangular multicasts
tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)
Our multicast model consists of 3 terms
1. m/Bn, the bandwidth cost
2. 2(d + 1) · o + 3L, the multicast start-up overhead
3. d · P1/d · (2o + L), the path overhead
Edgar Solomonik Mapping dense linear algebra 20/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
A model for binomial multicasts
tbnm = log2(P) · (m/Bn + 2o + L)
I The root of the binomial tree sends log2(P) copies of message
I The setup overhead is overlapped with the path overhead
I We assume no contention
Edgar Solomonik Mapping dense linear algebra 21/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
Model verification: one dimension
200
400
600
800
1000
1 8 64 512 4096 32768 262144
Ban
dwid
th (M
B/s
ec)
msg size (KB)
DCMF Broadcast on a ring of 8 nodes of BG/P
trect modelDCMF rectangle dput
tbnm modelDCMF binomial
Edgar Solomonik Mapping dense linear algebra 22/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
Model verification: two dimensions
0
500
1000
1500
2000
1 8 64 512 4096 32768 262144
Ban
dwid
th (M
B/s
ec)
msg size (KB)
DCMF Broadcast on 64 (8x8) nodes of BG/P
trect modelDCMF rectangle dput
tbnm modelDCMF binomial
Edgar Solomonik Mapping dense linear algebra 23/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
Model verification: three dimensions
0
500
1000
1500
2000
2500
3000
1 8 64 512 4096 32768 262144
Ban
dwid
th (M
B/s
ec)
msg size (KB)
DCMF Broadcast on 512 (8x8x8) nodes of BG/P
trect modelFaraj et al data
DCMF rectangle dputtbnm model
DCMF binomial
Edgar Solomonik Mapping dense linear algebra 24/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
Modelling collectives at exascale (p = 262, 144)
0
20
40
60
80
100
120
140
160
180
200
0.125 1 8 64 512 4096 32768 262144
Ban
dwid
th (G
B/s
ec)
msg size (MB)
Exascale broadcast performance
trect 6Dtrect 5Dtrect 4Dtrect 3Dtrect 2Dtrect 1Dtbnm 3D
Edgar Solomonik Mapping dense linear algebra 25/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
Modelling matrix multiplication at exascale
0.4
0.5
0.6
0.7
0.8
0.9
1
1 8 64
Par
alle
l effi
cien
cy
z dimension of partition
MM strong scaling at exascale (xy plane to full xyz torus)
2.5D with rectangular (c=z)2.5D with binomial (c=z)
2D with binomial
Edgar Solomonik Mapping dense linear algebra 26/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
Modelling LU factorization at exascale
0
0.2
0.4
0.6
0.8
1
1 8 64
Par
alle
l effi
cien
cy
z dimension of partition
LU strong scaling at exascale (xy plane to full xyz torus)
2.5D with rectangular (c=z)2.5D with binomial (c=z)
2D LU with binomial
Edgar Solomonik Mapping dense linear algebra 27/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
Conclusion
I Topology-aware schedulingI Present in IBM BG but not in Cray supercomputersI Avoids network contention/congestionI Enables optimized communication collectivesI Leads to simple communication performance models
I Future workI An automated framework for topology-aware mappingI Tensor computations mappingI Better models for network contention
Edgar Solomonik Mapping dense linear algebra 28/ 29
Collective communication2.5D algorithms
Modelling exascale
Multicast performanceMM and LU performance
Acknowledgements
I Krell CSGF DOE fellowship (DE-AC02-06CH11357)I Resources at Argonne National Lab and Lawrence Berkeley
National LabI DE-SC0003959I DE-SC0004938I DE-FC02-06-ER25786I DE-SC0001845I DE-AC02-06CH11357
I Berkeley ParLab fundingI Microsoft (Award #024263) and Intel (Award #024894)I U.C. Discovery (Award #DIG07-10227)
I Released by Lawrence Livermore National Laboratory asLLNL-PRES-514231
Edgar Solomonik Mapping dense linear algebra 29/ 29
Reductions
Backup slides
Edgar Solomonik Mapping dense linear algebra 30/ 29
Reductions
A model for rectangular reductions
tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)
I Any multicast tree can be inverted to produce a reduction treeI The reduction operator must be applied at each node
I each node operates on 2m dataI both the memory bandwidth and computation cost can be
overlapped
Edgar Solomonik Mapping dense linear algebra 31/ 29
Reductions
Rectangular reduction performance on BG/P
50
100
150
200
250
300
350
400
8 64 512
Ban
dwid
th (M
B/s
ec)
#nodes
DCMF Reduce peak bandwidth (largest message size)
torus rectangle ringtorus binomial
torus rectangletorus short binomial
BG/P rectangular reduction performs significantly worse thanmulticast
Edgar Solomonik Mapping dense linear algebra 32/ 29
Reductions
Performance of custom line reduction
100
200
300
400
500
600
700
800
512 4096 32768
Ban
dwid
th (M
B/s
ec)
msg size (KB)
Performance of custom Reduce/Multicast on 8 nodes
MPI BroadcastCustom Ring Multicast
Custom Ring Reduce 2Custom Ring Reduce 1
MPI Reduce
Edgar Solomonik Mapping dense linear algebra 33/ 29
Reductions
A new LU latency lower bound
k₁
k₀
k₂
k₃
k₄
k
A₀₀
A₂₂
A₃₃
A₄₄
A
n
n
critical path
d-1,d-1d-1
A₁₁
flops lower bound requires d = Ω(√p) blocks/messages
bandwidth lower bound required d = Ω(√cp) blocks/messages
Edgar Solomonik Mapping dense linear algebra 34/ 29
Reductions
2.5D LU strong scaling (without pivoting)
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
#nodes
LU without pivoting on BG/P (n=65,536)
ideal scaling2.5D LU
2D LU
Edgar Solomonik Mapping dense linear algebra 35/ 29
Reductions
Strong scaling of 2.5D LU with tournament pivoting
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
#nodes
LU with tournament pivoting on BG/P (n=65,536)
ideal scaling2.5D LU
2D LUScaLAPACK PDGETRF
Edgar Solomonik Mapping dense linear algebra 36/ 29
Top Related