Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010...
-
Upload
meghan-hutchins -
Category
Documents
-
view
224 -
download
0
Transcript of Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010...
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems
AACEC 2010 – Heraklion, Crete, Greece
Jakob Siegel1, Oreste Villa2, Sriram Krishnamoorthy2, Antonino Tumeo2 and Xiaoming Li1
1 University of Delaware2 Pacific Northwest National Laboratory
1
September 24th, 2010
Overview
IntroductionCluster levelNode levelResultsConclusionFuture Work
2
Overview
IntroductionCluster levelNode levelResultsConclusionFuture Work
3
Sparse Matrix-Matrix Multiply- Challenges
The efficient implementation of sparse matrix-matrix multiplications on HPC systems poses several challenges:
Large size of input matricesE.g. 106×106 with 30×106 nonzero elements
Compressed representationPartitioningDensity of the output matricesLoad balancing
large differences in density and computation times
4
Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at: http://www.cise.ufl.edu/davis/sparse.
Sparse Matrix-Matrix Multiply
Cross Cluster implementation:PartitioningData DistributionLoad BalancingCommunication/ScalingResult handling
In-Node implementation:Multiple efficient SpGEMM algorithms
CPU/GPU implementation Double bufferingExploiting heterogeneity
5
Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at: http://www.cise.ufl.edu/davis/sparse.
Overview
IntroductionCluster level
Node levelResultsConclusionFuture Work
6
Sparse Matrix-Matrix Multiply- Cluster level
BlockingBlock size depends on sparsity of input matrices and # processing elements. NumOfBlocksX × NumOfBlocksY >> NumOfProcessingElements
Data LayoutWhat format and order to allow for easy and fast access
Communication and storage implemented using Global Arrays (GA)
Offers a set of primitives for non-blocking operations, contiguous and non-contiguous data transfers.
7
Sparse Matrix-Matrix Multiply- Data representation and Tiling
8
A
B
C
C=A×B
• Blocked Matrix representation:• Each block is stored in
CSR* form
1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 2 7 0 0 0 0 0 5
data (1 -1 5 4 6 -2 2 7 5) col (0 1 1 2 3 0 2 3 4) row (0 2 3 5 8 9)
*CSR: Compressed Sparse Row
Sparse Matrix-Matrix Multiply- Data representation and Tiling
9
A
B
C
C=A×B
data column row data col…Tile 0 Tile 2
…
• Matrix A:• The single CSR tiles are stored serialized into
the GA space. • Tile sizes and offsets are stored in a 2D array• Tiles with 0 nonzero elements are not
represented in the GA dataset.
Sparse Matrix-Matrix Multiply- Data representation and Tiling
10
B• Matrix B:
• tiles are serialized in a transposed way.
• depending on the algorithm used to calculate the single tiles the data in the tiles can be stored transposed or not transposed.
• For the Gustavson algorithm the representation of the data in the tiles themselves is not transposed.
1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 2 7 0 0 0 0 0 5
1 0 0 -2 0-1 5 0 0 0 0 0 4 2 0 0 0 6 7 0 0 0 0 0 5
not transposed
or
transposed
Sparse Matrix-Matrix Multiply- Tasking and Data Movement
11
0 1 2 3 45 6 7 8 ..
1
C
• Each Block in C represents a Task.
• Nodes grab tasks and additional needed data when they have computational power available
• Results are stored locally
• meta data of the result blocks in each node is distributed to determine the offsets of the tiles in the GA space.
• Tiles are put into the GA space in right order
0 1 N-1…
34
0 25
Sparse Matrix-Matrix Multiply- Tasking and Data Movement
12
A
BC=A×B
• Each node fetches the data needed by the task to handle:
E.g. here for task/tile 5 the node has to load the data of Stripes sa = 1 and sb = 0
N-1
25
012…
Sa-1
0 1 2 …Sb-1
Overview
IntroductionCluster level
Node levelResultsConclusionFuture Work
14
2 3 0 0 0 0 0 -1 0 2 3 0 0 0 -3 1 0 0 0 0 2 3 0 0 1 0 0 2 2 0 0 0 0 2 -1 4
1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 0 7 -4 0 1 0 0 5 0 0 0 1 2
Sparse Matrix-Matrix Multiply - Gustavson
15
The algorithm is based on the equation:
i-th row of C is a linear combination of the v rows of B for which aiv is nonzero. Where A has the dimensions p×q and B q×r
0 -5 0 0 0-4 -5 0 14 -8-4 -2 0 14 70 0 0 0 0
×
data(2,3,-1,2,3,-3,1,2,3,1,2,2,2,-1,4) col (0,1, 1,3,4, 2,3,2,3,0,3,4,3, 4,5) row (0,2,5,7,9,12,15)
data(1,-1,5,4,6,-2,7,-4,1,5,1,2) col (0, 1,1,2,3, 0,3, 4,1,4,3,4) row (0,2,3,5,8,10,12)
A C
B
×
pibaciva
vivi
1for 0
i=1i=1, v=1i=1, v=3i=1, v=4
++
×
+
2 3 0 0 0 0 0 -1 0 2 3 0 0 0 -3 1 0 0 0 0 2 3 0 0 1 0 0 2 2 0 0 0 0 2 -1 4
1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 0 7 -4 0 1 0 0 5 0 0 0 1 2
Sparse Matrix-Matrix Multiply - Gustavson
16
A C
B
pibaciva
vivi
1for 0
In the CUDA implementation:
• each result row ci is handled by the 16 threads of a half warp (1/2W)
• For each nonzero elements aiv in A one 1/2W performs the multiplications for each row v· in parallel
• The results are kept in dense form until all calculations are complete
• Then the results get compressed on the device.
0 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 0
2 13 0 0 0-4 -2 0 14 7-2 0 -12-11 -4-6 0 8 33 -12-3 -1 0 14 2-4 -1 0 18 -5
half-warp 0half-warp 1half-warp 2…
Overview
IntroductionCluster levelNode level
ResultsConclusionFuture Work
17
Sparse Matrix-Matrix Multiply – Case Study
Midsize matrix from the University of Florida Sparse Matrix Collection*
2D/3D problemsize 72, 000 × 72, 000 28, 715, 634 nonzeroBlocked into 5041 tiles. Multiplying matrix with itself.
18
*http://www.cise.ufl.edu/davis/sparse
Darker colors represent higher densities of nonzero elements.
19
Sparse Matrix-Matrix Multiply - Results
Scaling of SpGEMM with the different approaches
1 2 4 8 160
50
100
150
200
250
300
350
execution time over number of nodesStatic
LB-Hom
LB-Het
Nodes
tim
e in
sec
Sparse Matrix-Matrix Multiply - Results
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15200
250
300
350
400
tasks executed by each nodeStatic
LB-Hom
LB-Het
node id
nu
mb
er o
f ta
sks
00000001111111222222233333334444444555555566666667777777888888899999991010101010101011111111111111121212121212121313131313131314141414141414151515151515150
5
10
15
20
25
30
Time to complete all assigned tasks per process
Static
LB-Hom
LB-Het
node id (7 processes per node)
tim
e in
sec
Sparse Matrix-Matrix Multiply - Results
Even inside a node where different compute elements are used the load balancing mechanism still performs well
The processes using the CUDA devices here completing almost 5x more tasks than the pure CPU processes.
21
Static
CPU1
CPU3
CPU5
CPU0
CPU2
CPU4
CPU6
LB-H
et
CUDA1
CPU1
CPU3
0
20
40
60
80
100
120
Tasks per Core in one of the nodes
nu
mb
er o
f ta
sks
Static
CPU1
CPU3
CPU5
CPU0
CPU2
CPU4
CPU6
LB-H
et
CUDA1
CPU1
CPU3
0
5
10
15
20
25
Time to complete all assigned tasks for each processor
tim
e in
sec
Overview
IntroductionCluster levelNode levelResults
ConclusionFuture Work
22
Sparse Matrix-Matrix Multiply
We presented a parallel framework using a co-design approach which takes into account characteristics of:
The selected application (here SpGEMM)The underlying hardware (heterogeneous cluster)
The difficulties of using static partitioning approaches show that a global load balancing method is neededDifferent optimized implementations of the Gustavson algorithm are presented and are used depending on the available compute elementFor the selected case study optimal load balancing with uniform computation time across all processing elements is achieved
23
Overview
IntroductionCluster levelNode levelResultsConclusion
Future Work
24
Future Work – General Tasking Framework for Heterogeneous GPU Clusters
More General Task definitionMore flexibility in Input and output data definitionExploring limits imposed on Tasks by a Heterogeneous system
Feedback loop during execution that allows more efficient assignment of tasks.Introducing heterogeneous execution on GPU and CPU in one process/core.Locality aware Task queue(s) and work stealingTask reinsertion or generation at the node level.
25
Thank you
Questions?
26