Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky...
-
date post
19-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky...
![Page 1: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/1.jpg)
Matrix Multiplication on Two Interconnected Processors
Brett A. Becker and Alexey Lastovetsky
Heterogeneous Computing Laboratory
School of Computer Science and Informatics
University College Dublin
_______________________________________________________
HeteroPar’06 Barcelona Sept. 28, 2006
![Page 2: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/2.jpg)
Outline
● Motivation and Goals
● Introduction: ‘Straight-Line’ Partitionings
● The ‘Square-Corner’ Partitioning - Minimizing the Total Volume of Communication
● MPI Experiments / Results
● Conclusion / Future Work
![Page 3: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/3.jpg)
Motivation and Goals
● Partitioning algorithms for MMM designed for n processors result in partitionings which are not always optimal on a small number of processors
● We seek to lower the Total Volume of Communication by utilizing a new partitioning strategy.
● Our ultimate interest is to determine if the Square-Corner partitioning
is a viable technique for deployment on 2 interconnected Clusters.
![Page 4: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/4.jpg)
Background: Straight-Line Partitioning
p
iii whS
1
)(
Total Volume of Inter-Processor Communication (TVC) is proportional to the Sum of Half-Perimeters (S)
Lower Bound (L) of S is when all partitions are square
p
iiaL
1
2
![Page 5: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/5.jpg)
Straight-Line Partitioning
From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.
Average and Minimum values of L
S
for two million randomly generated
areas
![Page 6: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/6.jpg)
Background: Straight-Line Partitioning2 Processors
NwhwhwhSi
ii 3)( 2211
2
1
NL
NaLi
i
2,0 as
)(22 22
1
The Straight-Line Partitioning can not meet the lower bound, L
![Page 7: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/7.jpg)
Background: Straight-Line Partitioning2 Processors
2TVC ,0 as N
Total Volume of Inter-Processor Communication (TVC) = N 2
![Page 8: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/8.jpg)
Introduction: Square-Corner Partitioning
0TVC ,0 as X
N2TVC
![Page 9: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/9.jpg)
Square-Corner Partitioning
NS
whwhwhSi
ii
2,0 as
)( 2211
2
1
NL
NaLi
i
2,0 as
)(22 22
1
The Square-Corner Partitioning can meet the lower bound, L
![Page 10: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/10.jpg)
Square-Corner Partitioning
Average and Minimum values of L
Sfor 2 million randomly generated areas
Power Ratio > 3:1
Adapted From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.
![Page 11: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/11.jpg)
Square-Corner PartitioningMinimizing the TVC
The Square-Corner Partitioning has a lower Total Volume of Communication compared to the Straight-Line Partitioning Provided the Processor Power Ratio is > 3:1
The Total Volume of Communication is minimized when the slower processor’s partition is a square
Theorem:
Theorem:
![Page 12: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/12.jpg)
Results: Square-Corner Partitioning
Matrix-Matrix Multiplication, N=6500, Bandwidth = 80Mb/s
Lower TVC Lower Communication Time Lower Execution Time
Average Reduction in Communication Time = 45%
Average Reduction in Execution Time = 14%
![Page 13: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/13.jpg)
Results: Square-Corner Partitioning
Matrix-Matrix Multiplication, N=6500, Bandwidth = 380Mb/s
Average Reduction in Communication Time = 44%
Lower TVC Lower Communication Time Lower Execution Time
Average Reduction in Execution Time = 10%
![Page 14: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/14.jpg)
Square-Corner Partitioning Overlapping Communication and Computation
A sub-partition of Processor 1’s C Partition is Immediately Calculable
![Page 15: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/15.jpg)
Square-Corner Partitioning Overlapping Communication and Computation
Overlapping more than doubled advantage of Square-Corner algorithm. ● No Overlapping → 17% faster than Straight-Line algorithm. ● Overlapping → 39% faster than Straight-Line algorithm.
Algorithm Execution Time Speedup
Straight-Line 83s 0.94Square-Corner (No Overlapping) 69s 1.13Square-Corner (Overlapping) 51s 1.53Sequential 78s N/A
MM Multiplication, N=4500, Bandwidth=100Mb/s, Ratio=5:1,
![Page 16: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/16.jpg)
Square-Corner Partitioning Two Cluster Architecture
Total of 20 Homogeneous Nodes in 2 Clusters
![Page 17: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/17.jpg)
Square-Corner Partitioning Two Clusters
Algorithm Execution Time Speedup
Straight-Line 123s 1.04Square-Corner 115s 1.11Sequential 128s N/A
MM Multiplication, N=9000, Bandwidth=100Mb/s
All Machines are Homogeneous. One Cluster of 4, One Cluster of 16
![Page 18: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/18.jpg)
Conclusions
● The Square-Corner Partitioning reduces the Total Volume of Communication provided the processor power ratio is > 3:1
● The possibility of Overlapping Communication and Computation can bring further reductions in Execution Time
● The Square-Corner Partitioning is viable on Two Clusters
_______________________________________________________
![Page 19: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/19.jpg)
Current and Future Work
● We have successfully extended the Square-Corner Partitioning to Three Processors
To do:
● Experiment on more Two-Cluster architectures
● Overlap Communication and Computation on Two Clusters
● Extend to Three-Processor Algorithm to Three Clusters
_______________________________________________________
![Page 20: Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d2a5503460f949ff6d7/html5/thumbnails/20.jpg)
Acknowledgements
This work was supported by: