Satisfying Your Dependencies with SuperMatrix
description
Transcript of Satisfying Your Dependencies with SuperMatrix
![Page 1: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/1.jpg)
September 17-20, 2007 Cluster 2007 1
Satisfying Your Dependencies with SuperMatrix
Ernie Chan
![Page 2: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/2.jpg)
September 17-20, 2007 Cluster 2007 2
Motivation
Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures Schedule submatrix operations out-of-order via
dependency analysis
Programmability High-level abstractions to hide details of
parallelization from user
![Page 3: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/3.jpg)
September 17-20, 2007 Cluster 2007 3
Outline
SuperMatrixImplementationPerformance ResultsConclusion
![Page 4: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/4.jpg)
September 17-20, 2007 Cluster 2007 4
SuperMatrix
![Page 5: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/5.jpg)
September 17-20, 2007 Cluster 2007 5
SuperMatrix
FLA_Part_2x2( A, &ATL, &ATR,
&ABL, &ABR, 0, 0, FLA_TL );
while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) &&
FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) )
{
b = min( FLA_Obj_length( ABR ), nb_alg );
FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02,
/* ************* */ /* ******************** */
&A10, /**/ &A11, &A12,
ABL, /**/ ABR, &A20, /**/ &A21, &A22,
b, b, FLA_BR );
/*------------------------------------------------------------------*/
FLA_LU_nopiv( A11 );
FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 );
FLA_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLA_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 );
/*------------------------------------------------------------------*/
FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02,
A10, A11, /**/ A12,
/* ************** */ /* ****************** */
&ABL, /**/ &ABR, A20, A21, /**/ A22,
FLA_TL );
}
![Page 6: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/6.jpg)
September 17-20, 2007 Cluster 2007 6
SuperMatrix
LU Factorization Without Pivoting Iteration 1
LU
TRSM
TRSM
GEMMTRSM
TRSM
GEMM
GEMM
GEMM
![Page 7: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/7.jpg)
September 17-20, 2007 Cluster 2007 7
SuperMatrix
LU Factorization Without Pivoting Iteration 2
LU
GEMM
TRSM
TRSM
![Page 8: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/8.jpg)
September 17-20, 2007 Cluster 2007 8
SuperMatrix
LU Factorization Without Pivoting Iteration 3
LU
![Page 9: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/9.jpg)
September 17-20, 2007 Cluster 2007 9
SuperMatrix
FLASH Matrix of matrices
![Page 10: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/10.jpg)
September 17-20, 2007 Cluster 2007 10
SuperMatrix
FLA_Part_2x2( A, &ATL, &ATR,
&ABL, &ABR, 0, 0, FLA_TL );
while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) &&
FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) )
{
FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02,
/* ************* */ /* ******************** */
&A10, /**/ &A11, &A12,
ABL, /**/ ABR, &A20, /**/ &A21, &A22,
1, 1, FLA_BR );
/*------------------------------------------------------------------*/
FLASH_LU_nopiv( A11 );
FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 );
FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 );
/*------------------------------------------------------------------*/
FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02,
A10, A11, /**/ A12,
/* ************** */ /* ****************** */
&ABL, /**/ &ABR, A20, A21, /**/ A22,
FLA_TL );
}
FLASH_Queue_exec( );
![Page 11: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/11.jpg)
September 17-20, 2007 Cluster 2007 11
SuperMatrix
Analyzer Delay execution and place tasks on queue
Tasks are function pointers annotated with input/output information
Compute dependence information (flow, anti, output) between all tasks
Create DAG of tasks
![Page 12: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/12.jpg)
September 17-20, 2007 Cluster 2007 12
SuperMatrix
Dispatcher Use DAG to execute tasks out-of-order in
parallel Akin to Tomasulo’s algorithm and instruction-
level parallelism on blocks of computation SuperScalar vs. SuperMatrix
![Page 13: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/13.jpg)
September 17-20, 2007 Cluster 2007 13
SuperMatrix
Dispatcher 4 threads 5 x 5 matrix
of blocks 55 tasks 18 stages
LU
TRSMGEMM
LU
LU
LU
LU
TRSM TRSM
TRSMTRSM
TRSMTRSMTRSMTRSM TRSM
TRSM
TRSM TRSM TRSM
TRSMTRSM TRSM
GEMM GEMMGEMMGEMM GEMMGEMM GEMMGEMM GEMM GEMM GEMM
GEMMGEMM
GEMM
GEMM GEMM
GEMM
GEMM
GEMM
GEMM GEMM
GEMMGEMM
GEMMGEMM GEMMGEMM
GEMMGEMM
TRSMTRSM
TRSM
![Page 14: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/14.jpg)
September 17-20, 2007 Cluster 2007 14
Outline
SuperMatrixImplementationPerformance ResultsConclusion
![Page 15: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/15.jpg)
September 17-20, 2007 Cluster 2007 15
Implementation
Analyzer
LU
GEMM
TRSMTRSM
GEMMGEMMGEMM
TRSMTRSM
LU
LUTRSMTRSMGEMM
Task Queue DAG of tasks
LU
TRSM
TRSM
TRSM
TRSMTRSM TRSM
LU
LU
GEMM GEMM GEMMGEMM
GEMM
![Page 16: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/16.jpg)
September 17-20, 2007 Cluster 2007 16
Implementation
Analyzer FLASH routines enqueue tasks onto global task
queue Dependencies between each task are
calculated and stored in the task structure Each submatrix block stores the last task enqueued
that writes to it Flow dependencies occur when a subsequent task
reads that block DAG is embedded in task queue
![Page 17: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/17.jpg)
September 17-20, 2007 Cluster 2007 17
Implementation
Dispatcher
Waiting Queue…
Threads
LU
GEMM
TRSMTRSM
GEMMGEMMGEMM
TRSMTRSM
LU
LUTRSMTRSMGEMM
Task Queue
LU
TRSMTRSMTRSM
TRSMLU TRSM TRSM TRSMTRSM
![Page 18: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/18.jpg)
September 17-20, 2007 Cluster 2007 18
Implementation
Dispatcher Place ready and available tasks on global
waiting queue First task on task queue always ready and
available Threads asynchronously dequeue tasks from
head of waiting queue Once a task completes execution, notify
dependent tasks and update waiting queue Loop until all tasks complete execution
![Page 19: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/19.jpg)
September 17-20, 2007 Cluster 2007 19
Outline
SuperMatrixImplementationPerformance ResultsConclusion
![Page 20: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/20.jpg)
September 17-20, 2007 Cluster 2007 20
Performance Results
Target Architectures
Processing Elements
Peak (GFLOPS)
BLAS Libraries
Itanium2 16 96.0 MKL 8.1
Xeon 8 41.6 MKL 9.0
Opteron 8 41.6 ACML 3.6
POWER5 8 60.8 ESSL 4.2
![Page 21: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/21.jpg)
September 17-20, 2007 Cluster 2007 21
Performance Results
GotoBLAS 1.13 installed on all machinesSupported Operations
LAPACK-level functions Cholesky factorization LU factorization without pivoting
All level-3 BLAS GEMM, TRMM, TRSM SYMM, SYRK, SYR2K HEMM, HERK, HER2K
![Page 22: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/22.jpg)
September 17-20, 2007 Cluster 2007 22
Performance Results
Implementations SuperMatrix + serial BLAS FLAME + multithreaded BLAS LAPACK + multithreaded BLAS
Block size = 192 Processing elements = 8
![Page 23: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/23.jpg)
September 17-20, 2007 Cluster 2007 23
Performance Results
SuperMatrix Implementation Fixed block sized
Varying block sizes can lead to better performance Experiments show 192 generally the best
Simplest scheduling No sorting to execute task on critical path earlier No attempt to improve data locality in these
experiments
![Page 24: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/24.jpg)
September 17-20, 2007 Cluster 2007 24
Performance Results
![Page 25: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/25.jpg)
September 17-20, 2007 Cluster 2007 25
Performance Results
![Page 26: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/26.jpg)
September 17-20, 2007 Cluster 2007 26
Performance Results
![Page 27: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/27.jpg)
September 17-20, 2007 Cluster 2007 27
Performance Results
![Page 28: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/28.jpg)
September 17-20, 2007 Cluster 2007 28
Performance Results
![Page 29: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/29.jpg)
September 17-20, 2007 Cluster 2007 29
Performance Results
![Page 30: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/30.jpg)
September 17-20, 2007 Cluster 2007 30
Outline
SuperMatrixImplementationPerformance ResultsConclusion
![Page 31: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/31.jpg)
September 17-20, 2007 Cluster 2007 31
Conclusion
Apply out-of-order execution techniques to schedule tasks
The whole is greater than the sum of the parts Exploit parallelism between operations
Despite having to calculate dependencies, SuperMatrix only has small performance penalties
![Page 32: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/32.jpg)
September 17-20, 2007 Cluster 2007 32
Conclusion
Programmability Code at a high level without needing to deal
with aspects of parallelization
![Page 33: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/33.jpg)
September 17-20, 2007 Cluster 2007 33
Authors
Ernie ChanField G. Van ZeeEnrique S. Quintana-OrtíGregorio Quintana-OrtíRobert van de Geijn
The University of Texas at Austin Universidad Jaume I
![Page 34: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/34.jpg)
September 17-20, 2007 Cluster 2007 34
Acknowledgements
We thank the Texas Advanced Computing Center (TACC) for access to their machines and their support
Funding NSF Grants
CCF—0540926 CCF—0702714
![Page 35: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/35.jpg)
September 17-20, 2007 Cluster 2007 35
References
[1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix Out-of-Order Scheduling of Matrix Operations on SMP and Multi-Core Architectures. In SPAA ‘07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007.
[2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks. Submitted to PPoPP 2008.
[3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. Submitted to Euromicro PDP 2008.
![Page 36: Satisfying Your Dependencies with SuperMatrix](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681474a550346895db48a82/html5/thumbnails/36.jpg)
September 17-20, 2007 Cluster 2007 36
Conclusion
More Information
http://www.cs.utexas.edu/users/flame
Questions?