Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model...
-
Upload
penelope-casey -
Category
Documents
-
view
215 -
download
1
Transcript of Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model...
![Page 1: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/1.jpg)
Implementing Dense Linear Algebra Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Algorithms on Multi-Core Processors
Using Dataflow Execution ModelUsing Dataflow Execution Model
Jakub KurzakJack Dongarra
University of Tennessee
John ShalfLawrence Berkeley National Laboratory
SIAM Parallel Processing for Scientific ComputingMS18: Dataflow 2.0: The Re-emergence and Evolution
of Dataflow Programming Techniques for Parallel ComputingMarch 13, 2008
![Page 2: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/2.jpg)
New Class of AlgorithmsNew Class of Algorithms
LU, QR, Cholesky, one-sided, factorizations Inspired by out-of-memory algorithmsProcessing of input by small tilesBlock data layoutFine granularitySuitable for dynamic schedulingSuitable for dynamic scheduling
LAPACK Working Note 190LAPACK Working Note 191 LAPACK Working Note 184
![Page 3: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/3.jpg)
New Class of AlgorithmsNew Class of Algorithms
Classic multi-core (x86)Faster than MKLFaster than ACMLWay faster than LAPACK
Exotic multi-core (CELL)Extremely fastNo alternatives
![Page 4: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/4.jpg)
New Algorithms – LU on x86New Algorithms – LU on x86
![Page 5: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/5.jpg)
New Algorithms – QR on x86New Algorithms – QR on x86
![Page 6: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/6.jpg)
New Algorithms – Cholesky on x86New Algorithms – Cholesky on x86
![Page 7: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/7.jpg)
New Algorithms – Cholesky on the CELLNew Algorithms – Cholesky on the CELL
![Page 8: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/8.jpg)
Going “Out of Sync”Going “Out of Sync”
CELL Cholesky – 8 cores
CELL Cholesky – 16 cores
![Page 9: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/9.jpg)
PLASMA FrameworkPLASMA Framework
Don't reinvent the wheelSCHEDULEHeNCEParametrized DAGs (Cosnard, Jeannot)Dataflow Systems (Simulink, Gedae)Systolic Arrays
![Page 10: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/10.jpg)
PLASMA Goals PLASMA Goals (WHAT)(WHAT)
ScalabilityMassive parallelism
PerformanceEfficient runtime
PortabilityCluster / MPPSMP / CMPOut-of-memory / CELL
![Page 11: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/11.jpg)
PLASMA Principles PLASMA Principles (HOW)(HOW)
Data-driven (dynamic) execution Explicit expression of parallelism Implicit communication Component architecture
“Language” Runtime Tools
Dataflow visualization DAG visualization Profiling Tracing ..... There is no such thing as autoparallelization.
BTW,
There is no such thing as auto SIMD'zation.
(see LAPACK Working Note 189)
![Page 12: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/12.jpg)
Current InfrastructureCurrent Infrastructure
MPIMPI
ScaLAPACScaLAPACKK
LAPACKLAPACKPBLASPBLAS
BLASBLAS
BLACSBLACS
![Page 13: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/13.jpg)
Current InfrastructureCurrent Infrastructure
MPIMPI
LAPACKLAPACKPBLASPBLAS
BLASBLAS
BLACSBLACS
ScaLAPACScaLAPACKK
![Page 14: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/14.jpg)
Target InfrastructureTarget Infrastructure
TRACTRACEE
VIVISS PLASMAPLASMA
““LanguageLanguage””
μ-kernelsμ-kernels(“μ(“μBLAS”)BLAS”)
RDMA-likeRDMA-like(“μMPI(“μMPI”)”)
PROFPROFPLASMAPLASMARuntimeRuntime
![Page 15: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/15.jpg)
DAG / Task GraphDAG / Task Graph
view computation as a DAG
use dependency / data driven scheduling
pursue the critical path
![Page 16: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/16.jpg)
DAGs can be hugeDifficult / impossible to storeDifficult / impossible to manage
I don't need to build the DAGI need the ability to traverse the DAG
DAG / Task GraphDAG / Task Graph
![Page 17: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/17.jpg)
Producer Taskknows where its output goes
Consumer Taskknows where its input comes from
Nobody knows the DAG
Parents know their children
Children know their parents
DAG / Task GraphDAG / Task Graph
![Page 18: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/18.jpg)
P
P
P
P
T
T
T
T
T T
0
1
2
3
0,00,0 0,1 0,2
1,0 1,1
2,0
0
1
2
3
0, 1, 2, 3
Task ID / Task CoordinatesTask ID / Task Coordinates
![Page 19: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/19.jpg)
P
P
P
P
T
T
T
T
T T
0
1
2
3
0,00,0 0,1 0,2
1,0 1,1
2,0
// Task NameP
// Execution Spacei = 0 ... N
// Parallel Partitioningi % Ncores = coreID
// Parametersinput T ( i – 1, 0 )output T ( i, 0 ... i )
Task DefinitionTask Definition
![Page 20: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/20.jpg)
P
P
P
P
T
T
T
T
T T
0
1
2
3
0,00,0 0,1 0,2
1,0 1,1
2,0
// Task NameT
// Execution Spacei = 0 ... N – 1j = 0 ... N – 1 – i
// Parallel Partitioningi % Ncores = coreID
// Parametersinput 1 P ( i )input 2 T ( i – 1, j + 1 )output T ( i + 1, j – 1 )
Task DefinitionTask Definition
![Page 21: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/21.jpg)
// NamePOTRF(k)
// Execution spacek = 0..SIZE
// Parallel partitioningk % GRIDrows == rowRANKk % GRIDcols == colRANK
// ParametersINOUT T <- (k == 0) ? IN(k, k) : T SYRK(k, k) -> T TRSM(k, k+1..SIZE) -> OUT(k, k)
// Bodylapack_spotrf(lapack_lower, NB, T, NB, &info);
// NameSYRK(k, n)
// Execution spacek = 0..SIZEn = 0..k-1
// Parallel partitioningk % GRIDrows == rowRANKk % GRIDcols == colRANK
// ParametersIN A <- C TRSM(n, k)INOUT T <- (n == 0) ? IN(k, k) : T SYRK(k, n-1) -> (n == k-1) ? T POTRF(k) : T SYRK(k, n+1)
// Bodycblas_ssyrk(CblasColMajor, CblasLower, CblasNoTrans, NB, NB, 1.0, A, NB, 1.0, T, NB);
// NameTRSM(k, m)
// Execution spacek = 0..SIZEm = k+1..SIZE
// Parallel partitioningm % GRIDrows == rowRANKk % GRIDcols == colRANK
// ParametersIN T <- T POTRF(k)INOUT C <- (k == 0) ? IN(m, k) : C GEMM(k, m, k-1) -> A GEMM(m, m+1..SIZE, k) -> B GEMM(k+1..m-1, m, k) -> A SYRK(m, k) -> OUT(k, m)
// Bodycblas_strsm(CblasColMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, NB, NB, 1.0, T, NB, B, NB);
// NameGEMM(k, m, n)
// Execuiton spacek = 0..SIZEm = k+1..SIZEn = 0..k-1
// Parallel partitioningm % GRIDrows == rowRANKk % GRIDcols == colRANK
// ParametersIN A <- C TRSM(n, k)IN B <- C TRSM(n, m)INOUT C <- (n == 0) ? IN(m, n) : C GEMM(k, m, n-1) -> (n == k-1) ? C TRSM(k, m) : C GEMM(k, m, n+1)
// Bodycblas_sgemm(CblasColMajor, CblasNoTrans, CblasTrans, NB, NB, NB, 1.0, B, NB, A, NB, 1.0, C, NB);
// GlobalsGRIDrows, GRIDcols, NB
![Page 22: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/22.jpg)
depreq
datadata data
depdep req
req
execute
notify
notifynotify
Task Life-CycleTask Life-Cycle
![Page 23: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/23.jpg)
data
requ
est
POTRF
GEMM
TRSM
res
pons
e
response
data
request
POTRFcompletionnotifications
GEMMcompletionnotifications
TRSMdataavailable
TRSMdependencies satisfied
TRSMcompletion notifications
Task Life-CycleTask Life-Cycle
![Page 24: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/24.jpg)
no dependencies satisfied
some dependencies satisfied waiting for all dependencies
all dependencies satisfied some data delivered waiting for all data
all data delivered waiting for execution
Task Life-CycleTask Life-Cycle
![Page 25: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/25.jpg)
Existing compiler technologytask body – standard C / Fortranformulas – C / Fortran expressionsmetadata – human readable ASCII
CompilationCompilation
![Page 26: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/26.jpg)
Schedulermanagement of task bins
Communication systemschared mem – pointer aliasingdistrib mem – messagingout-of-memory – mix
RuntimeRuntime
![Page 27: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/27.jpg)
Standalone ASCII codeLittle / No error checking
Visual tools to assist developmentplot dataflow diagraminstantiate small DAG
Visual ToolsVisual Tools
![Page 28: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/28.jpg)
POTRF
T
TRSM
T
C
SYRK
A
T
GEMM
B
C
A
OUTOUTININ
Dataflow VisualizationDataflow Visualization
![Page 29: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/29.jpg)
Task Graph VisualizationTask Graph Visualization
Tile LU Tile QR
Tile Cholesky
![Page 30: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/30.jpg)
Connectivity / TopologyConnectivity / Topology
4D cubequad-core
![Page 31: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/31.jpg)
Connectivity / TopologyConnectivity / Topology
communication matrix should resemble connectivity matrix
![Page 32: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649e2a5503460f94b1889f/html5/thumbnails/32.jpg)
Thank You !Thank You !
Questions ?Questions ?