Post on 18-Feb-2022
High Performance Linear Algebra with Intel Xeon Phi Coprocessors
University of Tennessee, Knoxville
Stan Tomov Azzam Haidar, Khairul Kabir, J. Dongarra, M. Gates, Y. Jia, P. Luszczek
ORNL Seminar, TN September 9, 2013
MAGMA: LAPACK for hybrid systems
• MAGMA • A new generation of HP Linear Algebra Libraries • To provide LAPACK/ScaLAPACK on hybrid architectures • http://icl.cs.utk.edu/magma/
• MAGMA MIC 1.0 • For hybrid, shared memory systems featuring Intel Xeon Phi coprocessors • Included are main DLA algorithms
• one and two-‐sided factorizations and solvers
• Open Source Software ( http://icl.cs.utk.edu/magma )
• MAGMA developers & collaborators • UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) • Community effort, similar to LAPACK/ScaLAPACK
Key aspects of MAGMA
A New Generation of DLA Software
Software/Algorithms follow hardware evolution in time
LINPACK (70’s)
(Vector operations)
Rely on
- Level-1 BLAS
operations
LAPACK (80’s)
(Blocking, cache
friendly)
Rely on
- Level-3 BLAS
operations
ScaLAPACK (90’s)
(Distributed Memory)
Rely on
- PBLAS Mess Passing
PLASMA (00’s)
New Algorithms
(many-core friendly)
Rely on
- a DAG/scheduler
- block data layout
- some extra kernels
MAGMA Hybrid Algorithms (heterogeneity friendly)
Rely on - hybrid scheduler - hybrid kernels
MAGMA Libraries & Software Stack
MAGMA 1.4 Hybrid Tile Algorithms through MORSE
single
multi
distr.
C P U GPU / Coprocessor H Y B R I D
BLAS
BLAS
MAGMA BLAS, panels, auxiliary
LAPACK
CUDA / OpenCL / MIC
Support: Linux, Windows, Mac OS X; C/C++, Fortran; Matlab, Python
MAGMA SPARSE (kernels, data formats, iterative & direct )
MAGMA 1.4, MAGMA MIC 1.0, clMAGMA 1.0 (Hybrid LAPACK & 2-stage eigensolvers)
StarPU run-time system PLASMA / Quark
MAGMA 1.4, MAGMA MIC 1.0 (Hybrid LAPACK & 2-stage eigensolvers)
Hybrid LAPACK/ScaLAPACK & Tile Algorithms
MAGMA Functionality
• 80+ hybrid algorithms have been developed (total of 320+ routines) • Every algorithm is in 4 precisions (s/c/d/z) • There are 3 mixed precision algorithms (zc & ds) • These are hybrid algorithms, expressed in terms of BLAS • MAGMA MIC provides support for Intel Xeon Phi Coprocessors
Methodology overview
• MAGMA MIC uses hybridization methodology based on • Representing linear algebra algorithms as collections
of tasks and data dependencies among them • Properly scheduling tasks' execution over
multicore CPUs and manycore coprocessors
• Successfully applied to fundamental linear algebra algorithms • One-‐ and two-‐sided factorizations and solvers • Iterative linear and eigensolvers
• Productivity 1) High level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions
Hybrid CPU+MIC algorithms (small tasks for multicores and large tasks for MICs)
A methodology to use all available resources:
MIC
MIC
MIC
Hybrid Algorithms
• Hybridization • Panels (Level 2 BLAS) are factored on CPU using LAPACK
• Trailing matrix updates (Level 3 BLAS) are done on the MIC using “look-ahead”
One-Sided Factorizations (LU, QR, and Cholesky)
!"#$%&'(%)&*%
next
pan
el i+
1
trailing submatrix i
trailing matrix
pane
l i
pane
l i
Programming LA on Hybrid Systems
• Algorithms expressed in terms of BLAS • Use vendor-‐optimized BLAS • Algorithms expressed in terms of tasks • Use some scheduling/run-‐time system
Intel Xeon Phi speciPic considerations
• Intel Xeon Phi coprocessors (vs. GPUs) are less dependent on host [ can login on the coprocessor, develop, and run programs in native mode]
• There is OpenCL support (as of Intel SDK XE 2013 Beta) facilitating Intel Xeon Phi’s use from the host
• There is pragma API but it may be too high-‐level for HP numerical libraries
• We used Intel Xeon Phi’s Low Level API (LLAPI) to develop MAGMA API [allows us to uniformly handle hybrid systems]
MAGMA MIC programming model
• Intel Xeon Phi acts as coprocessor
• On the Intel Xeon Phi, MAGMA runs a “server”
• Communications are implemented using LLAPI
PCIe
Host Intel Xeon Phi
Main ( ) server ( )
LLAPI
A Hybrid Algorithm Example • Left-‐looking hybrid Cholesky factorization in MAGMA
• The difference with LAPACK – the 4 additional lines in red • Line 8 (done on CPU) is overlapped with work on the MIC (from line 6)
MAGMA MIC programming model
Host program
Intel Xeon Phi interface
Send asynchronous requests to the MIC; Queued & Executed on the MIC
MAGMA MIC Performance ( QR )
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s
System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183
0
100
200
300
400
500
600
700
800
900
1000
0 5000 10000 15000 20000 25000
Performan
ce GFLOP/s
Matrix Size N X N
MAGMA_DGEQRF
CPU_DGEQRF
MAGMA MIC Performance (Cholesky)
0
100
200
300
400
500
600
700
800
900
1000
0 5000 10000 15000 20000 25000
Performan
ce GFLOP/s
Matrix Size N X N
MAGMA_DPOTRF
CPU_DPOTRF
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s
System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183
MAGMA MIC Performance (LU)
0
100
200
300
400
500
600
700
800
900
1000
0 5000 10000 15000 20000 25000
Performan
ce GFLOP/s
Matrix Size N X N
MAGMA_DGETRF
CPU_DGETRF
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s
System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183
MAGMA MIC Two-sided Factorizations
1. Copy dP to CPU
2. Copy v to GPUj
4. Copy to CPU
Work dWork
dYY dV
0C P U G P U
i
3. Copy y to CPUj
Aj
• Panels are also hybrid, using both CPU and MIC (vs. just CPU as in the one-‐sided factorizations)
• Need fast Level 2 BLAS – use MIC’s high bandwidth
MAGMA MIC Performance (Hessenberg)
0
20
40
60
80
100
120
140
160
0 5000 10000 15000 20000 25000
Performan
ce GFLOP/s
Matrix Size N X N
MAGMA_DGEHRD
CPU_DGEHRD
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s
System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183
MAGMA MIC Performance (Bidiagonal)
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s
System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183
0
10
20
30
40
50
60
0 5000 10000 15000 20000 25000
Performan
ce GFLOP/s
Matrix Size N X N
MAGMA_DGEBRD
CPU_DGEBRD
MAGMA MIC Performance (Tridiagonal)
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s
System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183
0
10
20
30
40
50
60
70
80
0 5000 10000 15000 20000 25000
Performan
ce GFLOP/s
Matrix Size N X N
MAGMA_DSYTRD
CPU_DSYTRD
Toward fast Eigensolver
2k3k4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k0
40
80
120
160
200
240
280
320
360
400
440
480
Matrix size
Gflo
p/s
DSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL
Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
flops formula: 4n3/3*time Higher is faster
A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-‐GPU generalized eigensolver for electronic structure calcula;ons based on fine grained memory aware tasks, ICL Technical report, 03/2012.
Toward fast Eigensolver
Characteristics • Stage 1: BLAS-3, increasing computational intensity, • Stage 2: BLAS-1.5, new cache friendly kernel, • 4X/12X faster than standard approach, • Bottelneck: if all Eigenvectors are required, it has 1
back transformation extra cost.
Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
flops formula: 4 n3/3*time Higher is faster
A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-‐GPU generalized eigensolver for electronic structure calcula;ons based on fine grained memory aware tasks, ICL Technical report, 03/2012.
2k3k4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k0
40
80
120
160
200
240
280
320
360
400
440
480
Matrix size
Gflo
p/s
DSYTRD_2stages_3GPUsDSYTRD_2stages_2GPUsDSYTRD_2stages_1GPUDSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL Acceleration w/ 3 GPUs:
15 X vs. 12 Intel cores
0 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 10160 20 40 60
0
10
20
30
40
50
60
nz = 3600
first
stage
0 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 10160 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 178
second
stage
1k 2k 3k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k0
50
100
150
200
250
300
350
400
450
500
Matrix size
Gflo
p/s
MAGMA_DSYTRD_2stages_16SB_with_MIC (new algorithm)MAGMA_DSYTRD_2stages_16SB_with_Kepler (new algorithm)MAGMA_DSYTRD_2stages_16SB_with_Fermi (new algorithm)MAGMA_DSYTRD_standard_16SB_with_KeplerMAGMA_DSYTRD_standard_16SB_with_MICMKL_DSYTRD_standard_16SB
Toward fast Eigensolver
Characteristics • Stage 1: BLAS-3, increasing computational intensity, • Stage 2: BLAS-1.5, new cache friendly kernel, • 4X/12X faster than standard approach, • Bottelneck: if all Eigenvectors are required, it has 1
back transformation extra cost.
flops formula: 4 n3/3*time Higher is faster
A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-‐GPU generalized eigensolver for electronic structure calcula;ons based on fine grained memory aware tasks, ICL Technical report, 03/2012.
0 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 10160 20 40 60
0
10
20
30
40
50
60
nz = 3600
first
stage
0 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 10160 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 178
second
stage
From Single to MultiMIC Support
• Data distribution • 1-‐D block-‐cyclic distribution
• Algorithm • MIC holding current panel is sending it to CPU
• All updates are done in parallel on the MICs
• Look-‐ahead is done with MIC holding the next panel
MIC 0
MIC 1
MIC 2
MIC 0 . . .
nb
MAGMA MIC Scalability LU Factorization Performance in DP
!"
#!!"
$!!"
%!!"
&!!"
'!!!"
'#!!"
'$!!"
'%!!"
'&!!"
#!!!"
##!!"
#$!!"
#%!!"
!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"
*+,-.,/
012+"3456*78"
90:,;<"=;>+"?"@"?"
9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""
'"9NK"
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s
System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117
MAGMA MIC Scalability LU Factorization Performance in DP
!"
#!!"
$!!"
%!!"
&!!"
'!!!"
'#!!"
'$!!"
'%!!"
'&!!"
#!!!"
##!!"
#$!!"
#%!!"
!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"
*+,-.,/
012+"3456*78"
90:,;<"=;>+"?"@"?"
9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""
#"9NK" '"9NK"
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s
System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117
MAGMA MIC Scalability LU Factorization Performance in DP
!"
#!!"
$!!"
%!!"
&!!"
'!!!"
'#!!"
'$!!"
'%!!"
'&!!"
#!!!"
##!!"
#$!!"
#%!!"
!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"
*+,-.,/
012+"3456*78"
90:,;<"=;>+"?"@"?"
9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""
)"9NK" #"9NK" '"9NK"
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s
System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117
MAGMA MIC Scalability LU Factorization Performance in DP
!"
#!!"
$!!"
%!!"
&!!"
'!!!"
'#!!"
'$!!"
'%!!"
'&!!"
#!!!"
##!!"
#$!!"
#%!!"
!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"
*+,-.,/
012+"3456*78"
90:,;<"=;>+"?"@"?"
9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""
$"9NK" )"9NK" #"9NK" '"9NK"
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s
System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117
MAGMA MIC Scalability QR Factorization Performance in DP
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s
System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
0 10000 20000 30000 40000
Performan
ce GFLOP/s
Matrix Size N X N
MAGMA DGEQRF Performance(MulYple Card)
4 MIC
3 MIC
2 MIC
1 MIC
MAGMA MIC Scalability Cholesky Factorization Performance in DP
Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s
Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s
System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
0 10000 20000 30000 40000
Performan
ce GFLOP/s
Matrix Size N X N
MAGMA DPOTRF Performance(MulYple Card)
4 MIC
3 MIC
2 MIC
1 MIC
Current and Future MIC-speciPic directions
• Explore the use of tasks of lower granularity on MIC [ GPU tasks in general are large, data parallel]
• Reduce synchronizations [ less fork-‐join synchronizations ]
• Scheduling of less parallel tasks on MIC [ on GPUs, these tasks are typically ofiloaded to CPUs] [ to reduce CPU-‐MIC communications ]
• Develop more algorithms, porting newest developments in LA, while discovering further MIC-‐speciiic optimizations
Current and Future MIC-speciPic directions
• Synchronization avoiding algorithms using Dynamic Runtime Systems
48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200
Current and Future MIC-speciPic directions
High-productivity w/ Dynamic Runtime Systems From Sequential Nested-‐Loop Code to Parallel Execution
for (k = 0; k < min(MT, NT); k++){ zgeqrt(A[k;k], ...); for (n = k+1; n < NT; n++) zunmqr(A[k;k], A[k;n], ...); for (m = k+1; m < MT; m++){ ztsqrt(A[k;k],,A[m;k], ...); for (n = k+1; n < NT; n++) ztsmqr(A[m;k], A[k;n], A[m;n], ...); }
}
Current and Future MIC-speciPic directions
High-productivity w/ Dynamic Runtime Systems From Sequential Nested-‐Loop Code to Parallel Execution
for (k = 0; k < min(MT, NT); k++){ Insert_Task(&cl_zgeqrt, k , k, ...); for (n = k+1; n < NT; n++) Insert_Task(&cl_zunmqr, k, n, ...); for (m = k+1; m < MT; m++){ Insert_Task(&cl_ztsqrt, m, k, ...); for (n = k+1; n < NT; n++) Insert_Task(&cl_ztsmqr, m, n, k, ...); }
}
Contact magma-devel@eecs.utk.edu Innovative Computing Laboratory University of Tennessee, Knoxville
MAGMA Team http://icl.cs.utk.edu/magma PLASMA team http://icl.cs.utk.edu/plasma Collaborating Partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France (StarPU team) KAUST, Saudi Arabia
Collaborators and support
PLASMA MAGMA
Support: