High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi ...

High Performance Linear Algebra with Intel Xeon Phi Coprocessors

University of Tennessee, Knoxville

Stan Tomov Azzam Haidar, Khairul Kabir, J. Dongarra, M. Gates, Y. Jia, P. Luszczek

ORNL Seminar, TN September 9, 2013

MAGMA: LAPACK for hybrid systems

•  MAGMA •  A new generation of HP Linear Algebra Libraries •  To provide LAPACK/ScaLAPACK on hybrid architectures •  http://icl.cs.utk.edu/magma/

•  MAGMA MIC 1.0 •  For hybrid, shared memory systems featuring Intel Xeon Phi coprocessors •  Included are main DLA algorithms

•  one and two-‐sided factorizations and solvers

•  Open Source Software ( http://icl.cs.utk.edu/magma )

•  MAGMA developers & collaborators •  UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) •  Community effort, similar to LAPACK/ScaLAPACK

Key aspects of MAGMA

A New Generation of DLA Software

Software/Algorithms follow hardware evolution in time

LINPACK (70’s)

(Vector operations)

Rely on

- Level-1 BLAS

operations

LAPACK (80’s)

(Blocking, cache

friendly)

Rely on

- Level-3 BLAS

operations

ScaLAPACK (90’s)

(Distributed Memory)

Rely on

- PBLAS Mess Passing

PLASMA (00’s)

New Algorithms

(many-core friendly)

Rely on

- a DAG/scheduler

- block data layout

- some extra kernels

MAGMA Hybrid Algorithms (heterogeneity friendly)

Rely on - hybrid scheduler - hybrid kernels

MAGMA Libraries & Software Stack

MAGMA 1.4 Hybrid Tile Algorithms through MORSE

single

distr.

C P U GPU / Coprocessor H Y B R I D

MAGMA BLAS, panels, auxiliary

LAPACK

CUDA / OpenCL / MIC

Support: Linux, Windows, Mac OS X; C/C++, Fortran; Matlab, Python

MAGMA SPARSE (kernels, data formats, iterative & direct )

MAGMA 1.4, MAGMA MIC 1.0, clMAGMA 1.0 (Hybrid LAPACK & 2-stage eigensolvers)

StarPU run-time system PLASMA / Quark

MAGMA 1.4, MAGMA MIC 1.0 (Hybrid LAPACK & 2-stage eigensolvers)

Hybrid LAPACK/ScaLAPACK & Tile Algorithms

MAGMA Functionality

•  80+ hybrid algorithms have been developed (total of 320+ routines) •  Every algorithm is in 4 precisions (s/c/d/z) •  There are 3 mixed precision algorithms (zc & ds) •  These are hybrid algorithms, expressed in terms of BLAS •  MAGMA MIC provides support for Intel Xeon Phi Coprocessors

Methodology overview

•  MAGMA MIC uses hybridization methodology based on •  Representing linear algebra algorithms as collections

of tasks and data dependencies among them •  Properly scheduling tasks' execution over

multicore CPUs and manycore coprocessors

•  Successfully applied to fundamental linear algebra algorithms •  One-‐ and two-‐sided factorizations and solvers •  Iterative linear and eigensolvers

•  Productivity 1) High level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions

Hybrid CPU+MIC algorithms (small tasks for multicores and large tasks for MICs)

A methodology to use all available resources:

Hybrid Algorithms

•  Hybridization •  Panels (Level 2 BLAS) are factored on CPU using LAPACK

•  Trailing matrix updates (Level 3 BLAS) are done on the MIC using “look-ahead”

One-Sided Factorizations (LU, QR, and Cholesky)

!"#$%&'(%)&*%

trailing submatrix i

trailing matrix

Programming LA on Hybrid Systems

•  Algorithms expressed in terms of BLAS •  Use vendor-‐optimized BLAS •  Algorithms expressed in terms of tasks •  Use some scheduling/run-‐time system

Intel Xeon Phi speciPic considerations

•  Intel Xeon Phi coprocessors (vs. GPUs) are less dependent on host [ can login on the coprocessor, develop, and run programs in native mode]

•  There is OpenCL support (as of Intel SDK XE 2013 Beta) facilitating Intel Xeon Phi’s use from the host

•  There is pragma API but it may be too high-‐level for HP numerical libraries

•  We used Intel Xeon Phi’s Low Level API (LLAPI) to develop MAGMA API [allows us to uniformly handle hybrid systems]

MAGMA MIC programming model

•  Intel Xeon Phi acts as coprocessor

•  On the Intel Xeon Phi, MAGMA runs a “server”

•  Communications are implemented using LLAPI

Host Intel Xeon Phi

Main ( ) server ( )

A Hybrid Algorithm Example •  Left-‐looking hybrid Cholesky factorization in MAGMA

•  The difference with LAPACK – the 4 additional lines in red •  Line 8 (done on CPU) is overlapped with work on the MIC (from line 6)

MAGMA MIC programming model

Host program

Intel Xeon Phi interface

Send asynchronous requests to the MIC; Queued & Executed on the MIC

MAGMA MIC Performance ( QR )

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s

System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DGEQRF

CPU_DGEQRF

MAGMA MIC Performance (Cholesky)

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DPOTRF

CPU_DPOTRF

MAGMA MIC Performance (LU)

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DGETRF

CPU_DGETRF

MAGMA MIC Two-sided Factorizations

1. Copy dP to CPU

2. Copy v to GPUj

4. Copy to CPU

Work dWork

dYY dV

0C P U G P U

3. Copy y to CPUj

•  Panels are also hybrid, using both CPU and MIC (vs. just CPU as in the one-‐sided factorizations)

•  Need fast Level 2 BLAS – use MIC’s high bandwidth

MAGMA MIC Performance (Hessenberg)

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DGEHRD

CPU_DGEHRD

MAGMA MIC Performance (Bidiagonal)

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DGEBRD

CPU_DGEBRD

MAGMA MIC Performance (Tridiagonal)

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DSYTRD

CPU_DSYTRD

Toward fast Eigensolver

2k3k4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k0

Matrix size

DSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL

Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

flops formula: 4n3/3*time Higher is faster

A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-‐GPU generalized eigensolver for electronic structure calcula;ons based on fine grained memory aware tasks, ICL Technical report, 03/2012.

  Characteristics •  Stage 1: BLAS-3, increasing computational intensity, •  Stage 2: BLAS-1.5, new cache friendly kernel, •  4X/12X faster than standard approach, •  Bottelneck: if all Eigenvectors are required, it has 1

back transformation extra cost.

Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

flops formula: 4 n3/3*time Higher is faster

2k3k4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k0

Matrix size

DSYTRD_2stages_3GPUsDSYTRD_2stages_2GPUsDSYTRD_2stages_1GPUDSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL Acceleration w/ 3 GPUs:

15 X vs. 12 Intel cores

0 10 20 30 40 50 60

nz = 10160 20 40 60

nz = 3600

0 10 20 30 40 50 60

nz = 10160 10 20 30 40 50 60

nz = 178

second

1k 2k 3k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k0

Matrix size

MAGMA_DSYTRD_2stages_16SB_with_MIC (new algorithm)MAGMA_DSYTRD_2stages_16SB_with_Kepler (new algorithm)MAGMA_DSYTRD_2stages_16SB_with_Fermi (new algorithm)MAGMA_DSYTRD_standard_16SB_with_KeplerMAGMA_DSYTRD_standard_16SB_with_MICMKL_DSYTRD_standard_16SB

  Characteristics •  Stage 1: BLAS-3, increasing computational intensity, •  Stage 2: BLAS-1.5, new cache friendly kernel, •  4X/12X faster than standard approach, •  Bottelneck: if all Eigenvectors are required, it has 1

back transformation extra cost.

flops formula: 4 n3/3*time Higher is faster

0 10 20 30 40 50 60

nz = 10160 20 40 60

nz = 3600

0 10 20 30 40 50 60

nz = 10160 10 20 30 40 50 60

nz = 178

second

From Single to MultiMIC Support

•  Data distribution •  1-‐D block-‐cyclic distribution

•  Algorithm •  MIC holding current panel is sending it to CPU

•  All updates are done in parallel on the MICs

•  Look-‐ahead is done with MIC holding the next panel

MIC 0 . . .

MAGMA MIC Scalability LU Factorization Performance in DP

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"

9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""

'"9NK"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"

#"9NK" '"9NK"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"

)"9NK" #"9NK" '"9NK"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"

$"9NK" )"9NK" #"9NK" '"9NK"

MAGMA MIC Scalability QR Factorization Performance in DP

0 10000 20000 30000 40000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA DGEQRF Performance(MulYple Card)

MAGMA MIC Scalability Cholesky Factorization Performance in DP

0 10000 20000 30000 40000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA DPOTRF Performance(MulYple Card)

Current and Future MIC-speciPic directions

•  Explore the use of tasks of lower granularity on MIC [ GPU tasks in general are large, data parallel]

•  Reduce synchronizations [ less fork-‐join synchronizations ]

•  Scheduling of less parallel tasks on MIC [ on GPUs, these tasks are typically ofiloaded to CPUs] [ to reduce CPU-‐MIC communications ]

•  Develop more algorithms, porting newest developments in LA, while discovering further MIC-‐speciiic optimizations

•  Synchronization avoiding algorithms using Dynamic Runtime Systems

48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200

High-productivity w/ Dynamic Runtime Systems From Sequential Nested-‐Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){ zgeqrt(A[k;k], ...); for (n = k+1; n < NT; n++) zunmqr(A[k;k], A[k;n], ...); for (m = k+1; m < MT; m++){ ztsqrt(A[k;k],,A[m;k], ...); for (n = k+1; n < NT; n++) ztsmqr(A[m;k], A[k;n], A[m;n], ...); }

High-productivity w/ Dynamic Runtime Systems From Sequential Nested-‐Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){ Insert_Task(&cl_zgeqrt, k , k, ...); for (n = k+1; n < NT; n++) Insert_Task(&cl_zunmqr, k, n, ...); for (m = k+1; m < MT; m++){ Insert_Task(&cl_ztsqrt, m, k, ...); for (n = k+1; n < NT; n++) Insert_Task(&cl_ztsmqr, m, n, k, ...); }

Contact magma-devel@eecs.utk.edu Innovative Computing Laboratory University of Tennessee, Knoxville

MAGMA Team http://icl.cs.utk.edu/magma PLASMA team http://icl.cs.utk.edu/plasma Collaborating Partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France (StarPU team) KAUST, Saudi Arabia

Collaborators and support

PLASMA MAGMA

Support:

High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi ...

Documents

Transcript of High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi ...

Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013

Intel Xeon Phi Coprocessor Instruction ... · Intel® Xeon Phi™ Coprocessor Instruction SetArchitectureReferenceManual September7,2012 ReferenceNumber:327364-001

Intel® Xeon Phi™ Coprocessor Datasheetgeco.mines.edu/prototype/Show_me_Intel_Phi_examples/xeon-phi... · Reference Number: 328209-002EN 7 2 Intel® Xeon Phi™ Coprocessor Architecture

Architecture for Discovery - Intel€¦ · Architecture for Discovery 6 Growing Intel® Xeon Phi™ Coprocessors Ecosystem Adoption “Supermicr o supports Intel® Xeon Phi™ coprocessors

INTEL XEON PHI COPROCESSOR Case Study · 2016-07-16 · INTEL CONFIDENTIAL Intel® Xeon Phi™ Coprocessor: Performance Proof-points Click on links for more information (SC12) Energy

Deep Convolutional Network evaluation on the Intel Xeon Phi

Intel Xeon Phi Programming Models · Xeon Phi Online" Intel have only recently publically unveiled Xeon Phi, and the ﬁrst commercially available cards are being delivered. You can

Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors

Programming for Intel® Xeon Phi™

Intel® Xeon Phi™ Coprocessor System Software …...Intel® Xeon Phi System Software Developer’s Guide Page 2 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL

Productive parallel programming for intel xeon phi coprocessors

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Intel Xeon Phi Coprocessor Instruction ...

Intel® Xeon® Phi Coprocessor High Performance Programming

Intel® Xeon Phi™ Coprocessor Datasheet · Reference Number: 328209-002EN 7 2 Intel® Xeon Phi™ Coprocessor Architecture 2.1 Intel® Xeon Phi™ Coprocessor Product Overview The

Intel® Compiler/Runtime - openmp.org · Compiler Offload –Intel® Xeon Phi™ coprocessor as offload target 10 Server (KNL) 1. Native programming •Intel® Xeon Phi™ server

Unveiling the Early Universe with Intel Xeon Processors and Intel Xeon Phi at COSMOS (University of Cambridge)

Intel Xeon Phi Co-Processorsechow/ipcc/hpc-course/HPC-xeonphi.pdfIntel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A.

Intel® Xeon Phi™ Coprocessorregistrationcenter.intel.com/irc_nas/4030/mpss... · 3. Using Intel® MPSS Gold Update 3 driver. 4. Intel® Xeon Phi™ coprocessors are installed on

Intel® Xeon Phi™ Processor x200 Offload Over Fabricregistrationcenter-download.intel.com/.../oof_user_guide.pdfIntroduction Intel® Xeon Phi Processor x200 Product Family August