Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on...

79
Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen [email protected] ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala University, Sweden Paolo Bientinesi (AICES, RWTH Aachen) 1 / 34

Transcript of Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on...

Page 1: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dense Linear Algebra on Heterogeneous Platforms:State of the Art and Trends

Paolo Bientinesi

AICES, RWTH [email protected]

ComplexHPC Spring School 2013Heterogeneous computing - Impact on algorithms

June 7th, 2013Uppsala University, Sweden

Paolo Bientinesi (AICES, RWTH Aachen) 1 / 34

Page 2: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

1 Setting the stage

2 Part 1: blocked algorithms

3 Part 2: multithreading, fork-join

4 Part 3: multihtreading, algorithms-by-blocks

5 Part 4: streaming

Paolo Bientinesi (AICES, RWTH Aachen) 2 / 34

Page 3: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dense Linear Algebra

M =

××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

Page 4: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dense Linear Algebra

M =

××× × ××× ×××××××××××××××××× ×××××××××××××××××××× ××××× ×××××× ××××× ×××× × ××× ×××××××××××××××××× ×××××××××××××××××××× ××××× ×××××× ××××× ×

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

Page 5: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dense Linear Algebra

M =

×× × × × ×× × ××× ×× ××× × ××× ×× ×× ×× × ×× ×× ××× × ×× × × × × ××× ××××× ××× × ××××× × ×××× × × ×× ××× ×× × ×× ×

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

Page 6: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dense Linear Algebra

M =

×××××××××××××××××× ×××××××××× ××××××××××××××××××××× ×××××××××××××××××××××××

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

Page 7: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dense Linear Algebra

M =

××××××××××××××××××××××××××××××××××

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

Page 8: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dense Linear Algebra

Matrix equations

AX +XB = C, A = A+A−1

2 , . . .

Linear systemsAx = b, AX = B, least squares, . . .

EigenproblemsAx = λx, AX = BXΛ, SVD, . . .

Support routinesfactorizations, reductions, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

Page 9: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dense Linear Algebra

Matrix equations

AX +XB = C, A = A+A−1

2 , . . .

Linear systemsAx = b, AX = B, least squares, . . .

EigenproblemsAx = λx, AX = BXΛ, SVD, . . .

Support routinesfactorizations, reductions, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

Page 10: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dense Linear Algebra

Matrix equations

AX +XB = C, A = A+A−1

2 , . . .

Linear systemsAx = b, AX = B, least squares, . . .

EigenproblemsAx = λx, AX = BXΛ, SVD, . . .

Support routinesfactorizations, reductions, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

Page 11: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Organization in layers

other librariesScaLAPACK, Elemental, PETSc, . . .

LAPACKLU = A, LLT = A, QR = A, QTTQ = A, . . .

BLAS

BLAS-3: C :=C +AB A,B,C,L ∈ Rn×n

C :=L−1B

BLAS-2: y := y +Ax A,L ∈ Rn×n, x, y ∈ Rn

y :=L−1x

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ Rdot :=α+ xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

Page 12: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Organization in layers

other librariesScaLAPACK, Elemental, PETSc, . . .

LAPACKLU = A, LLT = A, QR = A, QTTQ = A, . . .

BLAS

BLAS-3: C :=C +AB A,B,C,L ∈ Rn×n

C :=L−1B

BLAS-2: y := y +Ax A,L ∈ Rn×n, x, y ∈ Rn

y :=L−1x

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ Rdot :=α+ xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

Page 13: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Organization in layers

other librariesScaLAPACK, Elemental, PETSc, . . .

LAPACKLU = A, LLT = A, QR = A, QTTQ = A, . . .

BLAS

BLAS-3: C :=C +AB A,B,C,L ∈ Rn×n

C :=L−1B

BLAS-2: y := y +Ax A,L ∈ Rn×n, x, y ∈ Rn

y :=L−1x

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ Rdot :=α+ xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

Page 14: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Organization in layers

other librariesScaLAPACK, Elemental, PETSc, . . .

LAPACKLU = A, LLT = A, QR = A, QTTQ = A, . . .

BLAS

BLAS-3: C :=C +AB A,B,C,L ∈ Rn×n

C :=L−1B

BLAS-2: y := y +Ax A,L ∈ Rn×n, x, y ∈ Rn

y :=L−1x

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ Rdot :=α+ xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

Page 15: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Organization in layers

other librariesScaLAPACK, Elemental, PETSc, . . .

LAPACKLU = A, LLT = A, QR = A, QTTQ = A, . . .

BLASBLAS-3: C :=C +AB A,B,C,L ∈ Rn×n

C :=L−1B

BLAS-2: y := y +Ax A,L ∈ Rn×n, x, y ∈ Rn

y :=L−1x

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ Rdot :=α+ xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

Page 16: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Organization in layers

other librariesScaLAPACK, Elemental, PETSc, . . .

LAPACKLU = A, LLT = A, QR = A, QTTQ = A, . . .

BLASBLAS-3: C :=C +AB A,B,C,L ∈ Rn×n

C :=L−1B

BLAS-2: y := y +Ax A,L ∈ Rn×n, x, y ∈ Rn

y :=L−1x

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ Rdot :=α+ xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

Page 17: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Organization in layers

other librariesScaLAPACK, Elemental, PETSc, . . .

LAPACKLU = A, LLT = A, QR = A, QTTQ = A, . . .

BLASBLAS-3: C :=C +AB A,B,C,L ∈ Rn×n

C :=L−1B

BLAS-2: y := y +Ax A,L ∈ Rn×n, x, y ∈ Rn

y :=L−1x

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ Rdot :=α+ xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

Page 18: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Example: AX = B (full A)

AX = BLinear System

LU = ALU

Factorization

LX = BTriangularSystem

C = AB + CGemm

LX = BTriangularSystem

C = AB + CGemm

C = AB + CGemm

Paolo Bientinesi (AICES, RWTH Aachen) 6 / 34

Page 19: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Why BLAS-3? Why GEMM?

BLAS #FLOPS Mem. refs. Ratio

Level 3 2n3 4n2 n/2

Level 2 2n2 n2 2

Level 1 2n 3n 2/3

BLAS-3: C :=C +AB A,B,C,∈ Rn×n

BLAS-2: y := y +Ax A ∈ Rn×n, x, y ∈ Rn

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Morale BLAS-3: The larger the problem the better, as long as it fits in memory.GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK.

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

Page 20: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Why BLAS-3? Why GEMM?

BLAS #FLOPS Mem. refs. Ratio

Level 3 2n3 4n2 n/2

Level 2 2n2 n2 2

Level 1 2n 3n 2/3

BLAS-3: C :=C +AB A,B,C,∈ Rn×n

BLAS-2: y := y +Ax A ∈ Rn×n, x, y ∈ Rn

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Morale BLAS-3: The larger the problem the better, as long as it fits in memory.GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK.

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

Page 21: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Why BLAS-3? Why GEMM?

BLAS #FLOPS Mem. refs. Ratio

Level 3 2n3 4n2 n/2

Level 2 2n2 n2 2

Level 1 2n 3n 2/3

BLAS-3: C :=C +AB A,B,C,∈ Rn×n

BLAS-2: y := y +Ax A ∈ Rn×n, x, y ∈ Rn

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Morale BLAS-3: The larger the problem the better, as long as it fits in memory.GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK.

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

Page 22: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Why BLAS-3? Why GEMM?

BLAS #FLOPS Mem. refs. Ratio

Level 3 2n3 4n2 n/2

Level 2 2n2 n2 2

Level 1 2n 3n 2/3

BLAS-3: C :=C +AB A,B,C,∈ Rn×n

BLAS-2: y := y +Ax A ∈ Rn×n, x, y ∈ Rn

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Morale BLAS-3: The larger the problem the better, as long as it fits in memory.GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK.

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

Page 23: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Why BLAS-3? Why GEMM?

BLAS #FLOPS Mem. refs. Ratio

Level 3 2n3 4n2 n/2

Level 2 2n2 n2 2

Level 1 2n 3n 2/3

BLAS-3: C :=C +AB A,B,C,∈ Rn×n

BLAS-2: y := y +Ax A ∈ Rn×n, x, y ∈ Rn

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Morale BLAS-3: The larger the problem the better, as long as it fits in memory.GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK.

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

Page 24: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Why BLAS-3? Why GEMM?

BLAS #FLOPS Mem. refs. Ratio

Level 3 2n3 4n2 n/2

Level 2 2n2 n2 2

Level 1 2n 3n 2/3

BLAS-3: C :=C +AB A,B,C,∈ Rn×n

BLAS-2: y := y +Ax A ∈ Rn×n, x, y ∈ Rn

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Morale BLAS-3: The larger the problem the better, as long as it fits in memory.GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK.

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

Page 25: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Why BLAS-3? Why GEMM?

BLAS #FLOPS Mem. refs. Ratio

Level 3 2n3 4n2 n/2

Level 2 2n2 n2 2

Level 1 2n 3n 2/3

BLAS-3: C :=C +AB A,B,C,∈ Rn×n

BLAS-2: y := y +Ax A ∈ Rn×n, x, y ∈ Rn

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Morale BLAS-3: The larger the problem the better, as long as it fits in memory.GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK.

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

Page 26: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Why BLAS-3? Why GEMM?

BLAS #FLOPS Mem. refs. Ratio

Level 3 2n3 4n2 n/2

Level 2 2n2 n2 2

Level 1 2n 3n 2/3

BLAS-3: C :=C +AB A,B,C,∈ Rn×n

BLAS-2: y := y +Ax A ∈ Rn×n, x, y ∈ Rn

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Morale BLAS-3: The larger the problem the better, as long as it fits in memory.GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK.

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

Page 27: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Part 1: Blocked algorithms

Simple example: Cholesky factorization

Input: Matrix A, symmetric and positive definite.

Goal: Determine L (lower triangular matrix) such that LLT = A

L =

(LTL

LBL LBR

)= ?

Paolo Bientinesi (AICES, RWTH Aachen) 8 / 34

Page 28: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Cholesky factorizationiteration i

DONE

DONE

Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

Page 29: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Cholesky factorizationiteration i+ 1

@@R

-

sqrt

trsv syr

unblocked algorithm

CHOL

TRSM SYRK

blocked algorithm

Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

Page 30: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Cholesky factorizationiteration i+ 1

DONE

DONE

unblocked algorithm

DONE

DONE

blocked algorithm

Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

Page 31: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Cholesky: unblocked vs. blocked algorithms

Paolo Bientinesi (AICES, RWTH Aachen) 10 / 34

Page 32: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Part 2: Parallelism? fork-join

Solution #1: Multithreaded BLAS (+ vector instructions)

Chol

TRSM GEMM

LU TRSM

TRSM GEMM

Advantage: ease of use. Legacy code!

Drawback: unnecessary synchronization points

OpenBLAS, ATLAS, BLIS, old versions of MKL, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

Page 33: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Part 2: Parallelism? fork-join

Solution #1: Multithreaded BLAS (+ vector instructions)

Chol

TRSM GEMM

LU TRSM

TRSM GEMM

Advantage: ease of use. Legacy code!

Drawback: unnecessary synchronization points

OpenBLAS, ATLAS, BLIS, old versions of MKL, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

Page 34: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Multithreaded BLASXeon, 32 physical cores, MKL

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000

Effic

ienc

y

Matrix dimension

Efficiency of GEMM

1 thread2 threads4 threads8 threads16 threads32 threads

Paolo Bientinesi (AICES, RWTH Aachen) 12 / 34

Page 35: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM → peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B → LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 36: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM

→ peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B → LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 37: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM → peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B → LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 38: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM → peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B

→ LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 39: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM → peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B → LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 40: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM → peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B → LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 41: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM → peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B → LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 42: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM → peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B → LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 43: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM → peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B → LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 44: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Development of LA libraries

- New architecture / new architectural features

- GEMM → peak performance [FFT, SpMV]

- BLAS3, factorizations, AX=B → LINPACK benchmark

- BLAS3, factorizations, AX=B

- factorizations, AX=B

- factorizations, AX=B, support routines

...

...

- eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Page 45: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Examples 2005–2006: multicores

GEMM, mtBLAS

HPL2005-7: Blue Gene Ldual cores (PowerPC A2)

2008-9: Roadrunnerdual cores (Opteron)

2009: Jaguar6-cores (Opteron)

2010-11: K Computer8-cores SPARC64. . .

libFLAME, PLASMA,MKL

Eigensolvers: 2010

Paolo Bientinesi (AICES, RWTH Aachen) 14 / 34

Page 46: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Examples 2005–2006: Cell

GEMM: 99% [FFT]no BLAS

HPL2008-9: Roadrunner>1 PetaFLOP6k * (Opteron + 2 Cell)

no libs

2009: discontinued

Eigensolvers: –

Paolo Bientinesi (AICES, RWTH Aachen) 15 / 34

Page 47: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Examples 2005: GPGPUs

cuBLAS (*)

HPL 2010: Tianhe-1A2.5k dual-GPU (ATI Radeon)

CULA (*)

libFLAME, MAGMA(multiGPU)

Eigensolvers: 2010-11

Paolo Bientinesi (AICES, RWTH Aachen) 16 / 34

Page 48: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Present Intel Xeon Phi (MIC)

GEMM (*), MKL

HPL (predicted)2013: Tianhe-216k * (2 Ivy Bridge + 3 Phi)

-

-

-

Also ARM, FPGAs, DSPs, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 17 / 34

Page 49: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Present Intel Xeon Phi (MIC)

GEMM (*), MKL

HPL (predicted)2013: Tianhe-216k * (2 Ivy Bridge + 3 Phi)

-

-

-

Also ARM, FPGAs, DSPs, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 17 / 34

Page 50: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Future 2017: Exascale (?)

??? heterogeneousarchitecture

Paolo Bientinesi (AICES, RWTH Aachen) 18 / 34

Page 51: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Back to the algorithms: Chol. fact

Fork-join⇒ unnecessary synchronizations

CHOL

SYRKTRSM

CHOL

SYRKTRSM

Iteration 1 Iteration 2

Paolo Bientinesi (AICES, RWTH Aachen) 19 / 34

Page 52: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Part 3: Parallelism? algorithms-by-blocks

Idea: decompose the tasksSolution #2: dependent tasks + scheduling

BA

X

=

A0

A1

A2

B0

B1

B2

X

=

A0X = B0

A1X = B1

A2X = B2

Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

Page 53: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Part 3: Parallelism? algorithms-by-blocks

Idea: decompose the tasksSolution #2: dependent tasks + scheduling

BA

X

=

A0

A1

A2

B0

B1

B2

X

=

A0X = B0

A1X = B1

A2X = B2

Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

Page 54: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Part 3: Parallelism? algorithms-by-blocks

Idea: decompose the tasksSolution #2: dependent tasks + scheduling

BA

X

=

A0

A1

A2

B0

B1

B2

X

=

A0X = B0

A1X = B1

A2X = B2

Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

Page 55: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Storage by blocks

CHOL

SYRKTRSM

CHOL

SYRKTRSM

Iteration 1 Iteration 2

Paolo Bientinesi (AICES, RWTH Aachen) 21 / 34

Page 56: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Creating tasksIteration 1

CHOL

TRSM

TRSM

TRSM GEMM

GEMM

SYRK

SYRK

GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

Page 57: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Creating tasksIteration 2

CHOL

TRSM

TRSM

SYRK

GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

Page 58: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Creating tasksIteration 3

CHOL

TRSM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

Page 59: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Creating tasksIteration 4

CHOL

Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

Page 60: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dependencies

CHOL

TRSM

TRSM

TRSM GEMM

GEMM

SYRK

SYRK

GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

Page 61: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dependencies

CHOL

TRSM

TRSM

TRSM GEMM

GEMM

SYRK

SYRK

GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

Page 62: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dependencies

CHOL

TRSM

TRSM

TRSM GEMM

GEMM

SYRK

SYRK

GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

Page 63: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dependencies

CHOL

TRSM

TRSM

TRSM

GEMM

GEMM

SYRK

SYRK

GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

Page 64: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Dependencies

CHOL

TRSM

TRSM

TRSM

GEMM

GEMM

SYRK

SYRK

GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

Page 65: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

DAG - Dependencies4× 4-tile matrix �� ��CHOL

?����HHHj�� ��TRSM

�����9 ��= ZZ~

�� ��TRSM

�����9 ZZ~XXXXXz

�� ��TRSM

�����9 ZZ~XXXXXz�� ��SYRK

?

�� ��GEMM

?

�� ��GEMM

?

�� ��SYRK

?

�� ��GEMM

?

�� ��SYRK

?

�� ��CHOL

@@R

PPPPPq�� ��TRSM PPPPPq

XXXXXXXXz

�� ��TRSM PPPPPq

XXXXXXXXz�� ��SYRK

?

�� ��GEMM

?

�� ��SYRK

?

�� ��CHOL

HHHj�� ��TRSM

HHHj�� ��SYRK

?�� ��CHOL

Paolo Bientinesi (AICES, RWTH Aachen) 24 / 34

Page 66: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Task Execution4× 4-tile matrix

Stage Scheduled Tasks1 CHOL

2 TRSM TRSM TRSM TRSM

3 SYRK GEMM SYRK GEMM

4 GEMM SYRK GEMM GEMM

5 GEMM SYRK CHOL

6 TRSM TRSM TRSM

7 SYRK GEMM SYRK GEMM

8 GEMM SYRK CHOL

9 TRSM TRSM

10 SYRK GEMM SYRK

11 CHOL

12 TRSM

13 SYRK

14 CHOL

Paolo Bientinesi (AICES, RWTH Aachen) 25 / 34

Page 67: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

SPD Inverse: Chol+Inv+GEMM5× 5-tile matrix

Stage Scheduled Tasks

1 CHOL

2 TRSM TRSM TRSM TRSM

3 SYRK GEMM SYRK GEMM

4 GEMM SYRK GEMM GEMM

5 GEMM SYRK CHOL TRSM

6 TRSM TRSM TRSM TRSM

7 TRSM TRSM TRINV SYRK

8 GEMM SYRK GEMM GEMM

9 SYRK TTMM CHOL TRSM

10 TRSM TRSM TRSM TRSM

11 GEMM GEMM GEMM SYRK

12 GEMM SYRK TRSM CHOL

13 TRSM TRSM TRINV SYRK

14 TRSM GEMM GEMM GEMM

15 GEMM TRMM SYRK TRSM

16 TRSM TTMM CHOL TRSM

17 SYRK TRINV GEMM SYRK

18 GEMM GEMM GEMM TRMM

19 TRMM TRSM TRSM TRSM

20 TRSM TRSM TRSM TRSM

21 TTMM SYRK GEMM SYRK

22 TRINV GEMM GEMM TRINV

23 SYRK SYRK GEMM SYRK

24 TRMM GEMM TRMM GEMM

25 TRMM SYRK GEMM GEMM

26 TTMM GEMM TRMM TRMM

27 SYRK TRMM

28 TRMM

29 TTMM

Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34

Page 68: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

SPD Inverse: Chol+Inv+GEMM5× 5-tile matrix

Stage Scheduled Tasks

1 CHOL

2 TRSM TRSM TRSM TRSM

3 SYRK GEMM SYRK GEMM

4 GEMM SYRK GEMM GEMM

5 GEMM SYRK CHOL TRSM

6 TRSM TRSM TRSM TRSM

7 TRSM TRSM TRINV SYRK

8 GEMM SYRK GEMM GEMM

9 SYRK TTMM CHOL TRSM

10 TRSM TRSM TRSM TRSM

11 GEMM GEMM GEMM SYRK

12 GEMM SYRK TRSM CHOL

13 TRSM TRSM TRINV SYRK

14 TRSM GEMM GEMM GEMM

15 GEMM TRMM SYRK TRSM

16 TRSM TTMM CHOL TRSM

17 SYRK TRINV GEMM SYRK

18 GEMM GEMM GEMM TRMM

19 TRMM TRSM TRSM TRSM

20 TRSM TRSM TRSM TRSM

21 TTMM SYRK GEMM SYRK

22 TRINV GEMM GEMM TRINV

23 SYRK SYRK GEMM SYRK

24 TRMM GEMM TRMM GEMM

25 TRMM SYRK GEMM GEMM

26 TTMM GEMM TRMM TRMM

27 SYRK TRMM

28 TRMM

29 TTMM

Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34

Page 69: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Blocked vs. By-blocks

Paolo Bientinesi (AICES, RWTH Aachen) 27 / 34

Page 70: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Blocked vs. By-blocks: crossover

0

10

20

30

40

50

60

70

80

0 2000 4000 6000 8000 10000 12000 14000

GF

LOP

S

Matrix dimension

Performance of Cholesky ( LLT = A )

SuperMatrix

FLAME

Paolo Bientinesi (AICES, RWTH Aachen) 28 / 34

Page 71: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Multithreaded BLAS vs. Algorithms-by-blocks

No absolute winner: crossover!

3 Ease of use

6 Synchronization

3 Out of order execution

3 Parallelism dictated by datadependencies

6 Plateux

libFLAME, MKL, PLASMA, . . .

Heterogeneous systems↔ schedulers

Paolo Bientinesi (AICES, RWTH Aachen) 29 / 34

Page 72: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Part 4: Streaming

The entire problem does not fit even in main memory

Example b :=(XTM−1X

)−1XTM−1y

Linear regression with non-independent outcomes

Inputs

M ∈ Rn×n, SPD(M), n ∈ [103, . . . , 104]

X ∈ Rn×p, p ∈ [1, . . . , 20], full ranky ∈ Rn

Outputb ∈ Rp

?Sequence of thousands of problems?

Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34

Page 73: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Part 4: Streaming

The entire problem does not fit even in main memory

Example b :=(XTM−1X

)−1XTM−1y

Linear regression with non-independent outcomes

Inputs

M ∈ Rn×n, SPD(M), n ∈ [103, . . . , 104]

X ∈ Rn×p, p ∈ [1, . . . , 20], full ranky ∈ Rn

Outputb ∈ Rp

?Sequence of thousands of problems?

Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34

Page 74: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Algorithm

LLT = M

X := L−1X

S := XTX

GGT = S

y := L−1y

b := XT y

b := G−1b

b := G−T b

Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34

Page 75: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Algorithm – bottleneck?

LLT = M

X := L−1X → to accelerator (TRSM)S := XTX

GGT = S

y := L−1y

b := XT y

b := G−1b

b := G−T b

Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34

Page 76: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

double+triple buffering

GPU

CPU

HDD

t

Read b+3

Send b+2

GPU trsm b+1GPU trsm b

Read b+2

Recv b-1

Send b+1

CPU comp b-1

Write b-1

Recv b CPU b

CPU ⇄ GPU transfer

HDD ⇄ CPU transfer

GPU computation

CPU computation

Data dependencies

Asynchronous dispatch

Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34

Page 77: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

double+triple buffering

b-1

β

bb-1b-2 b+1 b+2 b+3b-3HDD

CPU/RAM

GPU

b-1

b-2 b-1b-3Results r

Data X

b

trsm

α

b+2

A

b-1

B

b+1

C

Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34

Page 78: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

Summary

Dense linear algebra: it’s matter of layersGEMM = foundations, THE building block

Irrespective of the architecture, need for highly optimized BLAS

How to use threads/cores?Multithreaded BLAS vs. algorithms-by-blocks

Large/many problems but limited memory⇒ streaming

Paolo Bientinesi (AICES, RWTH Aachen) 33 / 34

Page 79: Dense Linear Algebra on Heterogeneous Platforms: State of · PDF fileDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen

References

MKLhttp://software.intel.com/en-us/intel-mkl

cuBLAS, CULAhttps://developer.nvidia.com/cublas

libFLAMEhttp://wiki.cs.utexas.edu/flame

BLIShttp://code.google.com/p/blis/

OpenBLAShttp://xianyi.github.io/OpenBLAS/

Magmahttp://icl.cs.utk.edu/magma/

Plasmahttp://icl.cs.utk.edu/plasma/

Paolo Bientinesi (AICES, RWTH Aachen) 34 / 34