RWTH Aachen :: High Performance Matrix Computation

56
1592653589793238462643383279502884197169399375491898018 4159265358979323846264338327950288419716939937549189801 1415926535897932384626433832795028841971693993754918980 926535897932384626433832795028841971693993754918980183 1592653589793238462643383279502884197169399375491898018 4159265358979323846264338327950288419716939937549189801 5926535897932384626433832795028841971693993754918980183 4159265358979323846264338327950288419716939937549189801 1592653589793238462643383279502884197169399375491898018 1592653589793238462643383279502884197169399375491898018 5926535897932384626433832795028841971693993754918980183 1415926535897932384626433832795028841971693993754918980 5926535897932384626433832795028841971693993754918980183 4159265358979323846264338327950288419716939937549189801 1592653589793238462643383279502884197169399375491898018 1415926535897932384626433832795028841971693993754918980 High Performance Linear Algebra on Data Parallel Co-Processors II

Transcript of RWTH Aachen :: High Performance Matrix Computation

Page 1: RWTH Aachen :: High Performance Matrix Computation

1592653589793238462643383279502884197169399375491898018

4159265358979323846264338327950288419716939937549189801

1415926535897932384626433832795028841971693993754918980

926535897932384626433832795028841971693993754918980183

1592653589793238462643383279502884197169399375491898018

4159265358979323846264338327950288419716939937549189801

5926535897932384626433832795028841971693993754918980183

4159265358979323846264338327950288419716939937549189801

1592653589793238462643383279502884197169399375491898018

1592653589793238462643383279502884197169399375491898018

5926535897932384626433832795028841971693993754918980183

1415926535897932384626433832795028841971693993754918980

5926535897932384626433832795028841971693993754918980183

4159265358979323846264338327950288419716939937549189801

1592653589793238462643383279502884197169399375491898018

1415926535897932384626433832795028841971693993754918980

High Performance Linear Algebraon Data Parallel Co-Processors II

Page 2: RWTH Aachen :: High Performance Matrix Computation

2

[email protected] University of Toronto / Technische Universität Berlin

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

L2

Global Memory

Host

Input Assembler

Thread Scheduler

Data parallel co-processors

Page 3: RWTH Aachen :: High Performance Matrix Computation

3

[email protected] University of Toronto / Technische Universität Berlin

Memory latency

!

!

! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! !

, , ,! ! ! ! ! ! ! !! ! ! ! ( ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !!_Y(!&99L'(!8 ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !!

!

! ! ! ! !

! ! ! ! ! ! ! ! ! !

!

! !

! ! ! ! ! ! ! !

! ! ( ! ! ! ! ! !

! ! ! ! ! ! ! !

( ! ! ( ! ! ! ! !

! ! ! ! ! ! ! ! ! !!

! !

! ! ( ! ! ! ! ! !!

! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !!d ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !!

, , , ,! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! !

"#$! 0011"-.! 0311"-4! 2011"-.! "-./01!

&8!MF%9?!"IG9! 03! :/! C1! ;<;!

&=FE%(N!,*MJ! 02b! 0:b! 0Db! 02b!

'F9&=FE%(N! 2b! ;1b! 2b! D;b!

98+FN(U/! 2b! ;1b! 2b! <Db!

98+FN(U;1! ;1b! ;1b! 2b! ;1b!

98+FN(U;111! 1B2b! /B;b! ;B;b! ;B;b!

, , , , !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! Y(! ,*L=N%c8! &,PF(7(!

, ! ! ! ! ! ! ! ! !! ! ! !!

Y ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!

, , , , , !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

!

! ! ! ! !! !

! ! !

! ! !

! ! !!

! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! !

, , , , , ,! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

Vo

lko

v, V

., an

d J

. W

. D

emm

el.

Ben

chm

arki

ng

GP

Us

to T

un

e D

ense

Lin

ear

Alg

ebra

. 20

08.

Page 4: RWTH Aachen :: High Performance Matrix Computation

4

[email protected] University of Toronto / Technische Universität Berlin

Memory latency

!

!

! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! !

, , ,! ! ! ! ! ! ! !! ! ! ! ( ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !!_Y(!&99L'(!8 ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !!

!

! ! ! ! !

! ! ! ! ! ! ! ! ! !

!

! !

! ! ! ! ! ! ! !

! ! ( ! ! ! ! ! !

! ! ! ! ! ! ! !

( ! ! ( ! ! ! ! !

! ! ! ! ! ! ! ! ! !!

! !

! ! ( ! ! ! ! ! !!

! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !!d ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !!

, , , ,! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! !

"#$! 0011"-.! 0311"-4! 2011"-.! "-./01!

&8!MF%9?!"IG9! 03! :/! C1! ;<;!

&=FE%(N!,*MJ! 02b! 0:b! 0Db! 02b!

'F9&=FE%(N! 2b! ;1b! 2b! D;b!

98+FN(U/! 2b! ;1b! 2b! <Db!

98+FN(U;1! ;1b! ;1b! 2b! ;1b!

98+FN(U;111! 1B2b! /B;b! ;B;b! ;B;b!

, , , , !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! Y(! ,*L=N%c8! &,PF(7(!

, ! ! ! ! ! ! ! ! !! ! ! !!

Y ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!

, , , , , !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

!

! ! ! ! !! !

! ! !

! ! !

! ! !!

! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! !

, , , , , ,! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

Vo

lko

v, V

., an

d J

. W

. D

emm

el.

Ben

chm

arki

ng

GP

Us

to T

un

e D

ense

Lin

ear

Alg

ebra

. 20

08.

Page 5: RWTH Aachen :: High Performance Matrix Computation

5

[email protected] University of Toronto / Technische Universität Berlin

Linear algebra libraries

• CUBLAS: BLAS implementation for CUDA.

• CUSparse: Sparse matrix operations.

• CUFFT: FFT implementation for CUDA.

• CULA: Lapack implementation for CUDA.

• CURand: (Quasi) random number generation.

• NAG GPU library: NAG for GPUs.

• Thrust: STL style library for commonfunctionality (sort, reduce, ...).

Page 6: RWTH Aachen :: High Performance Matrix Computation

6

[email protected] University of Toronto / Technische Universität Berlin

BLAS performance

htt

p:/

/g

pg

pu

.org

/w

p/

wp

-co

nte

nt/

up

load

s/20

09/

11/

SC

09_C

UD

A_T

oo

ls_C

oh

en.p

df

Page 7: RWTH Aachen :: High Performance Matrix Computation

7

[email protected] University of Toronto / Technische Universität Berlin

LU factorization using BLAS!

!

"

#

$"

$#

%"

%#

&"

'( $%) %#' #$% $"%( %"() ("*' )$*% $'&)( &%+')

,-./012

345678!/-!09:4.

;9:4. -9<8/=5>985/:?@;A!1!,;A2!1!4285B9842

) ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! F9!LMN&8(NB!-P(!'&98(+!8P+(&N!9MF%9!

* ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !!- ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

, , , , , , ,! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !!! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! </?111B!-PL9?! &8!0! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!:! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! !

, , , , , , ,! ! ! ! ( ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

( ( ( ( ( ( !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !!

, , ,! ! ! ! ! ! ! ( ( ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !

Vo

lko

v, V

., an

d J

. W

. D

emm

el.

Ben

chm

arki

ng

GP

Us

to T

un

e D

ense

Lin

ear

Alg

ebra

. 20

08.

Page 8: RWTH Aachen :: High Performance Matrix Computation

8

[email protected] University of Toronto / Technische Universität Berlin

CUFFT performance

htt

p:/

/g

pg

pu

.org

/w

p/

wp

-co

nte

nt/

up

load

s/20

09/

11/

SC

09_C

UD

A_T

oo

ls_C

oh

en.p

df

Page 9: RWTH Aachen :: High Performance Matrix Computation

9

[email protected] University of Toronto / Technische Universität Berlin

CUSparse performance

htt

p:/

/w

ww

.nv

idia

.co

m/

con

ten

t/G

TC

-201

0/p

dfs

/20

70_G

TC

2010

.pd

f

Page 10: RWTH Aachen :: High Performance Matrix Computation

10

[email protected] University of Toronto / Technische Universität Berlin

CUSparse performance

htt

p:/

/w

ww

.nv

idia

.co

m/

con

ten

t/G

TC

-201

0/p

dfs

/20

70_G

TC

2010

.pd

f

Page 11: RWTH Aachen :: High Performance Matrix Computation

11

[email protected] University of Toronto / Technische Universität Berlin

GEMM

• General weighted matrix-matrix multiplication

Page 12: RWTH Aachen :: High Performance Matrix Computation

12

[email protected] University of Toronto / Technische Universität Berlin

GEMM

• General weighted matrix-matrix multiplication

C = αAB+ βC

Page 13: RWTH Aachen :: High Performance Matrix Computation

13

[email protected] University of Toronto / Technische Universität Berlin

GEMM

• General weighted matrix-matrix multiplication

• Algorithm has to be blocked to avoid beingmemory bandwidth limited.

C = αAB+ βC

Page 14: RWTH Aachen :: High Performance Matrix Computation

14

[email protected] University of Toronto / Technische Universität Berlin

GEMM

• General weighted matrix-matrix multiplication

• Algorithm has to be blocked to avoid beingmemory bandwidth limited.

• Many implementation details have to be alignedwith the details of the hardware architecture ...

C = αAB+ βC

Page 15: RWTH Aachen :: High Performance Matrix Computation

15

[email protected] University of Toronto / Technische Universität Berlin

GEMM

70% l l d dd h d h d ( )

60%

70%

our implementation (60%)

multiply and add with an operand in shared memory (66%)

50%

k

30%

40%

onof

Pea CUBLAS 1.1 (37%)

20%Fracti

0%

10%

0%

64 128 256 512 1024 2048 4096N

htt

p:/

/w

ww

.cs.

ber

kel

ey.e

du

/~

vo

lko

v/

vo

lko

v08

-sc0

8tal

k.p

df

Page 16: RWTH Aachen :: High Performance Matrix Computation

16

[email protected] University of Toronto / Technische Universität Berlin

GEMM!

!

! ! ! !

( ! ! (! ! ! ! !

! ! ! ! ! ! ! !

! ! ! ! ! ! ! !

! ! ! !! !!

M ! ! ! ! !

! ! ! ! ! ! ! !

! ! ! ! ! ! ! !

! ! ! !

! ! ! !

! ! ! ! ! !

! ! ! !

! !! :/! 1!

` ! ! ! ! ! !

! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! !

"

#"

$""

$#"

%""

%#"

&""

&#"

(""

'( $%) %#' #$% $"%( %"() ("*'

,-./012

C

))"",DE!F$'!G!$H&#,3>I

*)"",DE!F$'!G!$H'+,3>I

,DE%)"!F&"!G!$H&",3>I

)'"",DJ!F(!G!$H(#,3>I@/=4%KL9MN!&H",3>

'!;

&);

'!;

.#;'!;

@/=4%!OL/N!%H'+,3> ).;

J,PQQNB98=5<42CGC

) ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !!

!

! !

!

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!

= ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! !

, , , , ,! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

Vo

lko

v, V

., an

d J

. W

. D

emm

el.

Ben

chm

arki

ng

GP

Us

to T

un

e D

ense

Lin

ear

Alg

ebra

. 20

08.

Page 17: RWTH Aachen :: High Performance Matrix Computation

17

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.

Page 18: RWTH Aachen :: High Performance Matrix Computation

18

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.

• Ingredients: Gerschgorin intervaland Sturm counts .

G(T) = [l, u]) [ ]

C(y) = k

Page 19: RWTH Aachen :: High Performance Matrix Computation

19

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.

• Ingredients: Gerschgorin intervaland Sturm counts .

• Recipe: Bracket eigenvalues starting from theGerschgorin interval using Sturm counts.

G(T) = [l, u]) [ ]

C(y) = k

Page 20: RWTH Aachen :: High Performance Matrix Computation

20

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0 n

u

Page 21: RWTH Aachen :: High Performance Matrix Computation

21

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0 C(m1) n

um1

Page 22: RWTH Aachen :: High Performance Matrix Computation

22

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1 m1

m1

Page 23: RWTH Aachen :: High Performance Matrix Computation

23

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

C(m21)

m21

C(m21)

m21m1

m1

Page 24: RWTH Aachen :: High Performance Matrix Computation

24

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23

C(m22) C(m23)

m23m22

Page 25: RWTH Aachen :: High Performance Matrix Computation

25

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

Page 26: RWTH Aachen :: High Performance Matrix Computation

26

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

Page 27: RWTH Aachen :: High Performance Matrix Computation

27

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

C(mlk) C(mlk+1)

mlk+1mlk

Page 28: RWTH Aachen :: High Performance Matrix Computation

28

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

Count = 0;

d = 1;

for i = 1 to n

d = ai - x - (bi-1*bi-1)/d;

if d < 0

Count += 1;

endif

endfor

Page 29: RWTH Aachen :: High Performance Matrix Computation

29

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0 n

u

Page 30: RWTH Aachen :: High Performance Matrix Computation

30

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0 C(m1) n

um1

1X

Page 31: RWTH Aachen :: High Performance Matrix Computation

31

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1 m1

m1

1X

Page 32: RWTH Aachen :: High Performance Matrix Computation

32

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

C(m21)

m21

C(m21)

m21m1

m1

1X

2X

Page 33: RWTH Aachen :: High Performance Matrix Computation

33

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23

C(m22) C(m23)

m23m22

1X

2X

Page 34: RWTH Aachen :: High Performance Matrix Computation

34

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

1X

2X

3X

Page 35: RWTH Aachen :: High Performance Matrix Computation

35

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

1X

2X

3X

Page 36: RWTH Aachen :: High Performance Matrix Computation

36

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

C(mlk) C(mlk+1)

mlk+1mlk

1X

2X

3X

≈ nX

Page 37: RWTH Aachen :: High Performance Matrix Computation

37

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

• Strategy: one thread per tree node (or interval).

– Tree is processed level by level.

Page 38: RWTH Aachen :: High Performance Matrix Computation

38

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

• Strategy: one thread per tree node (or interval).

– Tree is processed level by level.

• Implementation “details”?

– Memory management? Can and should we useshared memory?

– How can we handle large matrices?

Page 39: RWTH Aachen :: High Performance Matrix Computation

39

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

1. Compute Gerschgorin interval.

Page 40: RWTH Aachen :: High Performance Matrix Computation

40

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

1. Compute Gerschgorin interval.

2. Move data to co-processor.

– Matrix data (ai, bi).

– Gerschgorin interval.

Page 41: RWTH Aachen :: High Performance Matrix Computation

41

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

1. Compute Gerschgorin interval.

2. Move data to co-processor.

– Matrix data (ai, bi).

– Gerschgorin interval.

3. Launch kernel.

Page 42: RWTH Aachen :: High Performance Matrix Computation

42

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

// move matrix data to shared mem

__shared__ float a[MAX_MAT_SIZE];

__shared__ float b[MAX_MAT_SIZE];

for( i=0; i<(mat_size/num_threads); ++i) {

a[tid] = a_g[tid];

b[tid] = b_g[tid];

}

__synchthreads();

Page 43: RWTH Aachen :: High Performance Matrix Computation

43

[email protected] University of Toronto / Technische Universität Berlin

struct SharedMem {

float left[MAX_THREADS_BLOCK];

float right[MAX_THREADS_BLOCK];

short left_count[MAX_THREADS_BLOCK];

short right_count[MAX_THREADS_BLOCK];

short compact[MAX_THREADS_BLOCK+1];

};

Bisection for eigenvalue spectrum

Page 44: RWTH Aachen :: High Performance Matrix Computation

44

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

// move Gerschgorin data to shared mem

__shared__ SharedMem smem;

if( 0 == tid) {

smem.left[0] = lg;

smem.right[0] = ug;

smem.left_count[0] = 0;

smem.right_count[0] = mat_size+1;

}

Page 45: RWTH Aachen :: High Performance Matrix Computation

45

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

// compute midpoint of interval

// compute Sturm count for midpoint

// write intervals to shared memory

smem.left[tid] = left;

smem.right[tid] = mid;

smem.left[tid+1] = mid;

smem.right[tid+1] = right;

...

}

Page 46: RWTH Aachen :: High Performance Matrix Computation

46

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

if( tid < num_active threads) {

float left = smem.left[tid];

float right = smem.right[tid];

...

// compute midpoint of interval

float mid = midpoint( left, right);

// compute Sturm count for midpoint

mid_count = sturm_count( left, right, mid);

Page 47: RWTH Aachen :: High Performance Matrix Computation

47

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

// compute Sturm count for midpoint

mid_count = sturm_count( left, right, mid);

// write intervals to shared memory

__syncthreads();

smem.left[2*tid] = left;

smem.right[2*tid] = mid;

smem.left[2*tid+1] = mid;

smem.right[2*tid+1] = right;

num_intervals *= 2;

}

Page 48: RWTH Aachen :: High Performance Matrix Computation

48

[email protected] University of Toronto / Technische Universität Berlin

Interlude: prefix sum (scan)

• Swiss army knife of data-parallel programming.

• Sort, max, min, reduce, ...

Page 49: RWTH Aachen :: High Performance Matrix Computation

49

[email protected] University of Toronto / Technische Universität Berlin

Interlude: prefix sum (scan)

• Swiss army knife of data-parallel programming.

• Sort, max, min, reduce, ...

• Divide-and-Conquer strategy.

– Traverse tree first bottom up and then top down.

– Bottom up: gather information.

– Top down: distribute information.

Page 50: RWTH Aachen :: High Performance Matrix Computation

50

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

// write intervals to shared memory

__syncthreads();

smem.left[2*tid] = left;

smem.right[2*tid] = mid;

smem.left[2*tid+1] = mid;

smem.right[2*tid+1] = right;

// compact intervals in shared memory

...

num_intervals = smem.compact[num_intervals];

}

Page 51: RWTH Aachen :: High Performance Matrix Computation

51

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

1. Read data from global to shared memory.

2. Until all intervals are converged:

– Read interval into registers.

– Bisect and compute Sturm count.

– Write to shared memory.

– Compact intervals.

3. Write eigenvalues to global memory.

Page 52: RWTH Aachen :: High Performance Matrix Computation

52

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

for( j=0; j<(mat_size/blockDim.x); ++j){

// read data with all threads in the block

// make sure the shared memory is not used

__syncthreads();

// read data for current block

if( addr < mat_size) {

d_data[tid] = g_d[addr];

lld_data[tid] = g_lld[addr];

}

__syncthreads();

Page 53: RWTH Aachen :: High Performance Matrix Computation

53

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

// read data for current block

if( addr < mat_size) {

a[tid] = a_g[addr];

b[tid] = b_g[addr];

}

__syncthreads();

// for all active intervals

if(tid < num_intervals) {

// Sturm count computation for current block

...

Page 54: RWTH Aachen :: High Performance Matrix Computation

54

[email protected] University of Toronto / Technische Universität Berlin

Bisection for eigenvalue spectrum

0 500 1000 1500 2000 2500 3000 3500 40000

2000

4000

6000

8000

10000

12000

14000

16000

18000Execution Time

Matrix Size

Tim

e (

ms)

sstemr mean

sstemr min

sstemr max

mrrr_dp mean

mrrr_dp min

mrrr_dp max

Page 55: RWTH Aachen :: High Performance Matrix Computation

55

[email protected] University of Toronto / Technische Universität Berlin

How to get started?

• http://developer.nvidia.com/cuda/

– Cuda 3.0 has device emulation mode.

– SDK has many, many code examples.

– NVIDIA linear algebra libraries.

• http://www.cs.berkeley.edu/~volkov/

– Performance analysis of linear algebra on dataparallel co-processors.

Page 56: RWTH Aachen :: High Performance Matrix Computation

56

[email protected] University of Toronto / Technische Universität Berlin

Conclusion

• Linear algebra on data parallel co-processors canbe significantly more efficient than on CPUs.

• But requires architecture specific design andoptimizations.

• For many algorithms still unclear how toefficiently implement them on a data parallel co-processor.