Slides meyer116

Utility of Graphics Processing Units

for dense matrix calculations in

computing and inverting genomic

relationship matrices

Karin Meyer and Bruce Tier

Animal Genetics and Breeding Unit,University of New England,

Armidale, Australia

AAABG 2013

GPUs for GRMs | Introduction

Computing in the genomics age

Requirements for mixed model analyses have changed!

Pre-genomics:mixed model equations sparse (few non-zero elements)

– inverse of pedigree based relationship matrix (NRM)

concentrated on exploiting sparsity in computations

Now: Genomic relationship matrix (GRM)

dense (large proportion of elements non-zero)require manipulation of large, dense matrices to

– compute the GRM– invert the GRM– predict breeding values in single-step approach

computationally demandingexploit new hardware and software libraries

– computing in parallel

K. M. | 2 / 16


Objectives

A first look to examine scope for

speeding up computations to calculate & invert GRMusing parallel computationsexploiting a Graphics Processing Unit (GPU)

“An excursion into the land of computer gaming,

modern compilers and software libraries”

K. M. | 3 / 16


What is a GPU?

Graphical Processing Unitinitially: co-processor for computer gaming

– graphics card– fast recalculation of pixel values– remember: i387 math processor for i386

now: GP-GPU adapted to General Purpose computing

– hundreds to thousands of cores– capable of double precision calculations– turns desktop PC into personal super-computer– but: limited memory (currently 6GB max.)

specialised interface for application programming

– CUDA (proprietary of NVIDIA) −→ Fortran interface– OPENCL (C++ like)

K. M. | 4 / 16


Tesla K20X

2688 cores

6 GB RAM

Peak: 3.95/1.31 Tflops (single/double precision)

K. M. | 5 / 16

GPUs for GRMs | Calculating the GRM

Memory needed: G = ZZ′

Gigabytes – single precision matrices

n Gn×n Zn×s

s = 100K 500K 1000K

5 000 0.1 1.9 9.3 18.610 000 0.4 3.7 18.6 37.320 000 1.5 7.5 37.3 74.530 000 3.4 11.2 55.9 111.850 000 9.3 18.6 93.1 186.3

−→ break calculations into blocks

−→ use out-of-core storage

↪→ parallel computing literature

K. M. | 6 / 16


Blockwise multiplication

All-in-one

=Z

Z′

G

G = ZZ′

Column-wise division

= + · · · +Z.1 · · · · · · Z.4Z′.1

· · ·

· · ·

Z′.4

Z.1Z′.1 Z.4Z′.4

G =∑k

Z.kZ′.k

K. M. | 7 / 16

Z: n animals× s SNPs

G: symmetricsn(n+ 1)/2flops


Blockwise multiplication - cont’

Row-wise division

=Zi.

Z′j. Gij = Zi.Z′j.

Row- and columnwise division

=

Gij =∑k

ZikZ′jk

K. M. | 8 / 16

GPUs for GRMs | Inverting the GRM

Blockwise inversion

G =

GPP GPC GPT

GCP GCC GCT

GTP GTC GTT

1. Subdivide matrix into blocks

C current −→ choosen size

P previousT trailing

2. Factor & invert chol(GCC)

3. Adjust GPP & GPC

4. ‘Absorb’ into GTT & GCT

5. Calculate inverse6. Redefine C, P & T; repeat

Gauss-Jordan Elimination

GCC := chol(GCC)

GCC := G−1CC

GPC := GPCGCC

GPP := GPP + GPCG′PC

GCT := G′CCGCT

GTT := GTT − G′CTGCT

GPT := GPT − GPCGCT

GCT := −(GCCGCT)

GPC := GPCG′CC

GCC := GCCG′CC

−→ break matrix products into blocks as needed

K. M. | 9 / 16

GPUs for GRMs | Computing environment

Hardware

‘Standard’ multi-core CPU

– Intel I7-3930K– 3.2 GHz, 12M cache– 64GB RAM– 6 cores

GPU

– Nvidia GTX 580– 512 cores– 3GB RAM– capable of double precision calculations

↪→ high end gaming card, not GP-GPU

K. M. | 10 / 16

GPUs for GRMs | Computing environment

Software

Linux (CentOS 6)

gfortran, gcc 4.4.7; Intel composer XE 13.0.1

Intel MKL library 11.0

– BLAS: SSYRK, SGEMM, STRTRI, STRMM– LAPACK: SPOTRF, SPOTRI

CUDA 5.0

– CUBLAS: CUBLAS_SSYRK, CUBLAS_SGEMM,CUBLAS_STRTRI, CUBLAS_STRMM

CULA dense R16a

– CULA_SPOTRF, CULA_SPOTRI, CULA_STRTRI

MAGMA 1.3

– MAGMAF_SLAUUM

K. M. | 11 / 16

GPUs for GRMs | Results

Time to ‘make’ GRM∗

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●●

●●

●● ●

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●● ● ● ● ●

GPU

CPU6

CPU1

1 hour

5 hours

10−1

100

101

102

103

0 20000 40000 60000 80000No. animals

Syste

m t

ime

(m

in)

∗Single precision calculations, s = 500 000

K. M. | 12 / 16


Time to ‘make’ GRM∗

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●●

●●

●● ●

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●● ● ● ● ●

GPU

CPU6

CPU1

1 hour

5 hours

10−1

100

101

102

103

0 20000 40000 60000 80000No. animals

Syste

m t

ime

(m

in)

∗Single precision calculations, s = 500 000

K. M. | 12 / 16

●

● ●

●●

●●

●

●

●

●

● ● ● ●●

●

●●

● ● ● ●● ● ● ● ● ● ● ● ●

● ●●

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●

●

● ●●

●

GPU / CPU1

CPU6 / CPU1

5

10

15

20


Time to invert GRM†

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1 hour

GPU

CPU6

CPU1

10−4

10−3

10−2

10−1

100

101

102

0 20000 40000 60000 80000No. animals

Syste

m t

ime

(m

in)

†Single precision calculations

K. M. | 13 / 16


Time to invert GRM†

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1 hour

GPU

CPU6

CPU1

10−4

10−3

10−2

10−1

100

101

102

0 20000 40000 60000 80000No. animals

Syste

m t

ime

(m

in)

†Single precision calculations

K. M. | 13 / 16

●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●●

●●●

●

●●●●

●

●●

●●●●●●●●●●●●●

●

●●●●●●●●

●●

●●●●●

●●●●

●●●

●●●●●●●●●

●

GPU / CPU1

CPU6 / CPU1

5

10

15


Time to invert GRM‡

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●●

●● ●

●●

●●

● ●● ● ●

● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●

●●

●● ●

●●

●

● ●

1 hour

GPU

CPU6

CPU1

10−3

10−2

10−1

100

101

102

0 20000 40000 60000 80000No. animals

Syste

m t

ime

(m

in)

‡Double precision calculations

K. M. | 14 / 16

●

●

●

●

●

●

●

●

●

●

●●

●●

●

● ● ●●

●

●

● ●

●● ●

●● ●

●

●

● ●● ●

●

●● ● ● ●

● ● ● ● ● ●

●

●● ● ● ● ● ●

● ● ●

●

●

●●

●

●

●

●

GPU / CPU1

CPU6 / CPU1

4

6

8

GPUs for GRMs | Results | Conclusions

Conclusions

Build & invert GRM

−→ dense matrix multiplications−→ highly parallelisable

BLAS and LAPACK library routines awesome

– highly optimised– exploit memory architecture of modern processors– multi-threaded versions

GPU: powerful new hardware for parallel computing

– thousands of processors– can use multiple GPUs simultaneously– BLAS/LAPACK etc. libraries available– but: limited memory −→ blocked algorithms– but: specialised interface, not much Fortran support yet

K. M. | 15 / 16

GPUs for GRMs | Results | Conclusions

K. M. | 16 / 16

Slides meyer116

Science

Transcript of Slides meyer116