Challenges – 2 slides Opportunities – 4 slides How ? - 3 slides Recommendations – 1 slide
Slides meyer116
-
Upload
prettygully -
Category
Science
-
view
15 -
download
2
Transcript of Slides meyer116
Utility of Graphics Processing Units
for dense matrix calculations in
computing and inverting genomic
relationship matrices
Karin Meyer and Bruce Tier
Animal Genetics and Breeding Unit,University of New England,
Armidale, Australia
AAABG 2013
GPUs for GRMs | Introduction
Computing in the genomics age
Requirements for mixed model analyses have changed!
Pre-genomics:mixed model equations sparse (few non-zero elements)
– inverse of pedigree based relationship matrix (NRM)
concentrated on exploiting sparsity in computations
Now: Genomic relationship matrix (GRM)
dense (large proportion of elements non-zero)require manipulation of large, dense matrices to
– compute the GRM– invert the GRM– predict breeding values in single-step approach
computationally demandingexploit new hardware and software libraries
– computing in parallel
K. M. | 2 / 16
GPUs for GRMs | Introduction
Objectives
A first look to examine scope for
speeding up computations to calculate & invert GRMusing parallel computationsexploiting a Graphics Processing Unit (GPU)
“An excursion into the land of computer gaming,
modern compilers and software libraries”
K. M. | 3 / 16
GPUs for GRMs | Introduction
What is a GPU?
Graphical Processing Unitinitially: co-processor for computer gaming
– graphics card– fast recalculation of pixel values– remember: i387 math processor for i386
now: GP-GPU adapted to General Purpose computing
– hundreds to thousands of cores– capable of double precision calculations– turns desktop PC into personal super-computer– but: limited memory (currently 6GB max.)
specialised interface for application programming
– CUDA (proprietary of NVIDIA) −→ Fortran interface– OPENCL (C++ like)
K. M. | 4 / 16
GPUs for GRMs | Introduction
Tesla K20X
2688 cores
6 GB RAM
Peak: 3.95/1.31 Tflops (single/double precision)
K. M. | 5 / 16
GPUs for GRMs | Calculating the GRM
Memory needed: G = ZZ′
Gigabytes – single precision matrices
n Gn×n Zn×s
s = 100K 500K 1000K
5 000 0.1 1.9 9.3 18.610 000 0.4 3.7 18.6 37.320 000 1.5 7.5 37.3 74.530 000 3.4 11.2 55.9 111.850 000 9.3 18.6 93.1 186.3
−→ break calculations into blocks
−→ use out-of-core storage
↪→ parallel computing literature
K. M. | 6 / 16
GPUs for GRMs | Calculating the GRM
Blockwise multiplication
All-in-one
=Z
Z′
G
G = ZZ′
Column-wise division
= + · · · +Z.1 · · · · · · Z.4Z′.1
· · ·
· · ·
Z′.4
Z.1Z′.1 Z.4Z′.4
G =∑k
Z.kZ′.k
K. M. | 7 / 16
Z: n animals× s SNPs
G: symmetricsn(n+ 1)/2flops
GPUs for GRMs | Calculating the GRM
Blockwise multiplication - cont’
Row-wise division
=Zi.
Z′j. Gij = Zi.Z′j.
Row- and columnwise division
=
Gij =∑k
ZikZ′jk
K. M. | 8 / 16
GPUs for GRMs | Inverting the GRM
Blockwise inversion
G =
GPP GPC GPT
GCP GCC GCT
GTP GTC GTT
1. Subdivide matrix into blocks
C current −→ choosen size
P previousT trailing
2. Factor & invert chol(GCC)
3. Adjust GPP & GPC
4. ‘Absorb’ into GTT & GCT
5. Calculate inverse6. Redefine C, P & T; repeat
Gauss-Jordan Elimination
GCC := chol(GCC)
GCC := G−1CC
GPC := GPCGCC
GPP := GPP + GPCG′PC
GCT := G′CCGCT
GTT := GTT − G′CTGCT
GPT := GPT − GPCGCT
GCT := −(GCCGCT)
GPC := GPCG′CC
GCC := GCCG′CC
−→ break matrix products into blocks as needed
K. M. | 9 / 16
GPUs for GRMs | Computing environment
Hardware
‘Standard’ multi-core CPU
– Intel I7-3930K– 3.2 GHz, 12M cache– 64GB RAM– 6 cores
GPU
– Nvidia GTX 580– 512 cores– 3GB RAM– capable of double precision calculations
↪→ high end gaming card, not GP-GPU
K. M. | 10 / 16
GPUs for GRMs | Computing environment
Software
Linux (CentOS 6)
gfortran, gcc 4.4.7; Intel composer XE 13.0.1
Intel MKL library 11.0
– BLAS: SSYRK, SGEMM, STRTRI, STRMM– LAPACK: SPOTRF, SPOTRI
CUDA 5.0
– CUBLAS: CUBLAS_SSYRK, CUBLAS_SGEMM,CUBLAS_STRTRI, CUBLAS_STRMM
CULA dense R16a
– CULA_SPOTRF, CULA_SPOTRI, CULA_STRTRI
MAGMA 1.3
– MAGMAF_SLAUUM
K. M. | 11 / 16
GPUs for GRMs | Results
Time to ‘make’ GRM∗
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●
●●
●● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●
GPU
CPU6
CPU1
1 hour
5 hours
10−1
100
101
102
103
0 20000 40000 60000 80000No. animals
Syste
m t
ime
(m
in)
∗Single precision calculations, s = 500 000
K. M. | 12 / 16
GPUs for GRMs | Results
Time to ‘make’ GRM∗
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●
●●
●● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●
GPU
CPU6
CPU1
1 hour
5 hours
10−1
100
101
102
103
0 20000 40000 60000 80000No. animals
Syste
m t
ime
(m
in)
∗Single precision calculations, s = 500 000
K. M. | 12 / 16
●
● ●
●●
●●
●
●
●
●
● ● ● ●●
●
●●
● ● ● ●● ● ● ● ● ● ● ● ●
● ●●
● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
●
● ●●
●
GPU / CPU1
CPU6 / CPU1
5
10
15
20
GPUs for GRMs | Results
Time to invert GRM†
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
1 hour
GPU
CPU6
CPU1
10−4
10−3
10−2
10−1
100
101
102
0 20000 40000 60000 80000No. animals
Syste
m t
ime
(m
in)
†Single precision calculations
K. M. | 13 / 16
GPUs for GRMs | Results
Time to invert GRM†
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
1 hour
GPU
CPU6
CPU1
10−4
10−3
10−2
10−1
100
101
102
0 20000 40000 60000 80000No. animals
Syste
m t
ime
(m
in)
†Single precision calculations
K. M. | 13 / 16
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●●
●
●●●●
●
●●
●●●●●●●●●●●●●
●
●●●●●●●●
●●
●●●●●
●●●●
●●●
●●●●●●●●●
●
GPU / CPU1
CPU6 / CPU1
5
10
15
GPUs for GRMs | Results
Time to invert GRM‡
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●
●● ●
●●
●●
● ●● ● ●
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●●
●● ●
●●
●
● ●
1 hour
GPU
CPU6
CPU1
10−3
10−2
10−1
100
101
102
0 20000 40000 60000 80000No. animals
Syste
m t
ime
(m
in)
‡Double precision calculations
K. M. | 14 / 16
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ● ●●
●
●
● ●
●● ●
●● ●
●
●
● ●● ●
●
●● ● ● ●
● ● ● ● ● ●
●
●● ● ● ● ● ●
● ● ●
●
●
●●
●
●
●
●
GPU / CPU1
CPU6 / CPU1
4
6
8
GPUs for GRMs | Results | Conclusions
Conclusions
Build & invert GRM
−→ dense matrix multiplications−→ highly parallelisable
BLAS and LAPACK library routines awesome
– highly optimised– exploit memory architecture of modern processors– multi-threaded versions
GPU: powerful new hardware for parallel computing
– thousands of processors– can use multiple GPUs simultaneously– BLAS/LAPACK etc. libraries available– but: limited memory −→ blocked algorithms– but: specialised interface, not much Fortran support yet
K. M. | 15 / 16
GPUs for GRMs | Results | Conclusions
Conclusions
Build & invert GRM
−→ dense matrix multiplications−→ highly parallelisable
BLAS and LAPACK library routines awesome
– highly optimised– exploit memory architecture of modern processors– multi-threaded versions
GPU: powerful new hardware for parallel computing
– thousands of processors– can use multiple GPUs simultaneously– BLAS/LAPACK etc. libraries available– but: limited memory −→ blocked algorithms– but: specialised interface, not much Fortran support yet
K. M. | 15 / 16
GPUs for GRMs | Results | Conclusions
Conclusions
Build & invert GRM
−→ dense matrix multiplications−→ highly parallelisable
BLAS and LAPACK library routines awesome
– highly optimised– exploit memory architecture of modern processors– multi-threaded versions
GPU: powerful new hardware for parallel computing
– thousands of processors– can use multiple GPUs simultaneously– BLAS/LAPACK etc. libraries available– but: limited memory −→ blocked algorithms– but: specialised interface, not much Fortran support yet
K. M. | 15 / 16