Isaa c Lyn g aa s ( irly n g aas@g m ail.co m )
description
Transcript of Isaa c Lyn g aa s ( irly n g aas@g m ail.co m )
Isaac Lyngaas ([email protected])
John Paige ([email protected])Advised by: Srinath Vadlamani ([email protected])
& Doug Nychka ([email protected])
SIParCS, July 31, 2014
Why use HPC with R?
Accelerating mKrig & Krig
Parallel Cholesky◦ Software Packages
Parallel Eigen Decomposition
Conclusions & Future Works
Accelerate the ‘Fields’ Krig and mKrig functions
Survey of parallel linear algebra software
◦ Multicore (Shared Memory)◦ GPU◦ Xeon Phi
Many developers & users in the field of Statistics◦ Readily available code base
Problem: R is slow for large size problems
Bottleneck in Linear Algebra operations◦ mKrig – Cholesky Decomposition◦ Krig – Eigen Decomposition
R uses sequential algorithms
Strategy: Use C interoperable libraries to parallelize linear algebra◦ C functions callable through R environment
Symmetric positive definite -> Triangular◦ A = LL^T◦ Nice properties for determinant
calculation
PLASMA (Multicore Shared Memory)
◦ http://icl.cs.utk.edu/plasma/
MAGMA (GPU & Xeon Phi)◦ http://icl.cs.utk.edu/magma/
CULA (GPU)◦ http://www.culatools.com/
Multicore (Shared Memory)
Block Scheduling◦ Determines what operations should be done
on which core
Block Size optimization◦ Dependent on Cache Memory
05
Spe
ed
up
vs.
1 C
ore
1015
Plasma using 1 Node(# of Observations = 25000)
8
# of Cores
1 2 4 12 16
Speedup
Optimal Speedup
67
500 40 Mb 1000
Block Size
1500
34
Tim
e(s
ec)
5
PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16)
256 Kb
0 10000 20000
# of Observations
30000 40000
100
200
300
400
500
600
PLASMA Optimal Block Sizes (Cores=16)
Opt
imal
Blo
ck s
ize
Utilizes GPUs or Xeon Phi for parallelization◦ Multiple GPU & Multiple Xeon Phi
implementations available◦ 1 CPU drives one 1GPU
Block Scheduling◦ Similar to PLASMA
Block Size dependent on Accelerator Architecture
CUDA Proprietary linear algebra package
Capable of doing Lapack operations using 1 GPU
API written in C
Dense & Spare operations available
1 Node of Caldera or Pronghorn◦ 2 x 8 core Intel Xeon E5-2670 (Sandy
Bridge) processors per Node 64 GB RAM (~59 GB available) Cache Per Core: L1=32Kb, L2 =256Kb Cache Per Socket: L3=20Mb
◦ 2 x Nvidia Tesla M270Q GPU (Caldera) ~5.2 GB RAM per device 1 core drives 1 GPU
◦ 2 x Xeon Phi 5110P (Pronghorn) ~7.4 GB RAM per device
• Serial R: ~3 GFLOP/sec
• Theoretical Peak Performance• 16 core Xeon SandyBridge: ~333
GFLOP/sec• 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec• 1 Xeon Phi 5110P: ~1,011 GFLOP/sec
0 10000 20000
# of Observations
30000 40000
010
0G
FLO
P/s
ec
200
300
400
Accelerated Hardware has Room for Improvement
Plasma (16 cores)
Magma 1 GPU
Magma 2 GPUs
Magma 1 MIC
Magma 2 MICs
CULA
All Parallel Cholesky Implementations are Faster than Serial R
20000
# of Observations
Tim
e(s
ec)
0 10000 30000 40000
0.01
0.1
110
100
1000
Serial R
Plasma (16 Cores)
CULA
Magma 1 GPU
Magma 2 GPUs
Magma 1 Xeon Phi
Magma 2 Xeon Phis
• >100 Times Speedup over serial R when # of Observations = 10k
• ~6 Times Speedup over serial R when # of Observations = 10k
0 2000 4000 6000 8000 10000
050
100
Tim
e(s
ec)
150
200
250
300
Eigendecomposition also Faster on Accelerated Hardware
# of Observations
Serial R
CULA
Magma 1 GPU
Magma 2 GPUs
• Both times taken using MAGMA w/ 2 GPUs
0 2000 4000 6000 8000 10000
05
1015
2025
30Can Run ~30 Cholesky Decompositions per Eigen Decomposition
# of Observations
Tim
e E
ige
nd
eco
mp
ositi
on
/ T
ime
Cho
lesk
y
• If we want to do 16 Cholesky decompositions in parallel, we are guaranteed better performance when speedup >16
0 5000
05
1015
2025
10000
# of Observations
15000 20000
Parallel Cholesky Beats Parallel R for Moderate to Large MatricesS
pee
dup
vs. P
aral
lel
R
Plasma
Magma 2 GPUs
Using Caldera◦ Single Cholesky Decomposition
◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size)
◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs◦ Matrix Size > 35k use PLASMA (16 cores w/
optimal block size)
Dependent on computing resources available
Explored Implementation on accelerated hardware◦ GPUs◦ Multicore (Shared Memory)◦ Xeon Phis
Installed third party linear algebra packages & programmed wrappers that call these packages from R◦ Installation instructions and programs available
through bitbucket repo for access contact Srinath Vadlamani
Future Work◦ Multicore Distributed Memory◦ Single Precision
Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: http://CRAN.R-project.org/package=fields. R package version 7.1.
Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009.
Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, 2010.
Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.
xPOTRF
xPOTRF
xPOTRF
xPOTRF
xPOTRF
xTRSM
xTRSM xTRSM xTRSM
xTRSM xTRSM xTRSM
xTRSM xTRSM
xTRSM
xSYRK
xSYRK xSYRK xSYRK
xSYRK xSYRK xSYRK
xSYRK xSYRK
xSYRK
xGEMMxGEMM xGEMM xGEMM xGEMM xGEMM
xGEMM xGEMM xGEMM
xGEMM
xPOTRF xTRSM xSYRK xGEMM
0
1
2
3
0
1
2
3
0
1
2
3
0
1 2
FINAL
• http://www.netlib.org/lapack/lawnspdf/lawn223.pdf