Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et...
Transcript of Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et...
![Page 1: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/1.jpg)
Accelerating ‘fields’ by
revamping the Cholesky
decomposition Vinay Ramakrishnaiah
Raghu Raj P Kumar
John Paige
Dr. Dorit Hammerling
1
07/29/2015
![Page 2: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/2.jpg)
Introduction What is ‘fields’?
Widely used R package for spatial statistics.
Literally thousands of users.
Developed by Dr. Douglas Nychka, director of The Institute for Mathematics Applied to Geosciences (IMAGe).
Major methods – thin plate splines, Kriging and Compact covariances for large data sets.
‘fields’ - single thread version on CPU.
http://burandtfieldmethods.blogspot.com/
2
![Page 3: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/3.jpg)
Introduction
Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop.
Make it user friendly.
MAGMA: Matrix algebra on GPU and Multicore Architectures
Aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. [http://icl.cs.utk.edu/magma/]
Biggest computational problem in Kriging
Evaluation of data likelihood function given the covariance.
Requires Cholesky Decomposition
3
![Page 4: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/4.jpg)
Motivation
Unconventional behavior
http://www.nor-tech.com/solutions/hybrid-gpu-solutions-from-nor-tech/8-gpu-
server/
http://wccftech.com/nvidia-reportedly-preparing-geforce-titan-gpu-gk110/
http://www.aarp.org/money/investing/info-2014/5-bad-money-
moves.html#slide1
http://www.kaizen-news.com/focus-performance/
4
Performance
![Page 5: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/5.jpg)
Motivation
Opposite impacts
http://www.bogotobogo.com/cplusplus/constructor.php
5
![Page 6: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/6.jpg)
Motivation
There is always more room!
http://funny-pictures.funmunch.com/funny-picture-545.html
6
![Page 7: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/7.jpg)
Approach
7
Identify performance
bottlenecks/cause for
unconventional behavior
Analyze the existing
code
See how to further
accelerate the code
![Page 8: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/8.jpg)
Performance analysis - Profiling
NVIDIA visual profiler
Why profile GPU?
Why NVVP?
8
![Page 9: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/9.jpg)
Performance analysis – Profiling
dpotrf_m 9 Non concurrent
memory transfers
Computations
CPU involved in
device to device data
exchange
Time----
Time----
![Page 10: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/10.jpg)
Performance analysis – Profiling
dpotrf_mgpu 10
Asynchronous
simultaneous data
transfer
Computations
Time----
![Page 11: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/11.jpg)
Performance analysis - Timing
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
0 5000 10000 15000 20000 25000 30000 35000
Tim
e in
seco
nd
s
Square matrix dimension N
Comparison of dpotrf_m and dpotrf_mgpu called from the R wrapper
dpotrf_mgpu (s)
dpotrf_m (s)
11
![Page 12: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/12.jpg)
Program Flow 12 R Program
magmaChol(…)
Rwrapper.r
dyn.load(…)
.Call(…)
fieldsMagma.r
C function
MAGMA wrapper
CUDA calls
Compute
![Page 13: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/13.jpg)
Overheads in R 13
0
2
4
6
8
10
12
14
16
18
20
magmaChol .Call dpotrf_m magmaChol .Call dpotrf_mgpu
dpotrf_m dpotrf_mgpu
Tim
e (
s)
Sample square matrix size N=32768
Overheads in R
overhead
overhead
![Page 14: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/14.jpg)
OPTIMIZATION APPROACHES
Reduce function call overheads in R.
Optimize R environment for
‘fieldsMAGMA’.
Accelerate the underlying C function.
14
![Page 15: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/15.jpg)
Reducing R overheads
Intel LD_PRELOAD
Single dynamic load
Move the dynamic loads to the highest
level of R call
15 R Program
magmaChol(…)
Rwrapper.r
dyn.load(…)
.Call(…)
fieldsMagma.r
C function
MAGMA wrapper
CUDA calls
Compute
![Page 16: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/16.jpg)
Reducing R overheads 16
0
5
10
15
20
25
0 5000 10000 15000 20000 25000 30000 35000
Tim
e (
s)
Matrix dimension N
Reduced overheads
Before
After
![Page 17: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/17.jpg)
Performance of double precision Cholesky
Decomposition on Intel Compilers
0
5
10
15
20
25
30
35
40
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
Tim
e (
s)
Matrix Dimension N
Comparison of accelerated code (double precision) - Intel 2012 vs Intel 2015 Compiler
Intel 12.1.5
Intel 15.0.3
17
![Page 18: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/18.jpg)
18 R Program
magmaChol(…)
Rwrapper.r
dyn.load(…)
.Call(…)
fieldsMagma.r
C function
MAGMA wrapper
CUDA calls
Compute
![Page 19: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/19.jpg)
Pinned memory
http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/
19
For single precision calculations we obtained an improvement up to
30% for large sized matrices.
![Page 20: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/20.jpg)
Mysteries of Deep
Copy
Deep copy performed in R acts as a overhead.
Used ‘perf’ to unravel the mystery!
20
Deep Copy In-place
Page Faults High Low
LLC Misses High Low
dTLB Misses Constant Constant
Abbreviations:
SP – Single Precision
DP – Double Precision
DPCPY – Deep copy
IP – In Place
![Page 21: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/21.jpg)
Results 21
0
10
20
30
40
50
60
70
80
90
0 5000 10000 15000 20000 25000 30000 35000
Sp
ee
d u
p
Square matrix dimension N
Speed up of accelerated single precision versions with respect to CPU implementation
SP_DCPY_nGPU1
SP_DCPY_nGPU2
SP_IP_nGPU1
SP_IP_nGPU2
![Page 22: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/22.jpg)
Results 22
0
10
20
30
40
50
60
70
80
90
0 5000 10000 15000 20000 25000 30000 35000
Sp
ee
d u
p
Square matrix dimension N
Speed up of accelerated double precision versions with respect to CPU implementation
DP_DPCPY_nGPU1
DP_DPCPY_nGPU2
DP_IP_nGPU1
DP_IP_nGPU2
![Page 23: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/23.jpg)
Conclusion
Achieved an improvement in performance of about 40% - 60% over the previous version of fieldsMAGMA for large matrices.
Ability to allocate pinned memory makes significant difference.
Reduce R overheads.
Fine-tuned the R environment for fieldsMAGMA.
Detailed technical report.
23
![Page 24: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/24.jpg)
Future work 24
Courtesy of NVIDIA Corporation
![Page 25: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/25.jpg)
References Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J.,
Langou, J., et al. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. Journel of Physics: Conference series , 180 (012037).
http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/. (n.d.).
https://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat. (n.d.).
Katzfuss, M., & Cressie, N. (2012). Bayesian hierarchical spatio-temporal smoothing for very large datasets (Vol. 23). Environmetrics.
Paige, J., Lyngaas, I., Ramakrishnaiah, V., Hammerling , D., Kumar, R., & Nychka, D. (2015). Incorporating MAGMA into the 'fields' spatial statistics package. Boulder: UCAR/NCAR.
Tomov, S., Dongarra, J., Volkov, V., & Demmel, J. MAGMA library. University of Tenessee and University of California, Knoxville, TN, and Berkeley, CA.
25
![Page 26: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/26.jpg)
Thank you!
Questions?
26
![Page 27: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/27.jpg)
Introduction
27
![Page 28: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/28.jpg)
Initial timing results
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
100
200
300
400
500
600
700
800
900
1000
0 2000 4000 6000 8000 10000 12000
Tim
e (
s)
GF
lop
s
Square matrix dimension N
Comparison of testing_dpotrf & testing_dpotrf_mgpu
testing_dpotrf_mgpu -Gflops
testing_dpotrf - Gflop/s
testing_dpotrf_mgpu -time (s)
testing_dpotrf - time (s)
28
![Page 29: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/29.jpg)
Single precision calculations
0
2
4
6
8
10
12
14
16
18
20
0 5000 10000 15000 20000 25000 30000 35000
Tim
e (
s)
Square matrix dimension N
Using pinned memory for single precision
non-pinned
pinned
29
![Page 30: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/30.jpg)
Results
0
5
10
15
20
25
30
0 5000 10000 15000 20000 25000 30000 35000
Tim
e (
s)
Square matrix dimension N
Comparison of execution times of different implementations
DP_DPCPY_nGPU1
DP_DPCPY_nGPU2
DP_IP_nGPU1
DP_IP_nGPU2
SP_DCPY_nGPU1
SP_DCPY_nGPU2
SP_IP_nGPU1
SP_IP_nGPU2
30
![Page 31: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/31.jpg)
Results
0
1
2
3
4
5
6
2481018
26224981
756310311
1349117369
2320126633
Sp
eed
up
Square matrix dimension N
Performance improvement of the accelerated Kriging over CPU version
mKrig_1_DP mKrig_1_SP
mKrig_2_DP mKrig_2_SP
31
![Page 32: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly.](https://reader035.fdocuments.in/reader035/viewer/2022081411/60a933700c271a35a055c870/html5/thumbnails/32.jpg)
Results
0
0.5
1
1.5
2
2.5
3
3.5
4
2481018
26224981
756310311
1349117369
Sp
ee
d u
p
Square matrix dimension N
Performance improvement of the accelerated Kriging workflow over the CPU version
mKrigWF_1_DP
mKrigWF_2_SP
mKrigWF_2_DP
mKrigWF_1_SP
32