Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et...

Accelerating ‘fields’ by

revamping the Cholesky

decomposition Vinay Ramakrishnaiah

Raghu Raj P Kumar

John Paige

Dr. Dorit Hammerling

07/29/2015

Introduction What is ‘fields’?

Widely used R package for spatial statistics.

Literally thousands of users.

Developed by Dr. Douglas Nychka, director of The Institute for Mathematics Applied to Geosciences (IMAGe).

Major methods – thin plate splines, Kriging and Compact covariances for large data sets.

‘fields’ - single thread version on CPU.

http://burandtfieldmethods.blogspot.com/

Introduction

Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop.

Make it user friendly.

MAGMA: Matrix algebra on GPU and Multicore Architectures

Aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. [http://icl.cs.utk.edu/magma/]

Biggest computational problem in Kriging

Evaluation of data likelihood function given the covariance.

Requires Cholesky Decomposition

Motivation

Unconventional behavior

http://www.nor-tech.com/solutions/hybrid-gpu-solutions-from-nor-tech/8-gpu-

server/

http://wccftech.com/nvidia-reportedly-preparing-geforce-titan-gpu-gk110/

http://www.aarp.org/money/investing/info-2014/5-bad-money-

moves.html#slide1

http://www.kaizen-news.com/focus-performance/

Performance

Motivation

Opposite impacts

http://www.bogotobogo.com/cplusplus/constructor.php

Motivation

There is always more room!

http://funny-pictures.funmunch.com/funny-picture-545.html

Approach

Identify performance

bottlenecks/cause for

unconventional behavior

Analyze the existing

See how to further

accelerate the code

Performance analysis - Profiling

NVIDIA visual profiler

Why profile GPU?

Why NVVP?

Performance analysis – Profiling

dpotrf_m 9 Non concurrent

memory transfers

Computations

CPU involved in

device to device data

exchange

Time----

Performance analysis – Profiling

dpotrf_mgpu 10

Asynchronous

simultaneous data

transfer

Computations

Time----

Performance analysis - Timing

0 5000 10000 15000 20000 25000 30000 35000

Square matrix dimension N

Comparison of dpotrf_m and dpotrf_mgpu called from the R wrapper

dpotrf_mgpu (s)

dpotrf_m (s)

Program Flow 12 R Program

magmaChol(…)

Rwrapper.r

dyn.load(…)

.Call(…)

fieldsMagma.r

C function

MAGMA wrapper

CUDA calls

Compute

Overheads in R 13

magmaChol .Call dpotrf_m magmaChol .Call dpotrf_mgpu

dpotrf_m dpotrf_mgpu

Sample square matrix size N=32768

Overheads in R

overhead

OPTIMIZATION APPROACHES

Reduce function call overheads in R.

Optimize R environment for

‘fieldsMAGMA’.

Accelerate the underlying C function.

Reducing R overheads

Intel LD_PRELOAD

Single dynamic load

Move the dynamic loads to the highest

level of R call

15 R Program

magmaChol(…)

Rwrapper.r

dyn.load(…)

.Call(…)

fieldsMagma.r

C function

MAGMA wrapper

CUDA calls

Compute

Reducing R overheads 16

0 5000 10000 15000 20000 25000 30000 35000

Matrix dimension N

Reduced overheads

Before

Performance of double precision Cholesky

Decomposition on Intel Compilers

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

Matrix Dimension N

Comparison of accelerated code (double precision) - Intel 2012 vs Intel 2015 Compiler

Intel 12.1.5

Intel 15.0.3

18 R Program

magmaChol(…)

Rwrapper.r

dyn.load(…)

.Call(…)

fieldsMagma.r

C function

MAGMA wrapper

CUDA calls

Compute

Pinned memory

http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/

For single precision calculations we obtained an improvement up to

30% for large sized matrices.

Mysteries of Deep

Deep copy performed in R acts as a overhead.

Used ‘perf’ to unravel the mystery!

Deep Copy In-place

Page Faults High Low

LLC Misses High Low

dTLB Misses Constant Constant

Abbreviations:

SP – Single Precision

DP – Double Precision

DPCPY – Deep copy

IP – In Place

Results 21

0 5000 10000 15000 20000 25000 30000 35000

Speed up of accelerated single precision versions with respect to CPU implementation

SP_DCPY_nGPU1

SP_DCPY_nGPU2

SP_IP_nGPU1

SP_IP_nGPU2

Results 22

0 5000 10000 15000 20000 25000 30000 35000

Speed up of accelerated double precision versions with respect to CPU implementation

DP_DPCPY_nGPU1

DP_DPCPY_nGPU2

DP_IP_nGPU1

DP_IP_nGPU2

Conclusion

Achieved an improvement in performance of about 40% - 60% over the previous version of fieldsMAGMA for large matrices.

Ability to allocate pinned memory makes significant difference.

Reduce R overheads.

Fine-tuned the R environment for fieldsMAGMA.

Detailed technical report.

Future work 24

Courtesy of NVIDIA Corporation

References Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J.,

Langou, J., et al. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. Journel of Physics: Conference series , 180 (012037).

http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/. (n.d.).

https://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat. (n.d.).

Katzfuss, M., & Cressie, N. (2012). Bayesian hierarchical spatio-temporal smoothing for very large datasets (Vol. 23). Environmetrics.

Paige, J., Lyngaas, I., Ramakrishnaiah, V., Hammerling , D., Kumar, R., & Nychka, D. (2015). Incorporating MAGMA into the 'fields' spatial statistics package. Boulder: UCAR/NCAR.

Tomov, S., Dongarra, J., Volkov, V., & Demmel, J. MAGMA library. University of Tenessee and University of California, Knoxville, TN, and Berkeley, CA.

Thank you!

Questions?

Introduction

Initial timing results

0 2000 4000 6000 8000 10000 12000

Comparison of testing_dpotrf & testing_dpotrf_mgpu

testing_dpotrf_mgpu -Gflops

testing_dpotrf - Gflop/s

testing_dpotrf_mgpu -time (s)

testing_dpotrf - time (s)

Single precision calculations

0 5000 10000 15000 20000 25000 30000 35000

Using pinned memory for single precision

non-pinned

pinned

Results

0 5000 10000 15000 20000 25000 30000 35000

Comparison of execution times of different implementations

DP_DPCPY_nGPU1

DP_DPCPY_nGPU2

DP_IP_nGPU1

DP_IP_nGPU2

SP_DCPY_nGPU1

SP_DCPY_nGPU2

SP_IP_nGPU1

SP_IP_nGPU2

Results

2481018

26224981

756310311

1349117369

2320126633

Performance improvement of the accelerated Kriging over CPU version

mKrig_1_DP mKrig_1_SP

mKrig_2_DP mKrig_2_SP

Results

2481018

26224981

756310311

1349117369

Performance improvement of the accelerated Kriging workflow over the CPU version

mKrigWF_1_DP

mKrigWF_2_SP

mKrigWF_2_DP

mKrigWF_1_SP

Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et...

Documents

Transcript of Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et...

Emerson Hannon SIParCS Intern Presentation July 26, 2011 E... · 2020-01-07 · SIParCS Intern Presentation. July 26, 2011 • Hometown: Arvada, Colorado • School: Electrical Engineering

Gemini - Paige

PT paige barcus

Paige Sandilands Portfolio

PAIGE CAPITAL MANAGEMENT, LLC, ) PAIGE OPPORTUNITY

Ukraine Paige Allen. Ukraine Flag two equal horizontal bands of azure blue (top) and golden yellow (bottom) represent grain fields under a blue sky.

Paige Jamaica Options

Paige Swiernik's Portfolio

Paige Melin - BXFall18

Paige Jacobs' Portfolio

Mendenhall, Paige 29

Paige E. Sullivan

Becoming AntiFragile - Dr Paige Williams Home | Dr Paige ...

Paige Miller

Peer Review- Paige

Paige Green 2009

Paige Jones

Paige Mosman Portfolio

paige tarjan

Paige Deery Presentation