Efficient GPU-based Construction of Occupancy Grids Using ...
Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon...
Transcript of Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon...
![Page 1: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/1.jpg)
Efficient GPU Programmingin Modern C++
Gordon BrownPrincipal Software Engineer, SYCL & C++
CppCon 2019 – Sep 2019
![Page 2: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/2.jpg)
© 2019 Codeplay Software Ltd.2
![Page 3: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/3.jpg)
© 2019 Codeplay Software Ltd.3
This talk is based on the SYCL programming model
Terminology may differ for other programming models
![Page 4: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/4.jpg)
© 2019 Codeplay Software Ltd.4
Agenda
Why use the GPU?
Brief introduction to SYCL
SYCL programming model
Optimising GPU programs
● Choosing the right algorithm
● Basic GPU programming principles
● Ideas for further optimisations
![Page 5: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/5.jpg)
© 2019 Codeplay Software Ltd.5
Why use the GPU?
![Page 6: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/6.jpg)
© 2019 Codeplay Software Ltd.6
“The free lunch is over”
“The end of Moore’s Law”
“The future is parallel”
![Page 7: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/7.jpg)
© 2019 Codeplay Software Ltd.7
Take a typical Intel chip
Intel Core i7 7th Gen○ 4x CPU cores
■ Each with hyperthreading■ Each with support for 256bit
AVX2 instructions○ Intel Gen 9.5 GPU
■ With 1280 processing elements
![Page 8: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/8.jpg)
© 2019 Codeplay Software Ltd.8
Regular sequential C++ code (non-vectorised) running on a single thread only takes advantage of a very small amount of the available resources of the chip
![Page 9: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/9.jpg)
© 2019 Codeplay Software Ltd.9
Vectorisation allows you to fully utilise a single CPU core
![Page 10: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/10.jpg)
© 2019 Codeplay Software Ltd.10
Multi-threading allows you to fully utilise all CPU cores
![Page 11: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/11.jpg)
© 2019 Codeplay Software Ltd.11
Heterogeneous dispatch allows you to fully utilise the entire chip
![Page 12: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/12.jpg)
© 2019 Codeplay Software Ltd.12
GPGPU programming was
once a niche technology
● Limited to specific
domain
● Separate source
solutions
● Verbose low-level APIs
● Very steep learning
curve
![Page 13: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/13.jpg)
© 2019 Codeplay Software Ltd.13
This is not the case anymore
● Almost everything has a
GPU now
● Single source solutions
● Modern C++ programming
models
● More accessible to the
average C++ developer
C++AMP
SYCL
CUDA Agency
Kokkos
HPX
Raja
![Page 14: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/14.jpg)
© 2019 Codeplay Software Ltd.14
Brief introduction to SYCL
![Page 15: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/15.jpg)
© 2019 Codeplay Software Ltd.15
Cross-platform, single-source, high-level, C++ programming layerBuilt on top of OpenCL and based on standard C++11
Delivering a heterogeneous programming solution for C++
![Page 16: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/16.jpg)
© 2019 Codeplay Software Ltd.16
Applications
SYCL for OpenCL
OpenCL
C++ template libraries
OpenCL-enabled devices
Host compiler Device compiler
Device IR(SPIR, SPIR-V, etc)
![Page 17: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/17.jpg)
© 2019 Codeplay Software Ltd.17
__global__ vec_add(float *a, float *b, float *c) {
return c[i] = a[i] + b[i];
}
float *a, *b, *c;
vec_add<<<range>>>(a, b, c);
vector<float> a, b, c;
#pragma parallel_for
for(int i = 0; i < a.size(); i++) {
c[i] = a[i] + b[i];
}
cgh.parallel_for<vec_add>(range, [=](cl::sycl::id<2> idx) {
c[idx] = a[idx] + c[idx];
}));
array_view<float> a, b, c;
extent<2> e(64, 64);
parallel_for_each(e, [=](index<2> idx) restrict(amp) {
c[idx] = a[idx] + b[idx];
});
![Page 18: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/18.jpg)
© 2019 Codeplay Software Ltd.18
int main(int argc, char *argv[]) {
}
![Page 19: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/19.jpg)
© 2019 Codeplay Software Ltd.19
#include <CL/sycl.hpp>
using namespace cl::sycl;
int main(int argc, char *argv[]) {
}
The whole SYCL API is included in the CL/sycl.hpp header file
![Page 20: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/20.jpg)
© 2019 Codeplay Software Ltd.20
#include <CL/sycl.hpp>
using namespace cl::sycl;
int main(int argc, char *argv[]) {
queue gpuQueue{gpu_selector{}};
}
A queue is used to enqueue work to a device such as a GPU
A device selector is a function object which provides a heuristic for selecting a suitable device
![Page 21: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/21.jpg)
© 2019 Codeplay Software Ltd.21
#include <CL/sycl.hpp>
using namespace cl::sycl;
int main(int argc, char *argv[]) {
queue gpuQeueue{gpu_selector{}};
defaultQueue.submit([&](handler &cgh){
});
}
A command group describes a unit work of work to be executed by a device
A command group is created by a function object passed to the submit function of the queue
![Page 22: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/22.jpg)
© 2019 Codeplay Software Ltd.23
#include <CL/sycl.hpp>
using namespace cl::sycl;
int main(int argc, char *argv[]) {
std::vector<float> dA{ … }, dB{ … }, dO{ … };
queue gpuQeueue{gpu_selector{}};
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
defaultQueue.submit([&](handler &cgh){
});
}
Buffers take ownership of data and manage it across the host and any number of devices
![Page 23: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/23.jpg)
© 2019 Codeplay Software Ltd.24
#include <CL/sycl.hpp>
using namespace cl::sycl;
int main(int argc, char *argv[]) {
std::vector<float> dA{ … }, dB{ … }, dO{ … };
queue gpuQeueue{gpu_selector{}};
{
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
defaultQueue.submit([&](handler &cgh){
});
}
}
Buffers synchronize on destruction via RAII waiting for any command groups that need to write back to it
![Page 24: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/24.jpg)
© 2019 Codeplay Software Ltd.25
#include <CL/sycl.hpp>
using namespace cl::sycl;
int main(int argc, char *argv[]) {
std::vector<float> dA{ … }, dB{ … }, dO{ … };
queue gpuQeueue{gpu_selector{}};
{
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
defaultQueue.submit([&](handler &cgh){
auto inA = bufA.get_access<access::mode::read>(cgh);
auto inB = bufB.get_access<access::mode::read>(cgh);
auto out = bufO.get_access<access::mode::write>(cgh);
});
}
}
Accessors describe the way in which you would like to access a buffer
They are also use to access the data from within a kernel function
![Page 25: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/25.jpg)
© 2019 Codeplay Software Ltd.26
#include <CL/sycl.hpp>
using namespace cl::sycl;
class add;
int main(int argc, char *argv[]) {
std::vector<float> dA{ … }, dB{ … }, dO{ … };
queue gpuQeueue{gpu_selector{}};
{
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
defaultQueue.submit([&](handler &cgh){
auto inA = bufA.get_access<access::mode::read>(cgh);
auto inB = bufB.get_access<access::mode::read>(cgh);
auto out = bufO.get_access<access::mode::write>(cgh);
cgh.parallel_for<add>(range<1>(dA.size()),
[=](id<1> i){ out[i] = inA[i] + inB[i]; });
});
}
}
Commands such as parallel_for can be used to define kernel functions
The first argument here is a range, specifying the iteration space
The second argument is a function object that represents the entry point for the SYCL kernel
The function object must takean id parameter that describes the current iteration being executed
![Page 26: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/26.jpg)
© 2019 Codeplay Software Ltd.27
#include <CL/sycl.hpp>
using namespace cl::sycl;
class add;
int main(int argc, char *argv[]) {
std::vector<float> dA{ … }, dB{ … }, dO{ … };
queue gpuQeueue{gpu_selector{}};
{
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
defaultQueue.submit([&](handler &cgh){
auto inA = bufA.get_access<access::mode::read>(cgh);
auto inB = bufB.get_access<access::mode::read>(cgh);
auto out = bufO.get_access<access::mode::write>(cgh);
cgh.parallel_for<add>(range<1>(dA.size()),
[=](id<1> i){ out[i] = inA[i] + inB[i]; });
});
}
}
Kernel functions defined using lambdas have to have a typename to provide them with a name
The reason for this is that C++ does not have a standard ABI for lambdas so they are represented differently across the host and device compiler
![Page 27: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/27.jpg)
© 2019 Codeplay Software Ltd.28
#include <CL/sycl.hpp>
using namespace cl::sycl;
class add;
int main(int argc, char *argv[]) {
std::vector<float> dA{ … }, dB{ … }, dO{ … };
queue gpuQeueue{gpu_selector{}};
{
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
defaultQueue.submit([&](handler &cgh){
auto inA = bufA.get_access<access::mode::read>(cgh);
auto inB = bufB.get_access<access::mode::read>(cgh);
auto out = bufO.get_access<access::mode::write>(cgh);
cgh.parallel_for<add>(range<1>(dA.size()),
[=](id<1> i){ out[i] = inA[i] + inB[i]; });
});
}
}
The rest of this talk will focus on kernels and how to optimize
them
![Page 28: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/28.jpg)
© 2019 Codeplay Software Ltd.29
SYCL programming model
![Page 29: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/29.jpg)
© 2019 Codeplay Software Ltd.30
Processing Element
1. A processing element executes a
single work-item
1
work-item
![Page 30: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/30.jpg)
© 2019 Codeplay Software Ltd.31
Processing Element
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element1
work-item
2
![Page 31: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/31.jpg)
© 2019 Codeplay Software Ltd.32
Processing Element
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element
3. A compute is composed of a number
of processing elements and executes
one or more work-group which are
composed of a number of work-items
1
Compute unit
work-itemwork-group(s)
2
3
![Page 32: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/32.jpg)
© 2019 Codeplay Software Ltd.33
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element
3. A compute is composed of a number
of processing elements and executes
one or more work-group which are
composed of a number of work-items
4. Each work-item can access the local
memory of their work-group, a
dedicated memory region for each
compute unit
Local memory
Compute unit
work-group(s)
2
3
4Processing
Element
1
work-item
![Page 33: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/33.jpg)
© 2019 Codeplay Software Ltd.34
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element
3. A compute is composed of a number
of processing elements and executes
one or more work-group which are
composed of a number of work-items
4. Each work-item can access the local
memory of their work-group, a
dedicated memory region for each
compute unit
5. A device can execute multiple work-
groups
Local memory
Compute unit
work-group(s)
2
3
4
5
Processing Element
1
work-item
![Page 34: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/34.jpg)
© 2019 Codeplay Software Ltd.35
Processing Element
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element
3. A compute is composed of a number
of processing elements and executes
one or more work-group which are
composed of a number of work-items
4. Each work-item can access the local
memory of their work-group, a
dedicated memory region for each
compute unit
5. A device can execute multiple work-
groups
6. Each work-item can access global
memory, a single memory region
available to all processing elements
1
Local memory
Global memory
Compute unit
work-itemwork-group(s)
2
3
4
6
5
![Page 35: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/35.jpg)
© 2019 Codeplay Software Ltd.36
Private memory Local memory Global memory< <
![Page 36: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/36.jpg)
© 2019 Codeplay Software Ltd.43
GPUs execute a large number of
work-items
![Page 37: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/37.jpg)
© 2019 Codeplay Software Ltd.44
They are not all guaranteed to
execute concurrently, most GPUs do
execute a number of work-items
uniformly (lock-step)
![Page 38: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/38.jpg)
© 2019 Codeplay Software Ltd.45
The number that are executed
concurrently varies between
different GPUs
There is no guarantee as to the
order in which they execute
![Page 39: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/39.jpg)
© 2019 Codeplay Software Ltd.46
What are GPUs good at?
➢ Highly parallel○ GPUs can run a very large number of processing elements in parallel
➢ Efficient at floating point operations○ GPUs can achieve very high FLOPs (floating-point operations per second)
➢ Large bandwidth○ GPUs are optimised for throughput and can handle a very large
bandwidth of data
![Page 40: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/40.jpg)
© 2019 Codeplay Software Ltd.47
Optimising GPU programs
![Page 41: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/41.jpg)
© 2019 Codeplay Software Ltd.48
There are different levels of optimisations you can apply
➢ Choosing the right algorithm
➢ This means choosing an algorithm that is well suited to parallelism
➢ Basic GPU programming principles
➢ Such as coalescing global memory access or using local memory
➢ Architecture specific optimisations
➢ Optimising for register usage or avoiding bank conflicts
➢ Micro-optimisations
➢ Such as floating point dnorm hacks
![Page 42: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/42.jpg)
© 2019 Codeplay Software Ltd.49
There are different levels of optimisations you can apply
➢ Choosing the right algorithm
➢ This means choosing an algorithm that is well suited to parallelism
➢ Basic GPU programming principles
➢ Such as coalescing global memory access or using local memory
➢ Architecture specific optimisations
➢ Optimising for register usage or avoiding bank conflicts
➢ Micro-optimisations
➢ Such as floating point dnorm hacks
This talk will mostly focus on these two
![Page 43: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/43.jpg)
© 2019 Codeplay Software Ltd.50
Choosing the right algorithm
![Page 44: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/44.jpg)
© 2019 Codeplay Software Ltd.51
What to parallelise on a GPU
➢ Find hotspots in your code base○ Looks for areas of your codebase that are hit often and well suited to
parallelism on the GPU
➢ Follow an adaptive optimisation approach such as APOD○ Analyse -> Parallelise -> Optimise -> Deploy
➢ Avoid over-optimisation○ You may reach a point where optimisations provide diminishing returns
![Page 45: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/45.jpg)
© 2019 Codeplay Software Ltd.52
What to look for in an algorithm
➢ Naturally data parallel○ Performing the same operation on multiple items in the computation
➢ Large problem○ Enough work to utilise the GPU’s processing elements
➢ Independent progress○ Little or no dependencies between items in the computation
➢ Non-divergent control flow○ Little or no branch or loop divergence
![Page 46: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/46.jpg)
© 2019 Codeplay Software Ltd.53
As a motivational example we will be looking at an image convolution
➢ The image convolution algorithm is “embarrassingly parallel”○ Each item in the computation can be calculated entirely independently
➢ The image convolution algorithm is very computation heavy○ A large number of operations have to be calculated for each item in the
computation, particularly when using larger filters
➢ Image processing requires a large bandwidth○ A lot of data must be passed through the GPU to process an image,
particularly if the image is very high resolution
![Page 47: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/47.jpg)
© 2019 Codeplay Software Ltd.54
1 7 5 8 2 3 8 3 4 6 2 2 4 5 8 3
1 3 4 3 2 4 3 4 5 6 1 6 5 7 8 5
9 2 1 8 1 4 6 9 5 1 4 5 1 9 4 7
3 6 2 0 2 2 9 8 2 7 9 4 2 6 1 5
1 7 2 2 8 4 6 8 4 7 6 8 3 2 4 1
4 9 9 5 1 3 7 3 8 1 7 4 1 5 9 4
4 0 6 3 6 9 9 6 8 5 9 9 0 2 1 5
3 8 1 2 4 7 1 7 6 7 7 2 6 3 6 7
6 7 5 4 3 1 4 4 2 6 3 0 5 0 7 0
1 3 4 2 2 8 1 6 4 9 5 3 7 1 2 4
7 5 4 3 7 0 4 0 3 0 4 4 2 8 9 0
0 9 9 8 0 2 9 8 2 1 6 0 6 3 4 1
6 4 0 1 9 1 7 4 8 3 0 5 0 2 0 6
1 5 7 6 3 0 6 5 4 6 0 4 1 8 7 0
3 3 0 5 9 8 2 4 7 1 5 2 0 4 9 7
1 9 0 4 0 3 0 6 1 2 8 7 0 1 2 9
1 2 1
2 4 2
1 2 1
Approximate gaussian blur 3x3
1/16
3
![Page 48: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/48.jpg)
© 2019 Codeplay Software Ltd.55
![Page 49: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/49.jpg)
© 2019 Codeplay Software Ltd.56
Basic GPU programming principles
![Page 50: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/50.jpg)
© 2019 Codeplay Software Ltd.57
Optimizing GPU programs means maximizing throughput
Compute
Memory
Maximize compute operations
Reduce time spent on memory
![Page 51: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/51.jpg)
© 2019 Codeplay Software Ltd.58
Optimizing GPU programs means maximizing throughput
➢ Maximise compute operations per cycle
➢ Make effective utilisation of the GPU’s hardware
➢ Reduce time spent on memory operations
➢ Reduce latency of memory access
![Page 52: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/52.jpg)
© 2019 Codeplay Software Ltd.59
Avoid divergent control flow
➢ Divergent branches and loops can cause inefficient utilisation
➢ If consecutive work-items execute different branches they must execute separate instructions
➢ If some work-items execute more iterations of a loop than neighbouring work-items this leaves them doing nothing
![Page 53: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/53.jpg)
© 2019 Codeplay Software Ltd.60
a[globalId] = 0;
if (globalId < 4) {
a[globalId] = x();
} else {
a[globalId] = y();
}
![Page 54: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/54.jpg)
© 2019 Codeplay Software Ltd.61
a[globalId] = 0;
if (globalId < 4) {
a[globalId] = x();
} else {
a[globalId] = y();
}
![Page 55: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/55.jpg)
© 2019 Codeplay Software Ltd.62
a[globalId] = 0;
if (globalId < 4) {
a[globalId] = x();
} else {
a[globalId] = y();
}
![Page 56: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/56.jpg)
© 2019 Codeplay Software Ltd.63
a[globalId] = 0;
if (globalId < 4) {
a[globalId] = x();
} else {
a[globalId] = y();
}
![Page 57: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/57.jpg)
© 2019 Codeplay Software Ltd.64
…
for (int i = 0; i <
globalId; i++) {
do_something();
}
…
![Page 58: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/58.jpg)
© 2019 Codeplay Software Ltd.65
…
for (int i = 0; i <
globalId; i++) {
do_something();
}
…
![Page 59: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/59.jpg)
© 2019 Codeplay Software Ltd.66
…
for (int i = 0; i <
globalId; i++) {
do_something();
}
…
![Page 60: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/60.jpg)
© 2019 Codeplay Software Ltd.67
…
for (int i = 0; i <
globalId; i++) {
do_something();
}
…
x2 x3 x4 x5 x6 x7
![Page 61: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/61.jpg)
© 2019 Codeplay Software Ltd.68
…
for (int i = 0; i <
globalId; i++) {
do_something();
}
…
x2 x3 x4 x5 x6 x7
![Page 62: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/62.jpg)
© 2019 Codeplay Software Ltd.69
cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),
[=](cl::sycl::nd_item<2> item) {
int rowOffset = item.get_global_id(0) * WIDTH * NUM_CHANNELS;
int my = NUM_CHANNELS * item.get_global_id(1) + rowOffset;
int fIndex = 0;
float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;
for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {
int curRow = my + r * (WIDTH * NUM_CHANNELS);
for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;
c++, fIndex += NUM_CHANNELS) {
int offset = c * NUM_CHANNELS;
sumR += inputAcc[curRow + offset] * filterAcc[fIndex];
sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];
sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];
sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];
}
}
outputAcc[my] = sumR;
outputAcc[my + 1] = sumG;
outputAcc[my + 2] = sumB;
outputAcc[my + 3] = sumA;
});
First we calculate the linear position of the data element within global memory relative to the current work-item
![Page 63: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/63.jpg)
© 2019 Codeplay Software Ltd.70
cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),
[=](cl::sycl::nd_item<2> item) {
int rowOffset = item.get_global_id(0) * WIDTH * NUM_CHANNELS;
int my = NUM_CHANNELS * item.get_global_id(1) + rowOffset;
int fIndex = 0;
float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;
for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {
int curRow = my + r * (WIDTH * NUM_CHANNELS);
for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;
c++, fIndex += NUM_CHANNELS) {
int offset = c * NUM_CHANNELS;
sumR += inputAcc[curRow + offset] * filterAcc[fIndex];
sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];
sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];
sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];
}
}
outputAcc[my] = sumR;
outputAcc[my + 1] = sumG;
outputAcc[my + 2] = sumB;
outputAcc[my + 3] = sumA;
});
Then we loop over each element in the filter, incrementing an offset as we go
![Page 64: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/64.jpg)
© 2019 Codeplay Software Ltd.71
cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),
[=](cl::sycl::nd_item<2> item) {
int rowOffset = item.get_global_id(0) * WIDTH * NUM_CHANNELS;
int my = NUM_CHANNELS * item.get_global_id(1) + rowOffset;
int fIndex = 0;
float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;
for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {
int curRow = my + r * (WIDTH * NUM_CHANNELS);
for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;
c++, fIndex += NUM_CHANNELS) {
int offset = c * NUM_CHANNELS;
sumR += inputAcc[curRow + offset] * filterAcc[fIndex];
sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];
sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];
sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];
}
}
outputAcc[my] = sumR;
outputAcc[my + 1] = sumG;
outputAcc[my + 2] = sumB;
outputAcc[my + 3] = sumA;
});
Then we multiply each data element in global memory with the corresponding element of the filter and add it to a sum, for each channel
![Page 65: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/65.jpg)
© 2019 Codeplay Software Ltd.72
cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),
[=](cl::sycl::nd_item<2> item) {
int rowOffset = item.get_global_id(0) * WIDTH * NUM_CHANNELS;
int my = NUM_CHANNELS * item.get_global_id(1) + rowOffset;
int fIndex = 0;
float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;
for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {
int curRow = my + r * (WIDTH * NUM_CHANNELS);
for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;
c++, fIndex += NUM_CHANNELS) {
int offset = c * NUM_CHANNELS;
sumR += inputAcc[curRow + offset] * filterAcc[fIndex];
sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];
sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];
sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];
}
}
outputAcc[my] = sumR;
outputAcc[my + 1] = sumG;
outputAcc[my + 2] = sumB;
outputAcc[my + 3] = sumA;
});
Finally we write out the sums to global memory again
![Page 66: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/66.jpg)
© 2019 Codeplay Software Ltd.73
0
2
4
6
8
10
12
3x3 5x5 7x7 9x9 11x11
Kernel time (ms)
Naive
Image conv
512x512 source image
Intel HD Graphics 530
![Page 67: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/67.jpg)
© 2019 Codeplay Software Ltd.74
Coalesced global memory access
➢ Reading and writing from global memory is very expensive
➢ It often means copying across an off-chip bus
➢ Reading and writing from global memory is done in chunks
➢ This means accessing data that is physically close together in memory is more efficient
![Page 68: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/68.jpg)
© 2019 Codeplay Software Ltd.75
float data[size];
![Page 69: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/69.jpg)
© 2019 Codeplay Software Ltd.76
float data[size];
...
f(a[globalId]);
![Page 70: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/70.jpg)
© 2019 Codeplay Software Ltd.77
float data[size];
...
f(a[globalId]);
![Page 71: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/71.jpg)
© 2019 Codeplay Software Ltd.78
float data[size];
...
f(a[globalId]);
100% global access utilisation
![Page 72: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/72.jpg)
© 2019 Codeplay Software Ltd.79
float data[size];
...
f(a[globalId * 2]);
![Page 73: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/73.jpg)
© 2019 Codeplay Software Ltd.80
float data[size];
...
f(a[globalId * 2]);
50% global access utilisation
![Page 74: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/74.jpg)
© 2019 Codeplay Software Ltd.81
This becomes very important when dealing with multiple dimensions
It’s important to ensure that the order work-items are executed in aligns with the order that data elements that are accessed
This maintains coalesced global memory access
global_id(0)gl
ob
al_i
d(1
)
auto id0 = get_global_id(0);
auto id1 = get_global_id(1);
auto linearId = (id1 * 4) + id0;
a[linearId] = f();
Row-major
![Page 75: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/75.jpg)
© 2019 Codeplay Software Ltd.82
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Here data elements are accessed in row-major and work-items are executed in row-major
Global memory access is coalesced
global_id(0)gl
ob
al_i
d(1
)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
auto id0 = get_global_id(0);
auto id1 = get_global_id(1);
auto linearId = (id1 * 4) + id0;
a[linearId] = f();
Row-major
Row-major
![Page 76: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/76.jpg)
© 2019 Codeplay Software Ltd.83
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15If the work-items were executed in column-major
Global memory access is no longer coalesced
global_id(0)gl
ob
al_i
d(1
)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
auto id0 = get_global_id(0);
auto id1 = get_global_id(1);
auto linearId = (id1 * 4) + id0;
a[linearId] = f();
Row-major
Column-major
![Page 77: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/77.jpg)
© 2019 Codeplay Software Ltd.84
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
However if you were to switch the data access pattern to column-major
Global memory access is coalesced again
global_id(0)gl
ob
al_i
d(1
)
auto id0 = get_global_id(0);
auto id1 = get_global_id(1);
auto linearId = (id0 * 4) + id1;
a[linearId] = f();
Column-major
Column-major
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
![Page 78: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/78.jpg)
© 2019 Codeplay Software Ltd.85
cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),
[=](cl::sycl::nd_item<2> item) {
int rowOffset = item.get_global_id(1) * WIDTH * NUM_CHANNELS;
int my = NUM_CHANNELS * item.get_global_id(0) + rowOffset;
int fIndex = 0;
float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;
for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {
int curRow = my + r * (WIDTH * NUM_CHANNELS);
for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;
c++, fIndex += NUM_CHANNELS) {
int offset = c * NUM_CHANNELS;
sumR += inputAcc[curRow + offset] * filterAcc[fIndex];
sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];
sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];
sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];
}
}
outputAcc[my] = sumR;
outputAcc[my + 1] = sumG;
outputAcc[my + 2] = sumB;
outputAcc[my + 3] = sumA;
});
Reversing the global ids will flip the linearization from row-major to column-major
Whether column-major or row-major linearization is more efficient depends on the device you are on
![Page 79: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/79.jpg)
© 2019 Codeplay Software Ltd.86
0
20
40
60
80
100
120
3x3 5x5 7x7 9x9 11x11
Kernel time (% of base)
Naive Coalesced
Image conv
512x512 source image
Intel HD Graphics 530
![Page 80: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/80.jpg)
© 2019 Codeplay Software Ltd.87
Make use of vector operations
➢ GPUs are vector processors
➢ Each processing element is capable of wide instructions which can operate on multiple elements of data at once
➢ Many compilers can auto-vectorise
➢ This can affect the amount of performance gain you may see in vectorising your kernels
![Page 81: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/81.jpg)
© 2019 Codeplay Software Ltd.88
float rS, gS, bS, aS;
float r1, g1, b1, a1;
float r2, g2, b2, a2;
...
rS = r1 + r2;
gS = g1 + g2;
bS = b1 + b2;
aS = a1 + a2;
32bit FP add
32bit FP add
32bit FP add
32bit FP add
![Page 82: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/82.jpg)
© 2019 Codeplay Software Ltd.89
cl::sycl::float4 vS;
cl::sycl::float4 v1;
cl::sycl::float4 v2;
...
rS = v1 + v2; 128bit FP vector add
![Page 83: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/83.jpg)
© 2019 Codeplay Software Ltd.90
cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),
[=](cl::sycl::nd_item<2> item) {
int rowOffset = item.get_global_id(1) * WIDTH;
int my = item.get_global_id(0) + rowOffset;
int fIndex = 0;
cl::sycl::float4 sum = cl::sycl::float4{0.0f};
for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {
int curRow = my + r * WIDTH
for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE; c++) {
sum += inputAcc[curRow + c] * filterAcc[fIndex];
fIndex++;
}
}
outputAcc[my] = sum;
});
To vectorise the kernel define all accessors in terms of SYCL vector types
This allows us to remove the calculations to factor in the number of channels
This also allows us to reduce the multiplications and assignments to single vector operators
![Page 84: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/84.jpg)
© 2019 Codeplay Software Ltd.91
82
84
86
88
90
92
94
96
98
100
102
3x3 5x5 7x7 9x9 11x11
Kernel time (% of base)
Coalesced Vectorised
Image conv
512x512 source image
Intel HD Graphics 530
![Page 85: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/85.jpg)
© 2019 Codeplay Software Ltd.92
Make use of local memory
➢ Local memory is much lower latency to access than global memory
➢ Cache commonly accessed data and temporary results in local memory rather than reading and writing to global memory
➢ Using local memory is not necessarily always more efficient
➢ If data is not accessed frequently enough to warrant the copy to local memory you may not see a performance gain
![Page 86: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/86.jpg)
© 2019 Codeplay Software Ltd.94
1 7 5 8 2 3 8 3 4 6 2 2 4 5 8 3
1 3 4 3 2 4 3 4 5 6 1 6 5 7 8 5
9 2 1 8 1 4 6 9 5 1 4 5 1 9 4 7
3 6 2 0 2 2 9 8 2 7 9 4 2 6 1 5
1 7 2 2 8 4 6 8 4 7 6 8 3 2 4 1
4 9 9 5 1 3 7 3 8 1 7 4 1 5 9 4
4 0 6 3 6 9 9 6 8 5 9 9 0 2 1 5
3 8 1 2 4 7 1 7 6 7 7 2 6 3 6 7
6 7 5 4 3 1 4 4 2 6 3 0 5 0 7 0
1 3 4 2 2 8 1 6 4 9 5 3 7 1 2 4
7 5 4 3 7 0 4 0 3 0 4 4 2 8 9 0
0 9 9 8 0 2 9 8 2 1 6 0 6 3 4 1
6 4 0 1 9 1 7 4 8 3 0 5 0 2 0 6
1 5 7 6 3 0 6 5 4 6 0 4 1 8 7 0
3 3 0 5 9 8 2 4 7 1 5 2 0 4 9 7
1 9 0 4 0 3 0 6 1 2 8 7 0 1 2 9
Each item in the computation needs to read neighbouring elements
This means each element of data is read multiple times
• 3x3 filter: up to 9 ops• 5x5 filter: up to 25 ops• And so on…
If each of these operations loads from global memory this is can be very expensive
1 2 1
2 4 2
1 2 1
1/16
![Page 87: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/87.jpg)
© 2019 Codeplay Software Ltd.95
1 7 5 8 2 3 8 3 4 6 2 2 4 5 8 3
1 3 4 3 2 4 3 4 5 6 1 6 5 7 8 5
9 2 1 8 1 4 6 9 5 1 4 5 1 9 4 7
3 6 2 0 2 2 9 8 2 7 9 4 2 6 1 5
1 7 2 2 8 4 6 8 4 7 6 8 3 2 4 1
4 9 9 5 1 3 7 3 8 1 7 4 1 5 9 4
4 0 6 3 6 9 9 6 8 5 9 9 0 2 1 5
3 8 1 2 4 7 1 7 6 7 7 2 6 3 6 7
6 7 5 4 3 1 4 4 2 6 3 0 5 0 7 0
1 3 4 2 2 8 1 6 4 9 5 3 7 1 2 4
7 5 4 3 7 0 4 0 3 0 4 4 2 8 9 0
0 9 9 8 0 2 9 8 2 1 6 0 6 3 4 1
6 4 0 1 9 1 7 4 8 3 0 5 0 2 0 6
1 5 7 6 3 0 6 5 4 6 0 4 1 8 7 0
3 3 0 5 9 8 2 4 7 1 5 2 0 4 9 7
1 9 0 4 0 3 0 6 1 2 8 7 0 1 2 9
A common technique for using local memory is to break up your input into tiles
Then each tile can be moved to local memory while the work-group is working on it
4 6 2 2 4 5 8 3
5 6 1 6 5 7 8 5
5 1 4 5 1 9 4 7
2 7 9 4 2 6 1 5
4 7 6 8 3 2 4 1
8 1 7 4 1 5 9 4
8 5 9 9 0 2 1 5
6 7 7 2 6 3 6 7
![Page 88: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/88.jpg)
© 2019 Codeplay Software Ltd.96
Synchronise work-groups when necessary
➢ Synchronising with a work-group barrier waits for all work-items to reach the same point
➢ Use a work-group barrier if you are copying data to local memory that neighbouring work-items will need to access
➢ Use a work-group barrier if you have temporary results that will be shared with other work-items
![Page 89: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/89.jpg)
© 2019 Codeplay Software Ltd.97
Remember that work-items are not
all guaranteed to execute
concurrently
![Page 90: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/90.jpg)
© 2019 Codeplay Software Ltd.98
A work-item can share results with
other work-items via local and global
memory
![Page 91: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/91.jpg)
© 2019 Codeplay Software Ltd.99
This means that it’s possible for a
work-item to read a result that hasn’t
yet been written to yet, you have a
data racedata race
![Page 92: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/92.jpg)
© 2019 Codeplay Software Ltd.100
This problem can be solved by a
synchronisation primitive called a
work-group barrier
![Page 93: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/93.jpg)
© 2019 Codeplay Software Ltd.101
Work-items will block until all
work-items in the work-group have
reached that point
![Page 94: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/94.jpg)
© 2019 Codeplay Software Ltd.102
Work-items will block until all
work-items in the work-group have
reached that point
![Page 95: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/95.jpg)
© 2019 Codeplay Software Ltd.103
So now you can be sure that all of
the results that you want to read
from have been written to
![Page 96: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/96.jpg)
© 2019 Codeplay Software Ltd.104
However this does not apply across
work-group boundaries, and you
have a data race again
work-group 1work-group 0
data race
![Page 97: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/97.jpg)
© 2019 Codeplay Software Ltd.105
cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),
[=](cl::sycl::nd_item<2> item) {
int globalRowOffset = item.get_global_id(1) * WIDTH;
int global = item.get_global_id(0) + globalRowOffset;
int localRowOffset = item.get_local_id(1) * WIDTH;
int local = item.get_local_id(0) + localRowOffset;
int fIndex = 0;
cl::sycl::float4 sum = cl::sycl::float4{0.0f};
copy_tile(scratchpad, inputAcc, local, global);
item.barrier(cl::sycl::access::fence_space::local);
for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {
int curRow = local + r * WIDTH
for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE; c++) {
sum += scratchpad[curRow + c] * filterAcc[fIndex];
fIndex++;
}
}
outputAcc[global] = sum;
});
To use local memory we need to also calculate the linear position in the current work-group
We can then use this to copy a tile from global memory into the local memory of the current work-group
Now the multiply operators within the loop are reading from local memory
![Page 98: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/98.jpg)
© 2019 Codeplay Software Ltd.106
cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),
[=](cl::sycl::nd_item<2> item) {
int globalRowOffset = item.get_global_id(1) * WIDTH;
int global = item.get_global_id(0) + globalRowOffset;
int localRowOffset = item.get_local_id(1) * WIDTH;
int local = item.get_local_id(0) + localRowOffset;
int fIndex = 0;
cl::sycl::float4 sum = cl::sycl::float4{0.0f};
copy_tile(scratchspace, inputAcc, local, global);
item.barrier(cl::sycl::access::fence_space::global_and_local);
for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {
int curRow = local + r * WIDTH
for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE; c++) {
sum += scratchspace[curRow + c] * filterAcc[fIndex];
fIndex++;
}
}
outputAcc[global] = sum;
});
Since we’re moving a tile into local memory and then performing operations on it there we need a barrier to ensure all elements of the tile are copied
![Page 99: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/99.jpg)
© 2019 Codeplay Software Ltd.107
0
20
40
60
80
100
120
3x3 5x5 7x7 9x9 11x11
Kernel time (% of base)
Vectorised Local mem
Image conv
512x512 source image
Intel HD Graphics 530
![Page 100: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/100.jpg)
© 2019 Codeplay Software Ltd.108
Choosing an good work-group size
➢ The occupancy of a kernel can be limited by a number of factors of the GPU
➢ Total number of processing elements
➢ Total number of compute units
➢ Total registers available to the kernel
➢ Total local memory available to the kernel
➢ You can query the preferred work-group size once the kernel is compiled
➢ However this is not guaranteed to give you the best performance
➢ It’s good practice to benchmark various work-group sizes and choose the best
![Page 101: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/101.jpg)
© 2019 Codeplay Software Ltd.109
0
20
40
60
80
100
120
3x3 5x5 7x7 9x9 11x11
Kernel time (% of base)
Vectorised Local mem (8x8) Local mem (16x16)
Image conv
512x512 source image
Intel HD Graphics 530
![Page 102: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/102.jpg)
© 2019 Codeplay Software Ltd.110
Ideas for further optimisations
![Page 103: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/103.jpg)
© 2019 Codeplay Software Ltd.111
Use constant memory
➢ Some GPUs provide a region of global memory that is read-only
➢ This can be faster to access as it doesn’t require caching
Private memory Local memoryGlobal memory< <
Constant memory
![Page 104: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/104.jpg)
© 2019 Codeplay Software Ltd.112
Private memory Local memory Global memory< <
Texture memory
Use texture memory
➢ Most GPUs have texture memory
➢ This can be faster to access for data that is represented as pixels
➢ This also provides sampling operations
![Page 105: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/105.jpg)
© 2019 Codeplay Software Ltd.113
Batch work together
➢ Hitting occupancy limitations of a GPU can lead to drops in performance gain
➢ This is because single work-items are having to do more chunks of work
➢ Batching work for each work-item allows reusing cached data
➢ Batching work that share neighbouring data allows you to further share local memory and registers
![Page 106: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/106.jpg)
© 2019 Codeplay Software Ltd.114
Local memory
1,0
0,1 1,1
0,0
Use double buffering
➢ If you hit occupancy limitations you will have more tiles than can be computed at once
➢ This means each work-group will compute more than one tile
![Page 107: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/107.jpg)
© 2019 Codeplay Software Ltd.115
Copy{0, 0}
Compute{0, 0}
Copy{1, 0}
Compute{1, 0}
Copy{0, 1}
Compute{0, 1}
Copy{1, 1}
Compute{1, 1}
Copy
Compute
![Page 108: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/108.jpg)
© 2019 Codeplay Software Ltd.116
Copy{0, 0}
Compute{0, 0}
Copy{1, 0}
Compute{1, 0}
Copy{0, 1}
Compute{0, 1}
Copy{1, 1}
Compute{1, 1}
Copy
Compute
![Page 109: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/109.jpg)
© 2019 Codeplay Software Ltd.117
Copy{0, 0}
Compute{0, 0}
Copy{1, 0}
Compute{1, 0}
Copy{0, 1}
Compute{0, 1}
Copy{1, 1}
Compute{1, 1}
Copy
Compute
![Page 110: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/110.jpg)
© 2019 Codeplay Software Ltd.118
Copy{0, 0}
Compute{0, 0}
Copy{1, 0}
Compute{1, 0}
Copy{0, 1}
Compute{0, 1}
Copy{1, 1}
Compute{1, 1}
Copy
Compute
Overlapping copy and compute within kernels allows for better utilisation of GPU processing elements and therefore better throughput
![Page 111: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/111.jpg)
© 2019 Codeplay Software Ltd.119
cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),
[=](cl::sycl::nd_item<2> item) {
int rowOffset = item.get_global_id(1) * WIDTH;
int my = item.get_global_id(0) + rowOffset;
int fIndex = 0;
cl::sycl::float4 sum = cl::sycl::float4{0.0f};
sum += inputAcc[(my – 1 * WIDTH) - 1] * filterAcc[0];
sum += inputAcc[(my – 1 * WIDTH)] * filterAcc[1];
sum += inputAcc[(my – 1 * WIDTH) + 1] * filterAcc[2];
sum += inputAcc[(my * WIDTH) - 1] * filterAcc[3];
sum += inputAcc[(my * WIDTH)] * filterAcc[4];
sum += inputAcc[(my * WIDTH) + 1] * filterAcc[5];
sum += inputAcc[(my + 1 * WIDTH) - 1] * filterAcc[6];
sum += inputAcc[(my + 1 * WIDTH)] * filterAcc[7];
sum += inputAcc[(my + 1 * WIDTH) + 1] * filterAcc[8];
outputAcc[my] = sum;
});
➢ Here we unroll the loop over the filter
➢ This allows the compiler more freedom in how it vectorises and allocates registers
➢ However this does make the code more obfuscated and less flexible
Loop unrolling
![Page 112: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/112.jpg)
© 2019 Codeplay Software Ltd.120
Further tips
➢ Use profiling tools to gather more accurate information about your programs
➢ SYCL provides kernel profiling
➢ Most OpenCL implementations provide proprietary profiler tools
➢ Follow vendor optimisation guides
➢ Most OpenCL vendors provide optimisation guides that detail recommendations on how to optimise programs for their respective GPU
![Page 113: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/113.jpg)
© 2019 Codeplay Software Ltd.121
Takeaways
➢ Identify which parts of your code to offload and which algorithms to use
➢ Look for hotspots in your code that are bottlenecks
➢ Identify opportunity for parallelism
➢ Optimising GPU programs means maximising throughput
➢ Maximize compute operations
➢ Minimise time spent on memory operations
➢ Use profilers to analyse your GPU programs and consult optimisation guides
![Page 114: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019](https://reader034.fdocuments.in/reader034/viewer/2022042917/5f58cb7203fcc9685e36c2fa/html5/thumbnails/114.jpg)
/codeplaysoft@codeplaysoft codeplay.com
Thank you for listening