Intro to Cuda
-
Upload
david-hauck -
Category
Software
-
view
30 -
download
0
Transcript of Intro to Cuda
![Page 1: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/1.jpg)
GPU AlgorithmsDavid Hauck
github.com/davidhauck
@david_hauck_mke
davidhauck40.blogspot.com
![Page 2: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/2.jpg)
Graphics Processing Unit
![Page 3: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/3.jpg)
![Page 4: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/4.jpg)
![Page 5: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/5.jpg)
Why?
![Page 6: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/6.jpg)
![Page 7: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/7.jpg)
Graphics Processing Unit
![Page 8: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/8.jpg)
Graphics Processing Unit
General Purpose
![Page 9: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/9.jpg)
T EM S
R
![Page 10: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/10.jpg)
HOST
![Page 11: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/11.jpg)
DEVICE
![Page 12: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/12.jpg)
![Page 13: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/13.jpg)
PCI Bus
Copy initial data to DEVICE
![Page 14: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/14.jpg)
PCI Bus
Run DEVICE Executable
![Page 15: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/15.jpg)
PCI Bus
Copy Results Back To HOST
![Page 16: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/16.jpg)
Still Running on CPU
![Page 17: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/17.jpg)
Still Running on CPUGPU is a Resource
![Page 18: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/18.jpg)
![Page 19: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/19.jpg)
MEMORYCONSCIOUSNESS
![Page 20: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/20.jpg)
HOST DEVICEPOINTERSPOINTERS
![Page 21: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/21.jpg)
int *a;
![Page 22: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/22.jpg)
int *a;int *d_a;
![Page 23: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/23.jpg)
arr = malloc(size);
![Page 24: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/24.jpg)
arr = malloc(size);
cudaMalloc(&d_arr, size);
![Page 25: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/25.jpg)
free(arr);
![Page 26: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/26.jpg)
free(arr);
cudaFree(d_arr);
![Page 27: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/27.jpg)
memcpy(dest, source, size);
![Page 28: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/28.jpg)
memcpy(dest, source, size);
cudaMemcpy(&dest, src, size, …);
![Page 29: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/29.jpg)
1: HOST DEVICE
2: EXECUTE
3: DEVICE HOST
![Page 30: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/30.jpg)
1: HOST DEVICE
3: DEVICE HOST
cudaMemcpy();
![Page 31: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/31.jpg)
1: HOST DEVICEcudaMemcpy(
&dest,source,size, ..hostToDevice);
![Page 32: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/32.jpg)
EXECUTION
![Page 33: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/33.jpg)
__global__ void myKernel(int *a){}
![Page 34: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/34.jpg)
myKernel<<<1,1>>>(d_arr);
![Page 35: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/35.jpg)
Let’s do an example
![Page 36: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/36.jpg)
abcd
+
efgh
=
ijkl
![Page 37: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/37.jpg)
abcd
+
efgh
=
ijkl
![Page 38: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/38.jpg)
abcd
+
efgh
=
ijkl
threadIdx.x
0
1
2
3
![Page 39: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/39.jpg)
int index = threadIdx.x;c[index] =
a[index] + b[index];
![Page 40: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/40.jpg)
Let’s invent an ALGORITHM
![Page 41: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/41.jpg)
K-Means Clustering
![Page 42: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/42.jpg)
![Page 43: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/43.jpg)
![Page 44: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/44.jpg)
![Page 45: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/45.jpg)
![Page 46: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/46.jpg)
![Page 47: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/47.jpg)
![Page 48: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/48.jpg)
![Page 49: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/49.jpg)
![Page 50: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/50.jpg)
![Page 51: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/51.jpg)
![Page 52: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/52.jpg)
![Page 53: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/53.jpg)
![Page 54: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/54.jpg)
![Page 55: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/55.jpg)
CODE
![Page 56: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/56.jpg)
Shared Memory• ~48k• Multiple GB device memory (100x higher latency)• Access memory in order• 1 2 3• 4 5 6• 7 8 9
![Page 57: Intro to Cuda](https://reader035.fdocuments.in/reader035/viewer/2022062522/5884cdd81a28ab767c8b593f/html5/thumbnails/57.jpg)
Considerations• Transistors are allocated to arithmetic, not memory. Sometimes
better to recompute rather than cache• Copying to/from host takes a while. Sometimes sequential operations
can stay on gpu• Avoid serialization (shared memory bank conflicts)• Asynchronous memory operations