NVIDIA Kepler Architecture
description
Transcript of NVIDIA Kepler Architecture
![Page 1: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/1.jpg)
NVIDIA Kepler Architecture
Paul BissonnetteRizwan Mohiuddin
Ajith Herga
![Page 2: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/2.jpg)
Compute Unified Device Architecture
• Hybrid CPU/GPU Code• Low latency code is run
on CPU– Result immediately
available• High latency, high
throughput code is run on GPU– Result on bus– GPU has many more cores
than CPU
![Page 3: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/3.jpg)
CPU/GPU Code
CUDA Program
GPU Routines
CPU Routines
NVCC GCC
GPU Object
CPU Object
CUDA Binary
Link
er
![Page 4: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/4.jpg)
Execution Model (Overview)
CPU GPU CPU GPU
CPUGPU
RPC RPCResu
lt
ResultIntermediateResult
![Page 5: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/5.jpg)
Execution Model (GPU)Th
read
Thre
ad
Thre
ad
Thread Block Thread Block Thread Block
Streaming Multiple Processor
Thread Grid
Graphics Card
![Page 6: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/6.jpg)
Execution Model (GPU)
• Each procedure runs as a “kernel”• An instance of a kernel runs on a thread block– A thread block executes on a single streaming
multiple processor• All instances of a particular kernel form a
thread grid– A thread grid executes on a single graphics card
across several streaming multiple processors
![Page 7: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/7.jpg)
Thread Cooperatively
• Multiple levels of sharing
• Thread blocks similar to MPI group
![Page 8: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/8.jpg)
GPU Execution of Kernels
• In Kepler threads can spawn new thread blocks/grids
• Less time spent in CPU• More natural recursion• Completion dependent
on child grids
![Page 9: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/9.jpg)
CUDA Languages
• CUDA C/C++ and CUDA Fortran• Scientific computing• Highly parallel applications• NVIDIA specific (unlike OpenCL)• Specialized for specific tasks– Highly optimized single precision floating point– Specialized data sharing instructions within thread
blocks
![Page 10: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/10.jpg)
HYPER QWithout HYPER Q:
• Availability of only one work queue thus can receive work only from one queue.
• Difficult for a CPU core to keep a GPU busy.
![Page 11: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/11.jpg)
• Using HYPER Q:– Allows connection from multiple CUDA streams,
Message Passing Interface (MPI) processes, or multiple threads of the same process.
– 32 concurrent work queues, can receive work from 32 process cores at the same time.
– 3X Performance increase on Fermi
![Page 12: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/12.jpg)
• Removes the problem of false intra-stream dependencies.
![Page 13: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/13.jpg)
Dynamic Parallelism• Without Dynamic
Parallelism– Data travels back and
forth between the CPU and GPU many times.
– This is because of the inability of the GPU to create more work on itself depending on the data.
![Page 14: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/14.jpg)
• With Dynamic Parallelism:– GPU can generate
work on itself based intermediate results, without involvement of CPU.
– Permits Dynamic Run Time decisions.
– Leaves the CPU free to do other work, conserves power.
![Page 15: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/15.jpg)
• Application Example: Adaptive Grid Simulation
![Page 16: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/16.jpg)
• Application Example: Quick Sort Computation
Streams spawning Streams
CPU launches quicksort
![Page 17: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/17.jpg)
CPU-GPU Stack Exchange
Runs on CPU
Looping based on intermediate results
Check if GPU has returned any more intermediate results
CPU spawns a stream to be computed on GPU
![Page 18: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/18.jpg)
Memory Organization
![Page 19: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/19.jpg)
Memory Organization
![Page 20: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/20.jpg)
Core Stream
![Page 21: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/21.jpg)
Stream Processor
![Page 22: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/22.jpg)
Kepler Architecture
![Page 23: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/23.jpg)
Scheduling
![Page 24: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/24.jpg)
Warp Scheduler
![Page 25: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/25.jpg)
![Page 26: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/26.jpg)
Thread Block level/Grid Scheduling
![Page 27: NVIDIA Kepler Architecture](https://reader033.fdocuments.in/reader033/viewer/2022061610/568161ea550346895dd219ae/html5/thumbnails/27.jpg)
References• NVIDIA Whitepapers
– http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
– http://developer.download.nvidia.com/assets/cuda/files/CUDADownloads/TechBrief_Dynamic_Parallelism_in_CUDA.pdf
• NVIDIA Keynote Presentation– http://www.youtube.com/watch?v=TxtZwW2Lf-w
• Georgia Tech Presentation– http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04
-14/02-cuda-overview.pdf
• http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/4
• http://gpuscience.com/code-examples/tesla-k20-gpu-quicksort-with-dynamic-parallelism