Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · ©...
Transcript of Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · ©...
Accelerate your Application with Kepler
Peter Messmer
© NVIDIA Corporation 2012
Goals
How to analyze/optimize an existing application for GPUs
What to consider when designing new GPU applications
How to optimize performance with Kepler /CUDA5
GPU Acceleration
© NVIDIA Corporation 2012
Here: Focus on Programming Languages
Applications
Libraries Programming
Languages OpenACC
Directives
© NVIDIA Corporation 2012
Here: Focus on Programming Languages
Applications
Libraries Programming
Languages OpenACC
Directives
© NVIDIA Corporation 2012
© NVIDIA Corporation 2012
© NVIDIA Corporation 2012
APOD – A Systematic Path to Performance
Parallelize
Optimize
Assess
Deploy
© NVIDIA Corporation 2012
Starting point: Matrix transpose
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
out[j][i] = in[i][j];
i
j
© NVIDIA Corporation 2012
Matrix transpose on CPU
void transpose(float in[], float out[])
{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
out[i*N+j] = in[j*N+i];
}
float in[], out[];
transpose(in, out, width);
i
j
© NVIDIA Corporation 2012
An initial CUDA Version
__global__ void transpose(float in[], float out[])
{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
out[i*N+j] = in[j*N+i];
}
float in[], out[];
cudaMemcpy(in, in_host, N*N*sizeof(float));
transpose<<<1,1>>>(in, out);
i
j
© NVIDIA Corporation 2012
An initial CUDA Version
+ Quickly implemented - Performance weak
__global__ transpose(float in[], float out[])
{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
out[i*N+j] = in[j*N+i];
}
float in[], out[];
…
transpose<<<1,1>>>(in, out);
© NVIDIA Corporation 2012
An initial CUDA Version
+ Quickly implemented - Performance weak
=> Express Parallelism!
__global__ transpose(float in[], float out[])
{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
out[i*N+j] = in[j*N+i];
}
float in[], out[];
…
transpose<<<1,1>>>(in, out);
© NVIDIA Corporation 2012
Thread: Sequential execution unit All threads execute same sequential program
Threads execute in parallel
Threads Blocks: A group of threads Executes on a single Streaming Multiprocessor (SM)
Threads within a block can cooperate
Light-weight synchronization
Data exchange
Grid: A group of thread blocks Communication between blocks expensive
Threadblocks of a grid execute on multiple SMs
Recap: Kernel Execution Model
© NVIDIA Corporation 2012
First Parallelization: Inner Loop
Process source rows independently
tid
in j
tid tid
tid out
j
tid
tid
__global__ transpose(float in[], float out[])
{
int tid = threadIdx.x;
for(int j=0; j < N; j++)
out[tid*N+j] = in[j*N+tid];
}
float in[], out[];
…
transpose<<<1,N>>>(in, out);
© NVIDIA Corporation 2012
Second Parallelization: One Block per Row
__global__ transpose(float in[], float out[])
{
int tid = threadIdx.x;
int bid = blockIdx.x;
out[tid*N+bid] = in[bid*N+tid];
}
float in[][], out[][];
…
transpose<<<N,N>>>(in, out);
tid out
tid
tid
tid
in
tid tid
bid
bid
bid bid
© NVIDIA Corporation 2012
- Application analysis
- Kernel properties
NVVP – NVIDIA Visual Profiler
© NVIDIA Corporation 2012
Application Assessment with NVVP
© NVIDIA Corporation 2012
Application Assessment with NVVP
© NVIDIA Corporation 2012
Application Assessment with NVVP
© NVIDIA Corporation 2012
Source-Level Hot-spot Analysis in NVVP
© NVIDIA Corporation 2012
Source-Level Hot-spot Analysis in NVVP
© NVIDIA Corporation 2012
What does Uncoalesced Store mean?
Global memory access happens in
transactions of 32 Bytes
Coalesced access:
Group of 32 threads (“warp”) accessing adjacent bytes
Uncoalesced access:
Group of 32 threads accessing scattered bytes
Results in up to 32 transactions
0 1 31
0 1 31
© NVIDIA Corporation 2012
Memory Access Patterns
Array access: ~OK: x[i] = a[i+1] – a[i]
Bad: x[i] = a[64*i] – a[i]
SoA vs AoS:
OK : point.x[i]
Bad: point[i].x
Random access Bad: a[rand_fun(i)]
0 1 31
0 1 31
© NVIDIA Corporation 2012
How can we improve the write?
Coalesced read
Scattered write (stride N)
Process matrix tile, not single row/column
Transpose matrix tile within block
in
out
© NVIDIA Corporation 2012
How can we improve the write?
Coalesced read
Scattered write (stride N)
Process matrix tile, not single row/column
Transpose matrix tile within block
=> Need threads in a block to cooperate
=> Use shared memory
in
out
© NVIDIA Corporation 2012
Shared memory
- Accessible by all threads in a block
- Fast compared to global mem
- Low access latency
- high BW
- (almost like registers)
- Common uses:
- Software managed cache
- Data layout conversion
Global Memory (DRAM)
Registers
SM-0
Registers
SM-N
SMEM SMEM
© NVIDIA Corporation 2012
Transpose with coalesced read/write
__global__ transpose(float in[], float out[])
{
__shared__ float tile[TILE][TILE];
int glob_in = xIndex + (yIndex)*N;
int glob_out = xIndex + (yIndex)*N;
tile[threadIdx.y][threadIdx.x] = in[glob_in];
__syncthreads();
out[glob_out] = tile[threadIdx.x][threadIdx.y];
}
grid(N/TILE, N/TILE,1)
threads(TILE, TILE, 1)
transpose<<<grid, threads>>>(in, out);
© NVIDIA Corporation 2012
Transpose with coalesced read/write
__global__ transpose(float in[], float out[])
{
__shared__ float tile[TILE][TILE];
int glob_in = xIndex + (yIndex)*N;
int glob_out = xIndex + (yIndex)*N;
tile[threadIdx.y][threadIdx.x] = in[glob_in];
__syncthreads();
out[glob_out] = tile[threadIdx.x][threadIdx.y];
}
grid(N/TILE, N/TILE,1)
threads(TILE, TILE, 1)
transpose<<<grid, threads>>>(in, out);
© NVIDIA Corporation 2012
Transpose with coalesced read/write
__global__ transpose(float in[], float out[])
{
__shared__ float tile[TILE][TILE];
int glob_in = xIndex + (yIndex)*N;
int glob_out = xIndex + (yIndex)*N;
tile[threadIdx.y][threadIdx.x] = in[glob_in];
__syncthreads();
out[glob_out] = tile[threadIdx.x][threadIdx.y];
}
grid(N/TILE, N/TILE,1)
threads(TILE, TILE, 1)
transpose<<<grid, threads>>>(in, out);
© NVIDIA Corporation 2012
- Synchronization in kernel
tile[y][x] = in[in_data]
__syncthreads()
out[out_index] = tile[x][y]
- Keep threads blocked at barrier to minimum
What happens at the barrier?
Thread-Block
Matrix Tile
Matrix Tile
Thread-Block
© NVIDIA Corporation 2012
- Synchronization in kernel
tile[y][x] = in[in_data]
__syncthreads()
out[out_index] = tile[x][y]
- Keep threads blocked at barrier to minimum
- Use more thread blocks, but:
- # blocks per SM limited by # threads/block
Thread Serialization
Thread-Block
Matrix Tile
Matrix Tile
Thread-Block
© NVIDIA Corporation 2012
- Synchronization in kernel
tile[y][x] = in[in_data]
__syncthreads()
out[out_index] = tile[x][y]
- Keep threads blocked at barrier to minimum
- Use more thread blocks, but:
- # blocks per SM limited by # threads/block
Thread Serialization
Thread-Block
Matrix Tile
Matrix Tile
Thread-Block
Solution: Reduce number of threads per block
© NVIDIA Corporation 2012
Impact of Reduced Serialization
© NVIDIA Corporation 2012
Impact of Reduced Serialization
© NVIDIA Corporation 2012
Shared Memory Organization
Organized in 32 independent banks
Optimal access: disjoint banks or multicast
Multiple access to same bank: serialization
Solution for transpose: pading
tile[16][16] => tile[16][17]
C
Bank
Any 1:1 mapping or MC
C C C
Bank Bank Bank
C
Bank
C C C
Bank Bank Bank
© NVIDIA Corporation 2012
Final Solution
© NVIDIA Corporation 2012
Final Solution
© NVIDIA Corporation 2012
APOD Cycle Summary
Assessment: Algorithm highly parallel
Parallelization: 1 thread per column 12 GB/s
Parallelization: 1 thread per element 99 GB/s
Optimization: Memory access coalescence 93 GB/s
Optimization: Latency hiding 124 GB/s
Optimization: Bank conflict resolution 170 GB/s
=> Ready for Deployment
APOD applied at each P/O step
© NVIDIA Corporation 2012
Additional Metrics
© NVIDIA Corporation 2012
Control Flow
if ( ... )
{
// then-clause
}
else
{
// else-clause
}
instr
ucti
on
s
© NVIDIA Corporation 2012
Execution within warps is coherent in
str
ucti
on
s / t
ime
Warp
(“vector” of threads)
35 34 33 63 62 32 3 2 1 31 30 0
Warp
(“vector” of threads)
© NVIDIA Corporation 2012
Execution diverges within a warp in
str
ucti
on
s / t
ime
3 2 1 31 30 0 35 34 33 63 62 32
© NVIDIA Corporation 2012
Execution diverges within a warp in
str
ucti
on
s / t
ime
3 2 1 31 30 0 35 34 33 63 62 32
Solution: Group threads with similar control flow
© NVIDIA Corporation 2012
Occupancy
Need independent threads per SM to hide
latencies
- Memory access
- Instruction
Hardware resources determine maximum
number of threads/threadblocks per SM
Consumed resources determine actual number
Occupancy = Nactual / Nmax
© NVIDIA Corporation 2012
Occupancy
Limiting resources:
- Number of threads
- Number of registers per thread
- Number of blocks
- Amount of shared memory per block
No need for 100% occupancy
- Depends on kernel
© NVIDIA Corporation 2012
Occupancy Calculator
Analyze effect of resource
consumption on occupancy
© NVIDIA Corporation 2012
- Command-Line Profiler
- Access to hardware counters
- List of supported counters: --query-events
Alternatives to NVVP: nvprof
%nvprof --print-gpu-trace ./transpose
Profiling result:
Start Duration Grid Size Block Size Regs* Size Throughput Name
577.11ms 874.57us - - - 4.19MB 4.80GB/s [CUDA memcpy HtoD]
598.45ms 1.67ms (1 1 1) (1024 1 1) 22 - - transposeNaive(float*,
600.12ms 1.67ms (1 1 1) (1024 1 1) 22 - - transposeNaive(float*,
601.79ms 1.67ms (1 1 1) (1024 1 1) 22 - - transposeNaive(float*,
nvprof --print-gpu-trace --aggregate-mode-off --events sm_cta_launched ./transpose
Profiling result:
Device Event Name, Kernel, Values
0 sm_cta_launched, transposeNaive(float*, ..), 76 73 72 72 73 74 75 73 73 72 73 73 72 73
© NVIDIA Corporation 2012
Alternatives to NVVP: nvprof
© NVIDIA Corporation 2012
Alternatives to NVVP: Instrumentation
cudaEventRecord(start, 0);
transpose<<<grid, threads>>>(..);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
transpose
start
stop
Tim
e
EventSynchronize
© NVIDIA Corporation 2012
Characteristics of an Ideal GPU Candidate
Sufficient parallelism
K20X: Up to 28’672 threads in flight
Rather > 10’000-way than ~10-way
Memory access patterns
Ideally close to stride 1 possible
Control-flow patterns
Low divergence, at least for groups
of threads
© NVIDIA Corporation 2012
Characteristics of an Ideal GPU Candidate
Sufficient parallelism
K20X: Up to 28’672 threads in flight
Rather > 10’000-way than ~10-way
Memory access patterns
Ideally close to stride 1 possible
Control-flow patterns
Low divergence, at least for groups
of threads
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
Streams
Stream 0
Host A
pp
Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
Streams
Stream 0
Host A
pp
Grid 1
Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
Streams
Stream 0
Host A
pp
Grid 1
Grid 2
Grid 3
Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
Streams
Stream 0
Host A
pp
Grid 2
Grid 3
Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit Grid 1 Grid 1
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
Streams
Stream 0
Host A
pp
Grid 2
Grid 3
Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit Grid 1
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
Streams
Stream 0
Host A
pp
Grid 2
Grid 3
Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
Streams
Stream 0
Host A
pp
Grid 3
Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit Grid 2
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
- In different streams can be concurrent
All kernel launches are asynchronous to the host
Streams
Stream 1
Stream 2
Host A
pp
Grid 1
Grid 2
Grid 3
Grid 5
Grid 6
Grid 7 Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
- In different streams can be concurrent
All kernel launches are asynchronous to the host
Streams
Stream 1
Stream 2
Host A
pp
Grid 2
Grid 3
Grid 5
Grid 6
Grid 7 Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit
Grid 1 Grid 1
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
- In different streams can be concurrent
All kernel launches are asynchronous to the host
Streams
Stream 1
Stream 2
Host A
pp
Grid 2
Grid 3
Grid 6
Grid 7
Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit Grid 1 Grid 5
© NVIDIA Corporation 2012
Kernel launches
- within the same stream are in-order
- In different streams can be concurrent
All kernel launches are asynchronous to the host
Streams
Stream 1
Stream 2
Host A
pp
Grid 3
Grid 6
Grid 7
Global Memory (DRAM)
SM-0 SM-N Grid
Mgmt
Unit Grid 2 Grid 5
© NVIDIA Corporation 2012
Asynchronous Data Transfer / Pipelining
Host
GPU
Time
© NVIDIA Corporation 2012
Asynchronous Data Transfer / Pipelining
CPU
GPU
CPU
GPU
Time
© NVIDIA Corporation 2012
Asynchronous Data Transfer / Pipelining
CPU
GPU
CPU
GPU
Time
© NVIDIA Corporation 2012
Asynchronous Data Transfer / Pipelining
CPU
GPU
CPU
GPU
Time
© NVIDIA Corporation 2012
Asynchronous Data Transfer / Pipelining
CPU
GPU
CPU
GPU
Time
© NVIDIA Corporation 2012
Asynchronous Data Transfer / Pipelining
CPU
GPU
CPU
GPU
Time
© NVIDIA Corporation 2012
Asynchronous Data Transfer / Pipelining
CPU
GPU
CPU
GPU
Time
cudaStream_t stream1, stream2;
cudaMemcpyAsync( dst, src, size,
dir, stream1 );
kernel<<<grid, block, 0, stream2>>>(…);
© NVIDIA Corporation 2012
Hyper-Q Enables Efficient Scheduling
Grid management unit can select most appropriate grid from 32 streams
Improves scheduling of concurrently executed grids
Particularly interesting for MPI applications
© NVIDIA Corporation 2012
Strong Scaling of MPI Application
GPU parallelizable part CPU parallel part Serial part
N=1
Multicore CPU only
© NVIDIA Corporation 2012
Strong Scaling of MPI Application
GPU parallelizable part CPU parallel part Serial part
N=2 N=1
Multicore CPU only
© NVIDIA Corporation 2012
Strong Scaling of MPI Application
GPU parallelizable part CPU parallel part Serial part
N=4 N=2 N=1
Multicore CPU only
© NVIDIA Corporation 2012
Strong Scaling of MPI Application
GPU parallelizable part CPU parallel part Serial part
N=4 N=2 N=1 N=8
Multicore CPU only
© NVIDIA Corporation 2012
GPU Accelerated MPI Application
GPU parallelizable part CPU parallel part Serial part
N=4 N=2 N=1 N=8
Multicore CPU only GPU accelerated CPU
N=1
© NVIDIA Corporation 2012
GPU Accelerated Strong Scaling
GPU parallelizable part CPU parallel part Serial part
N=4 N=2 N=1 N=8
Multicore CPU only GPU accelerated CPU
With Hyper-Q/Proxy Available in K20
N=4 N=2 N=1 N=8
© NVIDIA Corporation 2012
GPU Accelerated Strong Scaling
GPU parallelizable part CPU parallel part Serial part
N=4 N=2 N=1 N=8
Multicore CPU only GPU accelerated CPU
With Hyper-Q/Proxy Available in K20
N=4 N=2 N=1 N=8
© NVIDIA Corporation 2012
Example: Hyper-Q/Proxy for CP2K
© NVIDIA Corporation 2012
How to use Hyper-Q
- No application modifications necessary
- Proxy process between user processes and GPU
- nvidia-proxy-server-control –d
© NVIDIA Corporation 2012
Don’t Forget Large-Scale Behavior
Profile in realistic environment
Get profile at scale
Tau, Scalasca, VampirTrace+Vampir, Craypat, ..
Fix messaging problems first! GPUs will accelerate your compute, amplify messaging problems
Will also help CPU-only code
Compute
Waste
Nrank = 384
Tim
e
© NVIDIA Corporation 2012
CPU GPU
CUDA Dynamic Parallelism
GPU as Co-Processor
© NVIDIA Corporation 2012
CPU GPU CPU GPU
CUDA Dynamic Parallelism
Autonomous, Dynamic Parallelism GPU as Co-Processor
© NVIDIA Corporation 2012
Dynamic Work Generation
Initial Grid
© NVIDIA Corporation 2012
Dynamic Work Generation
Initial Grid
Statically assign conservative
worst-case grid
Fixed Grid
© NVIDIA Corporation 2012
Dynamic Work Generation
Initial Grid
Statically assign conservative
worst-case grid
Dynamically assign performance
where accuracy is required
Dynamic Grid
Fixed Grid
© NVIDIA Corporation 2012
Kernel launches grids
Identical syntax as host
CUDA runtime function in
cudadevrt library
__global__ void childKernel()
{
printf("Hello %d", threadIdx.x);
}
__global__ void parentKernel()
{
childKernel<<<1,10>>>();
cudaDeviceSynchronize();
printf("World!\n");
}
int main(int argc, char *argv[])
{
parentKernel<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
CUDA Dynamic Parallelism
© NVIDIA Corporation 2012
Characteristics of an Ideal GPU Candidate
Sufficient parallelism
K20X: Up to 28’672 threads in flight
Rather > 10’000-way than ~10-way
Memory access patterns
Ideally close to stride 1 possible
Control-flow patterns
Low divergence, at least for groups
of threads
Concurrent grids
© NVIDIA Corporation 2012
Before CUDA 5: Whole-Program Compilation
Earlier CUDA releases required a single source file for each kernel Linking with external code was not supported
a.cu b.cu c.cu main.cpp + program.exe
#include all files together to build
© NVIDIA Corporation 2012
CUDA 5: Separate Compilation & Linking
+ program.exe main.cpp
a.cu b.cu
a.o b.o
c.cu
c.o
Separate compilation allows building independent object files
CUDA 5 can link multiple object files into one program
© NVIDIA Corporation 2012
Benefits of Separate Compilation & Linking
Easier to reuse your existing code
- No need to include all files together any more
- “extern” attribute is respected
Incremental compilation reduces build time
- e.g. 47,000 line single-file: 50s down to 4s
Use 3rd party GPU Callable libraries or create your own
- GPU Callable BLAS Library (libcublas_device.a) included
in CUDA Toolkit 5.0, uses Dynamic Parallelism
© NVIDIA Corporation 2012
CUDA 5: GPU Callable Libraries
Can combine object files into static libraries Link and externally call device code
a.cu b.cu
a.o b.o +
ab.a
+
main.cpp
program.exe
foo.cu
+
© NVIDIA Corporation 2012
CUDA 5: GPU Callable Libraries
a.cu b.cu
a.o b.o +
ab.a
ab.a
program2.exe
+
main2.cpp
bar.cu
+
+
main.cpp
program.exe
foo.cu
+
Combine object files into static libraries
Facilitates code reuse, reduces compile time
© NVIDIA Corporation 2012
CUDA 5: Callbacks
Enables closed-source device
libraries to call user-defined
device callback functions
vendor.a
+
main.cpp
program.exe
foo.cu
+
callback.cu +
© NVIDIA Corporation 2012
Device Linker Invocation
Introduction of an optional link step for device code
nvcc –arch=sm_20 –dc a.cu b.cu
nvcc –arch=sm_20 –dlink a.o b.o –o link.o
g++ a.o b.o link.o –L<path> -lcudart
© NVIDIA Corporation 2012
Device Linker Invocation
Introduction of an optional link step for device code
Link device-runtime library for dynamic parallelism
Currently, link occurs at cubin level (PTX not yet supported)
nvcc –arch=sm_20 –dc a.cu b.cu
nvcc –arch=sm_20 –dlink a.o b.o –o link.o
g++ a.o b.o link.o –L<path> -lcudart
nvcc –arch=sm_35 –dc a.cu b.cu
nvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.o
g++ a.o b.o link.o –L<path> -lcudadevrt -lcudart
© NVIDIA Corporation 2012
GPUDirect enables GPU-aware MPI
GPU-GPU transfer across NIC Unified Virtual addresses allows
Without CPU participation to detect location of buffer pointed to
© NVIDIA Corporation 2012
GPUDirect enables GPU-aware MPI
cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost);
MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);
MPI_Recv(r_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);
cudaMemcpy(r_buf_h,r_buf_d,size,cudaMemcpyHostToDevice);
Simplifies to
MPI_Send(s_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD);
MPI_Recv(r_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD);
(for CPU and GPU buffers)
© NVIDIA Corporation 2012
GPU Management: nvidia-smi
Multi-GPU systems are widely available
Different systems are set up differently
Want to get quick information on - Approximate GPU utilization
- Approximate memory footprint
- Number of GPUs
- ECC state
- Driver version
Inspect and modify GPU state
Thu Nov 1 09:10:29 2012
+------------------------------------------------------+
| NVIDIA-SMI 4.304.51 Driver Version: 304.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20X | 0000:03:00.0 Off | Off |
| N/A 30C P8 28W / 235W | 0% 12MB / 6143MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20X | 0000:85:00.0 Off | Off |
| N/A 28C P8 26W / 235W | 0% 12MB / 6143MB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
© NVIDIA Corporation 2012
Where to find additional information
CUDA documentation [1]
- Best Practice Guide [2]
- Kepler Tuning Guide [3]
Kepler whitepaper [4]
[1] http://docs.nvidia.com
[2] http://docs.nvidia.com/cuda/cuda-c-best-practices-guide
[3] http://docs.nvidia.com/cuda/kepler-tuning-guide
[4] http://www.nvidia.com/object/nvidia-kepler.html
© NVIDIA Corporation 2012
Where to find additional information: GTC
Kepler architecture:
GTC12 Session S0642: Inside Kepler
Assessing performance limiters:
GTC10 Session 2012: Analysis-driven Optimization (slides 5-19):
http://www.nvidia.com/content/GTC-2010/pdfs/2012_GTC2010v2.pdf
Profiling tools:
GTC12 sessions:
S0419: Optimizing Application Performance with CUDA Performance Tools
S0420: Nsight IDE for Linux and Mac
...
CUPTI documentation (describes all the profiler counters)
Included in every CUDA toolkit (/cuda/extras/cupti/doc/Cupti_Users_Guide.pdf
GPU computing webinars in general:
http://developer.nvidia.com/gpu-computing-webinars
http://www.gputechconf.com/gtcnew/on-
demand-gtc.php
© NVIDIA Corporation 2012
Kepler and CUDA5: Powerful yet Easy
Kepler and CUDA5 simplify GPU acceleration
Bypass optimization trial/error with APOD
Profile and analyze sefficiently with NVVP
Improve MPI scalability with Hyper-Q/Proxy
Parallelize with CUDA Dynamic Parallelism
Thank you!
Backup Slides