Post on 12-Jan-2015
description
IAP09 CUDA@MIT/6.963
David Luebke NVIDIA Research
© NVIDIA Corporation 2008
The “New” Moore’s Law
Computers no longer get faster, just wider
You must re-think your algorithms to be parallel !
Data-parallel computing is most scalable solution
© NVIDIA Corporation 2008
Enter the GPU
Massive economies of scale
Massively parallel
© NVIDIA Corporation 2008
Enter CUDA
Scalable parallel programming model
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel computing
© NVIDIA Corporation 2007
Sound Bite
GPUs + CUDA =
The Democratization of Parallel Computing
Massively parallel computing has become a commodity technology
MOTIVATION
MOTIVATION
146X 36X 19X 17X 100X
149X 47X 20X 24X 30X
© NVIDIA Corporation 2008
CUDA: ‘C’ FOR PARALLELISM
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
} // Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Standard C Code
Parallel C Code
© NVIDIA Corporation 2008
Hierarchy of concurrent threads
Parallel kernels composed of many threads all threads execute the same sequential program
Threads are grouped into thread blocks threads in the same block can cooperate
Threads/blocks have unique IDs
Thread t
t0 t1 … tB
Block b
Kernel foo()
. . .
© NVIDIA Corporation 2008
Hierarchical organization
Thread
per-thread local memory
Block
per-block shared memory
Kernel 0
. . . per-device
global memory
. . .
Kernel 1
. . . Global barrier
Local barrier
© NVIDIA Corporation 2008
Heterogeneous Programming
CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocks
across multiple processing elements
Serial Code
. . .
. . .
Parallel Kernel foo<<< nBlk, nTid >>>(args);
Serial Code
Parallel Kernel bar<<< nBlk, nTid >>>(args);
© NVIDIA Corporation 2008
Thread = virtualized scalar processor
Independent thread of execution has its own PC, variables (registers), processor state, etc. no implication about how threads are scheduled
CUDA threads might be physical threads as on NVIDIA GPUs
CUDA threads might be virtual threads might pick 1 block = 1 physical thread on multicore CPU
© NVIDIA Corporation 2008
Block = virtualized multiprocessor
Provides programmer flexibility freely choose processors to fit data freely customize for each kernel launch
Thread block = a (data) parallel task all blocks in kernel have the same entry point but may execute any code they want
Thread blocks of kernel must be independent tasks program valid for any interleaving of block executions
© NVIDIA Corporation 2008
Blocks must be independent
Any possible interleaving of blocks should be valid presumed to run to completion without pre-emption can run in any order can run concurrently OR sequentially
Blocks may coordinate but not synchronize shared queue pointer: OK shared lock: BAD … can easily deadlock
Independence requirement gives scalability
© NVIDIA Corporation 2008
Scalable Execution Model
Kernel launched by host
. . .
SP
Shared Memory
MT IU
SP
Shared Memory
MT IU
SP
Shared Memory
MT IU
SP
Shared Memory
MT IU
SP
Shared Memory
MT IU
SP
Shared Memory
MT IU
SP
Shared Memory
MT IU
SP
Shared Memory
MT IU
. . .
Device Memory
Blocks Run on Multiprocessors
© NVIDIA Corporation 2008
Synchronization & Cooperation
Threads within block may synchronize with barriers … Step 1 …
__syncthreads(); … Step 2 …
Blocks coordinate via atomic memory operations e.g., increment shared queue pointer with atomicInc()
Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c);
vec_dot<<<nblocks, blksize>>>(c, c);
© NVIDIA Corporation 2008
Using per-block shared memory
Variables shared across block __shared__ int *begin, *end;
Scratchpad memory __shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x];
// … compute on scratch values … begin[threadIdx.x] = scratch[threadIdx.x];
Communicating values between threads scratch[threadIdx.x] = begin[threadIdx.x]; __syncthreads();
int left = scratch[threadIdx.x - 1];
Block
Shared
© NVIDIA Corporation 2008
Example: Parallel Reduction
Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];
Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums N threads need log N steps
one possible approach: Butterfly pattern
© NVIDIA Corporation 2008
Example: Parallel Reduction
Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];
Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums N threads need log N steps
one possible approach: Butterfly pattern
© NVIDIA Corporation 2008
Parallel Reduction for 1 Block
// INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize];
// One thread per element sum[i] = x_i; __syncthreads();
for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]
© NVIDIA Corporation 2008
Summing Up CUDA
CUDA = C + a few simple extensions makes it easy to start writing basic parallel programs
Three key abstractions: 1. hierarchy of parallel threads 2. corresponding levels of synchronization 3. corresponding memory spaces
Supports massive parallelism of manycore GPUs
© NVIDIA Corporation 2008
Parallel
// INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize];
// One thread per element sum[i] = x_i; __syncthreads();
for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]
© NVIDIA Corporation 2008
More efficient reduction tree
template<class T> __device__ T reduce(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x;
for(unsigned int offset=n/2; offset>0; offset/=2) { if(tid < offset) x[i] += x[i + offset]; __syncthreads(); }
return x[i]; // Note that only thread 0 has full sum }
© NVIDIA Corporation 2008
Reduction tree execution example
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
Input (shared memory)
0 1 2 3 4 5 6 7
8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2
0 1 2 3
8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
0 1
21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
0
41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
Final result
active threads
x[i] += x[i+8];
x[i] += x[i+4];
x[i] += x[i+2];
x[i] += x[i+1];
© NVIDIA Corporation 2008
Pattern 1: Fit kernel to the data
Code lets B-thread block reduce B-element array
For larger sequences: launch N/B blocks to reduce each B-element subsequence write N/B partial sums to temporary array repeat until done
We’ll be done in logB N steps
© NVIDIA Corporation 2008
Pattern 2: Fit kernel to the machine
For a block size of B: 1) Launch B blocks to sum N/B elements 2) Launch 1 block to combine B partial sums
We’ll be done in 2 steps
BUT ... need to make sure B blocks fill machine
4 7 5 9 11 14
25
3 1 7 0 4 1 6 3 4 7 5 9
11 14 25
3 1 7 0 4 1 6 3 4 7 5 9
11 14 25
3 1 7 0 4 1 6 3 4 7 5 9
11 14 25
3 1 7 0 4 1 6 3 4 7 5 9
11 14 25
3 1 7 0 4 1 6 3 4 7 5 9
11 14 25
3 1 7 0 4 1 6 3 4 7 5 9
11 14 25
3 1 7 0 4 1 6 3 4 7 5 9
11 14 25
3 1 7 0 4 1 6 3
4 7 5 9 11 14
25
3 1 7 0 4 1 6 3
Stage 1: many blocks
Stage2: 1 block
© NVIDIA Corporation 2008
Pattern 3: Atomic accumulation
Create a variable __device__ total;
At the end of each block, call atomicAdd() to accumulate partial sum into total.
We’ll be done in 1 step
BUT … we’ve created a global serialization point between blocks contention may be high, esp. for small block sizes atomics support only limited types / operators
© NVIDIA Corporation 2008
THINK DATA-PARALLEL: SCAN
• © NVIDIA Corporation 2007
Common Situations in Parallel Computation
Many parallel threads need to generate a single result value Reduce
Many parallel threads that need to partition data Split
Many parallel threads and variable output per thread Compact / Expand / Allocate
• © NVIDIA Corporation 2007
F T F F T F F T
F F F F F T T T
3 6 1 4 0 7 1 3
3 1 4 7 1 6 0 3
Flag
Payload
Split Operation
Given an array of true and false elements (and payloads)
Return an array with all true elements at the beginning
Examples: sorting, building trees
• © NVIDIA Corporation 2007
Variable Output Per Thread: Compact
3 7 4 1 3
3 0 7 0 4 1 0 3
Remove null elements
• © NVIDIA Corporation 2007
Variable Output Per Thread
Allocate Variable Storage Per Thread
Examples: marching cubes, geometry generation
A
B
C D
E
F
G
2 1 0 3 2
H
• © NVIDIA Corporation 2007
“Where do I write my output?”
In all of these situations, each thread must answer that simple question
The answer is:
“That depends (on how much the other threads need to write)!”
“Scan” is an efficient way to answer this question in parallel
• © NVIDIA Corporation 2007
Applications of Scan
Scan is a simple and useful parallel building block for many parallel algorithms:
Fascinating, since scan is unnecessary in sequential computing!
radix sort quicksort (segmented
scan) String comparison Lexical analysis Stream compaction Run-length encoding
Polynomial evaluation Solving recurrences Tree operations Histograms Allocation Etc.
© NVIDIA Corporation 2007
Parallel Scan (a.k.a. prefix sum)
Given sequence A of n elements and an associative operator ⊕ with identity I
Inclusive scan: scan(A, ⊕) = [ A0, A0 ⊕ A1, …, A0 ⊕ ··· ⊕ An-1]
Exclusive scan: prescan(A, ⊕) = [ I, A0, …, A0 ⊕ ··· ⊕ An-2]
Ex: A = [3 1 7 0 4 1 6 3] prescan(A,+) = [0 3 4 11 11 15 16 22]
© NVIDIA Corporation 2007
Inclusive Scan across 1 block
// After Hillis & Steele, Data parallel algorithms, CACM 1986. template<class T> __device__ T plus_scan(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x;
for(unsigned int offset=1; offset<n; offset *= 2) { T t; if(i>=offset) t = x[i-offset]; __syncthreads(); if(i>=offset) x[i] = t + x[i]; __syncthreads(); } }
© NVIDIA Corporation 2007
Combining results across blocks
Much like reduction case Pattern 1: Fit kernel to the data … O(logB N) steps Pattern 2: Fit kernel to the machine … O(1) steps
Now need to add partial sums back to partial results kernels need to save partial results need additional kernel to add partial sums back in
Sketch of algorithm using pattern 2: 1) B blocks compute partial results for subsequences 2) 1 block performs scan of partial sums from (1) 3) N/B blocks add results from (2) into results from (1)
© NVIDIA Corporation 2007
Directions for improvement
Templating for generic operators scan<plus>() much more powerful than plus_scan()
Pay attention to operator properties associative? commutative? has inverse? has identity? the fewer that are true, the more careful the code must be e.g., reduction examples assume commutative w/ identity
Many opportunities for code optimization remove inefficiencies (auto-) tuning for different cases
© NVIDIA Corporation 2007
Directions for optimization
Loop unrolling with fixed block sizes almost always a good idea
Minimize unutilized threads reduction example leaves half its threads idle
Improve work efficiency scan example does O(n log n) work
Watch out for divergence and incoherent load/store sometimes at odds with earlier directions
© NVIDIA Corporation 2008
GPU ARCHITECTURE
© NVIDIA Corporation 2008
GPU ARCHITECTURE
Communication Fabric
Mem
ory & I/O
Fi
xed
Func
tion
Acc
eler
atio
n
240 scalar cores
On-chip memory
Texture units
GeForce GTX 280 / Tesla T10
© NVIDIA Corporation 2008
SM
DP
SFU
MT Issue
Inst. Cache
Const. Cache
SFU
Memory
SP SP SP SP SP SP SP SP
Key Architectural Ideas
Massive multithreading Hide latency with computation not cache
SIMT execution (Single Instruction Multiple Thread)
Warps – threads run in groups of 32
HW handles divergence gracefully
SM
© NVIDIA Corporation 2008
Reduction tree redux
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
Input (shared memory)
0 1 2 3 4 5 6 7
8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2
0 1 2 3
8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
0 1
21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
0
41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
Final result
active threads
x[i] += x[i+8];
x[i] += x[i+4];
x[i] += x[i+2];
x[i] += x[i+1];
© NVIDIA Corporation 2008
Compare to interleaved addressing: Input (shared memory)
x[i] += x[i+8];
x[i] += x[i+4];
x[i] += x[i+2];
x[i] += x[i+1];
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
0 1 2 3 4 5 6 7
11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2
0 1 2 3
18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2
0 1
24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
0
41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
© NVIDIA Corporation 2008
SOME FINAL THOUGHTS
Faster is not “just faster” -- David Kirk, Chief Scientist
2-3X faster is “just faster” Do a little more, wait a little less Doesn’t change how you work
5-10x faster is “significant” Worth upgrading Worth re-writing (parts of) the application
100x (or more!) faster is “fundamentally different” Worth considering a new platform Worth re-architecting the application Makes new applications possible Drives “time to discovery”, changes science
© NVIDIA Corporation 2008
SOME FINAL THOUGHTS
We should teach parallel computing in CS 1 or CS 2
Remember: computers don’t get faster, just wider
Heapsort and mergesort
Both O(n lg n)
One parallel-friendly, one not
Students need to understand this early
© NVIDIA Corporation 2008
Conclusion
GPUs are massively parallel manycore computers Ubiquitous - most successful parallel processor in history Useful - users achieve huge speedups on real problems
CUDA is a powerful parallel programming model Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C
They provide tremendous scope for innovative research
Questions?
Copyright NVIDIA 2008 49
Questions & Discussion
Lots more slides…
CUDA programming Advanced examples: sparse matrix-vector, ray tracing
Performance: tools and applications BLAS, FFT, sort, sparse matrix-vector Fluid dynamics, molecular dynamics, n-body, discontinuous Galerkin
GPU architecture
Myths of GPU Computing
Otherwise: dluebke@nvidia.com
Example: Sparse Matrix-Vector
© NVIDIA Corporation 2007
Sparse matrix-vector multiplication
Sparse matrices have relatively few non-zero entries Frequently O(n) rather than O(n2) Only store & operate on these non-zero entries
3 0 1 0
0 0 0 0
0 2 4 1
1 0 0 1
Av[7] = { 3, 1, 2, 4, 1, 1, 1 };
Aj[7] = { 0, 2, 1, 2, 3, 0, 3 };
Ap[5] = { 0, 2, 2, 5, 7 };
Non-zero values
Column indices
Row pointers
Row 0 Row 2 Row 3
Example: Compressed Sparse Row (CSR) Format
© NVIDIA Corporation 2007
Sparse matrix-vector multiplication
float multiply_row(uint rowsize, // number of non-zeros in row uint *Aj, // column indices for row float *Av, // non-zero entries for row float *x) // the RHS vector { float sum = 0;
for(uint column=0; column<rowsize; ++column) sum += Av[column] * x[Aj[column]];
return sum; }
Av[7] = { 3, 1, 2, 4, 1, 1, 1 };
Aj[7] = { 0, 2, 1, 2, 3, 0, 3 };
Ap[5] = { 0, 2, 2, 5, 7 };
Non-zero values
Column indices
Row pointers
Row 0 Row 2 Row 3
© NVIDIA Corporation 2007
Sparse matrix-vector multiplication
float multiply_row(uint size, uint *Aj, float *Av, float *x);
void csrmul_serial(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];
y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); } }
© NVIDIA Corporation 2007
Sparse matrix-vector multiplication
float multiply_row(uint size, uint *Aj, float *Av, float *x);
__global__ void csrmul_kernel(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { uint row = blockIdx.x*blockDim.x + threadIdx.x;
if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];
y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); } }
© NVIDIA Corporation 2007
Compare with OpenMP void csrmul_openmp(… … … … … …) { #pragma omp parallel for for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];
y[row] = multiply_row(… … … …); } }
OpenMP Kernel
__global__ void csrmul_kernel(… … … … … …) { uint row = blockIdx.x*blockDim.x + threadIdx.x;
if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];
y[row] = multiply_row(… … … …); } }
CUDA Kernel
© NVIDIA Corporation 2008
Adding a simple caching scheme
__global__ void csrmul_cached(… … … … … …) { uint begin = blockIdx.x*blockDim.x, end = begin+blockDim.x; uint row = begin + threadIdx.x;
__shared__ float cache[blocksize]; // array to cache rows if( row<num_rows) cache[threadIdx.x] = x[row]; // fetch to cache __syncthreads();
if( row<num_rows ) { uint row_begin = Ap[row], row_end = Ap[row+1]; float sum = 0;
for(uint col=row_begin; col<row_end; ++col) { uint j = Aj[col];
// Fetch from cached rows when possible float x_j = (j>=begin && j<end) ? cache[j-begin] : x[j];
sum += Av[col] * x_j; }
y[row] = sum; } }
multiply_row()
Example: Scan
Example: Multigrid
Poisson Solver
© NVIDIA Corporation 2007
Results: Multigrid Poisson Solver
do k = 1, nz do j = 1, ny do i = 1+mod(k+j+redblack,2), nx, 2 residual = -hsq*B(i,j,k) - U(I,j,k)*6 + & (U(i+1,j,k) + U(i-1,j,k)) + & (U(i,j+1,k) + U(i,j-1,k)) + & (U(i,j,k-1) + U(i,j,k+1)) U(i,j,k) = U(i,j,k) + (omega/6)*residual end do end do end do
Gauss-Seidel Red-Black smoother main loop - FORTRAN
© NVIDIA Corporation 2007
Results: Multigrid Poisson Solver
__global__ void Relax( Grid U, Grid B, float omega, int red_black) { int i = blockIdx.x*blockDim.x + threadIdx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; int k = blockIdx.z*blockDim.z + threadIdx.z; if ((j+k+i)%2 != red_black) return; int idx = i + j*U.jstride + k*U.kstride; float residual = -U.hsq*B.buf[idx] - 6.0f*U.buf[idx] + U.buf[idx + U.istride] + U.buf[idx – U.istride] + U.buf[idx + U.jstride] + U.buf[idx – U.jstride] + U.buf[idx + 1 ] + U.buf[idx – 1 ]; U.buf[idx] += omega * residual/6.0f; }
Gauss-Seidel Red-Black smoother kernel - CUDA
© NVIDIA Corporation 2007
Results: Multigrid Poisson Solver
Optimal version is slightly more complex Fast math intrinsics, optimized memory layout, etc.
Single precision 1 step of full multigrid GTX280 vs 3.16 GHz Core2Duo E8500 (using 1 core)
GTX280 Core2 E8500
32 x 32 x 32 0.01 sec 0.02 sec
64 x 64 x 64 0.02 sec 0.14 sec
128 x 128 x 128 0.06 sec 1.16 sec
256 x 256 x 256 0.33 sec 9.27 sec
Example: Vector Add w/ Host Code
© NVIDIA Corporation 2007
Example: Vector Addition Kernel
// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }
int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }
Device Code
© NVIDIA Corporation 2007
Example: Vector Addition Kernel
// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }
int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }
Host Code
© NVIDIA Corporation 2007
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory float *h_A = …, *h_B = …;
// allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice) );
// execute the kernel on N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Example: Ray Tracing
Copyright NVIDIA 2008 67
Implications & Strategy
GTX 280 supports up to 30,720 concurrent threads!
Implication: minimize state per thread
Strategy: replace traversal stack with short stack
Horn et al., Interactive k-D Tree GPU Raytracing, I3D 2008
The following slides are borrowed from Daniel Horn’s presentation
Copyright NVIDIA 2008 68
A
D C
KD-Tree
B
X
Y
Z
X
Y Z
A B C D
tmin
tmax
Copyright NVIDIA 2008 69
D C
A
B
X
Y
Z
KD-Tree Traversal
X
Y Z
A B C D
Z A
Stack:
Copyright NVIDIA 2008 70
D C
A
B
X
Y
Z
KD-Restart
Standard traversal Omit stack operations Proceed to 1st leaf
If no intersection Advance (tmin,tmax) Restart from root
Proceed to next leaf
Copyright NVIDIA 2008 71
D C
A
B
X
Y
Z
KD-Restart with short stack (size 1)
X
Y Z
A B C D
Z A
Stack: A
Copyright NVIDIA 2008 72
Short Stack Cache
Even better:
Each thread stores full stack in memory non-blocking writes
Cache top of stack locally (registers or shared memory)
Enables BVHs as well as k-d trees
5-10% faster in our current implementation
Products
© NVIDIA Corporation 2007
Tesla S1070 1U System
1 single precision
2 typical power
4 Teraflops1
800 watts2
© NVIDIA Corporation 2007
Tesla C1060 Board
1 single precision
2 typical power
957 Gigaflops1
160 Watts2
© NVIDIA Corporation 2007
Building a 100TF datacenter
CPU 1U Server Tesla 1U System
10x lower cost
21x lower power
4 CPU cores 0.07 Teraflop
$ 2000
400 W
1429 CPU servers
$ 3.1 M
571 KW
4 GPUs: 960 cores 4 Teraflops
$ 8000
800 W
25 CPU servers 25 Tesla systems
$ 0.31 M
27 KW
© NVIDIA Corporation 2008
Tesla Personal Supercomputer
Supercomputing Performance Massively parallel CUDA Architecture 960 cores. 4 TeraFlops 250x the performance of a desktop
Personal One researcher, one supercomputer Plugs into standard power strip
Accessible Program in C for Windows, Linux Available now worldwide under $10,000
© NVIDIA Corporation 2007
CUDA SDK
NVIDIA C Compiler
NVIDIA Assembly for Computing CPU Host Code
Integrated CPU and GPU C Source Code
Libraries:FFT, BLAS,… Example Source Code
CUDA Driver
Debugger Profiler Standard C Compiler
GPU CPU
Applications
© NVIDIA Corporation 2007
Example: Single Precision BLAS
0
50
100
150
200
250
300
350
GFL
OPS
Matrix Size
BLAS (SGEMM) on CUDA
CUDA
ATLAS 1 Thread ATLAS 4 Threads
CUBLAS:CUDA2.0b2,TeslaC1060(10‐seriesGPU)ATLAS3.81onDual2.8GHzOpteronDual‐Core
© NVIDIA Corporation 2007
Example: Double Precision BLAS
0
10
20
30
40
50
60
70
GFL
OPS
Matrix Size
BLAS (DGEMM) on CUDA
CUBLAS
ATLAS Parallel
ATLAS Single
CUBLASCUDA2.0b2onTeslaC1060(10‐series)ATLAS3.81onIntelXeonE5440Quad‐core,2.83GHz
© NVIDIA Corporation 2007
DGEMM Performance
GPU performance includes data copies over PCIe gen-2
© NVIDIA Corporation 2007
CUDA-Simple: 14.8x faster,
33% of fastest GPU kernel
CUDA-Unroll8clx: fastest GPU kernel,
44x faster than CPU, 291 GFLOPS
on GeForce 8800GTX
GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 2008. In press.
CPU
http://www.ks.uiuc.edu/Research/namd/
Direct Coulomb Summation (VMD)
© NVIDIA Corporation 2008
Fluid Dynamics
2-D compressible Euler equations irregular structured grid finite volume method (Ni, 1981)
E. H. Phillips, Y. Zhang, R. L. Davis, J. D. Owens. A multi-grid solver for the 2D compressible Euler equations on a GPU cluster. http://www.ece.ucdavis.edu/cerl/techreports/2008-2/
© NVIDIA Corporation 2008
Solving Euler Equations
Divide domain into tiles & distribute amongst blocks
As always, try for: maximal work per tile minimal data exchange between tiles load balance within/between tiles
© NVIDIA Corporation 2007
1.6K 25K 400K 1.6M 6.4M
4 18 22 22 22
4 22
39 42 43
2
22
69 80 85
1 14
110
136
160
Domain Size (nodes)
Parallel Speedup vs. Single Core
QuadroFX 5600 2x QuadroFX 5600
4x QuadroFX 5600 8x QuadroFX 5600
Solving Euler Equations
© NVIDIA Corporation 2008
Discontinuous Galerkin Methods
Weak coupling between elements limited intra-element communication allows non-conforming meshes allows heterogeneous element types
Support local higher-order operations increased arithmetic per element but still local to the element
© NVIDIA Corporation 2007 CPU code achieves ~3GFLOPS/core
Full time-dependent electromagnetic scattering in an open volume surrounding an aircraft
from T. Warburton, A. Klockner, J. Hesthaven
© NVIDIA Corporation 2007
We see just over 700 GFLOPS on 3xGTX 280
3
from T. Warburton, A. Klockner, J. Hesthaven
Multi-GPU Parallel Scaling
Sparse Linear Algebra Results
© NVIDIA Corporation 2008
Sparse Matrix-Vector Multiplication (SpMV) on CUDA
Experimented with several data structures CSR: Compressed Sparse Row HYB: Hybrid of ELLPACK (ELL) and Coordinate (COO)
formats
HYB gave best results Speed of ELL with flexibility of COO
Benchmarked against matrices from “Optimization of Sparse Matrix-Vector Multiplication on
Emerging Multicore Platforms”, S. Williams et al, Supercomputing 2007
© NVIDIA Corporation 2007
Results: Sparse Matrix-Vector Multiplication (SpMV) on CUDA
CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
Double Precision
© NVIDIA Corporation 2008
T10 Double Precision Floating Point
Precision IEEE 754
Rounding modes for FADD and FMUL All 4 IEEE, round to nearest, zero, inf, -inf
Denormal handling Full speed
NaN support Yes
Overflow and Infinity support Yes
Flags No
FMA Yes
Square root Software with low-latency FMA-based convergence
Division Software with low-latency FMA-based convergence
Reciprocal estimate accuracy 24 bit
Reciprocal sqrt estimate accuracy 23 bit
log2(x) and 2^x estimates accuracy 23 bit
© NVIDIA Corporation 2007
NVIDIA Tesla T10 SSE2 Cell SPE Precision IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
All 4 IEEE, round to nearest, zero, inf, -inf
All 4 IEEE, round to nearest, zero, inf, -inf
All 4 IEEE, round to nearest, zero, inf, -inf
Denormal handling Full speed Supported, costs 1000’s of cycles
Supported only for results, not input operands (input denormals flushed-to-0)
NaN support Yes Yes Yes
Overflow and Infinity support Yes Yes Yes
Flags No Yes Yes
FMA Yes No Yes
Square root Software with low-latency FMA-based convergence Hardware Software only
Division Software with low-latency FMA-based convergence Hardware Software only
Reciprocal estimate accuracy 24 bit 12 bit 12 bit + step
Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit + step
log2(x) and 2^x estimates accuracy 23 bit No No
Double Precision Floating Point
© NVIDIA Corporation 2007
Single Precision Floating Point G80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
Round to nearest and round to zero
All 4 IEEE, round to nearest, zero, inf, -inf
Round to nearest only
Round to zero/truncate only
Denormal handling Flush to zero Supported, 1000’s of cycles
Supported, 1000’s of cycles Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinity support
Yes, only clamps to max norm Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit
Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit 12 bit
log2(x) and 2^x estimates accuracy 23 bit No 12 bit No
Quotes
© NVIDIA Corporation 2008
GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.
Jack Dongarra Professor, University of Tennessee Author of Linpack
© NVIDIA Corporation 2008
We’ve all heard ‘desktop supercomputer’ claims in the past, but this time it’s for real: NVIDIA and its partners will be delivering outstanding performance and broad applicability to the mainstream marketplace.
Heterogeneous computing is what makes such a breakthrough possible.
Burton Smith Technical Fellow, Microsoft Formerly, Chief Scientist at Cray