IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

IAP09 CUDA@MIT/6.963

David Luebke NVIDIA Research

The “New” Moore’s Law

Computers no longer get faster, just wider

You must re-think your algorithms to be parallel !

Data-parallel computing is most scalable solution

Enter the GPU

Massive economies of scale

Massively parallel

Enter CUDA

Scalable parallel programming model

Minimal extensions to familiar C/C++ environment

Heterogeneous serial-parallel computing

Sound Bite

GPUs + CUDA =

The Democratization of Parallel Computing

Massively parallel computing has become a commodity technology

MOTIVATION

146X 36X 19X 17X 100X

149X 47X 20X 24X 30X

CUDA: ‘C’ FOR PARALLELISM

void saxpy_serial(int n, float a, float *x, float *y)

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

} // Invoke serial SAXPY kernel

saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

// Invoke parallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;

saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Standard C Code

Parallel C Code

Hierarchy of concurrent threads

Parallel kernels composed of many threads all threads execute the same sequential program

Threads are grouped into thread blocks threads in the same block can cooperate

Threads/blocks have unique IDs

Thread t

t0 t1 … tB

Block b

Kernel foo()

Hierarchical organization

Thread

per-thread local memory

per-block shared memory

Kernel 0

. . . per-device

global memory

Kernel 1

. . . Global barrier

Local barrier

Heterogeneous Programming

CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocks

across multiple processing elements

Serial Code

Parallel Kernel foo<<< nBlk, nTid >>>(args);

Serial Code

Parallel Kernel bar<<< nBlk, nTid >>>(args);

Thread = virtualized scalar processor

Independent thread of execution has its own PC, variables (registers), processor state, etc. no implication about how threads are scheduled

CUDA threads might be physical threads as on NVIDIA GPUs

CUDA threads might be virtual threads might pick 1 block = 1 physical thread on multicore CPU

Block = virtualized multiprocessor

Provides programmer flexibility freely choose processors to fit data freely customize for each kernel launch

Thread block = a (data) parallel task all blocks in kernel have the same entry point but may execute any code they want

Thread blocks of kernel must be independent tasks program valid for any interleaving of block executions

Blocks must be independent

Any possible interleaving of blocks should be valid presumed to run to completion without pre-emption can run in any order can run concurrently OR sequentially

Blocks may coordinate but not synchronize shared queue pointer: OK shared lock: BAD … can easily deadlock

Independence requirement gives scalability

Scalable Execution Model

Kernel launched by host

Shared Memory

Device Memory

Blocks Run on Multiprocessors

Synchronization & Cooperation

Threads within block may synchronize with barriers … Step 1 …

__syncthreads(); … Step 2 …

Blocks coordinate via atomic memory operations e.g., increment shared queue pointer with atomicInc()

Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c);

vec_dot<<<nblocks, blksize>>>(c, c);

Using per-block shared memory

Variables shared across block __shared__ int *begin, *end;

Scratchpad memory __shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x];

// … compute on scratch values … begin[threadIdx.x] = scratch[threadIdx.x];

Communicating values between threads scratch[threadIdx.x] = begin[threadIdx.x]; __syncthreads();

int left = scratch[threadIdx.x - 1];

Shared

Example: Parallel Reduction

Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];

Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums N threads need log N steps

one possible approach: Butterfly pattern

Example: Parallel Reduction

Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];

Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums N threads need log N steps

one possible approach: Butterfly pattern

Parallel Reduction for 1 Block

// INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize];

// One thread per element sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]

Summing Up CUDA

CUDA = C + a few simple extensions makes it easy to start writing basic parallel programs

Three key abstractions: 1. hierarchy of parallel threads 2. corresponding levels of synchronization 3. corresponding memory spaces

Supports massive parallelism of manycore GPUs

Parallel

// INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize];

// One thread per element sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]

More efficient reduction tree

template<class T> __device__ T reduce(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x;

for(unsigned int offset=n/2; offset>0; offset/=2) { if(tid < offset) x[i] += x[i + offset]; __syncthreads(); }

return x[i]; // Note that only thread 0 has full sum }

Reduction tree execution example

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Input (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

Final result

active threads

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];

Pattern 1: Fit kernel to the data

Code lets B-thread block reduce B-element array

For larger sequences: launch N/B blocks to reduce each B-element subsequence write N/B partial sums to temporary array repeat until done

We’ll be done in logB N steps

Pattern 2: Fit kernel to the machine

For a block size of B: 1) Launch B blocks to sum N/B elements 2) Launch 1 block to combine B partial sums

We’ll be done in 2 steps

BUT ... need to make sure B blocks fill machine

4 7 5 9 11 14

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3

4 7 5 9 11 14

3 1 7 0 4 1 6 3

Stage 1: many blocks

Stage2: 1 block

Pattern 3: Atomic accumulation

Create a variable __device__ total;

At the end of each block, call atomicAdd() to accumulate partial sum into total.

We’ll be done in 1 step

BUT … we’ve created a global serialization point between blocks contention may be high, esp. for small block sizes atomics support only limited types / operators

THINK DATA-PARALLEL: SCAN

Common Situations in Parallel Computation

Many parallel threads need to generate a single result value Reduce

Many parallel threads that need to partition data Split

Many parallel threads and variable output per thread Compact / Expand / Allocate

F T F F T F F T

F F F F F T T T

3 6 1 4 0 7 1 3

3 1 4 7 1 6 0 3

Payload

Split Operation

Given an array of true and false elements (and payloads)

Return an array with all true elements at the beginning

Examples: sorting, building trees

Variable Output Per Thread: Compact

3 7 4 1 3

3 0 7 0 4 1 0 3

Remove null elements

Variable Output Per Thread

Allocate Variable Storage Per Thread

Examples: marching cubes, geometry generation

2 1 0 3 2

“Where do I write my output?”

In all of these situations, each thread must answer that simple question

The answer is:

“That depends (on how much the other threads need to write)!”

“Scan” is an efficient way to answer this question in parallel

Applications of Scan

Scan is a simple and useful parallel building block for many parallel algorithms:

Fascinating, since scan is unnecessary in sequential computing!

radix sort quicksort (segmented

scan) String comparison Lexical analysis Stream compaction Run-length encoding

Polynomial evaluation Solving recurrences Tree operations Histograms Allocation Etc.

Parallel Scan (a.k.a. prefix sum)

Given sequence A of n elements and an associative operator ⊕ with identity I

Inclusive scan: scan(A, ⊕) = [ A0, A0 ⊕ A1, …, A0 ⊕ ··· ⊕ An-1]

Exclusive scan: prescan(A, ⊕) = [ I, A0, …, A0 ⊕ ··· ⊕ An-2]

Ex: A = [3 1 7 0 4 1 6 3] prescan(A,+) = [0 3 4 11 11 15 16 22]

Inclusive Scan across 1 block

// After Hillis & Steele, Data parallel algorithms, CACM 1986. template<class T> __device__ T plus_scan(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x;

for(unsigned int offset=1; offset<n; offset *= 2) { T t; if(i>=offset) t = x[i-offset]; __syncthreads(); if(i>=offset) x[i] = t + x[i]; __syncthreads(); } }

Combining results across blocks

Much like reduction case Pattern 1: Fit kernel to the data … O(logB N) steps Pattern 2: Fit kernel to the machine … O(1) steps

Now need to add partial sums back to partial results kernels need to save partial results need additional kernel to add partial sums back in

Sketch of algorithm using pattern 2: 1) B blocks compute partial results for subsequences 2) 1 block performs scan of partial sums from (1) 3) N/B blocks add results from (2) into results from (1)

Directions for improvement

Templating for generic operators scan<plus>() much more powerful than plus_scan()

Pay attention to operator properties associative? commutative? has inverse? has identity? the fewer that are true, the more careful the code must be e.g., reduction examples assume commutative w/ identity

Many opportunities for code optimization remove inefficiencies (auto-) tuning for different cases

Directions for optimization

Loop unrolling with fixed block sizes almost always a good idea

Minimize unutilized threads reduction example leaves half its threads idle

Improve work efficiency scan example does O(n log n) work

Watch out for divergence and incoherent load/store sometimes at odds with earlier directions

GPU ARCHITECTURE

Communication Fabric

ory & I/O

240 scalar cores

On-chip memory

Texture units

GeForce GTX 280 / Tesla T10

MT Issue

Inst. Cache

Const. Cache

Memory

SP SP SP SP SP SP SP SP

Key Architectural Ideas

Massive multithreading Hide latency with computation not cache

SIMT execution (Single Instruction Multiple Thread)

Warps – threads run in groups of 32

HW handles divergence gracefully

Reduction tree redux

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Input (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

Final result

active threads

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];

Compare to interleaved addressing: Input (shared memory)

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

0 1 2 3 4 5 6 7

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

0 1 2 3

18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

SOME FINAL THOUGHTS

Faster is not “just faster” -- David Kirk, Chief Scientist

2-3X faster is “just faster” Do a little more, wait a little less Doesn’t change how you work

5-10x faster is “significant” Worth upgrading Worth re-writing (parts of) the application

100x (or more!) faster is “fundamentally different” Worth considering a new platform Worth re-architecting the application Makes new applications possible Drives “time to discovery”, changes science

SOME FINAL THOUGHTS

We should teach parallel computing in CS 1 or CS 2

Remember: computers don’t get faster, just wider

Heapsort and mergesort

Both O(n lg n)

One parallel-friendly, one not

Students need to understand this early

Conclusion

GPUs are massively parallel manycore computers Ubiquitous - most successful parallel processor in history Useful - users achieve huge speedups on real problems

CUDA is a powerful parallel programming model Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C

They provide tremendous scope for innovative research

Questions?

Copyright NVIDIA 2008 49

Questions & Discussion

Lots more slides…

CUDA programming Advanced examples: sparse matrix-vector, ray tracing

Performance: tools and applications BLAS, FFT, sort, sparse matrix-vector Fluid dynamics, molecular dynamics, n-body, discontinuous Galerkin

GPU architecture

Myths of GPU Computing

Otherwise: dluebke@nvidia.com

Example: Sparse Matrix-Vector

Sparse matrix-vector multiplication

Sparse matrices have relatively few non-zero entries Frequently O(n) rather than O(n2) Only store & operate on these non-zero entries

3 0 1 0

0 0 0 0

0 2 4 1

1 0 0 1

Av[7] = { 3, 1, 2, 4, 1, 1, 1 };

Aj[7] = { 0, 2, 1, 2, 3, 0, 3 };

Ap[5] = { 0, 2, 2, 5, 7 };

Non-zero values

Column indices

Row pointers

Row 0 Row 2 Row 3

Example: Compressed Sparse Row (CSR) Format

float multiply_row(uint rowsize, // number of non-zeros in row uint *Aj, // column indices for row float *Av, // non-zero entries for row float *x) // the RHS vector { float sum = 0;

for(uint column=0; column<rowsize; ++column) sum += Av[column] * x[Aj[column]];

return sum; }

Av[7] = { 3, 1, 2, 4, 1, 1, 1 };

Aj[7] = { 0, 2, 1, 2, 3, 0, 3 };

Ap[5] = { 0, 2, 2, 5, 7 };

Non-zero values

Column indices

Row pointers

Row 0 Row 2 Row 3

float multiply_row(uint size, uint *Aj, float *Av, float *x);

void csrmul_serial(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); } }

float multiply_row(uint size, uint *Aj, float *Av, float *x);

__global__ void csrmul_kernel(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { uint row = blockIdx.x*blockDim.x + threadIdx.x;

if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); } }

Compare with OpenMP void csrmul_openmp(… … … … … …) { #pragma omp parallel for for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(… … … …); } }

OpenMP Kernel

__global__ void csrmul_kernel(… … … … … …) { uint row = blockIdx.x*blockDim.x + threadIdx.x;

if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(… … … …); } }

CUDA Kernel

Adding a simple caching scheme

__global__ void csrmul_cached(… … … … … …) { uint begin = blockIdx.x*blockDim.x, end = begin+blockDim.x; uint row = begin + threadIdx.x;

__shared__ float cache[blocksize]; // array to cache rows if( row<num_rows) cache[threadIdx.x] = x[row]; // fetch to cache __syncthreads();

if( row<num_rows ) { uint row_begin = Ap[row], row_end = Ap[row+1]; float sum = 0;

for(uint col=row_begin; col<row_end; ++col) { uint j = Aj[col];

// Fetch from cached rows when possible float x_j = (j>=begin && j<end) ? cache[j-begin] : x[j];

sum += Av[col] * x_j; }

y[row] = sum; } }

multiply_row()

Example: Scan

Example: Multigrid

Poisson Solver

Results: Multigrid Poisson Solver

do k = 1, nz do j = 1, ny do i = 1+mod(k+j+redblack,2), nx, 2 residual = -hsq*B(i,j,k) - U(I,j,k)*6 + & (U(i+1,j,k) + U(i-1,j,k)) + & (U(i,j+1,k) + U(i,j-1,k)) + & (U(i,j,k-1) + U(i,j,k+1)) U(i,j,k) = U(i,j,k) + (omega/6)*residual end do end do end do

Gauss-Seidel Red-Black smoother main loop - FORTRAN

__global__ void Relax( Grid U, Grid B, float omega, int red_black) { int i = blockIdx.x*blockDim.x + threadIdx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; int k = blockIdx.z*blockDim.z + threadIdx.z; if ((j+k+i)%2 != red_black) return; int idx = i + j*U.jstride + k*U.kstride; float residual = -U.hsq*B.buf[idx] - 6.0f*U.buf[idx] + U.buf[idx + U.istride] + U.buf[idx – U.istride] + U.buf[idx + U.jstride] + U.buf[idx – U.jstride] + U.buf[idx + 1 ] + U.buf[idx – 1 ]; U.buf[idx] += omega * residual/6.0f; }

Gauss-Seidel Red-Black smoother kernel - CUDA

Optimal version is slightly more complex Fast math intrinsics, optimized memory layout, etc.

Single precision 1 step of full multigrid GTX280 vs 3.16 GHz Core2Duo E8500 (using 1 core)

GTX280 Core2 E8500

32 x 32 x 32 0.01 sec 0.02 sec

64 x 64 x 64 0.02 sec 0.14 sec

128 x 128 x 128 0.06 sec 1.16 sec

256 x 256 x 256 0.33 sec 9.27 sec

Example: Vector Add w/ Host Code

Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Device Code

Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Host Code

Example: Host code for vecAdd

// allocate and initialize host (CPU) memory float *h_A = …, *h_B = …;

// allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),

cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Example: Ray Tracing

Implications & Strategy

GTX 280 supports up to 30,720 concurrent threads!

Implication: minimize state per thread

Strategy: replace traversal stack with short stack

Horn et al., Interactive k-D Tree GPU Raytracing, I3D 2008

The following slides are borrowed from Daniel Horn’s presentation

KD-Tree

A B C D

KD-Tree Traversal

A B C D

Stack:

KD-Restart

Standard traversal Omit stack operations Proceed to 1st leaf

If no intersection Advance (tmin,tmax) Restart from root

Proceed to next leaf

KD-Restart with short stack (size 1)

A B C D

Stack: A

Short Stack Cache

Even better:

Each thread stores full stack in memory non-blocking writes

Cache top of stack locally (registers or shared memory)

Enables BVHs as well as k-d trees

5-10% faster in our current implementation

Products

Tesla S1070 1U System

1 single precision

2 typical power

4 Teraflops1

800 watts2

Tesla C1060 Board

1 single precision

2 typical power

957 Gigaflops1

160 Watts2

Building a 100TF datacenter

CPU 1U Server Tesla 1U System

10x lower cost

21x lower power

4 CPU cores 0.07 Teraflop

$ 2000

1429 CPU servers

$ 3.1 M

571 KW

4 GPUs: 960 cores 4 Teraflops

$ 8000

25 CPU servers 25 Tesla systems

$ 0.31 M

Tesla Personal Supercomputer

Supercomputing Performance Massively parallel CUDA Architecture 960 cores. 4 TeraFlops 250x the performance of a desktop

Personal One researcher, one supercomputer Plugs into standard power strip

Accessible Program in C for Windows, Linux Available now worldwide under $10,000

CUDA SDK

NVIDIA C Compiler

NVIDIA Assembly for Computing CPU Host Code

Integrated CPU and GPU C Source Code

Libraries:FFT, BLAS,… Example Source Code

CUDA Driver

Debugger Profiler Standard C Compiler

GPU CPU

Applications

Example: Single Precision BLAS

Matrix Size

BLAS (SGEMM) on CUDA

ATLAS 1 Thread ATLAS 4 Threads

CUBLAS:CUDA2.0b2,TeslaC1060(10‐seriesGPU)ATLAS3.81onDual2.8GHzOpteronDual‐Core

Example: Double Precision BLAS

Matrix Size

BLAS (DGEMM) on CUDA

CUBLAS

ATLAS Parallel

ATLAS Single

CUBLASCUDA2.0b2onTeslaC1060(10‐series)ATLAS3.81onIntelXeonE5440Quad‐core,2.83GHz

DGEMM Performance

GPU performance includes data copies over PCIe gen-2

CUDA-Simple: 14.8x faster,

33% of fastest GPU kernel

CUDA-Unroll8clx: fastest GPU kernel,

44x faster than CPU, 291 GFLOPS

on GeForce 8800GTX

GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 2008. In press.

http://www.ks.uiuc.edu/Research/namd/

Direct Coulomb Summation (VMD)

Fluid Dynamics

2-D compressible Euler equations irregular structured grid finite volume method (Ni, 1981)

E. H. Phillips, Y. Zhang, R. L. Davis, J. D. Owens. A multi-grid solver for the 2D compressible Euler equations on a GPU cluster. http://www.ece.ucdavis.edu/cerl/techreports/2008-2/

Solving Euler Equations

Divide domain into tiles & distribute amongst blocks

As always, try for: maximal work per tile minimal data exchange between tiles load balance within/between tiles

1.6K 25K 400K 1.6M 6.4M

4 18 22 22 22

39 42 43

69 80 85

Domain Size (nodes)

Parallel Speedup vs. Single Core

QuadroFX 5600 2x QuadroFX 5600

4x QuadroFX 5600 8x QuadroFX 5600

Solving Euler Equations

Discontinuous Galerkin Methods

Weak coupling between elements limited intra-element communication allows non-conforming meshes allows heterogeneous element types

Support local higher-order operations increased arithmetic per element but still local to the element

Full time-dependent electromagnetic scattering in an open volume surrounding an aircraft

from T. Warburton, A. Klockner, J. Hesthaven

We see just over 700 GFLOPS on 3xGTX 280

from T. Warburton, A. Klockner, J. Hesthaven

Multi-GPU Parallel Scaling

Sparse Linear Algebra Results

Sparse Matrix-Vector Multiplication (SpMV) on CUDA

Experimented with several data structures CSR: Compressed Sparse Row HYB: Hybrid of ELLPACK (ELL) and Coordinate (COO)

formats

HYB gave best results Speed of ELL with flexibility of COO

Benchmarked against matrices from “Optimization of Sparse Matrix-Vector Multiplication on

Emerging Multicore Platforms”, S. Williams et al, Supercomputing 2007

Results: Sparse Matrix-Vector Multiplication (SpMV) on CUDA

CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Double Precision

T10 Double Precision Floating Point

Precision IEEE 754

Rounding modes for FADD and FMUL All 4 IEEE, round to nearest, zero, inf, -inf

Denormal handling Full speed

NaN support Yes

Overflow and Infinity support Yes

Flags No

FMA Yes

Square root Software with low-latency FMA-based convergence

Division Software with low-latency FMA-based convergence

Reciprocal estimate accuracy 24 bit

Reciprocal sqrt estimate accuracy 23 bit

log2(x) and 2^x estimates accuracy 23 bit

NVIDIA Tesla T10 SSE2 Cell SPE Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

All 4 IEEE, round to nearest, zero, inf, -inf

Denormal handling Full speed Supported, costs 1000’s of cycles

Supported only for results, not input operands (input denormals flushed-to-0)

NaN support Yes Yes Yes

Overflow and Infinity support Yes Yes Yes

Flags No Yes Yes

FMA Yes No Yes

Square root Software with low-latency FMA-based convergence Hardware Software only

Division Software with low-latency FMA-based convergence Hardware Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit + step

Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit + step

log2(x) and 2^x estimates accuracy 23 bit No No

Double Precision Floating Point

Single Precision Floating Point G80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zero Supported, 1000’s of cycles

Supported, 1000’s of cycles Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No 12 bit No

Quotes

GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.

Jack Dongarra Professor, University of Tennessee Author of Linpack

We’ve all heard ‘desktop supercomputer’ claims in the past, but this time it’s for real: NVIDIA and its partners will be delivering outstanding performance and broad applicability to the mainstream marketplace.

Heterogeneous computing is what makes such a breakthrough possible.

Burton Smith Technical Fellow, Microsoft Formerly, Chief Scientist at Cray

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Education

Transcript of IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Chapter 1. Introduction - POLI's homepagepoli.cs.vsb.cz/edu/apps/cuda/cuda-programming.pdf · CUDA C Programming Guide Version 4.0 1 ... NVIDIA introduced CUDA™, ... Chapter 1.

An Introduction to GPU Computing and CUDA Architecturedeveloper.download.nvidia.com/CUDA/training/GTC... · What is CUDA? CUDA Architecture Expose GPU computing for general purpose

CUDA Libraries and CUDA Fortran - Nvidia · CUDA Libraries and CUDA Fortran Massimiliano Fatica NVIDIA Corporation. NVIDIA CUDA Libraries CUDA Toolkit includes several libraries:

CUDA: NEW AND UPCOMING FEATURES - University of Oxford · CUDA: NEW AND UPCOMING FEATURES. 2 CUDA ECOSYSTEM 2018 CUDA DOWNLOADS IN 2017 3,500,000 CUDA REGISTERED DEVELOPERS 800,000

Tuning CUDA Applications for Keplercseweb.ucsd.edu/.../static/cuda-5.5-doc/pdf/Kepler_Tuning_Guide.pdf · Tuning CUDA Applications for Kepler DA-06288-001_v5.5 ... and the CUDA C

IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)

Programming with CUDA · Programming with CUDA ... CUDA C programming guide – CUDA Programming 4 …

Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's CUDA (Gene Cooperman, NEU)

Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Debugging Your CUDA Applications With CUDA-GDBdeveloper.download.nvidia.com/GTC/PDF/1062_Satoor.pdf · Debugging Solutions CUDA-GDB (Linux & Mac) CUDA-MEMCHECK (Linux, Mac, & Windows)

DU-05227-042 v6.0 | February 2014 CUDA-GDB CUDA DEBUGGERdirac.ruc.dk/manuals/cuda-6.0/cuda-gdb.pdf · CUDA Debugger DU-05227-042 _v6.0 | 3 Chapter 2. RELEASE NOTES 6.0 Release Unified

CUDA Compiler Driver NVCC - Nvidiadocs.nvidia.com/cuda//pdf/CUDA_Compiler_Driver_NVCC.pdf · CUDA Compiler Driver NVCC TRM-06721-001_v9.1 ... CUDA Programming Model ... 3.1. Command

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC...Steve Abbott, Summit Training Workshop, December 2018 GPUDIRECT, CUDA AWARE MPI, & CUDA IPC

Jared Law CUDA: Super-Computing Made Easy. Jared Law NVidia CUDA: Why CUDA? What is CUDA? Where/how is CUDA being used? What does CUDA mean to programmers?

CUDA programming Performance considerations (CUDA best practices) NVIDIA CUDA C programming best practices guide ACK: CUDA teaching center Stanford (Hoberrock.

CUDA Lecture 4 CUDA Programming Basics

CUDA C BEST PRACTICES GUIDE - Multiprocesorski sistemimups.etf.rs/vezbe/cuda/docs/CUDA_C_Best_Practices_Guide.pdf · 1.3.1 CUDA Runtime API ... CUDA C Best Practices Guide DG-05603-001_v4.0

March 2015 CUDA-GDB CUDA DEBUGGER - Rice University · CUDA-GDB CUDA DEBUGGER DU-05227-042 _v7.0 | March 2015 User Manual. CUDA Debugger DU-05227-042 _v7.0 | ii TABLE OF CONTENTS

CUDA Without Cuda (CUDA Libraries) - Nvidiadeveloper.download.nvidia.com/CUDA/training/ntrotoCUDALibraries.pdf · CUDA Without Cuda (CUDA Libraries) GPU Computing Webinar 7/16/2011