Day1 02a Programming Overview

download Day1 02a Programming Overview

of 47

Transcript of Day1 02a Programming Overview

  • 8/3/2019 Day1 02a Programming Overview

    1/47

    CUDAProgramming Model

    Gernot Ziegler, NVIDIA UK(material by Gregory Ruetsch)

  • 8/3/2019 Day1 02a Programming Overview

    2/47

    NVIDIA Confidential

    Programming in C for CUDA

    C for CUDA = C + a few simple extensions

    as C developer, easy to start writing parallel programs

    Three key abstractions:1. parallel threads on device (GPU)

    2. manage corresponding memory spaces3. corresponding synchronization

    Host: Device management API

    Additionally, Runtime API & nvcc:use language extensions even for host code!

  • 8/3/2019 Day1 02a Programming Overview

    3/47

    NVIDIA Confidential

    Basics

    Set up GPU for computation

    GPU device and memory management

    GPU kernel launches (execution configuration)

    Some specifics of GPU/device code

    Some additional features:

    Vector typesAsynchronous execution

    CUDA error handling

    CUDA Events

    Note: only the basic features are covered

    Programming Guide and Reference Manualcontain more information

  • 8/3/2019 Day1 02a Programming Overview

    4/47

    NVIDIA Confidential

    Device Management

    First task: CPU will query and select GPU devices

    cudaGetDeviceCount( int* count )cudaSetDevice( int device )

    cudaGetDevice( int *current_device )

    cudaGetDeviceProperties( cudaDeviceProp* prop,

    int device )

    cudaChooseDevice( int *device, cudaDeviceProp* prop )

    Multi-GPU setup:

    device 0 is used by default,careful with combination of GFX card and Tesla !

    (usually, one CPU thread controls one GPU each,but driver API allows more)

  • 8/3/2019 Day1 02a Programming Overview

    5/47

    NVIDIA Confidential

    Managing Memory

    Host/CPU also manages device/GPU memory:

    Allocate & Free memoryCopy data to and from device's globalmemory(GPU DRAM, e.g. 4 GB on Tesla)

    cudaMalloc(void **pointer, size_t nbytes)cudaMemset(void *pointer, int value, size_t count)

    cudaFree(void *pointer)

    Host and device have separate memory spaces!

  • 8/3/2019 Day1 02a Programming Overview

    6/47

    NVIDIA Confidential

    Example:

    Managing memory (no data transfer)

    int n = 1024;int nbytes = 1024*sizeof(int);

    int *d_a = 0;

    cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);

    cudaFree(d_a);

  • 8/3/2019 Day1 02a Programming Overview

    7/47

    NVIDIA Confidential

    CUDA: Runtime support

    Explicit memory allocation returns pointers to GPU memory

    cudaMalloc(), cudaFree()

    Explicit memory copy for host device, device device

    cudaMemcpy(), cudaMemcpy2D(), ...

    Texture management

    cudaBindTexture(), cudaBindTextureToArray(), ...

    OpenGL & DirectX interoperabilitycudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(),

  • 8/3/2019 Day1 02a Programming Overview

    8/47

    NVIDIA Confidential

    Example: Host Code's mem manage// allocate host memory

    int numBytes = N * sizeof(float)

    float* h_A = (float*)malloc(numBytes);

    // allocate device memory

    float* d_A = 0;

    cudaMalloc((void**)&d_A, numbytes);

    // copy data from host to device

    cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);

    // execute the kernel on GPU: [ NEXT SLIDE ]

    gpu_func (params)

    // copy data from device back to hostcudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);

    // free device memory

    cudaFree(d_A);

  • 8/3/2019 Day1 02a Programming Overview

    9/47

    NVIDIA Confidential

    Kernel creation

    How to...

    gpu_func (params)

    write a kernel!

    First, re-cap on the CUDA architecture...

  • 8/3/2019 Day1 02a Programming Overview

    10/47

    NVIDIA Confidential

    Device code:

    Thread bundles

    Kernel = device code call

    A kernel is executed by agrid of thread blocks

    A thread block is a batch

    of threads that can

    cooperate throughshared memory

    Threads from different

    blocks cannot cooperate

    Host

    Kernel

    1

    Kernel

    2

    Device

    Grid 1

    Block(0, 0)

    Block(1, 0)

    Block(2, 0)

    Block(0, 1)

    Block(1, 1)

    Block(2, 1)

    Grid 2

    Block (1, 1)

    Thread

    (0, 1)

    Thread

    (1, 1)

    Thread

    (2, 1)

    Thread

    (3, 1)

    Thread

    (4, 1)

    Thread

    (0, 2)

    Thread

    (1, 2)

    Thread

    (2, 2)

    Thread

    (3, 2)

    Thread

    (4, 2)

    Thread

    (0, 0)

    Thread

    (1, 0)

    Thread

    (2, 0)

    Thread

    (3, 0)

    Thread

    (4, 0)

  • 8/3/2019 Day1 02a Programming Overview

    11/47

    NVIDIA Confidential

    Blocks must be independent

    "Threads from different blocks cannot cooperate"

    Why?

    Any possible interleaving of blocks should be validpresumed to run to completion without pre-emption

    can run in any order

    can run concurrently OR sequentially (GPU scaling)

    Blocks may coordinate but not synchronizeshared queue pointer: OK

    shared lock: BAD can easily deadlock

    So:Independence requirement givesscalabilityfor different GPU sizes.

  • 8/3/2019 Day1 02a Programming Overview

    12/47

    NVIDIA Confidential

    Device code:

    Thread IDs

    Threads and blocks have IDs

    So each thread can decide whatdata to work on

    Block ID: 1D or 2D

    Thread ID: 1D, 2D, or 3D

    2D/3D IDs simplifyaddressing when processing

    multidimensional dataImage processing

    Solving PDEs on volumes

    Device

    Grid 1

    Block

    (0, 0)

    Block

    (1, 0)

    Block

    (2, 0)

    Block

    (0, 1)

    Block

    (1, 1)

    Block

    (2, 1)

    Block (1, 1)

    Thread

    (0, 1)

    Thread

    (1, 1)

    Thread

    (2, 1)

    Thread

    (3, 1)

    Thread

    (4, 1)

    Thread

    (0, 2)

    Thread

    (1, 2)

    Thread

    (2, 2)

    Thread

    (3, 2)

    Thread

    (4, 2)

    Thread

    (0, 0)

    Thread

    (1, 0)

    Thread

    (2, 0)

    Thread

    (3, 0)

    Thread

    (4, 0)

  • 8/3/2019 Day1 02a Programming Overview

    13/47

    NVIDIA Confidential

    Programming Model:

    Memory Spaces

    Each thread can:

    Read/write per-thread registers

    (Read/write per-thread local memory)

    Read/write per-block shared memory

    Read/write per-grid global memory

    Read only per-grid constant memory

    Read only per-grid texture memory

    Grid

    Constant

    Memory

    Texture

    Memory

    Global

    Memory

    Block (0, 0)

    Shared Memory

    Local

    Memory

    Thread (0, 0)

    Registers

    Local

    Memory

    Thread (1, 0)

    Registers

    Block (1, 0)

    Shared Memory

    Local

    Memory

    Thread (0, 0)

    Registers

    Local

    Memory

    Thread (1, 0)

    Registers

    HostHost can read/write global,constant, and texturememory

    (all stored in GPU DRAM)

  • 8/3/2019 Day1 02a Programming Overview

    14/47

    NVIDIA Confidential

    Qualifiers for variable storage

    (device code)__device__

    Stored in device memory, aka global memory (e.g. 4GB on Tesla)

    Large capacity, BUT: high latency, uncached

    Allocated with cudaMalloc

    Accessible by all threads

    __shared__On-chip memory (SRAM, low latency), 16 kB per multiprocessor

    Allocated by execution configuration or at compile timeShared access by all threads in the same thread block

    Shortlived (only while block runs)

    All unqualified variables:

    Scalars and built-in vector types are stored in registersArrays may be in registers, or local memory(special form of global memory /DRAM)

  • 8/3/2019 Day1 02a Programming Overview

    15/47

    NVIDIA Confidential

    Launching kernels

    Modified C function call syntax:

    kernel()

    Execution Configuration (>):

    grid dimensions: x and y

    thread-block dimensions: x, y, and z

    dim3 grid(16, 16);

    dim3 block(16,16);

    kernel(...);

    kernel(...);

  • 8/3/2019 Day1 02a Programming Overview

    16/47

    NVIDIA Confidential

    Example: Host Code// allocate host memory

    int numBytes = N * sizeof(float)

    float* h_A = (float*)malloc(numBytes);

    // allocate device memory

    float* d_A = 0;

    cudaMalloc((void**)&d_A, numbytes);

    // copy data from host to device

    cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);

    // execute the kernel

    increment_gpu>(d_A, b);

    // copy data from device back to host

    cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);

    // free device memory

    cudaFree(d_A);

  • 8/3/2019 Day1 02a Programming Overview

    17/47

    NVIDIA Confidential

    CUDA Built-in Device Variables

    All__global__and__device__functions have

    access to these automatically defined variables

    dim3 gridDim;

    Dimensions of the grid in blocks (at most 2D)

    dim3 blockDim;

    Dimensions of the block in threads

    dim3 blockIdx;

    Block index within the griddim3 threadIdx;

    Thread index within the block

  • 8/3/2019 Day1 02a Programming Overview

    18/47

    NVIDIA Confidential

    Example: Increment Array Elements

    CPU program CUDA program

    void increment_cpu(float *a, float b, int N)

    {

    for (int idx = 0; idx

  • 8/3/2019 Day1 02a Programming Overview

    19/47

    NVIDIA Confidential

    Other extras (device code)

    Other language extras....

  • 8/3/2019 Day1 02a Programming Overview

    20/47

    NVIDIA Confidential

    Built-in Vector Types

    [u]char[1..4], [u]short[1..4], [u]int[1..4],

    [u]long[1..4], float[1..4]Structures accessed with x, y, z, w fields:

    uint4 param;

    int y = param.y;

    dim3

    Based on uint3

    Used to specify dimensions

    Default value (1,1,1)

    Can be used in GPU and CPU code (if nvcc compiled)

  • 8/3/2019 Day1 02a Programming Overview

    21/47

    NVIDIA Confidential

    Thread Synchronization

    void __syncthreads();Synchronizes all threads in a block

    Generates barrier synchronization instruction

    No thread can pass this barrier until all threads in the

    block reach it

    Often needed for shared memory write/read

    synchronization inbetween threads

  • 8/3/2019 Day1 02a Programming Overview

    22/47

    NVIDIA Confidential

    GPU Atomic Integer Operations

    Atomic operations on integers in global memory:

    Associative operations on signed/unsigned intsadd, sub, min, max, ...

    and, or, xor

    Increment, decrement

    Exchange, compare and swap

    32-bit: hardware with compute capability >= 1.1

    64-bit: hardware with compute capability >= 1.2

  • 8/3/2019 Day1 02a Programming Overview

    23/47

    NVIDIA Confidential

    C for CUDA : Summary

    Function qualifiers:__global__ void MyKernel() { }

    __device__ float MyDeviceFunc() { }

    Variable qualifiers:__constant__ float MyConstantArray[32];

    __shared__ float MySharedArray[32];

    Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocks

    dim3 dimBlock(4, 8, 8); // 256 threads per block

    MyKernel > (...); // Launch kernel

    Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3blockDim; // Block dimension

    dim3blockIdx; // Block index

    dim3 threadIdx; // Thread index

    void__syncthreads(); // Thread synchronization (ProgGuide)

  • 8/3/2019 Day1 02a Programming Overview

    24/47

    NVIDIA Confidential

    Runtime API: More features

    Other runtime specialties for host code...

  • 8/3/2019 Day1 02a Programming Overview

    25/47

    NVIDIA Confidential

    Asynchronous operation

    CUDA calls are enqueued in streams, and executed

    one after another : usually one default stream (0)Kernel launches are asynchronous

    control returns to CPU immediately

    kernel executes after all previous CUDA calls

    cudaMemcpy() is synchronouscopy starts after all previous CUDA calls have completed

    control returns to CPU after copy completes

    (async memcopies possible, too)

    Thus: GPU output, required on the host, leads to sync

  • 8/3/2019 Day1 02a Programming Overview

    26/47

    NVIDIA Confidential

    Example: Async operation

    // allocate host memoryint numBytes = N * sizeof(float)

    float* h_A = (float*)malloc(numBytes);

    // allocate device memory

    float* d_A = 0;cudaMalloc((void**)&d_A, numbytes);

    // copy data from host to device

    cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);

    // "execute the kernel"

    // truly: CPU enqueues kernel calls, GPU executes asynchronously

    kernel_A>(...);

    kernel_B>(...);

    kernel_C>(...);

    // copy data from device back to host - CPU/GPU SYNCcudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);

    // free device memory

    cudaFree(d_A);

  • 8/3/2019 Day1 02a Programming Overview

    27/47

    NVIDIA Confidential

    CUDA Error Reporting

    All CUDA calls return error code

    Except for kernel launchescudaError_t type

    cudaGetLastError( )

    Returns the code for the last error (no error: has a code)

    Even get error from kernel execution

    char *cudaGetErrorString(code)

    Returns a string describing the error

    printf(%s\n, cudaGetErrorString( cudaGetLastError() ) );

  • 8/3/2019 Day1 02a Programming Overview

    28/47

    NVIDIA Confidential

    Textures in CUDA

    Textures are known from graphics ...In CUDA, Texture is used for data reading

    Benefits:Addressable in 1D, 2D, or 3DData is cached (optimized for 2D locality)

    Helpful for irregular data access

    FilteringLinear / bilinear / trilinear

    dedicated hardware

    Wrap modes (for out-of-bounds addresses)

    Usage:Host code binds data to a texture referenceKernel reads data by calling a fetchfunction,e.g. tex1Dfetch()

  • 8/3/2019 Day1 02a Programming Overview

    29/47

    NVIDIA Confidential

    CUDA Event API

    CUDA call streams can be interspersed with Events

    Usage scenarios:measure elapsed time for CUDA calls (clock cycle precision!)

    query the status of an asynchronous CUDA callblock CPU until CUDA calls prior to the event are completed

    asyncAPI sample in CUDA SDK

    cudaEvent_t start, stop;

    cudaEventCreate(&start); cudaEventCreate(&stop);

    cudaEventRecord(start, 0);

    kernel(...);

    cudaEventRecord(stop, 0);

    cudaEventSynchronize(stop);float elapsedTime;

    cudaEventElapsedTime(&elapsedTime, start, stop);

    cudaEventDestroy(start); cudaEventDestroy(stop);

  • 8/3/2019 Day1 02a Programming Overview

    30/47

    NVIDIA Confidential

    Driver API

    Up to this point the host code weve seen has been fromthe runtime API cuda*() functions...

    Driver API: cu*() functions

    Advantages:Plain C interface, you can use any CPU compiler for host code(e.g. icc, etc.)

    More control over devices

    One CPU thread can control multiple GPUs

    PTX Just-In-Time (JIT) compilation(Parallel Thread eXecution (PTX) is our "GPU assembly language")

    No dependency on runtime library

    Disadvantages:

    No device emulationMore verbose code

    Note: Devicecode is identical, regardless of using theruntime or driver API

  • 8/3/2019 Day1 02a Programming Overview

    31/47

    NVIDIA Confidential

    Once more: Runtime and Driver API

    Best place to start for virtually all developers:Runtime API

    Easy to migrate to driver API if/when it is needed

    Anything which can be done in the runtime API canalso be done in the driver API, but not vice versa

    Much, much more information on both APIs in the

    CUDA Reference Manual

  • 8/3/2019 Day1 02a Programming Overview

    32/47

    NVIDIA Confidential

    New Features in CUDA 2.2

    Zero copyCUDA threads can directly read/write host (CPU)memory

    Requires pinned (non-pageable) memory

    Main benefits:More efficient than small PCIe data transfers

    May be better performance when there is no opportunityfor data reuse from device DRAM

    2D Texturing from linear memory

    Allows simpler write-to-texture in CUDAUseful for image processing

  • 8/3/2019 Day1 02a Programming Overview

    33/47

    NVIDIA Confidential

    nvcc is a C compiler

    Advanced C++ constructs (classes with inheritance

    and virtual functions) make it stumble in device

    code!

    If problems occur, and CUDART is still desirable:

    Let nvcc only compile .cu files that contain the

    kernels, let customer's compiler handle C++ code intheir own files, and link the two parts.

    Last resort: CUDA driver API,

    (nvcc compiles kernels into PTX or binaries,which application loads via C calls)

  • 8/3/2019 Day1 02a Programming Overview

    34/47

    NVIDIA Confidential

    C for CUDAOptimization

  • 8/3/2019 Day1 02a Programming Overview

    35/47

    NVIDIA Confidential

    Optimize Algorithms for GPU

    Maximize data-parallelism in the algorithm (SIMD):Think threads for data elements, not specific tasks

    Reduce thread divergence(performance impact from branch serialization,when groups smaller than 32 threads start to diverge)

    More computation on the GPU thancostly device-host data transfers

    Even low parallelism computations can sometimes be fasterthan transferring back and forth to host

  • 8/3/2019 Day1 02a Programming Overview

    36/47

    NVIDIA Confidential

    Optimize Algorithms for GPU: Maths

    Maximize arithmetic intensity (math per mem transfer)

    Sometimes its better to recompute results than to causeserial dependencies

    GPU spends its transistors on ALUs, not memory

    Double precision algorithms:Consider moving parts/all to single precision computation

    Hardware has builtin math functions (at reduced precision):__sinf(), __expf(), etc.

    Try -fast-math (implicitly converts e.g. sin() to _sinf()) or carefullyreplace individual function calls, considering reduced accuracy

  • 8/3/2019 Day1 02a Programming Overview

    37/47

    NVIDIA Confidential

    Optimize Memory Access

    Coalescing: "Optimal" memory access pattern

    Coalesced vs. Non-coalesced = order of magnitude!

    Shared memory: A user-managed cache

    Advanced concepts:

    Shared memory bank conflicts

    Make use of spatial localityfor texture and constant caches

  • 8/3/2019 Day1 02a Programming Overview

    38/47

    NVIDIA Confidential

    Coalescing

    Compute capability 1.0 and 1.1K-th thread must access k-th word in the segment (or k-th word in 2

    contiguous 128B segments for 128-bit words), not all threads need to

    participate

    Coalesces 1 transaction

    Out of sequence 16 transactions Misaligned 16 transactions

  • 8/3/2019 Day1 02a Programming Overview

    39/47

    NVIDIA Confidential

    Coalescing

    Compute capability 1.2 and higher

    1 transaction - 64B segment

    MMU is more advanced, relaxes coalescing requirements

    Coalescing achieved for any pattern of addresses that fits into a segmentof size: 32B for 8-bit words, 64B for 16-bit words, 128B for 32- and 64-bit

    words

    Smaller transactions may be issued to avoid wasted bandwidth due to

    unused words

    Exact rules in Programming Guide

  • 8/3/2019 Day1 02a Programming Overview

    40/47

    NVIDIA Confidential

    Take Advantage of Shared Memory

    Hundreds of times faster than global memory

    Threads can cooperate via shared memory

    Use one / a few threads to load / compute data shared

    by all threads

    Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-

    coalesceable addressing

  • 8/3/2019 Day1 02a Programming Overview

    41/47

    NVIDIA Confidential

    Use Parallelism Efficiently

    Partition your computation to keep the GPU

    multiprocessors equally busy

    Many threads, many thread blocks

    Keep threads' resource usage low enough

    to supportmultiple blocks per multiprocessor

    Resources: Registers, shared memory

  • 8/3/2019 Day1 02a Programming Overview

    42/47

    NVIDIA Confidential

    Host-Device Data Transfers

    Device-Host memory bandwidth

    much lower than device-device bandwidth

    8 GB/s peak (PCI-e x16 Gen 2) vs. 102 GB/s peak (Tesla C1060)

    Minimize transfers

    Dont transfer intermediate data:Can be allocated, operated on, and deallocatedwithout ever copying them to host memory

    Group transfersOne large transfer much better than many small ones

  • 8/3/2019 Day1 02a Programming Overview

    43/47

    NVIDIA Confidential

    Overlapping Data Transfers and

    Computation

    Stream and Async API allow overlaphost-device data transfers with computation

    CPU computation can overlap data transferson all CUDA capable devices

    Devices with Concurrent copy and execution(CompCap >= 1.1):Kernel computation can overlap data transfers, controlled viastreams and events.

    Stream = sequence of CUDA calls that execute in orderCalls in different streams can be interleaved

    Stream ID is an argument to async calls and kernel launches

  • 8/3/2019 Day1 02a Programming Overview

    44/47

    NVIDIA Confidential

    Shared Memory

    ~Hundred times faster than global memory

    Use it to cache data from global memory accesses

    Use it to avoid non-coalesced access

    Stage loads and stores in shared memory tore-order non-coalesceable addressing

    Threads can cooperate via shared memory

    share results with each othercontribute to common result,e.g. block min/max/avg

    G id/Bl k Si H i i

  • 8/3/2019 Day1 02a Programming Overview

    45/47

    NVIDIA Confidential

    Grid/Block Size Heuristics

    # of blocks > # of multiprocessorsSo all multiprocessors have at least one block to execute

    # of blocks / # of multiprocessors > 2

    Multiple blocks can run concurrently in a multiprocessor

    Blocks that arent waiting at a __syncthreads() keep the

    hardware busySubject to resource availability registers, shared memory

    # of blocks > 100 to scale to future devices

    Blocks executed in pipeline fashion1000 blocks per grid will scale across multiple generations

    A

  • 8/3/2019 Day1 02a Programming Overview

    46/47

    NVIDIA Confidential

    Accuracy

    GPU and CPU results may differ, but are

    equally accurate (to specified ulp accuracy)

    CPU operations arent strictly limited to 0.5 ulp

    Sequences of operations can be even more accurate

    due to 80-bit extended precision ALUs

    Compare GPU calculation to CPU SSE

    And: Floating-point arithmetic is not associative!

    Complex area (ask if unsure)

    S

  • 8/3/2019 Day1 02a Programming Overview

    47/47

    NVIDIA Confidential

    Summary

    GPU hardware can achieve great performance on data-parallel computations if you follow a few simple guidelines:

    Use parallelism efficientlyCoalesce memory accesses if possible

    Take advantage of shared memory

    Explore other memory spaces

    TextureConstant

    (Reduce shared memory bank conflicts)

    See the Programming Guide, Best Practices Guide and ReferenceManual

    If that doesn't help:Ask your local DevTech-Compute engineer :)