Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu,...

Programming Massively Parallel Processors Using CUDA & C++AMP

Lecture 1 - Introduction

Wen-mei Hwu, Izzat El HajjCEA-EDF-Inria Summer School 2013

2

Course Goals

• Learn how to program massively parallel processors and achieve– high performance– functionality and maintainability– scalability across future generations

• Technical subjects– principles and patterns of parallel algorithms– processor architecture features and

constraints– programming API, tools and techniques

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

3

Text/Notes

• D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach,” 2nd Edition, Morgan Kaufman Publisher, 2012,

• NVIDIA, NVidia CUDA C Programming Guide, version 5.0, NVidia, 2013 (reference book)


4

Course Sessions Overview

Monday• Lecture 1• Lecture 2 • Lab Session 1

Tuesday• Lecture 3• Lecture 4

Wednesday• Lecture 5• Lecture 6• Lab Session 2

Thursday• Lab Session 3

Friday• Lab Session 4• Lab Session 5


Blue Waters Hardware

• >300Cray System & Storage cabinets:

• >25,000Compute nodes:

• >1 TB/sUsable Storage Bandwidth:

• >1.5 PetabytesSystem Memory:

• 4 GBMemory per core module:

• 3D TorusGemin Interconnect Topology:

• >25 PetabytesUsable Storage:

• >11.5 PetaflopsPeak performance:

• >49,000Number of AMD Interlogos processors:

• >380,000Number of AMD x86 core modules:

• >4,000Number of NVIDIA Kepler GPUs:

5© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Cray XK7 Compute Node

Y

X

Z

HT3

HT3

PCIe Gen2

XK7 Compute Node Characteristics

AMD Series 6200 (Interlagos)

NVIDIA Kepler

Host Memory32GB

1600 MT/s DDR3

NVIDIA Tesla X2090 Memory

6GB GDDR5 capacity

Gemini High Speed Interconnect

Keplers in final installation


CPU and GPU have very different design philosophy

GPU Throughput Oriented Cores

Chip

Compute UnitCache/Local Mem

Registers

SIMD Unit

Th

read

ing

CPULatency Oriented Cores

Chip

Core

Local Cache

Registers

SIMD Unit

Co

ntro

l


CPUs: Latency Oriented Design

• Large caches– Convert long latency

memory accesses to short latency cache accesses

• Sophisticated control– Branch prediction for

reduced branch latency– Data forwarding for

reduced data latency

• Powerful ALU– Reduced operation

latency

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU


GPUs: Throughput Oriented Design

• Small caches– To boost memory

throughput

• Simple control– No branch prediction– No data forwarding

• Energy efficient ALUs– Many, long latency but

heavily pipelined for high throughput

• Require massive number of threads to tolerate latencies

9

DRAM

GPU


10

Winning Applications Use Both CPU and GPU

• CPUs for sequential parts where latency matters– CPUs can be 10+X

faster than GPUs for sequential code

• GPUs for parallel parts where throughput wins– GPUs can be 10+X

faster than CPUs for parallel code


Heterogeneous parallel computing is catching on.

• 280 submissions to GPU Computing Gems– 110 articles included in two volumes

Financial Analysis

Scientific Simulation

Engineering Simulation

Data Intensive Analytics

Medical Imaging

Digital Audio Processing

Computer Vision

Digital Video Processing

Biomedical Informatics

Electronic Design

Automation

Statistical Modeling

Ray Tracing Rendering

Interactive Physics

Numerical Methods


12

What is the stake?

• Scalable and portable software lasts through many hardware generations

Scalable algorithms and libraries can be the best legacy we can

leave behind from this era


Introduction to CUDA C

14

Objective

• To learn about data parallelism and the basic features of CUDA C that enable exploitation of data parallelism– Hierarchical thread organization– Main interfaces for launching parallel

execution– Thread index(es) to data index mapping


15

Parallel Programming Work Flow

• Identify compute intensive parts of an application

• Adopt scalable algorithms• Optimize data arrangements to

maximize locality• Performance Tuning• Pay attention to code portability and

maintainability


16

Massive Parallelism - Regularity

© David Kirk/NVIDIA and Wen-mei W. Hwu,

2007-2013

CUDA /OpenCL – Execution Model• Integrated host+device app C program

– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);


The Von-Neumann ModelMemory

Control Unit

I/O

ALURegFile

PC IR

Processing Unit

Processor


19

Arrays of Parallel Threads

• A CUDA kernel is executed by a grid (array) of threads – Each thread is a virtualized Von-Neumann Processor– All threads in a grid run the same kernel code (SPMD)– Each thread has an index that it uses to compute

memory addresses and make control decisions


tid = compute thread indexwork(tid)

Thread Blocks: Scalable Cooperation

• Divide thread array into multiple blocks– Threads within a block cooperate via shared

memory, atomic operations and barrier synchronization

– Threads in different blocks cannot cooperate


Computing the Thread Index

gridDim.x

# of blocks in grid



gridDim.x blockIdx.x

# of blocks in grid

position of block in grid



gridDim.x blockIdx.x blockDim.x

# of blocks in grid


# threads in block



gridDim.x blockIdx.x blockDim.x threadIdx.x

# of blocks in grid


# threads in block

position of thread in block



gridDim.x blockIdx.x blockDim.x threadIdx.x

# of blocks in grid


# threads in block

position of thread in block


tid = blockIdx.x*blockDim.x + threadIdx.x

blockIdx and threadIdx

• Each thread uses indices to decide what data to work on– blockIdx: 1D, 2D, or 3D

(CUDA 4.0)– threadIdx: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on volumes– …


A[0]vector A

vector B

vector C

A[1] A[2] A[3] A[4] A[N-1]

B[0] B[1] B[2] B[3]

…

B[4] … B[N-1]

C[0] C[1] C[2] C[3] C[4] C[N-1]…

+ + + + + +

Vector Addition – Conceptual View


Vector Addition – Traditional C Code

// Compute vector sum C = A+Bvoid vecAdd(float* A, float* B, float* C, int n){ for (i = 0, i < n, i++) C[i] = A[i] + B[i];}

int main(){ // Memory allocation for A_h, B_h, and C_h

// I/O to read A_h and B_h, N elements …

vecAdd(A_h, B_h, C_h, N);}


#include <cuda.h>void vecAdd(float* A, float* B, float* C, int n){ int size = n* sizeof(float); float* A_d, B_d, C_d; ...1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code – to have the device // to perform the actual vector addition

3. // copy C from the device memory // Free device vectors} CPU

Host Memory

GPUPart 2

Device Memory

Part 1

Part 3

Heterogeneous Computing vecAdd Host Code


Control FlowCPU GPU

cudaMalloc(...)

cudaMemcpy(...)

cudaMemcpy(...)

kernel<<<...>>>(...)

cudaFree(...)


Partial Overview of CUDA Memories

• Device code can:– R/W per-thread registers– R/W per-grid global memory

• Host code can– Transfer data to/from per

grid global memory

(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

We will cover more later.


(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

CUDA Device Memory Management API functions

• cudaMalloc()– Allocates object in the

device global memory– Two parameters

• Address of a pointer to the allocated object

• Size of of allocated object in terms of bytes

• cudaFree()– Frees object from device

global memory• Pointer to freed object


(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Host-Device Data Transfer API functions

• cudaMemcpy()

– memory data transfer– Requires four

parameters• Pointer to destination • Pointer to source• Number of bytes copied• Type/Direction of

transfer

– Transfer to device is asynchronous


void vecAdd(float* A, float* B, float* C, int n) {

int size = n * sizeof(float); float* A_d, B_d, C_d;

1. // Transfer A and B to device memory cudaMalloc((void **) &A_d, size); cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

// Allocate device memory for cudaMalloc((void **) &C_d, size);

2. // Kernel invocation code – to be shown later ...3. // Transfer C from device to host cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);

}


Example: Vector Addition Kernel

Input Vector 1:

Input Vector 2:

Output Vector:

assign 1 thread per element



Input Vector 1:

Input Vector 2:

Output Vector:

quantization of global thread count due to block organization



Input Vector 1:

Input Vector 2:

Output Vector:

boundary checks



Input Vector 1:

Input Vector 2:

Output Vector:

data padding


// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__void vecAddKernel(float* A_d, float* B_d, float* C_d, int n){

int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i];

}

int vectAdd(float* A, float* B, float* C, int n){

// A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);

}

Example: Vector Addition KernelDevice Code



Host Code


// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__void vecAddKernel(float* A_d, float* B_d, float* C_d, int n){

int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i];

}

int vectAdd(float* A, float* B, float* C, int n){

// A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);

}

More on Kernel Launchint vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1);

vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);}

• Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

Host Code


__global__void vecAddKernel(float *A_d, float *B_d, float *C_d, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x;

if( i<n ) C_d[i] = A_d[i]+B_d[i];}

__host__Void vecAdd(){ dim3 DimGrid(ceil(n/256,1,1); dim3 DimBlock(256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}

Kernel execution in a nutshell

KernelBlk 0 Blk N-1• • •

GPUM0

RAM

Mk• • •

Schedule onto multiprocessors


More on CUDA Function Declarations

hosthost__host__ float HostFunc()

hostdevice__global__ void KernelFunc()

devicedevice__device__ float DeviceFunc()

Only callable from the:

Executed on the:

__global__ defines a kernel function• Each “__” consists of two underscore characters• A kernel function must return void

__device__ and __host__ can be used together


Integrated C programs with CUDA extensions

NVCC Compiler

Host C Compiler/ Linker

Host Code Device Code (PTX)

Device Just-in-Time Compiler

Heterogeneous Computing Platform withCPUs, GPUs

Compiling A CUDA Program


Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu,...

Documents

Transcript of Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu,...