Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu,...

45
Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013

Transcript of Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu,...

Page 1: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Programming Massively Parallel Processors Using CUDA & C++AMP

Lecture 1 - Introduction

Wen-mei Hwu, Izzat El HajjCEA-EDF-Inria Summer School 2013

Page 2: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

2

Course Goals

• Learn how to program massively parallel processors and achieve– high performance– functionality and maintainability– scalability across future generations

• Technical subjects– principles and patterns of parallel algorithms– processor architecture features and

constraints– programming API, tools and techniques

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 3: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

3

Text/Notes

• D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach,” 2nd Edition, Morgan Kaufman Publisher, 2012,

• NVIDIA, NVidia CUDA C Programming Guide, version 5.0, NVidia, 2013 (reference book)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 4: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

4

Course Sessions Overview

Monday• Lecture 1• Lecture 2 • Lab Session 1

Tuesday• Lecture 3• Lecture 4

Wednesday• Lecture 5• Lecture 6• Lab Session 2

Thursday• Lab Session 3

Friday• Lab Session 4• Lab Session 5

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 5: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Blue Waters Hardware

• >300Cray System & Storage cabinets:

• >25,000Compute nodes:

• >1 TB/sUsable Storage Bandwidth:

• >1.5 PetabytesSystem Memory:

• 4 GBMemory per core module:

• 3D TorusGemin Interconnect Topology:

• >25 PetabytesUsable Storage:

• >11.5 PetaflopsPeak performance:

• >49,000Number of AMD Interlogos processors:

• >380,000Number of AMD x86 core modules:

• >4,000Number of NVIDIA Kepler GPUs:

5© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 6: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Cray XK7 Compute Node

Y

X

Z

HT3

HT3

PCIe Gen2

XK7 Compute Node Characteristics

AMD Series 6200 (Interlagos)

NVIDIA Kepler

Host Memory32GB

1600 MT/s DDR3

NVIDIA Tesla X2090 Memory

6GB GDDR5 capacity

Gemini High Speed Interconnect

Keplers in final installation

6© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 7: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

CPU and GPU have very different design philosophy

GPU Throughput Oriented Cores

Chip

Compute UnitCache/Local Mem

Registers

SIMD Unit

Th

read

ing

CPULatency Oriented Cores

Chip

Core

Local Cache

Registers

SIMD Unit

Co

ntro

l

7© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 8: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

CPUs: Latency Oriented Design

• Large caches– Convert long latency

memory accesses to short latency cache accesses

• Sophisticated control– Branch prediction for

reduced branch latency– Data forwarding for

reduced data latency

• Powerful ALU– Reduced operation

latency

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU

8© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 9: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

GPUs: Throughput Oriented Design

• Small caches– To boost memory

throughput

• Simple control– No branch prediction– No data forwarding

• Energy efficient ALUs– Many, long latency but

heavily pipelined for high throughput

• Require massive number of threads to tolerate latencies

9

DRAM

GPU

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 10: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

10

Winning Applications Use Both CPU and GPU

• CPUs for sequential parts where latency matters– CPUs can be 10+X

faster than GPUs for sequential code

• GPUs for parallel parts where throughput wins– GPUs can be 10+X

faster than CPUs for parallel code

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 11: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Heterogeneous parallel computing is catching on.

• 280 submissions to GPU Computing Gems– 110 articles included in two volumes

Financial Analysis

Scientific Simulation

Engineering Simulation

Data Intensive Analytics

Medical Imaging

Digital Audio Processing

Computer Vision

Digital Video Processing

Biomedical Informatics

Electronic Design

Automation

Statistical Modeling

Ray Tracing Rendering

Interactive Physics

Numerical Methods

11© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 12: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

12

What is the stake?

• Scalable and portable software lasts through many hardware generations

Scalable algorithms and libraries can be the best legacy we can

leave behind from this era

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 13: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Introduction to CUDA C

Page 14: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

14

Objective

• To learn about data parallelism and the basic features of CUDA C that enable exploitation of data parallelism– Hierarchical thread organization– Main interfaces for launching parallel

execution– Thread index(es) to data index mapping

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 15: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

15

Parallel Programming Work Flow

• Identify compute intensive parts of an application

• Adopt scalable algorithms• Optimize data arrangements to

maximize locality• Performance Tuning• Pay attention to code portability and

maintainability

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 16: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

16

Massive Parallelism - Regularity

© David Kirk/NVIDIA and Wen-mei W. Hwu,

2007-2013

Page 17: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

CUDA /OpenCL – Execution Model• Integrated host+device app C program

– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

17© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 18: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

The Von-Neumann ModelMemory

Control Unit

I/O

ALURegFile

PC IR

Processing Unit

Processor

18© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 19: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

19

Arrays of Parallel Threads

• A CUDA kernel is executed by a grid (array) of threads – Each thread is a virtualized Von-Neumann Processor– All threads in a grid run the same kernel code (SPMD)– Each thread has an index that it uses to compute

memory addresses and make control decisions

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

tid = compute thread indexwork(tid)

Page 20: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Thread Blocks: Scalable Cooperation

• Divide thread array into multiple blocks– Threads within a block cooperate via shared

memory, atomic operations and barrier synchronization

– Threads in different blocks cannot cooperate

20© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 21: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Computing the Thread Index

gridDim.x

# of blocks in grid

21© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 22: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Computing the Thread Index

gridDim.x blockIdx.x

# of blocks in grid

position of block in grid

22© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 23: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Computing the Thread Index

gridDim.x blockIdx.x blockDim.x

# of blocks in grid

position of block in grid

# threads in block

23© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 24: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Computing the Thread Index

gridDim.x blockIdx.x blockDim.x threadIdx.x

# of blocks in grid

position of block in grid

# threads in block

position of thread in block

24© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 25: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Computing the Thread Index

gridDim.x blockIdx.x blockDim.x threadIdx.x

# of blocks in grid

position of block in grid

# threads in block

position of thread in block

25© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

tid = blockIdx.x*blockDim.x + threadIdx.x

Page 26: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

blockIdx and threadIdx

• Each thread uses indices to decide what data to work on– blockIdx: 1D, 2D, or 3D

(CUDA 4.0)– threadIdx: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on volumes– …

26© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 27: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

A[0]vector A

vector B

vector C

A[1] A[2] A[3] A[4] A[N-1]

B[0] B[1] B[2] B[3]

B[4] … B[N-1]

C[0] C[1] C[2] C[3] C[4] C[N-1]…

+ + + + + +

Vector Addition – Conceptual View

27© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 28: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Vector Addition – Traditional C Code

// Compute vector sum C = A+Bvoid vecAdd(float* A, float* B, float* C, int n){ for (i = 0, i < n, i++) C[i] = A[i] + B[i];}

int main(){ // Memory allocation for A_h, B_h, and C_h

// I/O to read A_h and B_h, N elements …

vecAdd(A_h, B_h, C_h, N);}

28© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 29: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

#include <cuda.h>void vecAdd(float* A, float* B, float* C, int n){ int size = n* sizeof(float); float* A_d, B_d, C_d; ...1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code – to have the device // to perform the actual vector addition

3. // copy C from the device memory // Free device vectors} CPU

Host Memory

GPUPart 2

Device Memory

Part 1

Part 3

Heterogeneous Computing vecAdd Host Code

29© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 30: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Control FlowCPU GPU

cudaMalloc(...)

cudaMemcpy(...)

cudaMemcpy(...)

kernel<<<...>>>(...)

cudaFree(...)

30© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 31: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Partial Overview of CUDA Memories

• Device code can:– R/W per-thread registers– R/W per-grid global memory

• Host code can– Transfer data to/from per

grid global memory

(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

We will cover more later.

31© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 32: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

CUDA Device Memory Management API functions

• cudaMalloc()– Allocates object in the

device global memory– Two parameters

• Address of a pointer to the allocated object

• Size of of allocated object in terms of bytes

• cudaFree()– Frees object from device

global memory• Pointer to freed object

32© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 33: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Host-Device Data Transfer API functions

• cudaMemcpy()

– memory data transfer– Requires four

parameters• Pointer to destination • Pointer to source• Number of bytes copied• Type/Direction of

transfer

– Transfer to device is asynchronous

33© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 34: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

void vecAdd(float* A, float* B, float* C, int n) {

int size = n * sizeof(float); float* A_d, B_d, C_d;

1. // Transfer A and B to device memory cudaMalloc((void **) &A_d, size); cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

// Allocate device memory for cudaMalloc((void **) &C_d, size);

2. // Kernel invocation code – to be shown later ...3. // Transfer C from device to host cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);

}

34© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 35: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Example: Vector Addition Kernel

Input Vector 1:

Input Vector 2:

Output Vector:

assign 1 thread per element

35© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 36: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Example: Vector Addition Kernel

Input Vector 1:

Input Vector 2:

Output Vector:

quantization of global thread count due to block organization

36© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 37: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Example: Vector Addition Kernel

Input Vector 1:

Input Vector 2:

Output Vector:

boundary checks

37© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 38: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Example: Vector Addition Kernel

Input Vector 1:

Input Vector 2:

Output Vector:

data padding

38© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 39: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__void vecAddKernel(float* A_d, float* B_d, float* C_d, int n){

int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i];

}

int vectAdd(float* A, float* B, float* C, int n){

// A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);

}

Example: Vector Addition KernelDevice Code

39© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 40: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Example: Vector Addition Kernel

Host Code

40© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__void vecAddKernel(float* A_d, float* B_d, float* C_d, int n){

int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i];

}

int vectAdd(float* A, float* B, float* C, int n){

// A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);

}

Page 41: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

More on Kernel Launchint vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1);

vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);}

• Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

Host Code

41© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 42: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

__global__void vecAddKernel(float *A_d, float *B_d, float *C_d, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x;

if( i<n ) C_d[i] = A_d[i]+B_d[i];}

__host__Void vecAdd(){ dim3 DimGrid(ceil(n/256,1,1); dim3 DimBlock(256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}

Kernel execution in a nutshell

KernelBlk 0 Blk N-1• • •

GPUM0

RAM

Mk• • •

Schedule onto multiprocessors

42© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 43: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

More on CUDA Function Declarations

hosthost__host__ float HostFunc()

hostdevice__global__ void KernelFunc()

devicedevice__device__ float DeviceFunc()

Only callable from the:

Executed on the:

__global__ defines a kernel function• Each “__” consists of two underscore characters• A kernel function must return void

__device__ and __host__ can be used together

43© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 44: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Integrated C programs with CUDA extensions

NVCC Compiler

Host C Compiler/ Linker

Host Code Device Code (PTX)

Device Just-in-Time Compiler

Heterogeneous Computing Platform withCPUs, GPUs

Compiling A CUDA Program

44© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Page 45: Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

ANY MORE QUESTIONS?

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 45