CUDA Memory

CUDA Memory Types

CPS343

Parallel and High Performance Computing

Spring 2013

CPS343 (Parallel and HPC) CUDA Memory Types Spring 2013 1 / 24

Outline

1 Device memoryCUDA memory types and usesCUDA Type qualifiersProgramming Scenarios

2 Matrix multiplicationMatrix-matrix multiplicationGlobal memory versionShared memory version


Acknowledgements

Some material used in creating these slides comes from

NVIDIAs CUDA C Programming Guide

http://www.cvg.ethz.ch/teaching/2011spring/gpgpu/cuda memory.pdf


Outline




Compute Capability 1.x

Global memory (read and write)

slow and uncachedrequires sequential and aligned 16 byte read/writes to be fast (coalesedread/write)

Texture memory (read only) cache optimized for 2D access pattern

Constant memory

where constants and kernel arguments are storedslow, but with cache

Shared memory (16KB per SM)

fast, but subject to bank conflictspermits exchange of data between threads in block

Local memory

used for whatever doesnt fit in to registerspart of global memory, so slow and uncached

Registers fast, has only thread scope


Compute Capability 2.x

Global memory (read and write)

slow, but cached

Texture memory (read only) cache optimized for 2D access pattern

Constant memory

where constants and kernel arguments are storedspecial LoaD Uniform (LDU) instruction

Shared memory (48KB per SM)

fast, but subject to (differnt) bank conflicts

Local memory

used for whatever doesnt fit in to registerspart of global memory; slow but now cached

Registers 32768 32-bit registers per SM


Memory limitations

Global Memory

Best if 64 or 128 bytes (16 or 32 single-precision, 8 or 16double-precision) are read...

Coalesced read/writes:

parallel read/writes from threads in a blocksequential memory locations......with appropriate alignment

...otherwise up to 10x slower!

Shared Memory

Fastest if all threads read from same shared memory location and/orall threads index a shared aray via permutation (e.g. linearread/writes)

otherwise there can be bank conflicts


Outline




CUDA Type qualifiers

Variable declaration Memory Scope Lifetime

int localVar; register thread thread

int localArray[10]; local thread thread

shared int sharedVar; shared block block

device int globalVar; global grid application

constant int constantVar; constant grid application

Automatic variables without any qualifier reside in a register...

...except arrays (reside in local memory)

...or if there are not enough registers


CUDA Type performance

Variable declaration Memory Performance penalty

int localVar; register 1x

int localArray[10]; local 100x

shared int sharedVar; shared 1x

device int globalVar; global 100x

constant int constantVar; constant 1x


Outline




Scenario 1

Task:

Load data from global memoryDo thread-local computationsStore result to global memory

Solution:

Load data from global memory (coalesced)

float a = d_ptr[blockIdx.x*blockDim.x + threadIdx.x];

Do computation with registers

float res = f(a);

Store result (coalesced)

d_ptr[blockIdx.x*blockDim.x + threadIdx.x] = res;


Scenario 2

Task:Load data from global memoryDo block-local computationsStore result to global memory

Solution:Load data to shared memory

__shared__ float a_sh[BLOCK_SIZE ];

int idx = blockIdx.x*blockDim.x + threadIdx.x;

a_sh[threadIdx.x] = d_ptr[idx];

__syncthreads (); // important!

Do computation

float res = f(a_sh[threadIdx.x]);

Store result (coalesced)

d_ptr[idx] = res;


Outline




Matrix-matrix multiplication

Consider our familiar matrix product C = AB:

for ( i = 0; i < A.height; i++ )

{

for ( j = 0; j < B.width; j++ )

{

c[i][j] = 0;

for ( k = 0; k < A.width; k++ )

{

c[i][j] += a[i][k] * b[k][j];

}

}

}

How many times is each element of matrix A accessed?

How many times is each element of matrix B accessed?

How many times is each element of matrix C accessed?





{

for ( j = 0; j < B.width; j++ )

{

c[i][j] = 0;

for ( k = 0; k < A.width; k++ )

{

c[i][j] += a[i][k] * b[k][j];

}

}

}

How many times is each element of matrix A accessed? B.width

How many times is each element of matrix B accessed?






{

for ( j = 0; j < B.width; j++ )

{

c[i][j] = 0;

for ( k = 0; k < A.width; k++ )

{

c[i][j] += a[i][k] * b[k][j];

}

}

}


How many times is each element of matrix B accessed? A.height






{

for ( j = 0; j < B.width; j++ )

{

c[i][j] = 0;

for ( k = 0; k < A.width; k++ )

{

c[i][j] += a[i][k] * b[k][j];

}

}

}


How many times is each element of matrix B accessed? A.height

How many times is each element of matrix C accessed? A.width



Consider an elementc[row][col]. There areB.width elements on arow of C and A.heightelements in a column ofC .

To compute each ofthese elements, weaccess a row of A and acolumn of B.

We therefore access eachrow of A B.width timesand each column of BA.height times.


Outline




Kernel development

A CUDA kernel to compute the matrix product is straightforward

In this simple implementation we assume that our matrices are squareN N and stored using linear arrays

Access to the (i , j) element is faciliated via the macro

#define IDX(i,j,n) ((i)*(n)+j)


Matrix multiply: Global memory version kernel

// matrix -matrix kernel using only global memory

__global__ void matmulGlobal( float* c, float* a, float* b,

int N )

{

// compute row and column for our matrix element

int col = blockIdx.x * blockDim.x + threadIdx.x;

int row = blockIdx.y * blockDim.y + threadIdx.y;

if ( col < N && row < N )

{

float sum = 0.0;

for ( int k = 0; k < N; k++ )

{

sum += a[IDX(row ,k,N)] * b[IDX(k,col ,N)];

}

c[IDX(row ,col ,N)] = sum;

}

}

Note: sum will be stored in a register so we this kernel only makes onereference to C .


Outline




Matrix multiply: Shared memory version kernel 1

// matrix -matrix kernel using global and shared memory

__global__ void matmulShared( float* c, float* a, float* b,

int N )

{

// compute row and column for our matrix element

int col = blockIdx.x * blockDim.x + threadIdx.x;

int row = blockIdx.y * blockDim.y + threadIdx.y;

// compute the number of blocks we need

int M = ( N + BlockSize - 1 ) / BlockSize;



float sum = 0.0;

for ( int m = 0; m < M; m++ )

{

// all threads in block copy their element from

// matrix a and matrix b to shared memory

__shared__ float a_s[BlockSize ][ BlockSize ];

__shared__ float b_s[BlockSize ][ BlockSize ];

int c = m * BlockSize + threadIdx.x;

int r = m * BlockSize + threadIdx.y;

a_s[threadIdx.y][ threadIdx.x] = a[IDX(row ,c,N)];

b_s[threadIdx.y][ threadIdx.x] = b[IDX(r,col ,N)];

// make sure all threads are finished

__syncthreads ();



// compute partial sum using shared memory block

// K is block size except at right or bottom since we

// may not have a full block of data there

int K = (m == M - 1 ? N - m * BlockSize : BlockSize );

for ( int k = 0; k < K; k++ )

{

sum += a_s[threadIdx.y][k] * b_s[k][ threadIdx.x];

}

__syncthreads ();

}

if ( col < N && row < N ) c[IDX(row ,col ,N)] = sum;

}


Device memoryCUDA memory types and usesCUDA Type qualifiersProgramming Scenarios

Matrix multiplicationMatrix-matrix multiplicationGlobal memory versionShared memory version

CUDA Memory

Documents

Transcript of CUDA Memory

CUDA Profiling and Debugging - GitHub Pagescis565-fall-2013.github.io/lectures/09-25-CUDA-Profiling... · 2013. 12. 7. · Debugging and Profiling using CUDA Tools Memory Coalescing

CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.

CUDA Optimization: Memory Bandwidth Limited Kernelsdeveloper.download.nvidia.com/CUDA/training/bandwidthlimitedkernels_webinar.pdf · Two accesses per thread (load then store) The

Code GPU with CUDA - Memory Subsystem

Intro to CUDA - University of Tennessee€¦ · Introduction to CUDA C/C++ A Basic CUDA Program Outline intmain(){// Allocate memory for array on host // Allocate memory for array

Texture Memory -in CUDA Perspective TEXTURE MEMORY IN - IN CUDA PERSPECTIVE VINAY MANCHIRAJU.

CUDA 8 OVERVIEW - GTC On-Demand Featured Talks ...on-demand.gputechconf.com/gtc/2016/webinar/cuda-8...3 CUDA 8 — WHAT’S NEW PASCAL SUPPORT UNIFIED MEMORY LIBRARIES DEVELOPER TOOLS

TR-2014-09 Unified Memory in CUDA 6: A Brief Overview and ......memory on device B; and (ii) no need for staging the data transfer through host memory. 3. Unified Memory in CUDA 6

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA ...

CUDA-MEMCHECK - Nvidia DU-05355-001_v041 | 3 Chapter 01 : INTRODUCTION CUDA Memory Architecture CUDA uses a segmented memory architecture that allows applications to access data in

SC13 - GTC On Demand | NVIDIA GTC Digital...CUDA Fortran 2012/2013 Features PGI 2012 (CUDA 4.1 and 4.2) Texture memory support Support for double precision atomic add PGI 2013 (CUDA

Local Memory and Register Spillingdeveloper.download.nvidia.com/CUDA/training/register_spilling.pdf · © NVIDIA 2011 Local Memory •Name refers to memory where registers and other

Towards a Software Transactional Memory for CUDA

Textures & Surfacesdeveloper.download.nvidia.com/CUDA/training/texture_webinar_aug_2011.pdfBind to linear memory (device pointer) Texture is bound directly to global memory address

Tuning CUDA Applications for Kepler - dirac.ruc.dkdirac.ruc.dk/manuals/cuda-6.0/Kepler_Tuning_Guide.pdf · Chapter 1. Kepler Tuning ... 6 1.4.4.5. Global Memory Bandwith and GPU Boost

GPU Programming in CUDA · 9/24/2015 · Programming in CUDA Brian Marshall Introduction Preliminaries CUDA Kernels Memory Management Shared Memory Streams and Events Toolkit Overview

Lecture 19 Revisting Strides, CUDA Threads Topics Strides through memory Practical Performance considerationsReadings…

Characterizing CUDA Unified Memory (UM) -Aware MPI Designs … · 2019-12-02 · Characterizing CUDA Unified Memory (UM) -Aware MPI Designs on Modern GPU Architectures Karthik Vadambacheri

CUDA Memory Model - Wake Forest Universityusers.wfu.edu/choss/CUDA/docs/Lecture 9.pdfCUDA Memory Rules • Currently can only transfer data from host to global (and constant memory)

L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.