Computation to Core Mapping - Virginia Techaccel.cs.vt.edu/files/lecture9.pdf · 0,0 1,0 2,0 3,0...

Computation to Core Mapping

Computation to Core MappingComputation to Core Mapping─ Lessons learned from a simple application

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes


A Si l A Si l A li tiA li tiA Simple A Simple ApplicationApplicationMatrix Multiplicationp

Used as an example throughout the course

Goal for today:Show the concept of “Computation-to-Core Mapping”

Block schedule, Occupancy, and thread schedule

iAssumptionDeal with square matrix for simplicityLeave memory issues latery

With global memory and local registers


2


Th lg ith d CPU dTh lg ith d CPU dThe algorithm and CPU codeThe algorithm and CPU code

NP = M * N of size WIDTH x WIDTH

IDT

H

// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width) {

k

j

WI{

for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) {

double sum = 0;for (int k = 0; k < Width; ++k) {

M P

for (int k = 0; k < Width; ++k) {double a = M[i * width + k];double b = N[k * width + j];sum += a * b;

}i

WID

TH

}P[i * Width + j] = sum;

}} k


3WIDTH WIDTH


Th lg ith d CPU dTh lg ith d CPU dThe algorithm and CPU codeThe algorithm and CPU code

NP = M * N of size WIDTH x WIDTH

IDT

H

// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width) {

k

j

WI{

for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) {

double sum = 0;for (int k = 0; k < Width; ++k) {

Pay attention here!

M P

for (int k = 0; k < Width; ++k) {double a = M[i * width + k];double b = N[k * width + j];sum += a * b;

}i

WID

TH

}P[i * Width + j] = sum;

}} k


4WIDTH WIDTH


First Mapping SchemeFirst Mapping Scheme

Thread mapping: Npp gDefine the finest computational unit, and map it onto each thread

Main criterion : None Dependency WID

TH

tMain criterion : None Dependency

In our first scheme:Unit: Calculation of one element of P

Wtx

Block mapping:Simple: One block

M P

ty

WID

TH

(tx, ty)


5WIDTH WIDTH


St St 11 M l t M l tStep Step 11: Memory layout: Memory layout

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1 M (column#, row#)M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

(column#, row#)

MMM M MM M M MM M M MM M M

M

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2 M1,3M0,3 M2,3 M3,3



St St 22 I t M t i D t I t M t i D t T f T f (H t (H t C d )C d )Step Step 22: Input Matrix Data : Input Matrix Data Transfer Transfer (Host (Host Code)Code)

void MatrixMulOnDevice(float* M, float* N, float* P, int Width) {{

int size = Width * Width * sizeof(float); float* Md, Nd, Pd;

…1. // Allocate and Load M, N to device memory

cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the devicecudaMalloc(&Pd, size);



St St 33 O t t M t i D t O t t M t i D t T f T f (H t (H t C d )C d )Step Step 33: Output Matrix Data : Output Matrix Data Transfer Transfer (Host (Host Code)Code)

2. // Kernel invocation code – to be shown later…

3. // Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree (Pd);}



St St 44 K l F ti K l F tiStep Step 44: Kernel Function: Kernel Function// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {

// Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0;


9


St St 44 K l F ti ( t ) K l F ti ( t )Nd

Step Step 44: Kernel Function (cont.): Kernel Function (cont.)for (int k = 0; k < Width; ++k) {

float Melement = Md[threadIdx.y*Width+k];

WID

TH

float Melement Md[threadIdx.y Width k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement;

}tx

k

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}

tx

Md Pd

tyty

WID

TH

txk


10WIDTH WIDTH


St 5 K l St 5 K l I ti (H t I ti (H t C d ) C d ) Step 5: Kernel Step 5: Kernel Invocation (Host Invocation (Host Code) Code)

// Setup the execution configurationdim3 dimGrid(1 1);dim3 dimGrid(1, 1);

dim3 dimBlock(Width, Width);

// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);



I ith th Fi t M i g S hI ith th Fi t M i g S hIssues with the First Mapping SchemeIssues with the First Mapping SchemeOne Block of threads compute matrix Pd

Grid 1Block 1

Nd

matrix PdOther Multi-processors are not used.

Block 12

4

2Thread(2, 2)

6

( , )

3 2 5 4 48

WIDTH


Md Pd


I ith th Fi t M i g S hI ith th Fi t M i g S hIssues with the First Mapping SchemeIssues with the First Mapping SchemeEach thread

L d f t i MdGrid 1

Block 1

Nd

Loads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd

Block 12

4

2Thread(2, 2)for each pair of Md and Nd

elementsCompute to off-chip memory access ratio close to 1:1 (not very high)

6

( , )

high)

3 2 5 4 48

WIDTH


Md Pd


I ith th Fi t M i g S hI ith th Fi t M i g S hIssues with the First Mapping SchemeIssues with the First Mapping SchemeSize of matrix limited by the number of threads allowed in a

Grid 1Block 1

Nd

number of threads allowed in a thread block

Maximum threads per block: 512Can only do 22 x 22 matrix

Block 12

4

2Thread(2, 2)Can only do 22 x 22 matrix

You can put a loop around the kernel call for cases when Width > 22. But multiple kernel launch will cost you

6

( , )

will cost you.

3 2 5 4 48

WIDTH


Md Pd


S l ti th S d M i g S hS l ti th S d M i g S hSolution: the Second Mapping SchemeSolution: the Second Mapping SchemeNdThread mapping: the same with the first one

WID

TH

Block mapping:Create 2D thread blocks, each of which compute a (TILE_WIDTH)2 sub-matrix (tile) of

W

p ( _ ) ( )the result matrix

Each has (TILE_WIDTH)2 threads

Generate a 2D Grid of (WIDTH/TILE WIDTH)2

Md Pd

byTILE WIDTH

( / _ )blocks

WID

THty

txbx

TILE_WIDTH


15WIDTH WIDTH


Ab t th S d M i gAb t th S d M i gAbout the Second MappingAbout the Second Mapping

More blocks (WIDTH/TILE_WIDTH)2 Nd( / )

Support larger matrixThe maximum size of each dimension of a grid of thread blocks is 65535 W

IDT

H

grid of thread blocks is 65535.Max Width = 655352 x TILE_WIDTH2

Use more multi-processors

W

Md Pd

byTILE WIDTH

WID

THty

txbx

TILE_WIDTH


16WIDTH WIDTH


Alg ith t i g tilAlg ith t i g til

bx

tx0 1 2

Algorithm concept using tilesAlgorithm concept using tilesNd

01 TILE_WIDTH-12

Break-up Pd into tiles

WID

THEach block calculates one tile

Each thread calculates one elementBlock size equal tile sizeBlock size equal tile size

Md Pd

Pdsub10

0

TH

E

Hsub

TILE_WIDTH

by ty 21

TILE_WIDTH-1

1

TIL

E_W

IDT

WID

TH


17

WIDTHWIDTH2


E lE lExampleExample

Block(0,0) Block(1,0)

P1,0P0,0

P0 1

P2,0 P3,0

P1 1 P3 1P2 1

TILE_WIDTH = 2

0,1 1,1

P0,2 P2,2 P3,2P1,2

3,12,1

P0,3 P2,3 P3,3P1,3

Block(1,1)Block(0,1)



Bl k C t tiBl k C t tiBlock ComputationBlock Computation

Nd1 0Nd0 0

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Nd0,3Nd1,3

1,20,2

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0Pd3,0

Pd1,1 Pd3,1Pd2,1

Pd0,2 Pd2,2Pd3,2Pd1,2

Pd0,3 Pd2,3Pd3,3Pd1,3



K l C d i g TilK l C d i g TilKernel Code using TilesKernel Code using Tiles__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){{// Calculate the row index of the Pd element and M

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;// Calculate the column idenx of Pd and N

int Col blockIdx x*TILE WIDTH + threadIdx x;int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;// each thread computes one element of the block sub-matrix// pfor (int k = 0; k < Width; ++k)Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;}



R i d K l I ti (H t R i d K l I ti (H t C d ) C d ) Revised Kernel Invocation (Host Revised Kernel Invocation (Host Code) Code)

// Setup the execution configurationdim3 dimGrid (Width/TILE WIDTH Width/TILE WIDTH);dim3 dimGrid (Width/TILE_WIDTH, Width/TILE_WIDTH);dim3 dimBlock (TILE_WIDTH, TILE_WIDTH);

// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);



Q ti ?Q ti ?Questions?Questions?

For Matrix Multiplication using multiple For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32blocks?blocks?Why?



Bl k S h d li gBl k S h d li gBlock SchedulingBlock Scheduling

t0 t1 t2 … tm t0 t1 t2 … tmSM 1SM 0

SP

MT IU

SP

MT IU Blocks

Blocks

Up to 8 blocks to each SM as resource allowsSM in G80 can take up to 768 threads

Could be 256 (threads/block) * 3 blocks

SharedMemory

SharedMemory

Could be 256 (threads/block) 3 blocks Or 128 (threads/block) * 6 blocks, etc.

SM in GT200 can take up to 1024 threads


23


Th d h d li g i M lti i gTh d h d li g i M lti i gThread scheduling in MultiprocessingThread scheduling in Multiprocessing… …Block 1 Warps Block 2 Warps …Block 1 Warps

…t0 t1 t2 … t31

…t0 t1 t2 … t31

…t0 t1 t2 … t31

Each Block is executed as 32-thread WarpsIf 3 bl k i d t

Instruction Fetch/Dispatch

Instruction L1Streaming Multiprocessor

If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? SP

SP

SP

SP

Shared Memory

there in an SM?Each Block is divided into 256/32 = 8 WarpsThere are 8 * 3 = 24 Warps

SP

SPSFU

SP

SPSFU


There are 8 3 = 24 Warps


O f M ltiO f M ltiOccupancy of MultiprocessorOccupancy of Multiprocessor

How much a Multiprocessor is occupied:How much a Multiprocessor is occupied:Occupancy = Actually warps / Totally allowed

GT200 SM allows 32 warpspG80 SM allow 24 warps

For example:pOne block per SM, 32 threads per block

(32/32) / 32 = 3.125% (Very bad)

4 blocks per SM, 256 threads per block(256/32) * 4 / 32 = 100% (Very good)



CUDA O C l l tCUDA O C l l tCUDA Occupancy CalculatorCUDA Occupancy Calculator

There are three factors:There are three factors:Maximum number of warpsMaximum registers usageMaximum registers usageMaximum share memory usage

DemoDemo



A t O Q tiA t O Q tiAnswers to Our QuestionsAnswers to Our Questions

For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks?For G80 GPU:

For 8X8 we have 64 threads per Block Since each SM can For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! (Occupancy = 66 6%)into each SM! (Occupancy 66.6%)For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations achieve full capacity unless other resource considerations overrule. (Occupancy = 100%)For 32X32, we have 1024 threads per Block. Not even one can fit into an SM! (Can not support)


can fit into an SM! (Can not support)


A t O Q ti (C t’)A t O Q ti (C t’)Answers to Our Questions (Cont’)Answers to Our Questions (Cont’)

For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks?For GT200 GPU:

For 8X8 we have 64 threads per Block Since each SM can For 8X8, we have 64 threads per Block. Since each SM can take up to 1024 threads, there are 16 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! (Occupancy =50%)into each SM! (Occupancy 50%)For 16X16, we have 256 threads per Block. Each SM takes 4 Blocks and achieve full capacity unless other resource considerations overrule (Occupancy = 100%)considerations overrule. (Occupancy 100%)For 32X32, we have 1024 threads per Block. Each SM takes 1 Block and achieve full capacity unless other resource considerations overrule. (Occupancy = 100%)


considerations overrule. (Occupancy 100%)


C t tiC t ti tt C M i gC M i gComputationComputation--toto--Core MappingCore Mapping

Step 1:Step 1:Define your computational unit, map each unit to a thread

Avoid dependencyIncrease compute to memory access ratio

Step 2:Group your threads into blocks

Eliminate hardware limitEliminate hardware limitTake advantage of shared memory


Computation to Core Mapping - Virginia Techaccel.cs.vt.edu/files/lecture9.pdf · 0,0 1,0 2,0 3,0...

Documents

Transcript of Computation to Core Mapping - Virginia Techaccel.cs.vt.edu/files/lecture9.pdf · 0,0 1,0 2,0 3,0...