Introduction to CUDA

Introduction to CUDA2009/04/07Yun-Yang Ma

2

Overview What is CUDA

◦ Architecture◦ Programming Model◦ Memory Model

H.264 Motion Estimation on CUDA◦ Method◦ Experimental Results◦ Conclusion

Outline

3

In the past few years, Graphic Processing Unit (GPU) processing capability grows rapidly

Overview

4

General-purpose Computation on GPUs (GPGPU)◦ Not only for accelerating the graphics display but

also for speeding up non-graphics applications Linear algebra computation Scientific simulation

Overview

5

Compute Unified Device Architecture ◦ http://www.nvidia.com.tw/object/cuda_home_tw.ht

ml# (nVidia CUDA Zone)

◦ Single program multiple data (SPMD) computing device

What is CUDA ?

Fast Object Detection Leukocyte Tracking Real-time 3D modeling

http://www.nvidia.com.tw/object/cuda_home_tw.html

http://www.nvidia.com.tw/object/cuda_home_tw.html

6

Architecture

What is CUDA ?

7

Programming Model◦ Two parts of program executing

Host : CPU Device : GPU

What is CUDA ?

Main programHost

End of Main

．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．

Device

．．．．．．．．．．．．．．．．．．．．．

Kernel

End of Kernel

do parallelism

8

Thread Batching◦ CUDA creates a lot of threads on the device then

each thread will execute kernel program with different data

◦ The threads in the same thread block can co-work with each other through the shared memory

◦ Number of threads in a thread block is limited Thread blocks with same dimension can be organized

as a grid and do thread batching

What is CUDA ?

9

Thread Batching

What is CUDA ?

Host

Kernel 1

Kernel 2

DeviceGrid 1

Grid 2

Block (0,0) Block (1,0) Block (2,0)






Block (1,0)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

10

Memory Model◦ DRAM◦ Chip memory

What is CUDA ?GridBlock(0, 0) Block(1, 0)

Shared Memory

Registers Registers

Thread (0, 0)

Thread (1, 0)

Local Memory

Local Memory

Global Memory

Shared Memory

Registers Registers

Thread (0, 0)

Thread (1, 0)

Local Memory

Local Memory

11

Example : Vector additionWhat is CUDA ?

Kernel program

12

In [1], using an efficient block-level parallel algorithm for the variable block size motion estimation in H.264/AVC

MB mode

H.264 ME on CUDA

[1] “H.264/AVC Motion Estimation Implementation on Compute Unified Device Architecture (CUDA)”, IEEE International Conference on Multimedia & Expo (2008)

P_16x16

P_16x8 P_8x16 P_8x8

8x8 8x4 4x8 4x4

13

Two steps for deciding the final coding mode◦ Step 1 : Find best motion vectors of each MB

mode◦ Step 2 : Evaluate the R-D performance and

choose the best mode

H.264 ME algorithm is extremely complex and time consuming◦ Fast motion estimation method (TSS, DS, etc.)◦ In [1], they focus on Full Search ME

H.264 ME on CUDA

Too much branch instruction

14

First stage : Calculate integer pixel MVs

Method

16

16

44

4

88

4

8

8

8

16

16

8

Compute all SAD values between each block and all reference candidates in parallel

Merge 4x4 SADs to form all block sizes

Find the minimal SAD and determine the integer pixel MV

16

16

15

Second stage : Calculate fraction-pixel MVs◦ Reference frame is interpolated using a six-tap

filter and a bilinear filter defined in H.264/AVC◦ Calculate the SADs at 24 fractional pixel positions

that are adjacent to the best integer MV

Method

Half pixelQuarter pixel

Integer pixel

2423222120

1918171615

1413121110987654321

16

4x4 Block-Size SAD Calculation◦ Sequence resolution : 4CIF (704x576)◦ Search range : 32 x 32 ( leads to 1024 candidates

)◦ Each candidate SAD is computed by a thread◦ 256 threads executed in a thread block

Method

2561_

4_

4___ 2 RangeSearchHeightFrameWidthFrameNumberBlockThread

Every 256 candidates of one 4x4 block SAD calculation is assigned to a thread block

4x4 blocks number in a frame

Number of ME search candidates

256 threads in a thread block

= 706/4 x 576/4 x 322 x 1/256 = 101376

17

Block diagram of 4x4 block SAD calculation

Method

●● ‧ ‧ ‧ ‧ ‧ ‧●●● ‧ ‧ ‧ ‧ ‧ ‧●●● ‧ ‧ ‧ ‧ ‧ ‧●●● ‧ ‧ ‧ ‧ ‧ ‧●

1024 candidates of an 4x4 block

B1 T256T1 T2 …

B2 T256T1 T2 …

B3 T256T1 T2 …

B4 T256T1 T2 …

Kernel

…

B101376

…

DRAM

256 SADs256

SADs256

SADs256

SADs

256 SADs

…

18

Variable Block-Size SAD Generation◦ Merge the 4x4 SADs obtained in the previous step◦ Each thread fetches sixteen 4x4 SADs of one MB

at a candidate position and combines them to form other block size

Method

2561_

16_

16___ 2 RangeSearchHeightFrameWidthFrameNumberBlockThread

= 706/16 x 576/16 x 322 x 1/256 = 6336

19

Block diagram of variable block size SAD calculation

Method

DRAM

16 SADs

…

16 SADs

16 SADs

Kernel

B1T1

T2

T256

…

B2

…

B6336

4x8 SAD x88x4 SAD x88x8 SAD x48x16 SAD x216x8 SAD x216x16 SAD

x1

……

DRAM

20

Integer Pixel SAD Comparison◦ All 1024 SADs of one block are compared and the

least SAD is chosen as the integer-pixel MV◦ Each block size (16x16 to 4x4) has its own kernels

for SAD comparison◦ Seven kernels are implemented and executed

sequentially

Method

21

Block diagram of integer pixel SAD comparison

Method

1024 SADs DRAM

KernelB1

T1 T2 T256…

4 SADs 4 SADs 4 SADs

shared memorySAD SAD SAD

T1 ~ T128/2n -1

256/2n-1 SADs n iterations

Integer-pel MV

22

During the thread reduction process, a problem may occur◦ Shared memory bank conflict

A sequential addressing with non-divergent branching strategy is adopted

Method

23

SAD comparison using sequential addressing with non-divergent branching

Method

8 6 3 4 7 3 7 8 4 7 5 1 9 4 3 6

1 2 3 4 5 6 7 8

Shared memory(SAD value & index)

Thread ID(Do comparison)

4 6 3 1 7 3 3 6 4 7 5 1 9 4 3 6

1 2 3 4…

24

Fractional Pixel MV Refinement◦ Find the best fractional-pixel motion vector

around the integer motion vector of every block

Method

Half pixelQuarter pixel

Integer pixel

2423222120

1918171615

1413121110987654321

DRAMEncoding Frame

Integer- pel MV

Reference Frame

KernelB1shared memory

T1 T2 T24…

shared memory

T1 ~ T12/2n -1

24/2n-1 SADs n iterations

fractionl-pel MV

25

Environment◦ AMD Athlon 64 X2 Dual Core 2.1GHz with 2G

memory◦ NVIDIA GeForce 8800GTX with 768MB DRAM◦ CUDA Toolkit and SDK 1.1

Parameters◦ ME algorithm : Full Search◦ Search Range : 32x32

Experimental Results

26

The average execution time in ms for processing one frame using the proposed algorithm


Steps ms Percentage (%)

Step1. 4x4 Block Size SADs Calculation 33.98 31.24Step2. Variable Block Size SADs Generation 30.64 28.16

Step3. Integer Pixel SAD Comparison 9.69 8.90Step4. Fractional Pixel Interpolation 7.10 6.52Step5. Fractional Pixel ME Refinement 7.10 9.99Others 16.49 15.16Total 108.77 100

27

The ME performance comparison between CPU only and using GPU


SequenceFrame rate (fps) using AMD CPU

Frame rate (fps) using

GPUSpeed-up

Stefan (CIF) 3.04 31.54 10.38City (4CIF) 0.78 9.19 11.78

28

In this paper, they present an efficient block level parallelized algorithm for variable block size motion estimation using CUDA GPU.

GPU acting as a coprocessor can effectively accelerate massive data computation.

Conclusions

Introduction to CUDA

Documents

Transcript of Introduction to CUDA