Introduction to CUDA

28
Introduction to CUDA 2009/04/07 Yun-Yang Ma

description

Introduction to CUDA. 2009/04/07 Yun -Yang Ma. Outline. Overview What is CUDA Architecture Programming Model Memory Model H.264 Motion Estimation on CUDA Method Experimental Results Conclusion. Overview. - PowerPoint PPT Presentation

Transcript of Introduction to CUDA

Page 1: Introduction to CUDA

Introduction to CUDA2009/04/07Yun-Yang Ma

Page 2: Introduction to CUDA

2

Overview What is CUDA

◦ Architecture◦ Programming Model◦ Memory Model

H.264 Motion Estimation on CUDA◦ Method◦ Experimental Results◦ Conclusion

Outline

Page 3: Introduction to CUDA

3

In the past few years, Graphic Processing Unit (GPU) processing capability grows rapidly

Overview

Page 4: Introduction to CUDA

4

General-purpose Computation on GPUs (GPGPU)◦ Not only for accelerating the graphics display but

also for speeding up non-graphics applications Linear algebra computation Scientific simulation

Overview

Page 5: Introduction to CUDA

5

Compute Unified Device Architecture ◦ http://www.nvidia.com.tw/object/cuda_home_tw.ht

ml# (nVidia CUDA Zone)

◦ Single program multiple data (SPMD) computing device

What is CUDA ?

Fast Object Detection Leukocyte Tracking Real-time 3D modeling

Page 6: Introduction to CUDA

6

Architecture

What is CUDA ?

Page 7: Introduction to CUDA

7

Programming Model◦ Two parts of program executing

Host : CPU Device : GPU

What is CUDA ?

Main programHost

End of Main

...................................

Device

.....................

Kernel

End of Kernel

do parallelism

Page 8: Introduction to CUDA

8

Thread Batching◦ CUDA creates a lot of threads on the device then

each thread will execute kernel program with different data

◦ The threads in the same thread block can co-work with each other through the shared memory

◦ Number of threads in a thread block is limited Thread blocks with same dimension can be organized

as a grid and do thread batching

What is CUDA ?

Page 9: Introduction to CUDA

9

Thread Batching

What is CUDA ?

Host

Kernel 1

Kernel 2

DeviceGrid 1

Grid 2

Block (0,0) Block (1,0) Block (2,0)

Block (0,1) Block (1,1) Block (2,1)

Block (0,0) Block (1,0) Block (2,0)

Block (0,1) Block (1,1) Block (2,1)

Block (0,2) Block (1,2) Block (2,2)

Block (0,3) Block (1,3) Block (2,3)

Block (1,0)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Page 10: Introduction to CUDA

10

Memory Model◦ DRAM◦ Chip memory

What is CUDA ?GridBlock(0, 0) Block(1, 0)

Shared Memory

Registers Registers

Thread (0, 0)

Thread (1, 0)

Local Memory

Local Memory

Global Memory

Shared Memory

Registers Registers

Thread (0, 0)

Thread (1, 0)

Local Memory

Local Memory

Page 11: Introduction to CUDA

11

Example : Vector additionWhat is CUDA ?

Kernel program

Page 12: Introduction to CUDA

12

In [1], using an efficient block-level parallel algorithm for the variable block size motion estimation in H.264/AVC

MB mode

H.264 ME on CUDA

[1] “H.264/AVC Motion Estimation Implementation on Compute Unified Device Architecture (CUDA)”, IEEE International Conference on Multimedia & Expo (2008)

P_16x16

P_16x8 P_8x16 P_8x8

8x8 8x4 4x8 4x4

Page 13: Introduction to CUDA

13

Two steps for deciding the final coding mode◦ Step 1 : Find best motion vectors of each MB

mode◦ Step 2 : Evaluate the R-D performance and

choose the best mode

H.264 ME algorithm is extremely complex and time consuming◦ Fast motion estimation method (TSS, DS, etc.)◦ In [1], they focus on Full Search ME

H.264 ME on CUDA

Too much branch instruction

Page 14: Introduction to CUDA

14

First stage : Calculate integer pixel MVs

Method

16

16

44

4

88

4

8

8

8

16

16

8

Compute all SAD values between each block and all reference candidates in parallel

Merge 4x4 SADs to form all block sizes

Find the minimal SAD and determine the integer pixel MV

16

16

Page 15: Introduction to CUDA

15

Second stage : Calculate fraction-pixel MVs◦ Reference frame is interpolated using a six-tap

filter and a bilinear filter defined in H.264/AVC◦ Calculate the SADs at 24 fractional pixel positions

that are adjacent to the best integer MV

Method

Half pixelQuarter pixel

Integer pixel

2423222120

1918171615

1413121110987654321

Page 16: Introduction to CUDA

16

4x4 Block-Size SAD Calculation◦ Sequence resolution : 4CIF (704x576)◦ Search range : 32 x 32 ( leads to 1024 candidates

)◦ Each candidate SAD is computed by a thread◦ 256 threads executed in a thread block

Method

2561_

4_

4___ 2 RangeSearchHeightFrameWidthFrameNumberBlockThread

Every 256 candidates of one 4x4 block SAD calculation is assigned to a thread block

4x4 blocks number in a frame

Number of ME search candidates

256 threads in a thread block

= 706/4 x 576/4 x 322 x 1/256 = 101376

Page 17: Introduction to CUDA

17

Block diagram of 4x4 block SAD calculation

Method

●● ‧ ‧ ‧ ‧ ‧ ‧●●● ‧ ‧ ‧ ‧ ‧ ‧●●● ‧ ‧ ‧ ‧ ‧ ‧●●● ‧ ‧ ‧ ‧ ‧ ‧●

1024 candidates of an 4x4 block

B1 T256T1 T2 …

B2 T256T1 T2 …

B3 T256T1 T2 …

B4 T256T1 T2 …

Kernel

B101376

DRAM

256 SADs256

SADs256

SADs256

SADs

256 SADs

Page 18: Introduction to CUDA

18

Variable Block-Size SAD Generation◦ Merge the 4x4 SADs obtained in the previous step◦ Each thread fetches sixteen 4x4 SADs of one MB

at a candidate position and combines them to form other block size

Method

2561_

16_

16___ 2 RangeSearchHeightFrameWidthFrameNumberBlockThread

= 706/16 x 576/16 x 322 x 1/256 = 6336

Page 19: Introduction to CUDA

19

Block diagram of variable block size SAD calculation

Method

DRAM

16 SADs

16 SADs

16 SADs

Kernel

B1T1

T2

T256

B2

B6336

4x8 SAD x88x4 SAD x88x8 SAD x48x16 SAD x216x8 SAD x216x16 SAD

x1

……

DRAM

Page 20: Introduction to CUDA

20

Integer Pixel SAD Comparison◦ All 1024 SADs of one block are compared and the

least SAD is chosen as the integer-pixel MV◦ Each block size (16x16 to 4x4) has its own kernels

for SAD comparison◦ Seven kernels are implemented and executed

sequentially

Method

Page 21: Introduction to CUDA

21

Block diagram of integer pixel SAD comparison

Method

1024 SADs DRAM

KernelB1

T1 T2 T256…

4 SADs 4 SADs 4 SADs

shared memorySAD SAD SAD

T1 ~ T128/2n -1

256/2n-1 SADs n iterations

Integer-pel MV

Page 22: Introduction to CUDA

22

During the thread reduction process, a problem may occur◦ Shared memory bank conflict

A sequential addressing with non-divergent branching strategy is adopted

Method

Page 23: Introduction to CUDA

23

SAD comparison using sequential addressing with non-divergent branching

Method

8 6 3 4 7 3 7 8 4 7 5 1 9 4 3 6

1 2 3 4 5 6 7 8

Shared memory(SAD value & index)

Thread ID(Do comparison)

4 6 3 1 7 3 3 6 4 7 5 1 9 4 3 6

1 2 3 4…

Page 24: Introduction to CUDA

24

Fractional Pixel MV Refinement◦ Find the best fractional-pixel motion vector

around the integer motion vector of every block

Method

Half pixelQuarter pixel

Integer pixel

2423222120

1918171615

1413121110987654321

DRAMEncoding Frame

Integer- pel MV

Reference Frame

KernelB1shared memory

T1 T2 T24…

shared memory

T1 ~ T12/2n -1

24/2n-1 SADs n iterations

fractionl-pel MV

Page 25: Introduction to CUDA

25

Environment◦ AMD Athlon 64 X2 Dual Core 2.1GHz with 2G

memory◦ NVIDIA GeForce 8800GTX with 768MB DRAM◦ CUDA Toolkit and SDK 1.1

Parameters◦ ME algorithm : Full Search◦ Search Range : 32x32

Experimental Results

Page 26: Introduction to CUDA

26

The average execution time in ms for processing one frame using the proposed algorithm

Experimental Results

Steps ms Percentage (%)

Step1. 4x4 Block Size SADs Calculation 33.98 31.24Step2. Variable Block Size SADs Generation 30.64 28.16

Step3. Integer Pixel SAD Comparison 9.69 8.90Step4. Fractional Pixel Interpolation 7.10 6.52Step5. Fractional Pixel ME Refinement 7.10 9.99Others 16.49 15.16Total 108.77 100

Page 27: Introduction to CUDA

27

The ME performance comparison between CPU only and using GPU

Experimental Results

SequenceFrame rate (fps) using AMD CPU

Frame rate (fps) using

GPUSpeed-up

Stefan (CIF) 3.04 31.54 10.38City (4CIF) 0.78 9.19 11.78

Page 28: Introduction to CUDA

28

In this paper, they present an efficient block level parallelized algorithm for variable block size motion estimation using CUDA GPU.

GPU acting as a coprocessor can effectively accelerate massive data computation.

Conclusions