CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

PPoPP'2014 1

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU

ApplicationsYi Yang, NEC LabsHuiyang Zhou, NCSU

2/17/2014

www.nec-labs.com

PPoPP'2014 2

Outline• Background

• Motivation

• CUDA-NP

• Experiments

• Conclusions

2/17/2014

PPoPP'2014 3

Background• Many-core architecture

• Overcome the limitation of Instruction level parallelism (ILP).• Achieve high performance at lower energy

• Thread level parallelism (TLP) has been the key to utilize many-core architectures• CPUs support 10+ threads• Intel Many Integrated Core (MIC) supports 200+ threads• GPGPUs support 10K+ threads

• TLP is used to• Occupy a large number of cores. • Hide the off-chip memory latency.

2/17/2014

PPoPP'2014 4

GPGPU architecture• Same-instruction multiple-data (SIMD)

• A warp of threads (32 threads) executes same instruction on different data

• A thread can read registers from another thread in the same warp using shfl instruction (Latest NVIDIA Kepler GPUs)

• Memory coalescing• A warp of threads accesses data in a single cache line to maximize

memory bandwidth

• A thread block contains multiple warps• Threads in the same thread block can communicate using shared

memory (software-managed on-chip cache)• Threads in the same thread block run in a SM

2/17/2014

PPoPP'2014 5

Parallel programs to enable TLP• Parallel programming languages

• OpenMP• CUDA and OpenCL• OpenACC

• In order to write a correct parallel program, developers have to• Identify parallel code sections or parallel loops• Modify the code sections or loops using a specific language

• In order to achieve high performance• Understand the hardware platform

• None of these steps is easy

2/17/2014

PPoPP'2014 6

How to write a parallel (CUDA) program

• Two loops in the single thread program:• Which one do you prefer to parallelize?

2/17/2014

void tmv_single_thread(float *a, float*b, float* c, int w, int h){ for (int k=0; k<w; k++) { float sum = 0; for (int i=0; i<h; i++) sum += a[i*w+k]*b[i]; c[tx] = sum; }}

Transposed-matrix-vector multiplication (TMV)

__global__ void tmv_kernel

int k = threadIdx.x+blockIdx.x*blockDim.x;

-----------------------------------

-----------------------------------

Why not parallelize inner loop?

PPoPP'2014 7


• Motivation

• CUDA-NP

• Experiments

• Conclusions

2/17/2014

PPoPP'2014 8

Why not parallelize the inner loop

• Developers prefer to parallelize outer loops.• How to handle reduction or scan variables? (sum+=)• How to utilize the GPGPU features when

parallelizing the nested loop?

2/17/2014

__global__ void tmv(float *a, float*b, float* c, int w, int h){ float sum = 0; int tx = threadIdx.x+blockIdx.x*blockDim.x; float sum = 0; for (int i=0; i<h; i++) sum += a[i*w+tx]*b[i]; c[tx] = sum;}

Kernel code of Transposed-matrix-vector multiplication (TMV)

So we can find nested parallelism in many parallel programs

PPoPP'2014 9

Impact of nested parallelism• The overall thread level parallelism is not utilized• If we parallelize the nested parallelism, we can get more

TLP to make use of under-utilized resources• 10K threads per GPU = 100 threads from outer loop X

100 threads from inner loop

• The workload/resource of each thread is heavy• If we parallelize the nested parallelism, we can reduce

the workload/resource per thread• With limited resources, we can have more threads

2/17/2014

PPoPP'2014 10

NVIDIA dynamic parallelism

2/17/2014

64m32m

16m 8m 4m 2m 1m512k

256k128k

64k32k

16k 8k 4k 2k 1k512

2560

10203040506070

Size of n (number of threads per kernel lanuch for the child kernel)

Band

wid

th (G

B/s)

• NVIDIA dynamic parallelism: launch child kernels in a GPU thread• Memory-copy microbenchmark

• We launch each child kernel using a parent thread• Each thread of child kernel copies an element• The overall data to be copied (number of child kernel * number of thread per

child kernel): 64m floats

142 GB/S without dynamic parallelismUp to 63 GB/S with enabled dynamic parallelism

For same overall workload, increasing the number of child kernels reduces the performance.E.G. 4K child kernel launches (16k threads per child kernel: 34GB/s

PPoPP'2014 11

Limitation of NVIDIA dynamic parallelism

• Child kernel launch overhead

• Communication between child kernel and parent kernel• Significant overhead as it has to go through global

memory• Complicate the code development

• Not good for the applications with small loop counts

2/17/2014

PPoPP'2014 12


• Motivation

• CUDA-NP

• Experiments

• Conclusions

2/17/2014

PPoPP'2014 13

Our solution: CUDA-NP

• Developers add an OpenMP-like pragma to the parallel loop• Our compiler framework generates the optimized

code leveraging nested parallelism

__global__ void tmv(float *a, float*b, float* c, int w, int h){ float sum = 0; int tx = threadIdx.x+blockIdx.x*blockDim.x;#pragma np parallel for reduction(+:sum) for (int i=0; i<h; i++) sum += a[i*w+tx]*b[i]; c[tx] = sum;}

Kernel code of transposed-matrix-vector multiplication (TMV)

2/17/2014

PPoPP'2014 14

Execution diagram

• Assume each thread block of baseline has 8 threads• Optimized kernel has 8*4 threads per thread block• 4 slave threads are used to process the parallel section.

2/17/2014

Loop section

Sequential section

Sequential section

a) Execution time of baseline Master

threads

Slave threads

Sequential section

Sequential section

Parallel section

b) Execution time of the optimized kernel

PPoPP'2014 15

Example after CUDA-NP

• Introduce threads in Y dimension as slave threads• Process parallel section using multiple slave threads• Apply reduction after parallel section• Master thread is used for executing non-parallel section 2/17/2014

__global__ void tmv_np(float *a, float*b, float* c, int w, int h){ float sum = 0; int tx = threadIdx.x+blockIdx.x*blockDim.x; int slave_id = threadIdx.y; for (int i= slave_id; i<h; i+=slave_size) sum += a[i*w+tx]*b[i]; sum = reduction(sum); if (salve_id==0) c[tx] = sum;}

Kernel code of transposed-matrix-vector multiplica-tion (TMV)

PPoPP'2014 16

Slave threads organization

• Inter-warp nested parallelism• For a master thread, we allocate salve threads in

different warps.• Master thread id 0: slave thread ids 0, 8, 16, 24.

2/17/2014

Slave threads

Sequential section

Sequential section

Parallel section

0 1 2 3 4 5 6 7 master thread id8 9 10 11 12 13 14 15 slave thread id

16 17 18 19 20 21 22 23 slave thread id24 25 26 27 28 29 30 31 slave thread id

Inter-warp NP (warp size is 8)

PPoPP'2014 17

Slave threads organization

• Intra-warp nested parallelism• For a master thread, we allocate salve threads in same

warp.• Master thread id 0: slave thread ids 0, 1, 2, 3.

2/17/2014

Slave threads

Sequential section

Sequential section

Parallel section

0 4 8 12 16 20 24 28 master thread id1 5 9 13 17 21 25 29 slave thread id

2 6 10 14 18 22 26 30 slave thread id3 7 11 15 19 23 37 31 slave thread id

Intra-warp NP

PPoPP'2014 18

Variables across parallel sections• Scalar variables• Inputs/Live-Ins to Parallel Sections• Outputs/Live-Outs from Parallel Sections

• Array variables• Inputs/Live-Ins to Parallel Sections• Outputs/Live-Outs from Parallel Sections

2/17/2014

PPoPP'2014 19

Scalar variables• Inputs/Live-Ins to parallel sections• A scalar variable of master thread has to be broadcasted to

its slave threads.• Intra-warp NP on Kepler

• __shfl can be used to broadcast a scalar variable to its slave threads• Intra-warp NP on legacy hardware or Inter-warp NP

• Shared memory

• Scalar Outputs/Live-Outs from Parallel Sections• Reduction and scan variables• Intra-warp NP on Kepler

• __shfl can be used• Intra-warp NP on legacy hardware or Inter-warp NP

• Shared memory implementation

2/17/2014

PPoPP'2014 20

Array structures across parallel sections• Global memory or shared memory• Visible for all slave threads

• Local array (local memory or registers)• Replace local array with global memory• Replace local array with shared memory• Partition local array into small local array per slave

thread

2/17/2014

PPoPP'2014 21

Inter-Warp NP vs. Intra-Warp NPInter-warp NP Intra-warp NP

__shfl N Yimbalanced workload among slave threads

N Y

Negative impact on memory coalescing

N Y

Negative impact on constant memory

N Y

Number of slave threads has to be 2’s power

N Y

2/17/2014

Only advantage of Intra-warp NP

PPoPP'2014 22


• Motivation

• CUDA-NP

• Experiments

• Conclusions

2/17/2014

PPoPP'2014 23

Experimental Results• NVIDIA GTX 680 GPU • CUDA SDK 5.0

• Benchmarks• NVIDIA SDK: MarchingCubes (MC) • GPGPUSim: Libor (LIB). • Rodinia: Lud (LU), Leukocyte (LE), Streamcluster (SS),

Computational Fluid Dynamics (CFD), BucketSort (BK), and Nearest Neighbor (NN)• TMV and MV

2/17/2014

PPoPP'2014 24

Best speedup over baseline

• CUDA-NP can achieve from 1.36x to 6.69x speedups• On average CUDA-NP can achieve 2.18x speedup among the ten

benchmarks

MC LU LE MV SS LIB CFD BK TMV NN GM0.5

11.5

22.5

33.5

Spee

dup

6.69

2/17/2014

PPoPP'2014 25

Intra-warp NP vs inter-warp NP

• Most benchmarks prefer inter-warp NP• LU has control divergence in the baseline• NN prefer intra-warp NP due to un-coalesced memory

accesses in the baseline

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

MC LU LE MV SS LIB CFD BK TMV NN

00.5

11.5

22.5

33.5

2 threads 4 threads 8 threads 16 threads 32 threads

Spee

dup

5.196.69

6.17

2/17/2014

PPoPP'2014 26

Number of slave threads

• More TLP is not always useful• Most benchmarks prefer 4 or 8 slave threads to achieve the best

performance

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

INTE

RIN

TRA

MC LU LE MV SS LIB CFD BK TMV NN

00.5

11.5

22.5

33.5

2 threads 4 threads 8 threads 16 threads 32 threads

Spee

dup

5.196.69

6.17

2/17/2014

PPoPP'2014 27

Performance comparison for TMV

• CUBLAS 5.0 is a highly optimized library by NVIDIA• For 1K input, CUDA-NP version delivers 4.9x speedup over CUBLAS• CUDA-NP doesn’t hurt performance for large input sizes

2/17/2014

1K 2k 4k 8k 16k 32k 64k0

1020304050607080 CUBLAS BL CUDA-NP

The width of input matrix

Perf

orm

ance

(Gflo

ps/S

)

PPoPP'2014 28

Benefit of shfl instruction

• __shfl instruction is very useful for MC and LU to save shared memory usage• MC and LU use shared memory intensively

MC LU LE MV SS LIB CFD TMV NN0.8

0.9

1

1.1

1.2

1.3

1.4

spee

dup

2/17/2014

PPoPP'2014 29

Conclusions• Many benchmarks have nested parallelism with

small loop counts• We propose CUDA-NP as a compiler framework to

support directive-based nested parallelism• Our compiler explores both intra-warp NP and

inter-warp NP, and handles live variables across code sections• 2.18x speedup on average

2/17/2014

PPoPP'2014 30

Thanks

2/17/2014

PPoPP'2014 31

Local array replacement

LE LIB0.8

1

1.2

1.4

1.6shared memory global memory register file

Spee

dup

2/17/2014

PPoPP'2014 32

Comparison • NVIDIA dynamic parallelism• NN, TMV, LE, LIB, and CFD, are 28.92, 7.61, 13.45, 125.67

and 52.29 times slower than baselines, respectively.• MC, LU, MV, SS, and BK are using shared memory

• Require to copy data from shared memory to global memory to utilize the NVIDIA dynamic parallelism

2/17/2014

PPoPP'2014 33

Experimental Methodology• NVIDIA K20c• Benchmarks

2/17/2014

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Documents

Transcript of CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications