Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana...

36
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vuduc

Transcript of Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana...

Page 1: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Jaekyu Lee

Nagesh B. Lakshminarayana

Hyesoon Kim

Richard Vuduc

Page 2: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

2

Introduction

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

General Purpose GPUs (GPGPU) are getting popular High-performance capability

(NVIDIA Geforce GTX 580: 1.5 TFLOPS)

Many cores with large-scale multi-threading and SIMD unit

CUDA programming model SIMT (Single Instruction Multiple Threads) Hierarchy of threads groups: thread, thread block

SIMDExecutio

n

SharedMemory

Memory RequestBuffer Core

DRAM

Page 3: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

3

Memory Latency Problem

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Tolerating memory latency is critical in CPUs Many techniques have been proposed

Caches, prefetching, multi-threading, etc.

GPGPUs have employed multi-threading Memory latency is critical in GPGPUs as well

Limited thread-level-parallelism Application behavior

Algorithmically, lack of parallelism Limited by resource constraints

# registers per thread, # threads per block, shared memory usage per block

Page 4: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

4

Multi-threading Example

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Example 1: Enough threads

Example 2: Not enough threads

C C DMM C

C C DMM C

Switch

C C DMM C

C C DMM C

4 active threads

Switch

Switch

No stall

T0

T1

T2

T3

Memory Latency

C CM

C Computation M Memory D Dependent on memory

C C DMM C

C C DMM C

SwitchStall

2 active threads

Stall CyclesT0

T1

Memory Latency

C C DMM C

Page 5: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

5

Prefetching in GPGPUs

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Problem: when multi-threading is not enough, we need other mechanisms to hide memory latency. Other solutions

Caching (NVIDIA Fermi) Prefetching

Many prefetchers have been proposed for CPUs Stride, stream, Markov, CDP, GHB, helper thread, etc.

Question: Will the existing mechanisms work in GPGPUs?

In this talk

Page 6: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

6

Characteristic #1. Many Threads

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Problem #1. Training of prefetcher Accesses from many threads are interleaved

Thread ID indexing Reduced effective prefetcher size Scalability

Prefetcher Prefetcher

Prefetching in CPU

Prefetcher

Prefetching in GPGPU

1 thread 2 threads Many threads

Page 7: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Characteristic #2. Data Level Parallelism

Many-Thread Aware Prefetching Mechanisms (MICRO-43)7

Problem #2. Short thread lifetime Due to parallelization The length of a thread in parallel programs is

shorter Removes prefetching opportunities

prefetch

demand

prefetchdemand

Sequential Thread Parallel Threads

Too short lifetime

No opportunity

Useful!

Memory latency

Memory latency

create

terminate

Page 8: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

8

Characteristic #3. SIMT

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Problem #3. Single-Configuration Many-Threads (SCMT) Too many threads are controlled together Prefetch degree: # of prefetches per trigger

Prefetch degree 1: < cache size Prefetch degree 2: >> cache size

Problem #4. Amplified negative effects One useless prefetch request per thread

many useless prefetches

pref pref

pref pref

pref pref

pref pref

pref pref

pref pref

pref pref

pref pref

pref pref

Prefetch Cache Prefetch Cache

Fit in a cache

Capacity misses

Page 9: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

9

Goal

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Design hardware/software prefetching mechanisms for GPGPU applications

Step 1. Prefetcher for Many-thread Architecture Many-Thread Aware Prefetching

Mechanisms(Scalability, short thread lifetime)

Step 2. Feedback mechanism to reduce negative effects Prefetch Throttling

(SCMT, amplifying negative effects)

Page 10: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

10

Goal

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Design hardware/software prefetching mechanisms for GPGPU applications

Step 1. Prefetcher for Many-thread Architecture Many-Thread Aware Prefetching

Mechanisms(Scalability, short thread lifetime)* H/W prefetcher: in this talk, S/W prefetcher: in the paper

Step 2. Feedback mechanism to reduce negative effects Prefetch Throttling

(SCMT, amplifying negative effect)

Page 11: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

11

Stride Pref.

PromotionTable

IP Pref.

Decision Logic

Pref. AddrPC,

ADDR

PC, ADDR TID

PC, ADDR TID

Many-Thread Aware Hardware Prefetcher (Conventional) Stride prefetcher Promotion table for stride prefetcher (Scalability) Inter-Thread prefetcher (Short thread lifetime) Decision logic

PromotionTable

IP Pref.

Stride Pref.

Decision Logic

Stride Promotion

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Page 12: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

12

Solving Scalability Problem

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Problem #1. Training of prefetcher (Scalability) Stride Promotion

Similar (or even same) access pattern in threads (SIMT) Without promotion, table is occupied by redundant entries

By promotion, we can effectively manage storage

Reduce training time using earlier threads’ information

PC STRIDE

0x1a

65536

… …

… …

… …

… …

PC TID STRIDE

0x1a

1 65536

0x1a

3 65536

0x1a

10 65536

0x1a

7 65536

… … …

Redundant

Entries

Promotion

Conventional Stride Table

Promotion Table

Page 13: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

13

Solving Short Thread Lifetime Problem

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Problem #2. Short thread lifetime

Highly parallelized code often reduces prefetching opportunities

prefetchdemand

Memory latency

for (ii = 0; ii < 100; ++ii) { prefetch(A[ii+D]); prefetch(B[ii+D]); C[ii] = A[ii] + B[ii];}

// there are 100 threads__global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB;}

Loop!

No loopFew instructionsNo opportunity

Page 14: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

14

Inter-Thread Prefetching

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Instead, we can prefetch for other threads Inter-Thread Prefetching (IP) In CUDA, Memory index is a function of the thread

id

// there are 100 threads__global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int next_tid = tid + 32; prefetch(aa[next_tid]); prefetch(bb[next_tid]); int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB;}

T0 T3 …T2 … … …

T32 T35 …T34 … … …

T64 … …T66 … … …

Prefetch

Prefetch

Memory accessin other threads

prefetch

prefetch

T1

T33

T65

Page 15: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

15

IP Table

IP Pattern Detection in Hardware

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Detecting strides across threads

Launch prefetch requests

PC Addr1 TID 1 Addr 2 TID 2 Train Delta

- - - - - - -

PC:0x1a Addr:400 TID:3

PC Addr1 TID 1 Addr 2 TID 2 Train Delta

0x1a 400 3 - - - -

PC:0x1a Addr:1100 TID:10

PC Addr1 TID 1 Addr 2 TID 2 Train Delta

0x1a 400 3 1100 10 - -

PC:0x1a Addr:200 TID:1

PC Addr1 TID 1 Addr 2 TID 2 Train Delta

0x1a 400 3 1100 10 √ 100

Delta (Req1-Req2) = = 100

Delta (Req3-Req1) = = 100

Delta (Req3-Req2) = = 100

All three deltas are same

We found a pattern

Req 1Req 2Req 3

TID ∆

PC:0x1a Addr:2100 TID:1

Req 4 Prefetch (addr + stride)Addr:2100 Stride: 100

Addr ∆

Page 16: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

16

MT-aware Hardware Prefetcher

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Decision logic Promotion table > Stride prefetcher > IP prefetcher Stride behavior in a thread is more common Entries in Promotion table have been trained longer

time

PromotionTable

IP Pref.

Stride Pref.

Decision Logic

Pref. Addr

Stride Promotion

Cycle 1

Cycle 2

Cycle 3

PC, ADDR

PC, ADDR TID

PC, ADDR TID

Promotion IP Table Stride Prefetcher

Action

1st cycle 2nd cycle 3rd cycle

HIT HIT Not accessed Generate stride prefetch requests

HIT MISS Not accessed Generate stride prefetch requests

MISS HIT Not accessed Generate IP prefetch requests

MISS MISS Accessed

Generate stride prefetch requests, if hitUpdate Promotion Table, if necessary

Page 17: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

17

Goal

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Design a hardware/software prefetcher for GPGPU applications

Step 1. Prefetcher for Many-thread Architecture Many-Thread Aware Prefetching Mechanisms

(Scalability, short thread lifetime)

Step 2. Feedback mechanism to reduce negative effects Prefetch Throttling

(SCMT, amplifying negative effects)

Page 18: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

18

Design GPGPU Prefetch Throttling

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Need GPGPU specific metrics to identify whether prefetching is effective Extension from feedback prefetching for CPUs

[Srinath07] Useful prefetches – accurate and timely Harmful prefetches – inaccurate or too early

Some late prefetches can be tolerable By multithreading Less harmful

Page 19: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

19

Throttling Metrics

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Merged memory requests New request with same address of existing entries

Inside of a core (in MSHR) Late prefetches in CPUs Indicate accuracy (due to massive multi-threading)

Early block eviction from a prefetch cache Due to capacity misses, regardless of accuracy

Periodic Updates To cope with runtime behavior

Page 20: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Many-Thread Aware Prefetching Mechanisms (MICRO-43)20

Heuristic for Prefetch Throttling

* Ideal case (accurate and perfect timing) will have low early eviction and low merge ratio.

Throttle Degree Vary from 0 (prefetch all) to 5 (no prefetch) Default:2

Early Eviction

Merge Ratio Action Note

High High NO prefetch Too aggressive

Medium - LESS prefetch

Low High MORE prefetch

Low Low NO prefetch Inaccurate *

High Low NO prefetch Inaccurate

Page 21: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

21

Outline

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Motivation Step 1. Many-Thread Aware Prefetching Step 2. Prefetch Throttling Evaluation Conclusion

Page 22: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

22

Evaluation Methodology

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

MacSim simulator A cycle accurate, in-house simulator A trace-driven simulator (trace from

GPUOcelot[Diamos10])

Baseline 14-core (8-wide SIMD) Freq:900MHz, 16 Banks/8 Channels,

1.2GHz memory frequency, 900MHz bus, FR-FCFS NVIDIA G80 Architecture

14 memory intensive benchmarks CUDA SDK, Merge, Rodinia, and Parboil Stride, MP (massively parallel), and uncoalesced types Non-memory intensive benchmarks (in the paper)

Page 23: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

23

Evaluation Methodology

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Prefetch Stream, Stride, and GHB prefetchers evaluated 16 KB cache per core (other size results are in the

paper) Prefetch distance:1 degree :1 (the optimal

configuration)

Results Hardware prefetcher Software prefether (in the paper)

Page 24: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

24

Results: MT Hardware Prefetching

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

GHB/Stride do not work in mp and uncoal-type IP (Inter-Thread Prefetching) can be effective Stride Promotion improves performance of few benchmarks

blac

kco

nv

mer

senn

e

mon

te pns

scal

ar

stre

am

back

propce

ll

ocea

nbf

scfd

linea

r

sepi

aAV

G

stride-type mp-type uncoal-type

0.5

1

1.5

2

2.5

3GHB Stride Stride+Promotion Stride+IP

Sp

eed

up

15% over Stride

Page 25: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

25

blac

kco

nv

mer

senn

e

mon

te pns

scal

ar

stre

am

back

propce

ll

ocea

nbf

scfd

linea

r

sepi

aAV

G

stride-type mp-type un-type

0.5

1

1.5

2

2.5

3GHB GHB+F StridePC StridePC+T MT-HWP MT-HWP+T

Sp

eed

up

Results: MT-HWP with Throttling

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

15% over Stride + Throttling

GHB+F improves performance MT-HWP+T eliminates negative effect (stream) * Feedback mechanism is more effective in software prefetching

Page 26: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

26

Outline

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Motivation Step 1. Many-Thread Aware Prefetching Step 2. Prefetch Throttling Evaluation Conclusion

Page 27: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

27

Conclusion

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Memory latency is an important problem in GPGPUs as well.

GPGPU prefetching has four problems: Scalability, short thread, SCMT, and amplifying negative effects

Goal: Design hardware/software prefetcher Step 1. Many-Thread aware prefetcher (promotion, IP) Step 2. Prefetch throttling

MT-aware hardware prefetcher shows 15% performance improvement and prefetch throttling removes all the negative effects.

Future work Study other many-thread architectures.

Other programming models, architectures with caches

Page 28: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

28 Many-Thread Aware Prefetching Mechanisms (MICRO-43)

THANK YOU!

Page 29: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Jaekyu Lee

Nagesh B. Lakshminarayana

Hyesoon Kim

Richard Vuduc

Page 30: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

30

NVIDIA Fermi Result

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

bla

ck

conv

mers

enne

monte

pns

scala

r

stre

am

back

pro

p

Oce

an

bfs

cfd

linear

AV

G

STRIDE MP UNCOALESCED

0

0.5

1

1.5

2

2.5

HP HP+Throttle

Page 31: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

31

Different Prefetch Cache Size

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

MT-HWP MT-HWP+T MT-SWP MT-SWP+T0.9

1

1.1

1.2

1.3

1.4

1.5

1K 2K 4K 8K16K 32K 64K 128K

Sp

eed

up

Page 32: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

32

Software MT Prefetcher Results

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

blackco

nv

mersennemontepns

scalar

stream

backprop

cell

ocean bfs cfd

linear

sepia

AVG

stride-type mp-type uncoal-type

0.0

1.0

2.0

3.0

4.0

Register StrideMT-SWP MT-SWP+Throttle

Sp

eed

up

Page 33: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

33

Hardware prefetcher without TID

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

blac

kco

nv

mer

senn

e

mon

te pns

scal

ar

stre

am

back

propce

ll

Ocean bf

scfd

linea

r

sepi

aAV

G

stride-type mp-type uncoal-type

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

Stride StridePC Stream GHB

Sp

eed

up

Page 34: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

34

Hardware prefetcher with TID

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

blac

kco

nv

mer

senn

e

mon

te pns

scal

ar

stre

am

back

propce

ll

Ocean bf

scfd

linea

r

sepi

aAV

G

stride-type mp-type uncoal-type

0.6

0.8

1

1.2

1.4

1.6

Stride StridePC Stream GHB

Sp

eed

up

Page 35: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

35

Benefit Because of Few Threads?

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Black Conv Mersenne

Monte PNS Scalar stream

12 12 8 16 8 16 16backprop

cell ocean bfs cfd linear sepia

16 16 16 16 6 16 24

Some benchmarks have enough number of threads but they still cannot hide memory latency fully.

blac

kco

nv

mer

senn

e

mon

te pns

scal

ar

stre

am

back

propce

ll

ocea

nbf

scfd

linea

r

sepi

aAV

G

stride-type mp-type uncoal-type

0.5

1.5

2.5

3.5

GHB Stride Stride+Promotion Stride+IP

Sp

eed

up

Page 36: Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Many-Thread Aware Prefetching Mechanisms (MICRO-43)36

Inter-Thread Prefetching IP may not be useful in some cases

Case 1. Demand requests have already been generated Threads are not executed in a strict sequential order

Out of order execution among threads Redundant prefetches: requests will be merged in the

memory system. Less harmful.

Case 2. Out of array range effect: The last thread in a block generates a request for another thread which is mapped to a different core. Unless inter-core merge occurs in DRAM controller,

useless prefetches