Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Ankit Sethia* D. Anoushe Scott Mahlke...

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Ankit Sethia* D. Anoushe Scott Mahlke Jamshidi

University of Michigan

2

Graphics

Simulation

Linear Algebra

Data Analytics

Machine Learning

Computer Vision

All kinds of applications, compute and memory intensive

are targeting GPUs

GPU usage is expanding

3

PF

kmea

ns-2

sgem

m

leuk

o-2tp

acf

cutc

p

histo-

2BP-

1

sten

cil

mrig

-3

histog

ramsa

d-1

mri-

q

lava

MD

leuk

o-3

leuk

o-1BP-

2

srad

-1

srad

-2

mrig

-1

histo-

3

spm

v

mrig

-2

histo-

1lbmsa

d-2

mum

mer

parti

cle bfs

kmea

ns-1

0%

20%

40%

60%

80%

100%

% o

f p

eak I

PC

Compute Intensive

Memory Intensive

Memory intensive kernels saturate bandwidth and get lower

performance

Performance variation of kernels

4

L1

FPUs

LSU

Memory System

. . . .

L1

FPUs

LSUL1

FPUs

LSU

• Memory intensive kernels serialize memory requests• Critical to prioritize order of requests from SMs

Impact of memory saturation - I

5

Impact of memory saturation

PF

kmea

ns-2

sgem

m

leuk

o-2tp

acf

cutc

p

histo-

2BP-

1

sten

cil

mrig

-3

histog

ramsa

d-1

mri-

q

lava

MD

leuk

o-3

leuk

o-1BP-

2

srad

-1

srad

-2

mrig

-1

histo-

3

spm

v

mrig

-2

histo-

1lbmsa

d-2

mum

mer

parti

cle bfs

kmea

ns-1

0%

20%

40%

60%

80%

100%fraction of peak IPC fraction of cycles LSU stalled

Compute Intensive

Memory Intensive

Significant stalls in LSU correspond to low performance in memory

intensive kernels

6

L1

FPUs

LSU

Memory System

. . . .

L1

FPUs

LSUL1

FPUs

LSU

• Data present in the cache, but LSU can’t access• Unable to feed enough data for processing

Impact of memory saturation - II

W1

W0

L1 LSU

FPUs

W1 Data

Cache-blocks

7

Increasing memory resources

0

0.250.5

0.751

1.251.5

1.752

Large MSHRs + Queues Full Associativity Freq +20% All

Memory Intensive Kernels

Sp

eed

up

Large # of MSHRs + Full Associativity + 20% bandwidth boost

UNBUILDABLE

8

During memory saturation:• Serialization of memory requests cause less

overlap between memory accesses and compute: Memory aware scheduling (Mas)

• Data present in the cache cannot be reused as data cache cannot accept any request:

Cache access re-execution (Car)

Mas + car

Mascar

9

Memory Aware SchedulingMemory

Saturation

Warp 0 Warp 1 Warp 2

Memory requests

Memory requests

Memory requests

Serving one request and switching to another warp (RR)• No warp is ready to make forward progress

10

Memory Aware SchedulingMemory

Saturation

Warp 0 Warp 1 Warp 2

Memory requests

Memory requests

Memory requests

Serving one request and switching to another warp (RR)• No warp is ready to make forward progressServing all requests from one warp and then switching to another• One warp is ready to begin computation early (MAS)

GTO issues instructions from another warp whenever:• No instruction in the i-buffer for that warp• Dependency between instructionsSimilar to RR as multiple warps may issue memory requests

Schedule in Memory Priority mode

MAS operation

11

Check if memory intensive(MSHRs or miss queue almost full)

Y

NSchedule in Equal Priority mode

Assign new owner warp(only owner’s request can go beyond

L1)

Execute memory inst. only from owner, other warps can execute compute inst.

YNIs the next instruction of owner

dependent on already issued loadMP mode

12

Implementation of MAS

Scheduler

...

Decode

I-Buffer

Scoreboard

Ordered Warps

From RF

Issued Warp

Warp Id....

Mem_Q Head

Comp_Q Head

WRC

Stall bit

Memory saturation flag

OPtype...

WST

• Divide warps as memory and compute warps in ordered warps

• Warp Readiness Checker (WRC): Tests if a warp should be allowed to issue memory instructions

• Warp Status Table: Decide if scheduler should schedule from a warp

13

During memory saturation:• Serialization of memory requests cause less

overlap between memory accesses and compute: Memory aware scheduling (Mas)


Cache access re-execution (Car)

Mas + car

Mascar

14

L1 CacheLoad Store

Unit

Cache-blocks

W1 Data

W1 W0

Re-execution Queue

HIT

Cache access re-execution

W2

Better than adding more MSHRs:• More MSHR cause faster saturation of

memory• Cause faster thrashing of data cache

15

Experimental methodology

• GPGPUSim 3.2.2 – GTX 480 architecture• SMs – 15, 32 PEs/SM• Schedulers – LRR, GTO, OWL and CCWS• L1 cache – 32kB, 64 sets, 4 way, 64 MSHRs• L2 cache – 768kB, 8 way, 6 partitions, 200 core

cycles• DRAM – 32 requests/partition, 440 core cycles

16

Performance of compute intensive kernels

PF

kmea

ns-2

sgem

m

leuk

o-2

tpac

f

cutc

p

hist

o-2

BP-1

sten

cil

mrig

-3

hist

ogra

msa

d-1

mri-

q

lava

MD

leuk

o-3

GMEA

N0

0.25

0.5

0.75

1

1.25GTO OWL CCWS Mascar

Sp

eed

up

w.r

.t R

R

Performance of compute intensive kernels is insensitive to scheduling

policies

17

Performance of memory intensive kernels

BP-2 histo-3 histo-1 sad-20

0.5

1

1.5

2GTO OWL CCWS MAS CAR

Sp

eed

up

w.r

.t R

R

4.8 4.24

Bandwidth intensive

Cache Sensitive

3.0

Overall

Scheduler

GTOOWL

CCWSMascar

BandwidthIntensive

4%4%4%17%

CacheSensitive

24%4%

55%56%

13%4%

24%34%

Overall

18

During memory saturation:• Serialization of memory requests cause less overlap

between memory accesses and compute: Memory aware scheduling (Mas) allows one warp to issue all its requests and begin early computation


Cache access re-execution (Car) exploits more hit-under-miss opportunities through re-execution queue

Conclusion

34% speedup12% energy savings

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Questions??

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Ankit Sethia* D. Anoushe Scott Mahlke...

Documents

Transcript of Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Ankit Sethia* D. Anoushe Scott Mahlke...