Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham...

23
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering University of British Columbia Micro-40 Dec 5, 2007
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham...

Dynamic Warp Formation and Scheduling for Efficient GPU Control FlowWilson W. L. Fung

Ivan ShamGeorge YuanTor M. Aamodt

Electrical and Computer Engineering University of British Columbia

Micro-40 Dec 5, 2007

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 2

Motivation = GPU: A massively parallel architecture

SIMD pipeline: Most computation out of least silicon/energy

Goal: Apply GPU to non-graphics computing Many challenges This talk: Hardware Mechanism for Efficient Control Flow

1

10

100

1000

2001 2002 2003 2004 2005 2006 2007 2008 Year

GF

LO

PS

GPUCPU-ScalarCPU-SSE

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 3

Programming Model

Modern graphics pipeline

CUDA-like programming model Hide SIMD pipeline from programmer Single-Program-Multiple-Data (SPMD) Programmer expresses parallelism using threads ~Stream processing

VertexShader

PixelShaderOpenGL/

DirectX

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 4

Programming Model

Warp = Threads grouped into a SIMD instruction From Oxford Dictionary:

Warp: In the textile industry, the term “warp” refers to “the threads stretched lengthwise in a loom to be crossed by the weft”.

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 5

The Problem: Control flow GPU uses SIMD

pipeline to save area on control logic. Group scalar threads into

warps Branch divergence

occurs when threads inside warps branches to different execution paths.

Branch

Path A

Path B

Branch

Path A

Path B

50.5% performance loss with SIMD width = 16

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 6

Dynamic Warp Formation

Consider multiple warps

Branch

Path A

Path B

Opportunity?Branch

Path A

20.7% Speedup with 4.7% Area Increase

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 7

Outline

Introduction Baseline Architecture Branch Divergence Dynamic Warp Formation and Scheduling Experimental Result Related Work Conclusion

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 8

Baseline Architecture

ShaderCore

Interconnection Network

MemoryController

GDDR3

MemoryController

GDDR3

MemoryController

GDDR3

ShaderCore

ShaderCore

CPU spawn

done

GPU

CPU

Tim

e

CPU spawn

GPU

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 9

SIMD Execution of Scalar Threads All threads run the same kernel Warp = Threads grouped into a SIMD instruction

Thread Warp 3Thread Warp 8

Thread Warp 7Thread Warp

ScalarThread

W

ScalarThread

X

ScalarThread

Y

ScalarThread

Z

Common PC

SIMD Pipeline

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 10

Latency Hiding via Fine Grain Multithreading Interleave warp

execution to hide latencies

Register values of all threads stays in register file

Need 100~1000 threads Graphics has millions

of pixels

Decode

RF

RFRF

AL U

AL U

AL U

D-Cache

Thread Warp 6

Thread Warp 1Thread Warp 2DataAll Hit?

Miss?

Threads accessingmemory hierarchy

Thread Warp 3Thread Warp 8

Writeback

Threads availablefor scheduling

Thread Warp 7

I-Fetch

SIMD Pipeline

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 11

Thread Warp Common PC

SPMD Execution on SIMD Hardware:The Branch Divergence Problem

Thread2

Thread3

Thread4

Thread1

B

C D

E

F

A

G

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 12

- G 1111TOS

B

C D

E

F

A

G

Baseline: PDOM

Thread Warp Common PC

Thread2

Thread3

Thread4

Thread1

B/1111

C/1001 D/0110

E/1111

A/1111

G/1111

- A 1111TOSE D 0110E C 1001TOS

- E 1111E D 0110TOS- E 1111

A D G A

Time

CB E

- B 1111TOS - E 1111TOSReconv. PC Next PC Active Mask

Stack

E D 0110E E 1001TOS

- E 1111

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 13

Dynamic Warp Formation: Key Idea Idea: Form new warp at divergence

Enough threads branching to each path to create full new warps

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 14

Dynamic Warp Formation: Example

A A B B G G A AC C D D E E F F

Time

A A B B G G A AC D E E F

Time

A x/1111y/1111

B x/1110y/0011

C x/1000y/0010 D x/0110

y/0001 F x/0001y/1100

E x/1110y/0011

G x/1111y/1111

A new warp created from scalar threads of both Warp x and y executing at Basic Block D

D

Execution of Warp xat Basic Block A

Execution of Warp yat Basic Block A

LegendAA

Baseline

DynamicWarpFormation

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 15

I-Cache

Decode

Com

mit/

Writeback

RF 2

RF 1

ALU 2

ALU 1 (TID, Reg#)

(TID, Reg#)

RF 3ALU 3 (TID, Reg#)

RF 4ALU 4 (TID, Reg#)

Thread SchedulerPC-Warp LUT Warp Pool

Issue

Log

ic

Warp Allocator

TID x N PC A

TID x N PC B

H

H

TID x NPC Prio

TID x NPC Prio

OCCPC IDX

OCCPC IDX

Warp Update Register T

Warp Update Register NT

REQ

REQTID x N

PC PrioA 5 6 7 8

A 1 2 3 4

Dynamic Warp Formation: Hardware Implementation5 7 8

6

B

C

1011

0100

B 2 30110B 0 B 5 2 3 8

B

0010B 2

71

3

4

2 B

C

0110

1001

C 11001C 1 4C 61101C 1

No Lane Conflict

A: BEQ R2, BC: …

X

1234

Y

5678

X

1234

X

1234

X

1234

X

1234

Y

5678

Y

5678

Y

5678

Y

5678

Z

5238

Z

5238

Z

5238

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 16

Methodology Created new cycle-accurate simulator from

SimpleScalar (version 3.0d) Selected benchmarks from SPEC CPU2006,

SPLASH2, CUDA Demo Manually parallelized Similar programming model to CUDA

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 17

Experimental Results

0

16

32

48

64

80

96

112

128

hmmer lbm Black Bitonic FFT LU Matrix HM

IPC

Baseline: PDOMDynamic Warp FormationMIMD

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 18

0

16

32

48

64

80

96

112

128

hmmer lbm Black Bitonic FFT LU Matrix HM

IPC

BaselineDMajDMinDTimeDPdPriDPC

Dynamic Warp Scheduling

Lane Conflict Ignored (~5% difference)

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 19

Area Estimation CACTI 4.2 (90nm process) Size of scheduler = 2.471mm2 8 x 2.471mm2 + 2.628mm2 = 22.39mm2

4.7% of Geforce 8800GTX (~480mm2)

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 20

Related Works Predication

Convert control dependency into data dependency Lorie and Strong

JOIN and ELSE instruction at the beginning of divergence Cervini

Abstract/software proposal for “regrouping” SMT processor

Liquid SIMD (Clark et al.) Form SIMD instructions from scalar instructions

Conditional Routing (Kapasi) Code transform into multiple kernels to eliminate branches

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 21

Conclusion

Branch divergence can significantly degrade a GPU’s performance. 50.5% performance loss with SIMD width = 16

Dynamic Warp Formation & Scheduling 20.7% on average better than reconvergence 4.7% area cost

Future Work Warp scheduling – Area and Performance

Tradeoff

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 22

Thank You.

Questions?

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Dynamic Warp Formation and Scheduling

for Efficient GPU Control Flow 23

Shared Memory

Banked local memory accessible by all threads within a shader core (a block)

Idea: Break Ld/St into 2 micro-code: Address Calculation Memory Access

After address calculation, use bit vector to track bank access just like lane conflict in the scheduler