Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ......

43
Escaping the SIMD vs. MIMD mindset A new class of hybrid microarchitectures between GPUs and CPUs Sylvain Collange Università degli Studi di Siena x[email protected]x Séminaire DALI December 15, 2011

Transcript of Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ......

Page 1: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

Escaping the SIMD vs. MIMD mindsetA new class of hybrid microarchitectures

between GPUs and CPUs

Sylvain CollangeUniversità degli Studi di Siena

[email protected]

Séminaire DALIDecember 15, 2011

Page 2: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

2

This talk is not about GPUs

Yesterday (2000-2010)

Homogeneous multi-core

Discrete components CentralProcessing Unit

(CPU)

GraphicsProcessingUnit (GPU)

Latency-optimized

cores

Throughput-optimized

cores

Today (2011-...)Chip-level integration

Intel Sandy Bridge

AMD Fusion

NVIDIA Denver/Maxwell project…

TomorrowHeterogeneous multi-core

Focus on the throughput-optimized part

Programming model: SPMD

Heterogeneous multi-core chip

Hardwareaccelerators

Page 3: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

3

Outline

SIMT architectures

Parallel locality and its exploitation

Revisiting Flynn's Taxonomy

How to keep threads synchronized

Two instructions, multiple data

Parallel value locality

Page 4: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

4

Locality, regularity in sequential apps

Application behavior likely to follow regular patterns

Controlregularity

for(i…){

if(f(i)) {}

j = g(i);x = a[j];

}

Time

taken taken takentaken

Regular/local case Irregular case

taken takennot tk not tk

Memorylocality,regularity

j=21 j=4 j=2j=17j=17 j=18 j=20j=19

Applications

Caches

Branch prediction

Instruction prefetch, data prefetch, write combining…

i=0 i=1 i=2 i=3 i=0 i=1 i=2 i=3

Valuelocality

x=15 x=0 x=52x=2x=42 x=42 x=42x=42

Page 5: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

5

Regularity in parallel applications

Similarity in behavior between SPMD threads

IrregularRegular

Parallelcontrolregularity

Parallelmemorylocality

Tim

e

Thread1 2 3 41 2 3 4

switch(i) { case 2:... case 17:... case 21:...}

i=21 i=4 i=2i=17i=17 i=17 i=17i=17

loadA[8]

loadA[0]

loadA[11]

loadA[3]

loadA[8]

loadA[9]

loadA[10]

loadA[11]

A Memory

Parallelvaluelocality

a=32 a=32

r=A[i]

r=a*b

a=32 a=32

b=52 b=52 b=52 b=52

a=17 a=-5 a=11 a=42

b=15 b=0 b=-2 b=52

Page 6: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

6

How to exploit parallel locality?

Multi-threading implementation options:

Replication

Different resources, same time

Chip Multi-Processing (CMP)

Time-multiplexing

Same resource, different times

Multi-Threading (MT)

Factorization

If we have parallel locality

Same resource, same time

Single-Instruction Multi-Threading (SIMT)

time

time

spac

esp

ace

spac

e

time

T0T1T2T3

T0T1

T2T3

T0-T3

Page 7: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

7

Single Instruction, Multiple Threads (SIMT)

Area/Power-efficient thanks to parallel locality

(0-3) store

(0) mul

IF

ID

EX

LSU(0)

Mem

ory

(1) mul (2) mul (3) mul

(1) (2) (3)

(0-3) load

Factorization of fetch/decode, load-store units

Fetch 1 instruction on behalf of several threads

Read 1 memory location and broadcast to several registers

T0

T1

T2

T3

Page 8: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

8

Flynn's taxonomy revisited

Singleresource

Multipleresources

InstructionFetch

Resource:pipeline stage

Memory(Address)

RF, Execute(Data)

SIMT

MIMT

F

T0T

1T

2T

3

F

T0T

1T

2T

3

F F F

SAMT

MAMT

M

T0T

1T

2T

3

M

T0T

1T

2T

3

M M M

SDMT

MDMT

X

T0T

1T

2T

3

X

T0T

1T

2T

3

X X X

F MX

Mostly orthogonal

Mix and match to build your own _I_D_A_T pipeline!

Page 9: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

9

Examples: conventional design points

F MXMulti-core

MIMD(MAMT) F MX

F MX

GPU

SI(MDSA)MT

Short-vector SIMD

SIMD(SAST)

T0

T1

T2

X

F MX

X

T0

X

F MX

X

T0

T1

T2

MI MD MA MT

SIMD

SA ST

SIMD

SA MT

Page 10: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

10

A GPU: NVIDIA GeForce GTX 580

SIMT: warps of 32 threads

16 SMs / chip

2×16 cores / SM, 48 warps / SM

1580 Gflop/s

Up to 24576 threads in flight

Time

Core 1

Core 2

Core 16

Warp 3

Warp 1

Warp 47

SM1 SM16

……C

ore 17

Core 18

Core 32

Warp 4

Warp 2

Warp 48

Page 11: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

11

Outline

SIMT architectures

How to keep threads synchronized

The old way: mask stacks

The new way: distributed control and arbitration

Two instructions, multiple data

Parallel value locality

Page 12: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

12

How to keep threads synchronized?

Issue: control divergence

Rules of the game

One thread per Processing Element (PE)

All PE execute the same instruction

PEs can be individually disabled

PE 1 PE 21 instruction PE 0 PE 3

Thread 0 Thread 1 Thread 2 Thread 3

x = 0;

if(tid > 17) {

x = 1;

}

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

// Uniform condition

// Divergent conditions

Page 13: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

13

The standard way: mask stack

x = 0;

if(tid > 17) {

x = 1;

}

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

Code

push

push

pop

push

pop

pop

1111

1111 1100

1111 1100 1000

1111 1100

1111 1100 0100

1111 1100

1111

Mask Stack1 activity bit / thread

tid=0

tid=1

tid=2

tid=3

1111

skip

// Uniform condition

// Divergent conditions

Page 14: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

14

Goto considered harmful?

jjaljrsyscall

MIPS

jmpiififfelseendifdowhilebreakconthaltmsavemrestpushpop

Intel GMAGen4(2006)

jmpiifelseendifcasewhilebreakconthaltcallreturnfork

Intel GMASB(2011)

pushpush_elsepoppush_wqmpop_wqmelse_wqmjump_anyreactivatereactivate_wqmloop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMD Cayman(2011)

pushpush_elsepoploop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMDR600(2007)

jumploopendlooprependrepbreakloopbreakrepcontinue

AMDR500(2005)

barbrabrkbrkptcalcontkilpbkpretretssytrap.s

NVIDIATesla(2007)

barbptbrabrkbrxcalcontexitjcaljmpjmxlongjmppbkpcntplongjmppretretssy.s

NVIDIAFermi(2010)

Control instructions in some CPU and GPU instruction sets

Why so many?

Expose control flow structure to the instruction sequencer

No generic support for arbitrary control flow

Page 15: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

15

Alternative: 1 PC / thread

Master PC

Code Program Counters (PCs)tid= 0 1 2 3x = 0;

if(tid > 17) {

x = 1;

}

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

1 0 0 0

PC0

PC1

PC2

PC3

Match→ active

No match→ inactive

Page 16: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

16

Scheduling policy: min(SP:PC)

Which PC to choose as master PC ?

Conditionals, loops

Order of code addresses

min(PC)

Functions

Favor max nesting depth

min(SP)

if(…){}else{}

…p? br else…br endifelse:…endif:

Source Assembleur Ordre

1

2

3

while(…){} 1 2 3

start:…p? br start…

4

…f();

void f(){ …}

…call f…f:…ret

2

3

1

G. Diamos, A. Kerr, H. Wu, S. Yalamanchili, B. Ashbaugh,S. Maiyuran. SIMD re-convergence at thread frontiers.MICRO 44, 2011.

With compiler support

Unstructured control flow too

No code duplication

Page 17: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

17

Our new SIMT pipeline

Vot

e InstructionFetch

MPC Insn,MPC

Bro

adca

st

Match Exec Update PCInsn

PC0

PC0

PC1

Insn,MPC

Match Exec Update PCInsn

PC1

Insn,MPC

Match Exec Update PCInsn

PCn

Insn,MPCPC

n

No match: discard instruction

S. Collange. Une architecture unifiée pour traiter la divergence de contrôle et la divergence mémoire en SIMT. SympA'14, 2011.

Page 18: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

18

Benefits of multiple-PC arbitration

Before: stack, counters

O(d), O(log d) memoryd = nesting depth

C-style structured control-flow only

1 R/W port to memory

Exceptions: stack overflow, underflow

Partial SIMD semantics(Bougé-Levaire)

Structured control flow only

Specific instruction sets

After: multiple PCs

O(1) memory

No shared state

Arbitrary control flow

Allows thread suspension, restart, migration

Full SPMD semantics(multi-thread)

Traditional languages, compilers

Traditional instruction sets

Enables many new architecture ideas

Page 19: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

19

Outline

SIMT architectures

How to keep threads synchronized

Two instructions, multiple data

From divergent branches

From multiple warps

Parallel value locality

Page 20: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

20

Sharing 2 resources

Resource count

1

M

InstructionFetch

Resource type: Memory port(Address)

Computation /registers(Data)

SIMT

MIMT

F

T0T

1T

2T

3

F

T0T

1T

2T

3

F F F

2

DIMTF

T0T

1T

2T

3

F

SAMT

MAMT

M

T0T

1T

2T

3

M

T0T

1T

2T

3

M M M

DAMTM

T0T

1T

2T

3

M

SDMT

MDMT

X

T0T

1T

2T

3

X

T0T

1T

2T

3

X X X

DDMTX

T0T

1T

2T

3

X

A. Glew. Coherent vector lane threading. Berkeley ParLab Seminar, 2009.

Page 21: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

21

Simultaneous Branch Interweaving

Co-issue instructions from divergent branches

Fill holes using parallelism from divergent paths

SIMT(baseline)

SBI

Same warp,differentinstruction

1

234

56

Control-flowgraph

Page 22: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

22

Secondary scheduler policy

Primary scheduler: MPC1=Min

i(PC

i)

Secondary scheduler: MPC2 = Min

i(PC

i, PC

i ≠ MPC

1)

Enforce control-flow reconvergence

Annotate reconvergence points with pointer to dominator

Wait for any thread of the warp between PCdiv and PCrec

T0

T1

T2

T3

T0 and T2 (at F)wait for T1 (in D).T3 (in B) can proceedin parallel.

Page 23: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

23

Implementation

Fermi GPUs already have 2 instruction schedulers

Direct both schedulers to the same units

Fermi: warp size 322 warps / clock1 instruction / warp

SBI: warp size 641 warp / clock2 instructions / warp

Page 24: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

24

Simultaneous Warp Interweaving

Co-issue instructions from different warps

Transposition of Simultaneous Multi-Threading (SMT)in the SIMD world

SWI SBI+SWI

Different warp,differentinstruction

Page 25: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

25

Implementation: cascaded scheduling

Secondary scheduler refines initial scheduling

Looks for warp instruction with disjoint set of active threads

SBI/SWI: warp size 641 warp / clock

Page 26: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

26

Detecting compatible warps

Bitset inclusion test:Content-Associative Memory

Treat zeros as don't care bits

Power-hungry!

1 1

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

hit

0 0 0

0 0 0 0 0 0

0 hit0 0 0

Set-associative lookupSplit warps in setsRestrict lookup to 1 setMore power-efficient 1 1

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

hit

0 0 0

0 0 0 0 0 0

00 0 0

sameset

Page 27: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

27

Set-associative lookup is good enough

3-way: 97% of fully-associative (23-way) performance

Direct-mapped: 96%

Page 28: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

28

2332

Using divergence correlations

Issue: unbalanced divergence introduces conflicts

e.g. Parallel reduction

Solution: static lane shuffling

Apply different lane permutation for each warp

Preserves inter-thread memory locality

time

warp 0 warp 1 warp 2 warp 3

Warp 0 is never compatible with warp 2:conflict in lane 0

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

time

warp 0 warp 1 warp 2 warp 3

Threads 0 mapped to different physicallanes: no conflict

0 1 2 3 01 23 0 1 01

Page 29: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

29

Results

Collaboration with Nicolas Brunie (LIP, ENS Lyon / Kalray), Gregory Diamos (Georgia Tech / NVIDIA)

Regular applications Irregular applications

Speedup Regular Irregular

SBI +15% +41%

SWI +25% +33%

SBI+SWI +23% +40%

Page 30: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

30

Outline

SIMT architectures

How to keep threads synchronized

Two instructions, multiple data

Parallel value locality

Dynamic scalarization

Affine vector cache

Affine-aware register allocation

Page 31: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

31

32 birds with 1 stone

Not as crazy as it looks...

What about SISDSAMT?

Phenomenon: parallel value locality

Applications: instruction sharing, register sharing

F MX

T0

T1

T2

SI SD SA MT

SI SD MT

Page 32: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

32

What are we computing on?

Uniform data

In a warp, v[tid] = c 5 5 5 5 5 5 5 5

8 9 101112131415

thread 0Affine data

In a warp, v[tid] = b + tid × s

Base b, stride sb=8

s=1

c=5

RF reads

Operations

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

OtherAffineUniform

Average frequency in GPGPU applications

thread 1

Page 33: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

33

mov i ← tid A←Aloop:

load t ← X[i] K←U[A]mul t ← a×t K←U×Kstore X[i] ← t U[A]←Kadd i ← i+tcnt A←A+Ubranch i<n? loop A<U?

loop:load t ← X[i] K←U[A]mul t ← a×t K←U×K...

Instructions

t17 X X X X X0 1 X X X X

51 X X X X X

ain

Thread0 10 2 3 …

Dynamic scalarization: tagging registers

KU

UA

Tag

TagsAssociate a tag to each vector register

Uniform, Affine, unKnown

Propagate tags across arithmetic instructions

2 lanes are enough to encode uniform and affine vectors

Trace

Page 34: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

34

Dynamic scalarization: clock-gating

DecodeFetch

De-duplication

Tags

Readoperands

ScalarRF

Vector RF

Execute

Branch /Mask

...

Reg ID Reg ID + tag

Inactive for24% of instructions

Inactive for38% of operands

S. Collange, D. Defour, Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09, 2009

Page 35: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

35

Why on-chip memory size matters

Conventional wisdom

Cache area in CPU vs. GPUaccording to the NVIDIACUDA Programming Guide:

GPU Register files+ caches

NVIDIA GF110

3.9 MB

AMD Cayman

7.7 MB

At this rate, will catch up with CPUs by 2012…

Actual data

Page 36: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

36

What is inside thread-private memory?

Private memory: extension to the RF

Contains call stack, local arrays, spilled registers

80% of private memory traffic is affine

RF traffic was 50%

Page 37: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

37

Affine Vector Cache

As level-1 cache

Private memory physically interleaved across threads1 cache line = 1 spilled vector register

Affine vectors: store (base, stride) only

Research project of Alexandre Kouyoumdjian, LIP, ENS Lyon, April-May 2011

16× morecompact

Page 38: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

38

What is inside a GPU register file?

50% - 92% of GPU RF contains affine variables

More than register reads: non-affine variables are short-lived

Also explains private memory traffic

MatrixMul: 3 non-affine / 14

Research project of Élie Gédéon, LIP, ENS Lyon, June-July 2011

Needleman-Wunsch:2 non-affine / 24

Non-affine registers alive in inner loop:

Convolution: 4 non-affine inhotspot / 14

Page 39: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

39

Compilers to the rescue

Static analysis to identify affine registers

Issue: divergent control-flow introduces dependencies

Solution: gated-SSA form + live-range splitting

Application: spill affine variables to shared memory

Collaboration with Fernando Magno Quintão Pereira, Diogo Sampaio, Rafael Martins, Universidade Federal de Minas Gerais, Brazil

Up to 40% speedup on current GPUs, for 8 registers / thread

Page 40: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

40

Future direction: affine cache as RF

SIMT execution: only 1 tag lookup / operand

Same translation for all lanes

Affine ALU handles most control flow and addresses

Vector ALUs/FPUs do the heavy lifting

Coordinate warp scheduling and replacement policy?

Tags

L0Array

L0Array

L0Array

L0Array

ALU/FPU ALU/FPU ALU/FPU ALU/FPU

InstructionDecode

Instructions

ArchregIDs

µarch reg IDs

Affine ALU

AffineArray

Bro

adca

st

InstructionFetch

Replacementpolicy

Page 41: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

41

Bottom line: the missing link

New micro-architecture space between Clustered Multi-Threading and SIMD

New ways to exploit parallel value locality for higher perf/W

CMP

SMT

CMTSIMDstack-based

SIMTPC-based

SIMT

SIMD programming model Multi-thread programming model

Optimize forparallel locality

Allow moreflexibility

Page 42: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

42

Conclusion: research factorization?

Clustered multi-thread architectures: choose between

Replication

Time-multiplexing

Factorization New!

Instruction fetch policy in multi-thread processors: balance

Instruction throughput

Fairness

Parallel locality New!

Control-flow reconvergence points

For latency: to reduce branch misprediction penalty

For throughput: to restore thread synchronization New!

Cross-fertilization with ideas from “classical” superscalar microarchitecture ?

Page 43: Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

Escaping the SIMD vs. MIMD mindseta new class of hybrid microarchitectures

between GPUs and CPUs

Sylvain CollangeUniversità degli Studi di Siena

[email protected]

Séminaire DALIDecember 15, 2011