EECS 583 Class 16 Automatic Parallelization Via Decoupled...

52
EECS 583 Class 16 Automatic Parallelization Via Decoupled Software Pipelining University of Michigan November 7, 2018

Transcript of EECS 583 Class 16 Automatic Parallelization Via Decoupled...

Page 1: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

EECS 583 – Class 16

Automatic Parallelization Via

Decoupled Software Pipelining

University of Michigan

November 7, 2018

Page 2: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 1 -

Announcements + Reading Material

Midterm exam – Next Wednesday in class

» Starts at 10:40am sharp

» Open book/notes

No regular class next Monday

» Group office hours in class in 2246 SRB

Reminder – Sign up for paper presentation slot on my

door if you haven’t done so already

Today’s class reading

» “Automatic Thread Extraction with Decoupled Software Pipelining,” G.

Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th

IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.

» “Revisiting the Sequential Programming Model for Multi-Core,” M. J.

Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, Proc

40th IEEE/ACM International Symposium on Microarchitecture,

December 2007.

Page 3: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 2 -

Moore’s Law

Source: Intel/Wikipedia

Page 4: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 3 -

Single-Threaded Performance Not Improving

Page 5: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 4 -

What about Parallel Programming? –or-

What is Good About the Sequential Model? Sequential is easier

» People think about programs sequentially

» Simpler to write a sequential program

Deterministic execution

» Reproducing errors for debugging

» Testing for correctness

No concurrency bugs

» Deadlock, livelock, atomicity violations

» Locks are not composable

Performance extraction

» Sequential programs are portable

Are parallel programs? Ask GPU developers

» Performance debugging of sequential programs straight-forward

Page 6: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 5 -

Compilers are the Answer? - Proebsting’s Law

“Compiler Advances Double Computing Power Every 18 Years”

Run your favorite set of benchmarks with your favorite state-of-the-art optimizing compiler. Run the benchmarks both with and without optimizations enabled. The ratio of of those numbers represents the entirety of the contribution of compiler optimizations to speeding up those benchmarks. Let's assume that this ratio is about 4X for typical real-world applications, and let's further assume that compiler optimization work has been going on for about 36 years. Therefore, compiler optimization advances double computing power every 18 years. QED.

1

10

100

1000

10000

100000

1000000

10000000

100000000

1971

1975

1979

1983

1987

1991

1995

1999

2003

Years

Sp

ee

du

p

Conclusion – Compilers not about performance!

Page 7: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

Can We Convert Single-threaded

Programs into Multi-threaded?

Page 8: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 7 -

Loop Level Parallelization

i = 0-39

i = 10-19

i = 30-39

i = 0-9

i = 20-29

Thread 1Thread 0

Loop Chunk

Bad news: limited number of parallel

loops in general purpose applications

–1.3x speedup for SpecINT2000 on 4 cores

Page 9: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 8 -

DOALL Loop Coverage

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.

alvi

nn05

6.ea

r17

1.sw

im17

2.m

grid

177.

mes

a17

9.ar

t18

3.eq

uake

188.

amm

p00

8.es

pres

so02

3.eq

ntot

t02

6.co

mpr

ess

072.

sc09

9.go

124.

m88

ksim

129.

com

pres

s13

0.li

132.

ijpeg

164.

gzip

175.

vpr

181.

mcf

197.

pars

er25

5.vo

rtex

256.

bzip

230

0.tw

olf

cjpe

gdj

peg

epic

g721

deco

deg7

21en

code

gsm

deco

degs

men

code

mpe

g2de

cm

peg2

enc

pegw

itdec

pegw

itenc

raw

caud

iora

wda

udio

unep

icgr

ep lex

wc

yacc

aver

age

SPEC FP SPEC INT Mediabench Utilities

Frac

tion

of s

eque

ntia

l exe

cutio

n

Page 10: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 9 -

What’s the Problem?

for (i=0; i<100; i++) {

. . . = *p;

*q = . . .

}

1. Memory dependence analysis

Memory dependence profiling

and speculative parallelization

Page 11: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 10 -

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n

056.e

ar

171.s

wim

172.m

grid

177.m

esa

179.a

rt

183.e

quake

188.a

mm

p

008.e

spre

sso

023.e

qnto

tt

026.c

om

pre

ss

072.s

c

099.g

o

124.m

88ksim

129.c

om

pre

ss

130.li

132.ijp

eg

164.g

zip

175.v

pr

181.m

cf

197.p

ars

er

256.b

zip

2

300.t

wolf

cjp

eg

djp

eg

epic

g721decode

g721encode

gsm

decode

gsm

encode

mpeg2dec

mpeg2enc

pegw

itdec

pegw

itenc

raw

caudio

raw

daudio

unepic

gre

p

lex

yacc

avera

ge

SPEC FP SPEC INT Mediabench Utilities

Fracti

on

of

sequ

en

tial execu

tion Provable DOALL

DOALL Coverage – Provable and Profiled

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n

056.e

ar

171.s

wim

172.m

grid

177.m

esa

179.a

rt

183.e

quake

188.a

mm

p

008.e

spre

sso

023.e

qnto

tt

026.c

om

pre

ss

072.s

c

099.g

o

124.m

88ksim

129.c

om

pre

ss

130.li

132.ijp

eg

164.g

zip

175.v

pr

181.m

cf

197.p

ars

er

256.b

zip

2

300.t

wolf

cjp

eg

djp

eg

epic

g721decode

g721encode

gsm

decode

gsm

encode

mpeg2dec

mpeg2enc

pegw

itdec

pegw

itenc

raw

caudio

raw

daudio

unepic

gre

p

lex

yacc

avera

ge

SPEC FP SPEC INT Mediabench Utilities

Fracti

on

of

sequ

en

tial execu

tion

Profiled DOALL

Provable DOALL

Still not good enough!

Page 12: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 11 -

What’s the Next Problem?

2. Data dependences

while (ptr != NULL) {

. . .

ptr = ptr->next;

sum = sum + foo;

}

Compiler transformations

Page 13: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 12 -

sum2 += xsum1 += x

We Know How to Break Some of These

Dependences – Recall ILP Optimizations

sum+=x

sum = sum1 + sum2

Thread 1Thread 0

Apply accumulator variable expansion!

Page 14: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 13 -

Data Dependences Inhibit Parallelization

Accumulator, induction, and min/max expansion

only capture a small set of dependences

2 options

» 1) Break more dependences – New transformations

» 2) Parallelize in the presence of dependences – more

than DOALL parallelization

We will talk about both, but for now ignore this

issue

Page 15: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 14 -

char *memory;

void * alloc(int size);

void * alloc(int size) {

void * ptr = memory;

memory = memory + size;

return ptr;

}

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

What’s the Next Problem?

3. C/C++ too restrictive

Page 16: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 15 -

char *memory;

void * alloc(int size);

void * alloc(int size) {

void * ptr = memory;

memory = memory + size;

return ptr;

}

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

Loops cannot be parallelized even if

computation is independent

Page 17: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 16 -

Commutative Extension

Interchangeable call sites

» Programmer doesn’t care about the order that a

particular function is called

» Multiple different orders are all defined as correct

» Impossible to express in C

Prime example is memory allocation routine

» Programmer does not care which address is returned

on each call, just that the proper space is provided

Enables compiler to break dependences that flow

from 1 invocation to next forcing sequential

behavior

Page 18: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 17 -

char *memory;

void * alloc(int size);

@Commutative

void * alloc(int size) {

void * ptr = memory;

memory = memory + size;

return ptr;

}

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

Page 19: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 18 -

char *memory;

void * alloc(int size);

@Commutative

void * alloc(int size) {

void * ptr = memory;

memory = memory + size;

return ptr;

}

Implementation dependences should not cause serialization.

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

Page 20: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 19 -

What is the Next Problem?

4. C does not allow any prescribed non-

determinism

» Thus sequential semantics must be assumed even

though they not necessary

» Restricts parallelism (useless dependences)

Non-deterministic branch programmer does

not care about individual outcomes

» They attach a probability to control how statistically

often the branch should take

» Allow compiler to tradeoff ‘quality’ (e.g.,

compression rates) for performance

When to create a new dictionary in a compression scheme

Page 21: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 20 -

280chars

3000chars

70chars

520chars

Sequential Program

100chars

100chars

80chars 100

chars

20

Parallel Program

dict = create_dict();

while((char = read(1))) {

profitable =

compress(char, dict)

if (!profitable) {

dict = restart(dict);

}

}

finish_dict(dict);

#define CUTOFF 100

dict = create_dict();

count = 0;

while((char = read(1))) {

profitable =

compress(char, dict)

if (!profitable)

dict=restart(dict);

if (count == CUTOFF){

dict=restart(dict);

count=0;

}

count++;

}

finish_dict(dict);

!profit

!profit

!profit

!profit

cutoffcutoffcutoffcutoff

!profit

Page 22: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 21 -

1500chars

2500chars

800chars

1200chars

1700chars

2-Core Parallel Program

64-Core Parallel Program

1000chars

1300chars

dict = create_dict();

while((char = read(1))) {

profitable =

compress(char, dict)

@YBRANCH(probability=.01)

if (!profitable) {

dict = restart(dict);

}

}

finish_dict(dict);

Compilers are best situated to make the tradeoff between output quality

and performance

100chars

100chars

80chars 100

chars

20

cutoffcutoffcutoffcutoff

!prof

cutoff cutoff cutoff cutoff

!prof

!prof!prof

Re

se

t e

ve

ry

25

00

ch

ara

cte

rsR

ese

t e

ve

ry1

00

ch

ara

cte

rs

Page 23: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 22 -

Capturing Output/Performance Tradeoff: Y-

Branches in 164.gzipdict = create_dict();

while((char = read(1))) {

profitable =

compress(char, dict)

if (!profitable) {

dict = restart(dict);

}

}

finish_dict(dict);

#define CUTOFF 100000

dict = create_dict();

count = 0;

while((char = read(1))) {

profitable =

compress(char, dict)

if (!profitable)

dict=restart(dict);

if (count == CUTOFF){

dict=restart(dict);

count=0;

}

count++;

}

finish_dict(dict);

dict = create_dict();

while((char = read(1))) {

profitable =

compress(char, dict)

@YBRANCH(probability=.00001)

if (!profitable) {

dict = restart(dict);

}

}

finish_dict(dict);

Page 24: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 23 -

164.gzip 26 x x x

175.vpr 1 x x x

176.gcc 18 x x x x

181.mcf 0 x

186.crafty 9 x x x x x

197.parser 3 x x

253.perlbmk 0 x x x

254.gap 3 x x x

255.vortex 0 x x x

256.bzip2 0 x x

300.twolf 1 x x x

Modified only 60 LOC out of ~500,000 LOC

Page 25: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 24 -

Lack of an Aggressive Compilation Framework

What prevents the automatic extraction of parallelism?

Sequential Programming Model

Performance Potential

Page 26: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 25 -

What About Non-Scientific Codes???

for(i=1; i<=N; i++) // C

a[i] = a[i] + 1; // X

while(ptr = ptr->next) // LD

ptr->val = ptr->val + 1; // X

Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++)

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Cyclic Multithreading

(CMT)

Example: DOACROSS

[Cytron, ICPP 86]

Independent

Multithreading

(IMT)

Example: DOALL

parallelization

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2 Core 1 Core 2

Page 27: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 26 -

Alternative Parallelization Approaches

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1 Core 2

while(ptr = ptr->next) // LD

ptr->val = ptr->val + 1; // X

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1 Core 2

Pipelined

Multithreading (PMT)

Example: DSWP

[PACT 2004]

Cyclic

Multithreading

(CMT)

Page 28: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 27 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1 Core 2

CMTIMT

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1 Core 2

PMT

Page 29: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 28 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

1 iter/cycle

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1 Core 2

PMT

1 iter/cyclelat(comm) = 1:

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1 Core 2

CMT

1 iter/cycle

Page 30: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 29 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1 Core 2

PMT

1 iter/cyclelat(comm) = 1: 1 iter/cycle1 iter/cycle

1 iter/cyclelat(comm) = 2: 0.5 iter/cycle1 iter/cycle

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1 Core 2

CMT

Page 31: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 30 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1 Core 2

PMT

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1 Core 2

CMT

Cross-thread Dependences Wide Applicability

Thread-local Recurrences Fast Execution

Page 32: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 31 -

Our Objective: Automatic Extraction of

Pipeline Parallelism using DSWP

Find

English

Sentences

Parse

Sentences

(95%)

Emit

Results

Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage)

197.parser

Page 33: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

Decoupled Software Pipelining

Page 34: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 33 -

Decoupled Software Pipelining (DSWP)

A: while(node)

B: ncost = doit(node);

C: cost += ncost;

D: node = node->next;

Inter-thread communication

latency is a one-time cost

intra-iteration

loop-carried

register

control

communication queue

[MICRO 2005]

Dependence

GraphDAGSCC

Thread 1 Thread 2D

B

C

A

A D

B

C

A

DB

C

Page 35: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 34 -

Implementing DSWPL1:

Aux:

DFG

intra-iteration

loop-carried

register

memory

control

Page 36: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 35 -

Optimization: Node Splitting

To Eliminate Cross Thread Control

L1

L2

intra-iteration

loop-carried

register

memory

control

Page 37: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 36 -

Optimization: Node Splitting To Reduce

CommunicationL1

L2

intra-iteration

loop-carried

register

memory

control

Page 38: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 37 -

Constraint: Strongly Connected Components

Solution: DAGSCCConsider:

intra-iteration

loop-carried

register

memory

control

Eliminates pipelined/decoupled property

Page 39: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 38 -

2 Extensions to the Basic Transformation

Speculation

» Break statistically unlikely dependences

» Form better-balanced pipelines

Parallel Stages

» Execute multiple copies of certain “large” stages

» Stages that contain inner loops perfect candidates

Page 40: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 39 -

Why Speculation?

A: while(node)

B: ncost = doit(node);

C: cost += ncost;

D: node = node->next;

Dependence

Graph D

B

C

A

DAGSCC A D

B

C

intra-iteration

loop-carried

register

control

communication queue

Page 41: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 40 -

Why Speculation?

A: while(cost < T && node)

B: ncost = doit(node);

C: cost += ncost;

D: node = node->next;

Dependence

Graph D

B

C

A

DAGSCC A D

B

C

A B C D

Predictable

Dependences

intra-iteration

loop-carried

register

control

communication queue

Page 42: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 41 -

Why Speculation?

A: while(cost < T && node)

B: ncost = doit(node);

C: cost += ncost;

D: node = node->next;

Dependence

Graph D

B

C

A

DAGSCC D

B

C

Predictable

Dependences

Aintra-iteration

loop-carried

register

control

communication queue

Page 43: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 42 -

Execution Paradigm

Misspeculation

detected

DAGSCC D

B

C

A

Misspeculation Recovery

Rerun Iteration 4

intra-iteration

loop-carried

register

control

communication queue

Page 44: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 43 -

Understanding PMT Performance

0

1

2

3

4

5

A:1

A:2 B:1

B:2

B:3

B:4

A:3

A:4

A:5

A:6 B:5

Core 1 Core 2

0

1

2

3

4

5

A:1

B:1

C:1

C:3

A:2

B:2

A:3

B:3

Core 1 Core 2

Idle

T

ime

1 cycle/iterSlowest thread:

Iteration Rate: 1 iter/cycle

2 cycle/iter

0.5 iter/cycle

)max( itT

1. Rate ti is at least as large as the longest dependence recurrence.

2. NP-hard to find longest recurrence.

3. Large loops make problem difficult in practice.

Page 45: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 44 -

Selecting Dependences To Speculate

A: while(cost < T && node)

B: ncost = doit(node);

C: cost += ncost;

D: node = node->next;

Dependence

Graph D

B

C

A

DAGSCC D

B

C

A

Thread 1

Thread 2

Thread 3

Thread 4intra-iteration

loop-carried

register

control

communication queue

Page 46: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 45 -

Detecting Misspeculation

DAGSCC D

B

C

A

A1: while(consume(4))

D : node = node->next

produce({0,1},node);Th

rea

d 1

A3: while(consume(6))

B3: ncost = consume(2);

C : cost += ncost;

produce(3,cost);

Th

rea

d 3

A2: while(consume(5))

B : ncost = doit(node);

produce(2,ncost);

D2: node = consume(0);

Th

rea

d 2

A : while(cost < T && node)

B4: cost = consume(3);

C4: node = consume(1);

produce({4,5,6},cost < T

&& node);

Th

rea

d 4

Page 47: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 46 -

Detecting Misspeculation

DAGSCC D

B

C

A

A1: while(TRUE)

D : node = node->next

produce({0,1},node);Th

rea

d 1

A3: while(TRUE)

B3: ncost = consume(2);

C : cost += ncost;

produce(3,cost);Th

rea

d 3

A2: while(TRUE)

B : ncost = doit(node);

produce(2,ncost);

D2: node = consume(0);Th

rea

d 2

A : while(cost < T && node)

B4: cost = consume(3);

C4: node = consume(1);

produce({4,5,6},cost < T

&& node);

Th

rea

d 4

Page 48: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 47 -

Detecting Misspeculation

DAGSCC D

B

C

A

A1: while(TRUE)

D : node = node->next

produce({0,1},node);Th

rea

d 1

A3: while(TRUE)

B3: ncost = consume(2);

C : cost += ncost;

produce(3,cost);Th

rea

d 3

A2: while(TRUE)

B : ncost = doit(node);

produce(2,ncost);

D2: node = consume(0);Th

rea

d 2

A : while(cost < T && node)

B4: cost = consume(3);

C4: node = consume(1);

if(!(cost < T && node))

FLAG_MISSPEC();

Th

rea

d 4

Page 49: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 48 -

Breaking False Memory Dependences

Memory

Version 3

Oldest Version

Committed by

Recovery ThreadDependence

Graph D

B

C

A

intra-iteration

loop-carried

register

control

communication queue

false memory

Page 50: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 49 -

Adding Parallel Stages to DSWP

LD = 1 cycleX = 2 cycles

while(ptr = ptr->next) // LD

ptr->val = ptr->val + 1; // X

ThroughputDSWP: 1/2 iteration/cycleDOACROSS: 1/2 iteration/cyclePS-DSWP: 1 iteration/cycle

Comm. Latency = 2 cycles

0

1

2

3

4

5

LD:1

LD:2

X:1

X:3

LD:3

LD:4

LD:5

LD:6

X:5

6

7

LD:7

LD:8

Core 1 Core 2 Core 3

X:2

X:4

Page 51: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 50 -

Things to Think About – Speculation

How do you decide what dependences to speculate?

» Look solely at profile data?

» How do you ensure enough profile coverage?

» What about code structure?

» What if you are wrong? Undo speculation decisions at run-time?

How do you manage speculation in a pipeline?

» Traditional definition of a transaction is broken

» Transaction execution spread out across multiple cores

Page 52: EECS 583 Class 16 Automatic Parallelization Via Decoupled ...web.eecs.umich.edu/~mahlke/courses/583f18/lectures/583L...176.gcc 18 x x x x 181.mcf 0 x 186.crafty 9 x x x x x 197.parser

- 51 -

Things to Think About 2 – Pipeline Structure

When is a pipeline a good/bad choice for parallelization?

Is pipelining good or bad for cache performance?

» Is DOALL better/worse for cache?

Can a pipeline be adjusted when the number of available

cores increases/decreases, or based on what else is running

on the processor?

How many cores can DSWP realistically scale to?