EECS 583 Class 16 Automatic Parallelization Via Decoupled...

EECS 583 – Class 16

Automatic Parallelization Via

Decoupled Software Pipelining

University of Michigan

November 7, 2018

- 1 -

Announcements + Reading Material

Midterm exam – Next Wednesday in class

» Starts at 10:40am sharp

» Open book/notes

No regular class next Monday

» Group office hours in class in 2246 SRB

Reminder – Sign up for paper presentation slot on my

door if you haven’t done so already

Today’s class reading

» “Automatic Thread Extraction with Decoupled Software Pipelining,” G.

Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th

IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.

» “Revisiting the Sequential Programming Model for Multi-Core,” M. J.

Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, Proc

40th IEEE/ACM International Symposium on Microarchitecture,

December 2007.

- 2 -

Moore’s Law

Source: Intel/Wikipedia

- 3 -

Single-Threaded Performance Not Improving

- 4 -

What about Parallel Programming? –or-

What is Good About the Sequential Model? Sequential is easier

» People think about programs sequentially

» Simpler to write a sequential program

Deterministic execution

» Reproducing errors for debugging

» Testing for correctness

No concurrency bugs

» Deadlock, livelock, atomicity violations

» Locks are not composable

Performance extraction

» Sequential programs are portable

Are parallel programs? Ask GPU developers

» Performance debugging of sequential programs straight-forward

- 5 -

Compilers are the Answer? - Proebsting’s Law

“Compiler Advances Double Computing Power Every 18 Years”

Run your favorite set of benchmarks with your favorite state-of-the-art optimizing compiler. Run the benchmarks both with and without optimizations enabled. The ratio of of those numbers represents the entirety of the contribution of compiler optimizations to speeding up those benchmarks. Let's assume that this ratio is about 4X for typical real-world applications, and let's further assume that compiler optimization work has been going on for about 36 years. Therefore, compiler optimization advances double computing power every 18 years. QED.

1

10

100

1000

10000

100000

1000000

10000000

100000000

1971

1975

1979

1983

1987

1991

1995

1999

2003

Years

Sp

ee

du

p

Conclusion – Compilers not about performance!

Can We Convert Single-threaded

Programs into Multi-threaded?

- 7 -

Loop Level Parallelization

i = 0-39

i = 10-19

i = 30-39

i = 0-9

i = 20-29

Thread 1Thread 0

Loop Chunk

Bad news: limited number of parallel

loops in general purpose applications

–1.3x speedup for SpecINT2000 on 4 cores

- 8 -

DOALL Loop Coverage

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.

alvi

nn05

6.ea

r17

1.sw

im17

2.m

grid

177.

mes

a17

9.ar

t18

3.eq

uake

188.

amm

p00

8.es

pres

so02

3.eq

ntot

t02

6.co

mpr

ess

072.

sc09

9.go

124.

m88

ksim

129.

com

pres

s13

0.li

132.

ijpeg

164.

gzip

175.

vpr

181.

mcf

197.

pars

er25

5.vo

rtex

256.

bzip

230

0.tw

olf

cjpe

gdj

peg

epic

g721

deco

deg7

21en

code

gsm

deco

degs

men

code

mpe

g2de

cm

peg2

enc

pegw

itdec

pegw

itenc

raw

caud

iora

wda

udio

unep

icgr

ep lex

wc

yacc

aver

age

SPEC FP SPEC INT Mediabench Utilities

Frac

tion

of s

eque

ntia

l exe

cutio

n

- 9 -

What’s the Problem?

for (i=0; i<100; i++) {

. . . = *p;

*q = . . .

}

1. Memory dependence analysis

Memory dependence profiling

and speculative parallelization

- 10 -

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n

056.e

ar

171.s

wim

172.m

grid

177.m

esa

179.a

rt

183.e

quake

188.a

mm

p

008.e

spre

sso

023.e

qnto

tt

026.c

om

pre

ss

072.s

c

099.g

o

124.m

88ksim

129.c

om

pre

ss

130.li

132.ijp

eg

164.g

zip

175.v

pr

181.m

cf

197.p

ars

er

256.b

zip

2

300.t

wolf

cjp

eg

djp

eg

epic

g721decode

g721encode

gsm

decode

gsm

encode

mpeg2dec

mpeg2enc

pegw

itdec

pegw

itenc

raw

caudio

raw

daudio

unepic

gre

p

lex

yacc

avera

ge


Fracti

on

of

sequ

en

tial execu

tion Provable DOALL

DOALL Coverage – Provable and Profiled

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n

056.e

ar

171.s

wim

172.m

grid

177.m

esa

179.a

rt

183.e

quake

188.a

mm

p

008.e

spre

sso

023.e

qnto

tt

026.c

om

pre

ss

072.s

c

099.g

o

124.m

88ksim

129.c

om

pre

ss

130.li

132.ijp

eg

164.g

zip

175.v

pr

181.m

cf

197.p

ars

er

256.b

zip

2

300.t

wolf

cjp

eg

djp

eg

epic

g721decode

g721encode

gsm

decode

gsm

encode

mpeg2dec

mpeg2enc

pegw

itdec

pegw

itenc

raw

caudio

raw

daudio

unepic

gre

p

lex

yacc

avera

ge


Fracti

on

of

sequ

en

tial execu

tion

Profiled DOALL

Provable DOALL

Still not good enough!

- 11 -

What’s the Next Problem?

2. Data dependences

while (ptr != NULL) {

. . .

ptr = ptr->next;

sum = sum + foo;

}

Compiler transformations

- 12 -

sum2 += xsum1 += x

We Know How to Break Some of These

Dependences – Recall ILP Optimizations

sum+=x

sum = sum1 + sum2

Thread 1Thread 0

Apply accumulator variable expansion!

- 13 -

Data Dependences Inhibit Parallelization

Accumulator, induction, and min/max expansion

only capture a small set of dependences

2 options

» 1) Break more dependences – New transformations

» 2) Parallelize in the presence of dependences – more

than DOALL parallelization

We will talk about both, but for now ignore this

issue

- 14 -

char *memory;

void * alloc(int size);

void * alloc(int size) {

void * ptr = memory;

memory = memory + size;

return ptr;

}

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

What’s the Next Problem?

3. C/C++ too restrictive

- 15 -

char *memory;





return ptr;

}

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

Loops cannot be parallelized even if

computation is independent

- 16 -

Commutative Extension

Interchangeable call sites

» Programmer doesn’t care about the order that a

particular function is called

» Multiple different orders are all defined as correct

» Impossible to express in C

Prime example is memory allocation routine

» Programmer does not care which address is returned

on each call, just that the proper space is provided

Enables compiler to break dependences that flow

from 1 invocation to next forcing sequential

behavior

- 17 -

char *memory;


@Commutative




return ptr;

}

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

- 18 -

char *memory;


@Commutative




return ptr;

}

Implementation dependences should not cause serialization.

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

- 19 -

What is the Next Problem?

4. C does not allow any prescribed non-

determinism

» Thus sequential semantics must be assumed even

though they not necessary

» Restricts parallelism (useless dependences)

Non-deterministic branch programmer does

not care about individual outcomes

» They attach a probability to control how statistically

often the branch should take

» Allow compiler to tradeoff ‘quality’ (e.g.,

compression rates) for performance

When to create a new dictionary in a compression scheme

- 20 -

280chars

3000chars

70chars

520chars

Sequential Program

100chars

100chars

80chars 100

chars

20

Parallel Program

dict = create_dict();

while((char = read(1))) {

profitable =

compress(char, dict)

if (!profitable) {

dict = restart(dict);

}

}

finish_dict(dict);

#define CUTOFF 100


count = 0;


profitable =


if (!profitable)

dict=restart(dict);

if (count == CUTOFF){

dict=restart(dict);

count=0;

}

count++;

}

finish_dict(dict);

!profit

!profit

!profit

!profit

cutoffcutoffcutoffcutoff

!profit

- 21 -

1500chars

2500chars

800chars

1200chars

1700chars

2-Core Parallel Program

64-Core Parallel Program

1000chars

1300chars



profitable =


@YBRANCH(probability=.01)

if (!profitable) {


}

}

finish_dict(dict);

Compilers are best situated to make the tradeoff between output quality

and performance

100chars

100chars

80chars 100

chars

20

cutoffcutoffcutoffcutoff

!prof

cutoff cutoff cutoff cutoff

!prof

!prof!prof

Re

se

t e

ve

ry

25

00

ch

ara

cte

rsR

ese

t e

ve

ry1

00

ch

ara

cte

rs

- 22 -

Capturing Output/Performance Tradeoff: Y-

Branches in 164.gzipdict = create_dict();


profitable =


if (!profitable) {


}

}

finish_dict(dict);

#define CUTOFF 100000


count = 0;


profitable =


if (!profitable)

dict=restart(dict);

if (count == CUTOFF){

dict=restart(dict);

count=0;

}

count++;

}

finish_dict(dict);



profitable =


@YBRANCH(probability=.00001)

if (!profitable) {


}

}

finish_dict(dict);

- 23 -

164.gzip 26 x x x

175.vpr 1 x x x

176.gcc 18 x x x x

181.mcf 0 x

186.crafty 9 x x x x x

197.parser 3 x x

253.perlbmk 0 x x x

254.gap 3 x x x

255.vortex 0 x x x

256.bzip2 0 x x

300.twolf 1 x x x

Modified only 60 LOC out of ~500,000 LOC

- 24 -

Lack of an Aggressive Compilation Framework

What prevents the automatic extraction of parallelism?

Sequential Programming Model

Performance Potential

- 25 -

What About Non-Scientific Codes???

for(i=1; i<=N; i++) // C

a[i] = a[i] + 1; // X

while(ptr = ptr->next) // LD

ptr->val = ptr->val + 1; // X

Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++)

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Cyclic Multithreading

(CMT)

Example: DOACROSS

[Cytron, ICPP 86]

Independent

Multithreading

(IMT)

Example: DOALL

parallelization

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2 Core 1 Core 2

- 26 -

Alternative Parallelization Approaches

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1 Core 2



0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1 Core 2

Pipelined

Multithreading (PMT)

Example: DSWP

[PACT 2004]

Cyclic

Multithreading

(CMT)

- 27 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1 Core 2

CMTIMT

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1 Core 2

PMT

- 28 -


0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

1 iter/cycle

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1 Core 2

PMT

1 iter/cyclelat(comm) = 1:

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1 Core 2

CMT

1 iter/cycle

- 29 -


0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1 Core 2

PMT

1 iter/cyclelat(comm) = 1: 1 iter/cycle1 iter/cycle

1 iter/cyclelat(comm) = 2: 0.5 iter/cycle1 iter/cycle

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1 Core 2

CMT

- 30 -


0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1 Core 2

PMT

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1 Core 2

CMT

Cross-thread Dependences Wide Applicability

Thread-local Recurrences Fast Execution

- 31 -

Our Objective: Automatic Extraction of

Pipeline Parallelism using DSWP

Find

English

Sentences

Parse

Sentences

(95%)

Emit

Results

Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage)

197.parser

Decoupled Software Pipelining

- 33 -

Decoupled Software Pipelining (DSWP)

A: while(node)

B: ncost = doit(node);

C: cost += ncost;

D: node = node->next;

Inter-thread communication

latency is a one-time cost

intra-iteration

loop-carried

register

control

communication queue

[MICRO 2005]

Dependence

GraphDAGSCC

Thread 1 Thread 2D

B

C

A

A D

B

C

A

DB

C

- 34 -

Implementing DSWPL1:

Aux:

DFG

intra-iteration

loop-carried

register

memory

control

- 35 -

Optimization: Node Splitting

To Eliminate Cross Thread Control

L1

L2

intra-iteration

loop-carried

register

memory

control

- 36 -

Optimization: Node Splitting To Reduce

CommunicationL1

L2

intra-iteration

loop-carried

register

memory

control

- 37 -

Constraint: Strongly Connected Components

Solution: DAGSCCConsider:

intra-iteration

loop-carried

register

memory

control

Eliminates pipelined/decoupled property

- 38 -

2 Extensions to the Basic Transformation

Speculation

» Break statistically unlikely dependences

» Form better-balanced pipelines

Parallel Stages

» Execute multiple copies of certain “large” stages

» Stages that contain inner loops perfect candidates

- 39 -

Why Speculation?

A: while(node)


C: cost += ncost;


Dependence

Graph D

B

C

A

DAGSCC A D

B

C

intra-iteration

loop-carried

register

control

communication queue

- 40 -

Why Speculation?

A: while(cost < T && node)


C: cost += ncost;


Dependence

Graph D

B

C

A

DAGSCC A D

B

C

A B C D

Predictable

Dependences

intra-iteration

loop-carried

register

control

communication queue

- 41 -

Why Speculation?



C: cost += ncost;


Dependence

Graph D

B

C

A

DAGSCC D

B

C

Predictable

Dependences

Aintra-iteration

loop-carried

register

control

communication queue

- 42 -

Execution Paradigm

Misspeculation

detected

DAGSCC D

B

C

A

Misspeculation Recovery

Rerun Iteration 4

intra-iteration

loop-carried

register

control

communication queue

- 43 -

Understanding PMT Performance

0

1

2

3

4

5

A:1

A:2 B:1

B:2

B:3

B:4

A:3

A:4

A:5

A:6 B:5

Core 1 Core 2

0

1

2

3

4

5

A:1

B:1

C:1

C:3

A:2

B:2

A:3

B:3

Core 1 Core 2

Idle

T

ime

1 cycle/iterSlowest thread:

Iteration Rate: 1 iter/cycle

2 cycle/iter

0.5 iter/cycle

)max( itT

1. Rate ti is at least as large as the longest dependence recurrence.

2. NP-hard to find longest recurrence.

3. Large loops make problem difficult in practice.

- 44 -

Selecting Dependences To Speculate



C: cost += ncost;


Dependence

Graph D

B

C

A

DAGSCC D

B

C

A

Thread 1

Thread 2

Thread 3

Thread 4intra-iteration

loop-carried

register

control

communication queue

- 45 -

Detecting Misspeculation

DAGSCC D

B

C

A

A1: while(consume(4))

D : node = node->next

produce({0,1},node);Th

rea

d 1


B3: ncost = consume(2);

C : cost += ncost;

produce(3,cost);

Th

rea

d 3


B : ncost = doit(node);

produce(2,ncost);

D2: node = consume(0);

Th

rea

d 2

A : while(cost < T && node)

B4: cost = consume(3);

C4: node = consume(1);

produce({4,5,6},cost < T

&& node);

Th

rea

d 4

- 46 -


DAGSCC D

B

C

A

A1: while(TRUE)



rea

d 1

A3: while(TRUE)


C : cost += ncost;

produce(3,cost);Th

rea

d 3

A2: while(TRUE)


produce(2,ncost);

D2: node = consume(0);Th

rea

d 2




produce({4,5,6},cost < T

&& node);

Th

rea

d 4

- 47 -


DAGSCC D

B

C

A

A1: while(TRUE)



rea

d 1

A3: while(TRUE)


C : cost += ncost;

produce(3,cost);Th

rea

d 3

A2: while(TRUE)


produce(2,ncost);

D2: node = consume(0);Th

rea

d 2




if(!(cost < T && node))

FLAG_MISSPEC();

Th

rea

d 4

- 48 -

Breaking False Memory Dependences

Memory

Version 3

Oldest Version

Committed by

Recovery ThreadDependence

Graph D

B

C

A

intra-iteration

loop-carried

register

control

communication queue

false memory

- 49 -

Adding Parallel Stages to DSWP

LD = 1 cycleX = 2 cycles



ThroughputDSWP: 1/2 iteration/cycleDOACROSS: 1/2 iteration/cyclePS-DSWP: 1 iteration/cycle

Comm. Latency = 2 cycles

0

1

2

3

4

5

LD:1

LD:2

X:1

X:3

LD:3

LD:4

LD:5

LD:6

X:5

6

7

LD:7

LD:8

Core 1 Core 2 Core 3

X:2

X:4

- 50 -

Things to Think About – Speculation

How do you decide what dependences to speculate?

» Look solely at profile data?

» How do you ensure enough profile coverage?

» What about code structure?

» What if you are wrong? Undo speculation decisions at run-time?

How do you manage speculation in a pipeline?

» Traditional definition of a transaction is broken

» Transaction execution spread out across multiple cores

- 51 -

Things to Think About 2 – Pipeline Structure

When is a pipeline a good/bad choice for parallelization?

Is pipelining good or bad for cache performance?

» Is DOALL better/worse for cache?

Can a pipeline be adjusted when the number of available

cores increases/decreases, or based on what else is running

on the processor?

How many cores can DSWP realistically scale to?

EECS 583 Class 16 Automatic Parallelization Via Decoupled...

Documents

Transcript of EECS 583 Class 16 Automatic Parallelization Via Decoupled...