Download - Inter-Iteration Scalar Replacement in the Presence of Control-Flow

Inter-Iteration Scalar Replacement

in the Presence of Control-Flow

Mihai Budiu – Microsoft Research, Silicon Valley

Seth Copen Goldstein – Carnegie Mellon University

ODES 2005

2

Summary

• What: compiler optimization

• Where: dense regular matrix codes– FORTRAN – some media processing

• Goal: reduce number of memory accesses

• How: allocate array elements to registers

• New: optimal algorithm based on predication

3

Outline

• Scalar Replacement

• Predicated PRE

• Combining the two

• Results

4

Scalar Replacement

a[i] = a[i] + 2;

a[i] <<= 4;

tmp = a[i];

tmp += 2;

tmp <<= 4;

a[i] = tmp;

Back-end

ld a[i]arith ...st a[i]ld a[i]arith …st a[i]

ld a[i]arith …

arith …st a[i]

Front-end

5

Inter-Iteration Scalar Replacement

for (i=0; i < N; i++)

a[i] += a[i+1];

ld a[0]ld a[1]st a[0]ld a[1]ld a[2]st a[1]

Runtime

tmp0 = a[0];for (i=0; i < N; i++) { tmp1 = a[1]; a[i] = tmp0 + tmp1; tmp0 = tmp1;}

i=0

i=1

ld a[0]

ld a[1]st a[0]

ld a[2]st a[1]

i=0

i=1

tmp1

6

Rotating Scalars

for (i=0; i < N; i++)

a[i] += a[i+3];

Invariant: tmp0 = a[i+0]tmp1 = a[i+1]tmp2 = a[i+2]tmp3 = a[i+3]

for (…) { …. tmp0 = tmp1; tmp1 = tmp2; tmp2 = tmp3; tmp3 = a[i+4];}

Itanium has hardware support for rotating registers.

7

Control-Flow

for (i=0; i < N; i++)

if (i & 1)

a[i] += a[i+3];

8

Outline


• Predicated PRE


• Results

9

Availability

y

y = a[i];

...

if (x) {

...

... = a[i];

}

10

Conservative Analysis

if (x) {

...

y = a[i];

}

...

... = a[i];y?

11

Predicated PREflag = false;

if (x) {

...

y = a[i];

flag = true;

}

...

... = flag ? y : a[i];

Invariant: flag = true y = a[i]

12

Outline


• Predicated PRE


• Results

13

Scalars and Flags

for (i=0; i < N; i++) if (i & 1) a[i] += a[i+3];

(valid0 = true) tmp0 = a[i+0] (valid1 = true) tmp1 = a[i+1] (valid2 = true) tmp2 = a[i+2] (valid3 = true) tmp3 = a[i+3]

bool scalar

Invariant:

14

Scalar Replacement Algorithm

if (! validk) {

ld a[i+k] tmpk = a[i+k]; validk = true;

}Can be implemented with predication or conditional moves

st a[i+k], v tmpk = v; validk = true;

15

Optimality

• No scalarized memory location isread or written two times

• The resulting program touches exactly the same memory locationsas the original program

• Proof: trivial based on valid flags invariant

[given perfect dependence analysis and enough registers]

16

Additional Details

• Initialize validk to false• Rotate scalars and valid flags• Use ‘dirtyk’ flags to avoid extra stores• Postlude for missing stores:

if (validk) a[N+k] = tmpk

• Lift loop-invariant accesses(finding loop-invariant predicates)

• Hardware support

(see paper)

(for rotating registers and flags).

17

Outline


• Predicated PRE


• Results

18

Redundant Stores

0

5

10

15

20

25

30a

dp

cm_

e

ad

pcm

_d

gsm

_e

gsm

_d

ep

ic_

e

ep

ic_

d

mp

eg

2_

e

mp

eg

2_

d

jpe

g_

e

jpe

g_

d

pe

gw

it_e

pe

gw

it_d

g7

21

_e

g7

21

_d

pg

p_

e

pg

p_

d

rast

a

me

sa

09

9.g

o

12

4.m

88

ksim

12

9.c

om

pre

ss

13

0.li

13

2.ij

pe

g

13

4.p

erl

14

7.v

ort

ex

18

3.e

qu

ake

18

8.a

mm

p

16

4.g

zip

17

5.v

pr

17

6.g

cc

18

1.m

cf

19

7.p

ars

er

25

4.g

ap

30

0.tw

olf

%st promo

%st PRE

53

% r

educ

tion

19

Redundant Loads

0

5

10

15

20

25

30

35

40

45ad

pcm

_e

adpc

m_d

gsm

_e

gsm

_d

epic

_e

epic

_d

mpe

g2_e

mpe

g2_d

jpeg

_e

jpeg

_d

pegw

it_e

pegw

it_d

g721

_e

g721

_d

pgp_

e

pgp_

d

rast

a

mes

a

099.

go

124.

m88

ksim

129.

com

pres

s

130.

li

132.

ijpeg

134.

perl

147.

vort

ex

183.

equa

ke

188.

amm

p

164.

gzip

175.

vpr

176.

gcc

181.

mcf

197.

pars

er

254.

gap

300.

twol

f

% ld promo

% ld PRE

% r

educ

tion

20

Performance Impact%

red

uctio

n ru

nnin

g tim

e

[target: Spatial Computation]

Removed accesses tend to be cache hits:small contribution to running time.

21

Conclusions

• Use predicates to dynamically detect redundant memory accesses

• Simple algorithm gives “optimal” result even with un-analyzable control flow

• Can dramatically reduce memory accesses

22

Related WorkCarr & Kennedy, PLDI 1990

Scalar Replacement- Arrays, no control flow -

Carr & Kennedy, SPE 1994Generalized Scalar Replacement

- Restricted control-flow -

Scholz, Europar 2003Predicated PRE

- Single iteration, no writes -

This work, ODES 2005PPRE across iterations

- Optimal -

Morel & Renvoise, CACM 1979Partial Redundancy Elimination- Not across remote iterations -

Non-speculative promotion

Speculative promotion