Inter-Iteration Scalar Replacement
in the Presence of Control-Flow
Mihai Budiu – Microsoft Research, Silicon Valley
Seth Copen Goldstein – Carnegie Mellon University
ODES 2005
2
Summary
• What: compiler optimization
• Where: dense regular matrix codes– FORTRAN – some media processing
• Goal: reduce number of memory accesses
• How: allocate array elements to registers
• New: optimal algorithm based on predication
3
Outline
• Scalar Replacement
• Predicated PRE
• Combining the two
• Results
4
Scalar Replacement
a[i] = a[i] + 2;
a[i] <<= 4;
tmp = a[i];
tmp += 2;
tmp <<= 4;
a[i] = tmp;
Back-end
ld a[i]arith ...st a[i]ld a[i]arith …st a[i]
ld a[i]arith …
arith …st a[i]
Front-end
5
Inter-Iteration Scalar Replacement
for (i=0; i < N; i++)
a[i] += a[i+1];
ld a[0]ld a[1]st a[0]ld a[1]ld a[2]st a[1]
Runtime
tmp0 = a[0];for (i=0; i < N; i++) { tmp1 = a[1]; a[i] = tmp0 + tmp1; tmp0 = tmp1;}
i=0
i=1
ld a[0]
ld a[1]st a[0]
ld a[2]st a[1]
i=0
i=1
tmp1
6
Rotating Scalars
for (i=0; i < N; i++)
a[i] += a[i+3];
Invariant: tmp0 = a[i+0]tmp1 = a[i+1]tmp2 = a[i+2]tmp3 = a[i+3]
for (…) { …. tmp0 = tmp1; tmp1 = tmp2; tmp2 = tmp3; tmp3 = a[i+4];}
Itanium has hardware support for rotating registers.
7
Control-Flow
for (i=0; i < N; i++)
if (i & 1)
a[i] += a[i+3];
8
Outline
• Scalar Replacement
• Predicated PRE
• Combining the two
• Results
9
Availability
y
y = a[i];
...
if (x) {
...
... = a[i];
}
10
Conservative Analysis
if (x) {
...
y = a[i];
}
...
... = a[i];y?
11
Predicated PREflag = false;
if (x) {
...
y = a[i];
flag = true;
}
...
... = flag ? y : a[i];
Invariant: flag = true y = a[i]
12
Outline
• Scalar Replacement
• Predicated PRE
• Combining the two
• Results
13
Scalars and Flags
for (i=0; i < N; i++) if (i & 1) a[i] += a[i+3];
(valid0 = true) tmp0 = a[i+0] (valid1 = true) tmp1 = a[i+1] (valid2 = true) tmp2 = a[i+2] (valid3 = true) tmp3 = a[i+3]
bool scalar
Invariant:
14
Scalar Replacement Algorithm
if (! validk) {
ld a[i+k] tmpk = a[i+k]; validk = true;
}Can be implemented with predication or conditional moves
st a[i+k], v tmpk = v; validk = true;
15
Optimality
• No scalarized memory location isread or written two times
• The resulting program touches exactly the same memory locationsas the original program
• Proof: trivial based on valid flags invariant
[given perfect dependence analysis and enough registers]
16
Additional Details
• Initialize validk to false• Rotate scalars and valid flags• Use ‘dirtyk’ flags to avoid extra stores• Postlude for missing stores:
if (validk) a[N+k] = tmpk
• Lift loop-invariant accesses(finding loop-invariant predicates)
• Hardware support
(see paper)
(for rotating registers and flags).
17
Outline
• Scalar Replacement
• Predicated PRE
• Combining the two
• Results
18
Redundant Stores
0
5
10
15
20
25
30a
dp
cm_
e
ad
pcm
_d
gsm
_e
gsm
_d
ep
ic_
e
ep
ic_
d
mp
eg
2_
e
mp
eg
2_
d
jpe
g_
e
jpe
g_
d
pe
gw
it_e
pe
gw
it_d
g7
21
_e
g7
21
_d
pg
p_
e
pg
p_
d
rast
a
me
sa
09
9.g
o
12
4.m
88
ksim
12
9.c
om
pre
ss
13
0.li
13
2.ij
pe
g
13
4.p
erl
14
7.v
ort
ex
18
3.e
qu
ake
18
8.a
mm
p
16
4.g
zip
17
5.v
pr
17
6.g
cc
18
1.m
cf
19
7.p
ars
er
25
4.g
ap
30
0.tw
olf
%st promo
%st PRE
53
% r
educ
tion
19
Redundant Loads
0
5
10
15
20
25
30
35
40
45ad
pcm
_e
adpc
m_d
gsm
_e
gsm
_d
epic
_e
epic
_d
mpe
g2_e
mpe
g2_d
jpeg
_e
jpeg
_d
pegw
it_e
pegw
it_d
g721
_e
g721
_d
pgp_
e
pgp_
d
rast
a
mes
a
099.
go
124.
m88
ksim
129.
com
pres
s
130.
li
132.
ijpeg
134.
perl
147.
vort
ex
183.
equa
ke
188.
amm
p
164.
gzip
175.
vpr
176.
gcc
181.
mcf
197.
pars
er
254.
gap
300.
twol
f
% ld promo
% ld PRE
% r
educ
tion
20
Performance Impact%
red
uctio
n ru
nnin
g tim
e
[target: Spatial Computation]
Removed accesses tend to be cache hits:small contribution to running time.
21
Conclusions
• Use predicates to dynamically detect redundant memory accesses
• Simple algorithm gives “optimal” result even with un-analyzable control flow
• Can dramatically reduce memory accesses
22
Related WorkCarr & Kennedy, PLDI 1990
Scalar Replacement- Arrays, no control flow -
Carr & Kennedy, SPE 1994Generalized Scalar Replacement
- Restricted control-flow -
Scholz, Europar 2003Predicated PRE
- Single iteration, no writes -
This work, ODES 2005PPRE across iterations
- Optimal -
Morel & Renvoise, CACM 1979Partial Redundancy Elimination- Not across remote iterations -
Non-speculative promotion
Speculative promotion
Top Related