Spatial Computation
description
Transcript of Spatial Computation
![Page 1: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/1.jpg)
Spatial Computation
Thesis committee:Seth Goldstein
Peter Lee
Todd Mowry
Babak Falsafi
Nevin Heintze
Ph.D. Thesis defense, December 8, 2003
SCS
Mihai BudiuCMU CS
![Page 2: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/2.jpg)
2
Spatial Computation
Thesis committee:Seth Goldstein
Peter Lee
Todd Mowry
Babak Falsafi
Nevin Heintze
Ph.D. Thesis defense, December 8, 2003
SCSA model of general-purpose computationbased on Application-Specific Hardware.
![Page 3: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/3.jpg)
3
Thesis StatementApplication-Specific Hardware (ASH):
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
![Page 4: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/4.jpg)
4
Outline• Introduction
• Compiling for ASH
• Media processing on ASH
• ASH vs. superscalar processors
• Conclusions
![Page 5: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/5.jpg)
5
CPU Problems
• Complexity
• Power
• Global Signals
• Limited ILP
![Page 6: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/6.jpg)
6
Design Complexity
from Michael Flynn’s FCRC 2003 talk
58%/Year
21%/Year
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2003
2001
2005
2007
2009
xxx
x xx
x
Logic transistors/chip
Transistors/staff*month
Source: S. Malik, orig Sematech
Prod
uctiv
ity
10
1,000,000
10,000,000
100,000,000
1000
100
10,000
100,000
10
1000
100
10,000
100,000
1,000,000
10,000,000
Chi
p si
ze (K
tran
sist
ors)
Design Time:CAD productivity favors FPL
2.5
.10
.35
![Page 7: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/7.jpg)
7
Communication vs. Computation
5ps 20ps
gate wire
Power consumption on wires is also dominant
![Page 8: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/8.jpg)
8
Our Approach: ASH
Application-Specific Hardware
![Page 9: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/9.jpg)
9
1.
2.
1.
2.Programs
Programs
Resource Binding Time
CPU ASH
![Page 10: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/10.jpg)
10
Hardware Interface
CPU ASH
ISA
software
hardware
software
hardwaregates
virtual ISA
![Page 11: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/11.jpg)
11
Application-Specific HardwareC program
Compiler
Dataflow IR
Reconfigurable/custom hw
![Page 12: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/12.jpg)
12
Contributions
Compilation
Computerarchitecture
Reconfigurablecomputing
Embeddedsystems
Asynchronouscircuits
High-levelsynthesis
Dataflowmachines
Nanotechnology
theory
syste
ms
![Page 13: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/13.jpg)
13
Outline• Introduction
• CASH: Compiling for ASH
• Media processing on ASH
• ASH vs. superscalar processors
• Conclusions
![Page 14: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/14.jpg)
14
Computation = Dataflow
• Operations ) functional units• Variables ) wires• No interpretation
x = a & 7;...
y = x >> 2;
Programs
&
a 7
>>
2
x
Circuits
![Page 15: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/15.jpg)
15
Basic Operation
+data
valid
ack
latch
![Page 16: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/16.jpg)
16
+
Asynchronous Computation
data
valid
ack
1
+
2
+
3
+
4
+
8
+
7
+
6
+
5
latch
![Page 17: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/17.jpg)
17
Distributed Control Logic
+ -
ackrdy
FSM
asynchronous control
short, local wires
![Page 18: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/18.jpg)
18
Forward Branches
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
Conditionals ) Speculation critical path
![Page 19: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/19.jpg)
19
Control Flow ) Data Flow
datapredicate
Merge (label)
Gateway
data
data
Split (branch)p
!
![Page 20: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/20.jpg)
20
i
+1< 100
0
*
+
sum
0
Loops
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;return sum; !
ret
![Page 21: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/21.jpg)
21
no speculation
sequencingof side-effects
Predication and Side-Effects
Load
addr
data
pred
token
token
tomemory
![Page 22: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/22.jpg)
22
Thesis StatementApplication-Specific Hardware:
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
![Page 23: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/23.jpg)
23
Outline• Introduction• CASH: Compiling for ASH
– An optimization on the SIDE
• Media processing on ASH• ASH vs. superscalar processors• Conclusions
skip to
![Page 24: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/24.jpg)
24
Availability Dataflow Analysis
y
y = a*b;
...
if (x) {
...
... = a*b;
}
![Page 25: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/25.jpg)
25
Dataflow Analysis Is Conservative
if (x) {
...
y = a*b;
}
...
... = a*b;y?
![Page 26: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/26.jpg)
26
Static Instantiation, Dynamic Evaluation
flag = false;
if (x) {
...
y = a*b;
flag = true;
}
...
... = flag ? y : a*b;
![Page 27: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/27.jpg)
27
SIDE Register Promotion Impact
0
5
10
15
20
25
30
ad
pcm
_e
ad
pcm
_d
gsm
_e
gsm
_d
ep
ic_
e
ep
ic_
d
mp
eg
2_
e
mp
eg
2_
d
jpe
g_
e
jpe
g_
d
pe
gw
it_e
pe
gw
it_d
g7
21
_e
g7
21
_d
pg
p_
e
pg
p_
d
rast
a
me
sa
09
9.g
o
12
4.m
88
ksim
12
9.c
om
pre
ss
13
0.li
13
2.ij
pe
g
13
4.p
erl
14
7.v
ort
ex
18
3.e
qu
ake
18
8.a
mm
p
16
4.g
zip
17
5.v
pr
17
6.g
cc
18
1.m
cf
19
7.p
ars
er
25
4.g
ap
30
0.tw
olf
%st promo
%st PRE
53
0
5
10
15
20
25
30
35
40
45
adp
cm_e
adp
cm_d
gsm
_e
gsm
_d
epic
_e
epic
_d
mpe
g2_e
mpe
g2_d
jpeg
_e
jpeg
_d
peg
wit_
e
peg
wit_
d
g72
1_e
g72
1_d
pgp
_e
pgp
_d
rast
a
mes
a
099
.go
124
.m88
ksim
129
.co
mp
ress
130
.li
132
.ijpe
g
134
.pe
rl
147
.vo
rtex
183
.eq
uake
188
.am
mp
164
.gzi
p
175
.vp
r
176
.gcc
181
.mcf
197
.pa
rser
254
.ga
p
300
.twol
f
% ld promo
% ld PRE
Loads
Stores
% r
educ
tion
![Page 28: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/28.jpg)
28
Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH
• ASH vs. superscalar processors• Conclusions
![Page 29: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/29.jpg)
29
Performance Evaluation
ASH
LSQ
limited BW
L18K
L21/4M
Mem
CPU: 4-way OOO
Assumption: all operations have the same latency.
![Page 30: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/30.jpg)
30
Media Kernels, vs 4-way OOO
0
0.5
1
1.5
2
2.5
3ad
pcm
_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
Tim
es f
aste
r
125.85.8
![Page 31: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/31.jpg)
31
Media Kernels, IPC
0
5
10
15
20
25
adpc
m_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
Base IPC
ASH IPC
4
![Page 32: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/32.jpg)
32
Speed-up IPC Correlation
0
1
2
3
4
5
6
7
8
9
10ad
pcm
_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
Tim
es b
igg
er
Speed-up
IPC Ratio
12
![Page 33: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/33.jpg)
33
Low-Level EvaluationC
CASHcore
Verilog back-end
Synopsys,Cadence P/R
Results shown so far.All results in thesis.
Results in the next two slides.
ASIC
180nm std. cell library, 2V
~1999technology
![Page 34: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/34.jpg)
34
Area
0
2
4
6
8
10
12
adpc
m_d
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg_
d
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
Sq
uar
e m
m
Reference: P4 in 180nm has 217mm2
![Page 35: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/35.jpg)
35
Power
vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W
0
50
100
150
200
250
300
350
Tim
es s
mal
ler
than
OO
O
power ratio 70 41 41 129 147 94 121 136 303 303
adpcm_d g721_d g721_e gsm_d gsm_e jpeg_d mpeg2_d mpeg2_e pegwit_d pegwit_e
![Page 36: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/36.jpg)
36
Thesis StatementApplication-Specific Hardware:
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
![Page 37: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/37.jpg)
37
Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH
– dataflow pipelining
• ASH vs. superscalar processors• Conclusions
skip to
![Page 38: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/38.jpg)
38
Pipeliningi
+
<=
100
1
*
+
sum
pipelinedmultiplier(8 stages)
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;
cycle=1
![Page 39: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/39.jpg)
39
Pipeliningi
+
<=
100
1
*
+
sum
cycle=2
![Page 40: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/40.jpg)
40
Pipeliningi
+
<=
100
1
*
+
sum
cycle=3
![Page 41: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/41.jpg)
41
Pipeliningi
+
<=
100
1
*
+
sum
cycle=4
![Page 42: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/42.jpg)
42
Pipeliningi
+
<=
100
1
i=1
i=0
+
sum
cycle=5
pipeline balancing
![Page 43: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/43.jpg)
43
Outline• Introduction
• CASH: Compiling for ASH
• Media processing on ASH
• ASH vs. superscalar processors
• Conclusions
![Page 44: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/44.jpg)
44
This Is Obvious!
ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good).
![Page 45: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/45.jpg)
45
SpecInt95, ASH vs 4-way OOO
-50
-40
-30
-20
-10
0
10
20
300
99
.go
12
4.m
88
ksim
12
9.c
om
pre
ss
13
0.li
13
2.ij
pe
g
13
4.p
erl
14
7.v
ort
ex
Pe
rce
nt
slo
we
r /
fas
ter
![Page 46: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/46.jpg)
46
Predicted not takenEffectively a noop for CPU!
Predicted taken.
Branch Prediction
for (i=0; i < N; i++) {
...
if (exception) break;
}
i
+
<
1
&
!
exception
result available before inputs
ASH crit path
CPU crit path
![Page 47: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/47.jpg)
47
SpecInt95, perfect prediction
-60
-40
-20
0
20
40
60
099.
go
124.
m88
ksim
129.
com
pres
s
130.
li
132.
ijpeg
134.
perl
147.
vort
ex
Per
ce
nt
slo
we
r/fa
ster
baseline
prediction
no data
![Page 48: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/48.jpg)
48
ASH Problems
• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static
– No branch prediction– No dynamic unrolling– No register renaming
• Calls/returns not lenient• ...
![Page 49: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/49.jpg)
49
Thesis StatementApplication-Specific Hardware:
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
![Page 50: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/50.jpg)
50
Outline
Introduction
+ CASH: Compiling for ASH
+ Media processing on ASH
+ ASH vs. superscalar processors
= Conclusions
![Page 51: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/51.jpg)
51
• low power
• simple verification?
• specialized to app.
• unlimited ILP
• simple hardware
• no fixed window
• economies of scale
• highly optimized
• branch prediction
• control speculation
• full-dataflow
• global signals/decision
Strengths
![Page 52: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/52.jpg)
52
Conclusions
• Compiling “around the ISA” is a fruitful research approach.
• Distributed computation structures require more synchronization overhead.
• Spatial Computation efficiently implements high-ILP computation with very low power.
![Page 53: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/53.jpg)
53
Backup Slides
• Control logic • Pipeline balancing• Lenient execution• Dynamic Critical Path• Memory PRE• Critical path analysis• CPU + ASH
![Page 54: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/54.jpg)
54
Control Logic
C
C
Reg
rdyin
ackin
rdyoutackout
datain dataout
back back to talk
![Page 55: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/55.jpg)
55
Last-Arrival Events
+
data
valid
ack
• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges
![Page 56: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/56.jpg)
56
Dynamic Critical Path
3. Some edges may repeat 2. Trace back along
last-arrival edges
1. Start from last node
back back to analysis
![Page 57: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/57.jpg)
57
Critical Paths
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
![Page 58: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/58.jpg)
58
Lenient Operations
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
Solve the problem of unbalanced pathsback back to talk
![Page 59: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/59.jpg)
59
Pipeliningi
+
<=
100
1
*i=1
i=0
+
sum
cycle=6
![Page 60: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/60.jpg)
60
Pipeliningi
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
Longlatency pipe
predicate
cycle=7
![Page 61: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/61.jpg)
61
Predicate ackedge is on thecritical path.
Pipeliningi
+
<=
100
1
*
+
sum
critical pathi’s loop
sum’s loop
![Page 62: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/62.jpg)
62
Pipelinine balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
decouplingFIFO
cycle=7
![Page 63: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/63.jpg)
63
Pipelinine balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
critical path
decouplingFIFO
back back to presentation
![Page 64: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/64.jpg)
64
Register Promotion
…=*p(p2)
*p=…(p1)
…=*p
*p=…(p1)
(p2 Æ : p1)
Load is executed only if store is not
![Page 65: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/65.jpg)
65
Register Promotion (2)
…=*p(p2)
*p=…(p1)
…=*p(false)
*p=…(p1)
• When p2 ) p1 the load becomes dead...• ...i.e., when store dominates load in CFG
back
![Page 66: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/66.jpg)
66
¼ PRE
...=*p(p1) ...=*p(p2) ...=*p(p1 Ç p2)
This corresponds in the CFG to lifting the load to a basic block dominating the original loads
![Page 67: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/67.jpg)
67
Store-store (1)
*p=...(p2)
*p=…(p1)
*p=...(p2)
*p=…(p1 Æ : p2)
• When p1 ) p2 the first store becomes dead...• ...i.e., when second store post-dominates first in CFG
![Page 68: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/68.jpg)
68
Store-store (2)
*p=...(p2)
*p=…(p1)
*p=...(p2)
*p=…(p1 Æ : p2)
• Token edge eliminated, but...• ...transitive closure of tokens preserved
back
![Page 69: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/69.jpg)
69
A Code Fragment
for(i = 0; i < 64; i++) {
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
Y[i] = X[j].q;
}
SpecINT95:124.m88ksim:init_processor, stylized
![Page 70: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/70.jpg)
70
Dynamic Critical Path
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
load predicate
loop predicate
sizeof(X[j])
definition
![Page 71: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/71.jpg)
71
MIPS gcc CodeLOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
L1! L2 ! L3 ! L5 ! L14-instructions loop-carried dependence
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
![Page 72: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/72.jpg)
72
If Branch Prediction Correct
L1! L2 ! L3 ! L5 ! L1Superscalar is issue-limited!2 cycles/iteration sustained
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
LOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
![Page 73: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/73.jpg)
73
Critical Path with Prediction
Loads are notspeculative
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
![Page 74: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/74.jpg)
74
Prediction + Load Speculation
~4 cycles!Load not pipelined(self-anti-dependence)
ack edge
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
![Page 75: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/75.jpg)
75
OOO Pipe Snapshot
IF DA EX WB CT
L5L1L2
L1L2L3L4
L1L3
L5L3L2
L1L3L3
registerrenaming
LOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
![Page 76: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/76.jpg)
76
Unrolling?
for(i = 0; i < 64; i++) {
for (j = 0; X[j].r != 0xF; j+=2) {
if (X[j].r == i)
break;
if (X[j+1].r == 0xF)
break;
if (X[j+1].r == i)
break;
}
Y[i] = X[j].q;
}
when 1 iteration
back back to talk
![Page 77: Spatial Computation](https://reader036.fdocuments.in/reader036/viewer/2022062309/56813e2c550346895da80bd6/html5/thumbnails/77.jpg)
77
Ideal Architecture
High-ILPcomputation
Low ILP computation+ OS+ VM
CPU ASH
Memory
back