Register Pressure in Instruction Level Parallelism
-
Upload
price-walters -
Category
Documents
-
view
53 -
download
0
description
Transcript of Register Pressure in Instruction Level Parallelism
![Page 1: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/1.jpg)
Register Pressure in Instruction Level Parallelism
TOUATI Sid-Ahmed-Ali
![Page 2: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/2.jpg)
19/04/23 Thesis defense 2
Outline
Prologue
Part one : Basic Blocks
Part two : Simple Innermost Loops
Epilogue
![Page 3: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/3.jpg)
19/04/23 Thesis defense 3
Memory BottleneckSpec2000 Performance Statistics
0
0,5
1
1,5
2
2,5
3
3,5
IPC
Real
Perfect L2
Perfect Mem
From [Lin et al 01], in HPCA 2001. Simulated performance on an Alpha 21364 processor (1.6Ghz). Recent Compaq compiler (peak-optimization compiler flags).
![Page 4: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/4.jpg)
19/04/23 Thesis defense 4
Solutions by Software
To avoidTo avoid
Using registers
To tolerateTo tolerate
ILP/TLP Software prefetching
![Page 5: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/5.jpg)
19/04/23 Thesis defense 5
Combined Complex Pass
Original Graph
Scheduling +Register Allocation
Scheduling +Register Allocation
We do not advocate this method
![Page 6: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/6.jpg)
19/04/23 Thesis defense 6
Why Decoupling ?
Memory wallMemory bottleneck >> ILP enhancementUseless spilling Early RA still generates faster codes [Freudenberg et al 92, Brasier et al 95, Janssen 01].
Register constraints are more genericHeterogeneous, complex resource constraints.Registers : types and number of available registers.
Complexity of register constraintsWhile there is always a schedule for a DDG under any resource constraints, it is not the case with limited number of registers (spilling is sometimes unavoidable).
![Page 7: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/7.jpg)
19/04/23 Thesis defense 7
Our chart
I. Performance enhancement priority to registers against ILP scheduling, but the former must respect the latter (if possible).
II. Portability No major re-writing of compilers (investment cost).
Generic processor : meet most of existing ILP processors.
![Page 8: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/8.jpg)
19/04/23 Thesis defense 8
Our Generic ILP processor
Non restriction on ILP degree.Infinite parallelism (infinite resources).
Multiple register types.A statement may produces multiple values, but with distinct types.
Visible delays in reading from and writing into registers.
A register is not occupied until the result is available (later after the issue time in the pipeline).
![Page 9: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/9.jpg)
19/04/23 Thesis defense 9
First Strategy : Register Pressure Management
Original DDG
Register Pressure ManagementRegister Pressure Management
Modified DDG
Register Constraints
Code SchedulingCode SchedulingRegister AllocationRegister Allocation
MinimizeCritical Path
Increase
![Page 10: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/10.jpg)
19/04/23 Thesis defense 10
Register Saturation and Sufficiency
RS
RS
RS
RF
RF
RFR
Add arcs
Spilling
![Page 11: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/11.jpg)
19/04/23 Thesis defense 11
Second Strategy : Schedule Independent Register Allocation
Original DDG
Early Register AllocationEarly Register Allocation
Allocated DDG
Register Constraints
Code SchedulingCode Scheduling
MinimizeCritical Path
Increase
![Page 12: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/12.jpg)
19/04/23 Thesis defense 12
Part I : Basic Blocks
1. Register Requirement (Use)
2. Register Saturation
3. Register Sufficiency
4. Local Schedule Independent Register Allocation
5. Related Work
6. Conclusion
![Page 13: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/13.jpg)
19/04/23 Thesis defense 13
Local Register Requirement
1
2
34
5
6
7
8
9
10
11
12
+
x
+
++
+
+
st
ld
+
st
x
++
++
+
+
ld
![Page 14: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/14.jpg)
19/04/23 Thesis defense 14
Without Assuming a Schedule…
Value lifetime intervals are not definedRegister Requirement not defined.
Two notions in this case:Register Saturation per register type (max RR)
Guarantees that registers do not constraint the ILP scheduling.
Register Sufficiency per register type (min RR)Prevents from obsolete spilling.
![Page 15: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/15.jpg)
19/04/23 Thesis defense 15
Computing Register Saturation
Given a DAG, compute the exact maximal register requirement for all valid schedules.
NP-complete problem [Touati 00].Optimal method (integer linear programming).
Algorithmic heuristics.
![Page 16: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/16.jpg)
19/04/23 Thesis defense 16
Integer Programming Techniques
We use binary variables for expressing disjunction, implication, equivalence and max operator.
Disjunction : the domain set of the variables must be bounded.
1,0
)1()(
.)(
0)(
0)(
gxg
fxf
xg
xf
![Page 17: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/17.jpg)
19/04/23 Thesis defense 17
Integer Programming Techniques
Implication
Equivalence
Max
0)(
0)(
0)(0)(
xg
xf
xgxf
0)(
0)(
0)(
0)(0)(0)(
xg
xf
xg
xfxgxf
yzxy
xzyxyxz ),max(
![Page 18: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/18.jpg)
19/04/23 Thesis defense 18
Optimal RS Computation
Scheduling constraints :e=(u,v) : v - u (e)
Killing dates : kill(ut)=max(v+r(v)), v reads ut
Interferences :St
u,v =1 (kill(u)def(v) kill(v) def(u))
Maximal clique = independent set in the complementary graph:
Stu,v
= 0 xut + xv
t 1
Objective function = maximize (independent set)Maximize xu
t
At most O(n2) variables and O(n2+m) constraints.
![Page 19: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/19.jpg)
19/04/23 Thesis defense 19
Problem Formulation with Graphs
RS computation chose a unique killer for each value.
Computing a killing function that associates a unique killer to each value.
Two constraints :The killing function must not introduce a circuit in the DAG.
The killing function must maximize the register requirement.
![Page 20: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/20.jpg)
19/04/23 Thesis defense 20
Killing Function...
++
x
+
++
+
+
st
ld
++
x
+
++
+
+
ld
Killing function Disjoint Value DAG : interval order
![Page 21: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/21.jpg)
19/04/23 Thesis defense 21
Register Saturation Problem
Find a valid killing function such that the maximal antichain in the disjoint value DAG is maximal among other killing functions.
NP-complete Problem.
Polynomial heuristics.
![Page 22: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/22.jpg)
19/04/23 Thesis defense 22
Our Heuristics (Greedy-k)
Decompose the potential killing graph into connected bipartite components
cb=(S, T, Ecb)
Find a Saturating Killing Set: maximize the parallel values with S (minimize the number of arcs in the disjoint value DAG).
![Page 23: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/23.jpg)
19/04/23 Thesis defense 23
Saturating Killing Set
S
T
Descendant values
S
T’
Descendant values
T-T’
Descendant values
![Page 24: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/24.jpg)
19/04/23 Thesis defense 24
Greedy-k versus Optimal RS
Benchmarks : 27 loops from Spec-FP-95, whetstone, livermore.
DAGs=unrolled loops.
134 experimented DAGs (#nodes up to 120).
Maximal difference empirical difference between optimal RS and approximated RS* by Greedy-k is 1 FP register (5% of DAGs).
![Page 25: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/25.jpg)
19/04/23 Thesis defense 25
Representative RS Behaviour
RS vs. loop unrolling
0
5
10
15
20
25
30
35
40
45
unroll
RS
whet-loop1
whet-loop2
whet-loop3
![Page 26: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/26.jpg)
19/04/23 Thesis defense 26
Reducing Register Saturation
Problem : does there exist an extended DDG G’ from G such that RS(G’)R and Critical Path P ?
NP-hard problem [Touati 01]Optimal solution with integer programming.
Algorithmic heuristics.
![Page 27: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/27.jpg)
19/04/23 Thesis defense 27
Optimal RS Reduction
The problem is equivalent to computing a schedule that does not require more than R registers (NP-complete), while the total schedule time is P.
Given such schedule, we report arcs into G so as to guarantee the same interval order as defined by .
![Page 28: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/28.jpg)
19/04/23 Thesis defense 28
Integer Program for RS Reduction
We bound the register requirement for each register type, and the total schedule time:
xut Rt
P
The objective function maximizes the RR of a considered register type:
Maximize xut
At most O(n2) variables and O(n2+m) constraints
![Page 29: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/29.jpg)
19/04/23 Thesis defense 29
Algorithmic Heuristics for RS Reduction
Serialize Saturating Values lifetime intervals.
Do not extend the critical path if possible.
![Page 30: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/30.jpg)
19/04/23 Thesis defense 30
Interval Serialization
++
x
+
++
+
+
st
ld
r-w
DAGExtended
![Page 31: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/31.jpg)
19/04/23 Thesis defense 31
Experiments (RS Reduction)
Optimal versus approximatedLoops were unrolled till 4, #nodes up to 80.
We parameterise R (#available registers) as 1, next power of 2, and 32.
Maximal empirical error is two registers.
![Page 32: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/32.jpg)
19/04/23 Thesis defense 32
Experiments (RS reduction)
RS reduction (R=32)
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
unroll
red
uce
d R
S
spec-spice-loo7
spec-fp-loop1
liv-loop23
![Page 33: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/33.jpg)
19/04/23 Thesis defense 33
Experiments (ILP loss)
RS reduction and ILP loss
71%
19%
5%
1%
4%
RS=RS* ILP=ILP*
RS=RS* ILP<ILP*
RS>RS* ILP=ILP*
RS>RS* ILP<ILP*
RS>RS* ILP>ILP*
![Page 34: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/34.jpg)
19/04/23 Thesis defense 34
Part I : Basic Blocks
1. Register Requirement (Use)
2. Register Saturation
3. Register Sufficiency
4. Local Schedule Independent Register Allocation
5. Related Work
6. Conclusion
![Page 35: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/35.jpg)
19/04/23 Thesis defense 35
Computing Register Sufficiency
Its complexity is still an open problem !!Proved NP-complete for sequential codes, but not for parallel ones.Proved NP-complete for ILP codes if we restrict the total schedule time.
Integer programmingSame intLP system as RS, but we bound the register requirement : xu
t Rt
Algorithmic heuristicsLifetime interval serialization (as RS reduction)Do not care about critical path increase.Set R=1 (reduce RS as low as possible)
![Page 36: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/36.jpg)
19/04/23 Thesis defense 36
Experiments (RF Computation)
27 loop bodies, maximal empirical error is 1 register (7 cases).
Optimal vs. Approximated RF
0
1
2
3
4
5
6
7
8
9
loop body
R
Optimal RF
Approximated RF
RS
![Page 37: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/37.jpg)
19/04/23 Thesis defense 37
Part I : Basic Blocks
1. Register Requirement (Use)
2. Register Saturation
3. Register Sufficiency
4. Local Schedule Independent Register Allocation
5. Related Work
6. Conclusion
![Page 38: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/38.jpg)
19/04/23 Thesis defense 38
Example of Early RA
ld
+
++
x
++
+
+
Register Allocation is
a minimal chain decomposition
![Page 39: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/39.jpg)
19/04/23 Thesis defense 39
Two Critical Loops
ILP loss after Register Allocation
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
unroll
ILP
lo
ss liv-loop23
spec-spice-loop8
spec-spice-loop4
![Page 40: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/40.jpg)
19/04/23 Thesis defense 40
Related Work (RS)
Our RS study is an extension to URSA framework [Berson 96]. We provide an adequate formulation to this problem.
URSA AssumptionDAG=pure data-flow graph. No multiple register types, no delays, all nodes are assumed values.
The URSA problem formalisation is not correct.
The efficiency of URSA was not compared to the optimal solutions.
![Page 41: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/41.jpg)
19/04/23 Thesis defense 41
Conclusion of Part I
RS and RF are analysed before ILP scheduling : the DAG becomes free from register constraints.
RS management maximizes the register requirement in order to minimize the # of introduced false dependences.
RF analysis enables to check if spill code is useless.
Our heuristics are nearly optimal (empirical results).
![Page 42: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/42.jpg)
19/04/23 Thesis defense 42
Part II : Simple Innermost Loops
1. Cyclic Register Requirement (Use)
2. Cyclic Register Saturation
3. Cyclic Register Sufficiency
4. Cyclic Schedule Independent Register Allocation
5. Related Work
6. Conclusion
![Page 43: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/43.jpg)
19/04/23 Thesis defense 43
Software Pipelining Motif
0123456789
1011121314151617
- c ae - b- d -- - -
cn2 1 0
rn
0123
h
h
h
h
iterations
ab--c-d--e
1st
ab--c-d--e
2nd
ab--c-d--e
3rd
time
L
![Page 44: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/44.jpg)
19/04/23 Thesis defense 44
Cyclic Register Requirement
h=4
v1 v2 v3
0
2
13
01234567891011121314151617181920
It i
It i+1
It i+2
h=4v1
v2
v3
v1
v2
v3
v1
v2
v3
v1
v2
v3
![Page 45: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/45.jpg)
19/04/23 Thesis defense 45
Computing CRN
In a circular interval graph, the size of a maximal clique in the interference graph the width [Tucker 75].
We decompose the circular graph into two parts :
1. Complete turns around the circle (# of distinct interfering instances)
2. In_fraction_of_h intervals.
![Page 46: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/46.jpg)
19/04/23 Thesis defense 46
In_fraction_of_h Intervals
In_fraction_of_h intervals are the remainder of circular intervals after removing the complete turns around the circle.
If we unroll twice the kernel of the in_fraction_of_h intervals, the maximal clique of the interference graph is equal to the width of the in_fraction_of_h intervals [Touati 2002].
h=4
v1 v2 v3 h
v1 v2 v3
h
![Page 47: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/47.jpg)
19/04/23 Thesis defense 47
Part II : Simple Innermost Loops
1. Cyclic Register Requirement (Use)2. Cyclic Register Saturation (CRS)
a) Computing Cyclic Register Saturation b) Reducing Cyclic Register Saturation
3. Cyclic Register Sufficiency4. Cyclic Schedule Independent Register
Allocation5. Related Work6. Conclusion
![Page 48: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/48.jpg)
19/04/23 Thesis defense 48
Computing CRS
CRSt is the exact maximal cyclic register requirement of all valid SWP schedules.
Absolute CRSt is infinite.
If MII=0 (acyclic DDG), the loop is completely parallel : cannot be implemented by a SWP kernel.
If L is not bounded, we may have an infinite # of values simultaneously live.
Optimal method by integer programming.
![Page 49: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/49.jpg)
19/04/23 Thesis defense 49
Optimal CRS Computation (1)
The intLP system is written for a fixed h and a bounded L.
At most O(n2) variables and O(n2+m) constraints.
Scheduling constraints :
Killing dates :
Eeehe vu ),(.)(
Eeehe vu ),(.)(
tRrv
EvueuConsv
tu Vuehvk
tR
t ,
),()(
,)(.)(max
,
![Page 50: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/50.jpg)
19/04/23 Thesis defense 50
Optimal CRS Computation (2)
# of complete turns around the circle
Two acyclic intervals (]a,b], ]a’,b’]) for each in_fraction_of_h intervals (]l,r]).
h
phulifetime
t
tt
u
uu
t
)(
hbb
haa
hrblr
rblr
la
tt
tt
tttt
tttt
tt
uu
uu
uuuu
uuuu
uu
'
'
![Page 51: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/51.jpg)
19/04/23 Thesis defense 51
Optimal CRS Computation (3)
The interferences of acyclic intervals are computed as in the acyclic case
A maximal clique is computed like in the acyclic case
Objective function
10, jIJI xxs
)( )( )()(1 S JI, IbeginJendJbeginIend
I
IVu
uxpMax
tRt
t
acyclic,
![Page 52: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/52.jpg)
19/04/23 Thesis defense 52
CRS Experiments
Upper-bound of CRS
0,00%
20,00%
40,00%
60,00%
80,00%
100,00%
120,00%
4 8 16 32 64
R
% (
CR
S <
=)
![Page 53: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/53.jpg)
19/04/23 Thesis defense 53
Reducing CRS
Problem : Given a DDG G, does there exist an extended DDG G’ such that CRSt(G’) Rt and its critical circuit is h ?
NP-hard [Touati 2002].
Is reduced from the problem of computing , a SWP motif, with a initiation interval equal to h and does not require more than Rt registers.
G’ is constructed from G such that we guarantee the same lifetime interval order as defined by .
![Page 54: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/54.jpg)
19/04/23 Thesis defense 54
Optimal CRS Reduction
The same intLP system as CRS computation, except that we bound the objective function by Rt
tI
IVu
uRxp
tRt
t acyclic,
![Page 55: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/55.jpg)
19/04/23 Thesis defense 55
Part II : Simple Innermost Loops
1. Cyclic Register Requirement (Use)2. Cyclic Register Saturation3. Cyclic Register Sufficiency4. Cyclic Schedule Independent Register
Allocation1. With loop unrolling.2. Rotating register files.3. Polynomial cases.
5. Related Work6. Conclusion
![Page 56: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/56.jpg)
19/04/23 Thesis defense 56
Motivating Example
u
v
- R1 R2 R1
u
It i
v v
It i+1
u
It i+
v
u
It i+
u
v
R
-
![Page 57: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/57.jpg)
19/04/23 Thesis defense 57
Killing Tasks
u
v1
1
v2
2
u’
v1
’1
v2
’2
ku’ku
-1 -2 -’1 -’2
’
’
![Page 58: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/58.jpg)
19/04/23 Thesis defense 58
Reuse Graphs
5
1
4
3
7
8
2
6
5
1
4
3
8
2
6
1
2
3 4
6
5
8
2
6
2 6
7 7
![Page 59: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/59.jpg)
19/04/23 Thesis defense 59
SIRA Problem
Given a DDG, find a valid reuse graph for each register type such that (C) R while the critical circuit is minimized.Classical NP-complete problem [Eisenbeis and Gasperoni 95].
Minimizing the unrolling degree is a difficult problem (non linear).
![Page 60: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/60.jpg)
19/04/23 Thesis defense 60
SIRA Exact Formulation (1)
The existence of a SWP
Reuse Arc (bijection)
Anti-dependences
),( ),()( vueehe vu
uv
tuv
v
tvu ,1,,
tRt
vuvwkt
vu Vvuhvtu
,,, , ,)(1
![Page 61: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/61.jpg)
19/04/23 Thesis defense 61
SIRA Exact Formulation (2)
Register Requirement
Objective function
At most O(n2) variables and O(n2+m) constraints.
tRtvu
tvu VvuR ,
,, , ,
tRvu
tvu Vvu ,
,, , , Minimize
![Page 62: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/62.jpg)
19/04/23 Thesis defense 62
Rotating Register FilesNo need to unroll the kernel to allocate registers.Cost (if we do not unroll the loop) : at most one extra register for the same h [Rau et al 92, Lelait96].
R1=…h
Physical registers
r5
r4
r3
r2r1r0
kernel iterationh
![Page 63: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/63.jpg)
19/04/23 Thesis defense 63
Hamiltonian Ordering
(u,v) is a reuse arc (ho(v)=ho(u)+1) mod n
01
2
3
45
6
![Page 64: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/64.jpg)
19/04/23 Thesis defense 64
Polynomial Problem
Theorem [Touati 2002]: if we fix statically the reuse arcs, computing the distances so as to minimize the register requirement under a fixed execution rate has a totally unimodular constraints matrix.
![Page 65: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/65.jpg)
19/04/23 Thesis defense 65
Polynomial Instances
Disable register sharing among statements : fix statically a self reuse scheme (each variables reuses the register freed by itself) [Ning 93].
Fix an arbitrary (on in a cleverer way) hamiltonian reuse circuit.
Look for other cleverer method !
![Page 66: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/66.jpg)
19/04/23 Thesis defense 66
Experiments (SIRA vs. Ham SIRA)
Hamiltonian SIRA needs at most one extra register than SIRA (under the same II) in
very few cases.
![Page 67: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/67.jpg)
19/04/23 Thesis defense 67
Optimal vs. Polynomial SIRA
Two strategies : self reuse (disable register sharing) and arbitrary hamiltonian reuse.To synthesize :
Self-reuse strategy is the worst decision in terms of register requirement. Arbitrary Hamiltonian reuse scheme exhibits better behaviour.Self-reuse strategy exhibits better behaviour in terms of unrolling degrees (if no RRF exists).
![Page 68: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/68.jpg)
19/04/23 Thesis defense 68
Related Work in LoopsAs far I we know, nothing about cyclic register saturation and sufficiency.SWP under register constraints
Heuristics : [Huff 93, Ning93, Wang et al 95, Sanchez96, Llosa 96]
Parallelism vs. Storage : [Strout et al 98, Thies et al 01].
Optimal : buffers [Altman 95], stage scheduling [Eichenberger et al 96, Huard 01], MAXLIVE [Sawya 96, Fimmel et al 01].
Register Allocation of an already scheduled loopRRF : [Rau et al 92], [Duesterwald et al 92].Loop unrolling : [Hendren et al 92].Meeting graph [Lelait et al 96].
![Page 69: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/69.jpg)
19/04/23 Thesis defense 69
Conclusion of Part II
Contrary to acyclic RS, absolute CRS may be infinite. Practically, we can compute it if we bound L.
CRF exists, and we can compute it.
SIRA allows to construct an optimal (early) cyclic register allocation with a minimal critical circuit of the loop.
Fixing at compile time the reuse arcs makes the problem polynomial.
![Page 70: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/70.jpg)
19/04/23 Thesis defense 70
Future Research Proposals (1)
A fact :a thesis can never be exhaustive ! Our efforts open a wide range of future work subjects.
Open ProblemsComplexity of Register Sufficiency in ILP Codes.
ExperimentsCRS, CRF, and spilling heuristics.
Machine with finite resources.
![Page 71: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/71.jpg)
19/04/23 Thesis defense 71
Future Research Proposals (2)
Extend processor model (write multiple results per type).
Loops with branches.
Multi-dimensional scheduling : loop nest.
![Page 72: Register Pressure in Instruction Level Parallelism](https://reader031.fdocuments.in/reader031/viewer/2022032205/56812f21550346895d94b97b/html5/thumbnails/72.jpg)
19/04/23 Thesis defense 72
Conclusion : our THESIS in two points
Registers should be handled (early) at the intermediate level of compilation, but with register saturation (maximize and not minimize the register requirement).Avoid spilling, even if you degrade static ILP extraction.