Register Pressure in Instruction Level Parallelism

Register Pressure in Instruction Level Parallelism

TOUATI Sid-Ahmed-Ali

19/04/23 Thesis defense 2

Outline

Prologue

Part one : Basic Blocks

Part two : Simple Innermost Loops

Epilogue


Memory BottleneckSpec2000 Performance Statistics

0

0,5

1

1,5

2

2,5

3

3,5

IPC

Real

Perfect L2

Perfect Mem

From [Lin et al 01], in HPCA 2001. Simulated performance on an Alpha 21364 processor (1.6Ghz). Recent Compaq compiler (peak-optimization compiler flags).


Solutions by Software

To avoidTo avoid

Using registers

To tolerateTo tolerate

ILP/TLP Software prefetching


Combined Complex Pass

Original Graph

Scheduling +Register Allocation

Scheduling +Register Allocation

We do not advocate this method


Why Decoupling ?

Memory wallMemory bottleneck >> ILP enhancementUseless spilling Early RA still generates faster codes [Freudenberg et al 92, Brasier et al 95, Janssen 01].

Register constraints are more genericHeterogeneous, complex resource constraints.Registers : types and number of available registers.

Complexity of register constraintsWhile there is always a schedule for a DDG under any resource constraints, it is not the case with limited number of registers (spilling is sometimes unavoidable).


Our chart

I. Performance enhancement priority to registers against ILP scheduling, but the former must respect the latter (if possible).

II. Portability No major re-writing of compilers (investment cost).

Generic processor : meet most of existing ILP processors.


Our Generic ILP processor

Non restriction on ILP degree.Infinite parallelism (infinite resources).

Multiple register types.A statement may produces multiple values, but with distinct types.

Visible delays in reading from and writing into registers.

A register is not occupied until the result is available (later after the issue time in the pipeline).


First Strategy : Register Pressure Management

Original DDG

Register Pressure ManagementRegister Pressure Management

Modified DDG

Register Constraints

Code SchedulingCode SchedulingRegister AllocationRegister Allocation

MinimizeCritical Path

Increase


Register Saturation and Sufficiency

RS

RS

RS

RF

RF

RFR

Add arcs

Spilling


Second Strategy : Schedule Independent Register Allocation

Original DDG

Early Register AllocationEarly Register Allocation

Allocated DDG

Register Constraints

Code SchedulingCode Scheduling

MinimizeCritical Path

Increase


Part I : Basic Blocks

1. Register Requirement (Use)

2. Register Saturation

3. Register Sufficiency

4. Local Schedule Independent Register Allocation

5. Related Work

6. Conclusion


Local Register Requirement

1

2

34

5

6

7

8

9

10

11

12

+

x

+

++

+

+

st

ld

+

st

x

++

++

+

+

ld


Without Assuming a Schedule…

Value lifetime intervals are not definedRegister Requirement not defined.

Two notions in this case:Register Saturation per register type (max RR)

Guarantees that registers do not constraint the ILP scheduling.

Register Sufficiency per register type (min RR)Prevents from obsolete spilling.


Computing Register Saturation

Given a DAG, compute the exact maximal register requirement for all valid schedules.

NP-complete problem [Touati 00].Optimal method (integer linear programming).

Algorithmic heuristics.


Integer Programming Techniques

We use binary variables for expressing disjunction, implication, equivalence and max operator.

Disjunction : the domain set of the variables must be bounded.

1,0

)1()(

.)(

0)(

0)(

gxg

fxf

xg

xf


Integer Programming Techniques

Implication

Equivalence

Max

0)(

0)(

0)(0)(

xg

xf

xgxf

0)(

0)(

0)(

0)(0)(0)(

xg

xf

xg

xfxgxf

yzxy

xzyxyxz ),max(


Optimal RS Computation

Scheduling constraints :e=(u,v) : v - u (e)

Killing dates : kill(ut)=max(v+r(v)), v reads ut

Interferences :St

u,v =1 (kill(u)def(v) kill(v) def(u))

Maximal clique = independent set in the complementary graph:

Stu,v

= 0 xut + xv

t 1

Objective function = maximize (independent set)Maximize xu

t

At most O(n2) variables and O(n2+m) constraints.


Problem Formulation with Graphs

RS computation chose a unique killer for each value.

Computing a killing function that associates a unique killer to each value.

Two constraints :The killing function must not introduce a circuit in the DAG.

The killing function must maximize the register requirement.


Killing Function...

++

x

+

++

+

+

st

ld

++

x

+

++

+

+

ld

Killing function Disjoint Value DAG : interval order


Register Saturation Problem

Find a valid killing function such that the maximal antichain in the disjoint value DAG is maximal among other killing functions.

NP-complete Problem.

Polynomial heuristics.


Our Heuristics (Greedy-k)

Decompose the potential killing graph into connected bipartite components

cb=(S, T, Ecb)

Find a Saturating Killing Set: maximize the parallel values with S (minimize the number of arcs in the disjoint value DAG).


Saturating Killing Set

S

T

Descendant values

S

T’

Descendant values

T-T’

Descendant values


Greedy-k versus Optimal RS

Benchmarks : 27 loops from Spec-FP-95, whetstone, livermore.

DAGs=unrolled loops.

134 experimented DAGs (#nodes up to 120).

Maximal difference empirical difference between optimal RS and approximated RS* by Greedy-k is 1 FP register (5% of DAGs).


Representative RS Behaviour

RS vs. loop unrolling

0

5

10

15

20

25

30

35

40

45

unroll

RS

whet-loop1

whet-loop2

whet-loop3


Reducing Register Saturation

Problem : does there exist an extended DDG G’ from G such that RS(G’)R and Critical Path P ?

NP-hard problem [Touati 01]Optimal solution with integer programming.

Algorithmic heuristics.


Optimal RS Reduction

The problem is equivalent to computing a schedule that does not require more than R registers (NP-complete), while the total schedule time is P.

Given such schedule, we report arcs into G so as to guarantee the same interval order as defined by .


Integer Program for RS Reduction

We bound the register requirement for each register type, and the total schedule time:

xut Rt

P

The objective function maximizes the RR of a considered register type:

Maximize xut

At most O(n2) variables and O(n2+m) constraints


Algorithmic Heuristics for RS Reduction

Serialize Saturating Values lifetime intervals.

Do not extend the critical path if possible.


Interval Serialization

++

x

+

++

+

+

st

ld

r-w

DAGExtended


Experiments (RS Reduction)

Optimal versus approximatedLoops were unrolled till 4, #nodes up to 80.

We parameterise R (#available registers) as 1, next power of 2, and 32.

Maximal empirical error is two registers.


Experiments (RS reduction)

RS reduction (R=32)

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

unroll

red

uce

d R

S

spec-spice-loo7

spec-fp-loop1

liv-loop23


Experiments (ILP loss)

RS reduction and ILP loss

71%

19%

5%

1%

4%

RS=RS* ILP=ILP*

RS=RS* ILP<ILP*

RS>RS* ILP=ILP*

RS>RS* ILP<ILP*

RS>RS* ILP>ILP*







5. Related Work

6. Conclusion


Computing Register Sufficiency

Its complexity is still an open problem !!Proved NP-complete for sequential codes, but not for parallel ones.Proved NP-complete for ILP codes if we restrict the total schedule time.

Integer programmingSame intLP system as RS, but we bound the register requirement : xu

t Rt

Algorithmic heuristicsLifetime interval serialization (as RS reduction)Do not care about critical path increase.Set R=1 (reduce RS as low as possible)


Experiments (RF Computation)

27 loop bodies, maximal empirical error is 1 register (7 cases).

Optimal vs. Approximated RF

0

1

2

3

4

5

6

7

8

9

loop body

R

Optimal RF

Approximated RF

RS







5. Related Work

6. Conclusion


Example of Early RA

ld

+

++

x

++

+

+

Register Allocation is

a minimal chain decomposition


Two Critical Loops

ILP loss after Register Allocation

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

unroll

ILP

lo

ss liv-loop23

spec-spice-loop8

spec-spice-loop4


Related Work (RS)

Our RS study is an extension to URSA framework [Berson 96]. We provide an adequate formulation to this problem.

URSA AssumptionDAG=pure data-flow graph. No multiple register types, no delays, all nodes are assumed values.

The URSA problem formalisation is not correct.

The efficiency of URSA was not compared to the optimal solutions.


Conclusion of Part I

RS and RF are analysed before ILP scheduling : the DAG becomes free from register constraints.

RS management maximizes the register requirement in order to minimize the # of introduced false dependences.

RF analysis enables to check if spill code is useless.

Our heuristics are nearly optimal (empirical results).


Part II : Simple Innermost Loops

1. Cyclic Register Requirement (Use)

2. Cyclic Register Saturation

3. Cyclic Register Sufficiency

4. Cyclic Schedule Independent Register Allocation

5. Related Work

6. Conclusion


Software Pipelining Motif

0123456789

1011121314151617

- c ae - b- d -- - -

cn2 1 0

rn

0123

h

h

h

h

iterations

ab--c-d--e

1st

ab--c-d--e

2nd

ab--c-d--e

3rd

time

L


Cyclic Register Requirement

h=4

v1 v2 v3

0

2

13

01234567891011121314151617181920

It i

It i+1

It i+2

h=4v1

v2

v3

v1

v2

v3

v1

v2

v3

v1

v2

v3


Computing CRN

In a circular interval graph, the size of a maximal clique in the interference graph the width [Tucker 75].

We decompose the circular graph into two parts :

1. Complete turns around the circle (# of distinct interfering instances)

2. In_fraction_of_h intervals.


In_fraction_of_h Intervals

In_fraction_of_h intervals are the remainder of circular intervals after removing the complete turns around the circle.

If we unroll twice the kernel of the in_fraction_of_h intervals, the maximal clique of the interference graph is equal to the width of the in_fraction_of_h intervals [Touati 2002].

h=4

v1 v2 v3 h

v1 v2 v3

h



1. Cyclic Register Requirement (Use)2. Cyclic Register Saturation (CRS)

a) Computing Cyclic Register Saturation b) Reducing Cyclic Register Saturation

3. Cyclic Register Sufficiency4. Cyclic Schedule Independent Register

Allocation5. Related Work6. Conclusion


Computing CRS

CRSt is the exact maximal cyclic register requirement of all valid SWP schedules.

Absolute CRSt is infinite.

If MII=0 (acyclic DDG), the loop is completely parallel : cannot be implemented by a SWP kernel.

If L is not bounded, we may have an infinite # of values simultaneously live.

Optimal method by integer programming.


Optimal CRS Computation (1)

The intLP system is written for a fixed h and a bounded L.


Scheduling constraints :

Killing dates :

Eeehe vu ),(.)(

Eeehe vu ),(.)(

tRrv

EvueuConsv

tu Vuehvk

tR

t ,

),()(

,)(.)(max

,



# of complete turns around the circle

Two acyclic intervals (]a,b], ]a’,b’]) for each in_fraction_of_h intervals (]l,r]).

h

phulifetime

t

tt

u

uu

t

)(

hbb

haa

hrblr

rblr

la

tt

tt

tttt

tttt

tt

uu

uu

uuuu

uuuu

uu

'

'



The interferences of acyclic intervals are computed as in the acyclic case

A maximal clique is computed like in the acyclic case

Objective function

10, jIJI xxs

)( )( )()(1 S JI, IbeginJendJbeginIend

I

IVu

uxpMax

tRt

t

acyclic,


CRS Experiments

Upper-bound of CRS

0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

120,00%

4 8 16 32 64

R

% (

CR

S <

=)


Reducing CRS

Problem : Given a DDG G, does there exist an extended DDG G’ such that CRSt(G’) Rt and its critical circuit is h ?

NP-hard [Touati 2002].

Is reduced from the problem of computing , a SWP motif, with a initiation interval equal to h and does not require more than Rt registers.

G’ is constructed from G such that we guarantee the same lifetime interval order as defined by .


Optimal CRS Reduction

The same intLP system as CRS computation, except that we bound the objective function by Rt

tI

IVu

uRxp

tRt

t acyclic,



1. Cyclic Register Requirement (Use)2. Cyclic Register Saturation3. Cyclic Register Sufficiency4. Cyclic Schedule Independent Register

Allocation1. With loop unrolling.2. Rotating register files.3. Polynomial cases.

5. Related Work6. Conclusion


Motivating Example

u

v

- R1 R2 R1

u

It i

v v

It i+1

u

It i+

v

u

It i+

u

v

R

-


Killing Tasks

u

v1

1

v2

2

u’

v1

’1

v2

’2

ku’ku

-1 -2 -’1 -’2

’

’


Reuse Graphs

5

1

4

3

7

8

2

6

5

1

4

3

8

2

6

1

2

3 4

6

5

8

2

6

2 6

7 7


SIRA Problem

Given a DDG, find a valid reuse graph for each register type such that (C) R while the critical circuit is minimized.Classical NP-complete problem [Eisenbeis and Gasperoni 95].

Minimizing the unrolling degree is a difficult problem (non linear).


SIRA Exact Formulation (1)

The existence of a SWP

Reuse Arc (bijection)

Anti-dependences

),( ),()( vueehe vu

uv

tuv

v

tvu ,1,,

tRt

vuvwkt

vu Vvuhvtu

,,, , ,)(1


SIRA Exact Formulation (2)

Register Requirement

Objective function


tRtvu

tvu VvuR ,

,, , ,

tRvu

tvu Vvu ,

,, , , Minimize


Rotating Register FilesNo need to unroll the kernel to allocate registers.Cost (if we do not unroll the loop) : at most one extra register for the same h [Rau et al 92, Lelait96].

R1=…h

Physical registers

r5

r4

r3

r2r1r0

kernel iterationh


Hamiltonian Ordering

(u,v) is a reuse arc (ho(v)=ho(u)+1) mod n

01

2

3

45

6


Polynomial Problem

Theorem [Touati 2002]: if we fix statically the reuse arcs, computing the distances so as to minimize the register requirement under a fixed execution rate has a totally unimodular constraints matrix.


Polynomial Instances

Disable register sharing among statements : fix statically a self reuse scheme (each variables reuses the register freed by itself) [Ning 93].

Fix an arbitrary (on in a cleverer way) hamiltonian reuse circuit.

Look for other cleverer method !


Experiments (SIRA vs. Ham SIRA)

Hamiltonian SIRA needs at most one extra register than SIRA (under the same II) in

very few cases.


Optimal vs. Polynomial SIRA

Two strategies : self reuse (disable register sharing) and arbitrary hamiltonian reuse.To synthesize :

Self-reuse strategy is the worst decision in terms of register requirement. Arbitrary Hamiltonian reuse scheme exhibits better behaviour.Self-reuse strategy exhibits better behaviour in terms of unrolling degrees (if no RRF exists).


Related Work in LoopsAs far I we know, nothing about cyclic register saturation and sufficiency.SWP under register constraints

Heuristics : [Huff 93, Ning93, Wang et al 95, Sanchez96, Llosa 96]

Parallelism vs. Storage : [Strout et al 98, Thies et al 01].

Optimal : buffers [Altman 95], stage scheduling [Eichenberger et al 96, Huard 01], MAXLIVE [Sawya 96, Fimmel et al 01].

Register Allocation of an already scheduled loopRRF : [Rau et al 92], [Duesterwald et al 92].Loop unrolling : [Hendren et al 92].Meeting graph [Lelait et al 96].


Conclusion of Part II

Contrary to acyclic RS, absolute CRS may be infinite. Practically, we can compute it if we bound L.

CRF exists, and we can compute it.

SIRA allows to construct an optimal (early) cyclic register allocation with a minimal critical circuit of the loop.

Fixing at compile time the reuse arcs makes the problem polynomial.


Future Research Proposals (1)

A fact :a thesis can never be exhaustive ! Our efforts open a wide range of future work subjects.

Open ProblemsComplexity of Register Sufficiency in ILP Codes.

ExperimentsCRS, CRF, and spilling heuristics.

Machine with finite resources.


Future Research Proposals (2)

Extend processor model (write multiple results per type).

Loops with branches.

Multi-dimensional scheduling : loop nest.


Conclusion : our THESIS in two points

Registers should be handled (early) at the intermediate level of compilation, but with register saturation (maximize and not minimize the register requirement).Avoid spilling, even if you degrade static ILP extraction.

Register Pressure in Instruction Level Parallelism

Documents

Transcript of Register Pressure in Instruction Level Parallelism