Just-In-Time Software Pipelining -...

Just-In-Time Software Pipelining

Hongbo Rong Hyunchul Park

Youfeng Wu Cheng Wang

Programming Systems Lab

Intel Labs, Santa Clara

What is software pipelining?

2

for (i = 0; i<N; i++){a: x = y + 1 b: y = A[i] + x c: B[i+2] = B[i]*x d: A[i+2] = B[i+2]

}

A loop optimization exposing instruction-level parallelism (ILP)

What is software pipelining?

2

for (i = 0; i<N; i++){a: x = y + 1 b: y = A[i] + x c: B[i+2] = B[i]*x d: A[i+2] = B[i+2]

}

a

c b

dLocal dependence

Loop-carried dependence

<2>

<1>

<1>

A loop optimization exposing instruction-level parallelism (ILP)

3

a

c b

d<2>

<1>

<1>

3

Iteration0 1 2 3

a

c b

d<2>

<1>

<1>

3

Iteration

a b c

d

0 1 2 3

a

c b

d<2>

<1>

<1>

3

Iteration

a b c

da b c

d

0 1 2 3

a

c b

d<2>

<1>

<1>

3

Iteration

a b c

da b c

d

0 1 2 3

Initiation interval(II)=2

a

c b

d<2>

<1>

<1>

3

Iteration

a b c

da b c

da b c

d

0 1 2 3


a

c b

d<2>

<1>

<1>

3

Iteration

a b c

da b c

da b c

da b c

d

0 1 2 3


a

c b

d<2>

<1>

<1>

3

Iteration

a b c

da b c

da b c

da b c

d

0 1 2 3


• Different iterations work on different stages in parallel

Kernel2 stages

3

Iteration

a b c

da b c

da b c

da b c

d

0 1 2 3


• Different iterations work on different stages in parallel

• II is the performance indicator

Kernel2 stages

Software pipelining has been static

• Extensively studied in 3 decades, and efficient for wide-issue architectures– VLIW [Lam 1988]

– superscalar [Ruttenberg et al. 1996]

4




• It is seen only in static compilers

4




• It is seen only in static compilers

• Most works aim to minimize II but not compile overhead

4

It is time now to extend software

pipelining to dynamic compilers!

5



• Dynamic languages are increasingly popular– JavaScript and PhP 88.9% and 81.5% in client and server websites (W3Techs)

5




• Huge amount of legacy code – Small optimization scope: a loop iteration

– Software pipelining enlarges the scope to many iterations

5






• Minimizing compile overhead must be the 1st

objective– Only simple/fast algorithms can be used

5






• Minimizing compile overhead must be the 1st

objective– Only simple/fast algorithms can be used

– linear-time algorithms are preferred

5

Challenges

• Memory aliases kill parallelism– Hardware: Atomic region + rotating alias registers [MICRO-46]

6

Challenges


6

for (i = 0; i<N; i++){abcd

}

Challenges


6


}

Original optimization scope

Challenges


6


}


for (j = 0; j<N; j+=M){for (i = j; i<j+M; i++){

abcd

}}

Challenges


6


}



abcd

}}

Atomic region

Challenges


• Costly rollback– Software: Light-weight checkpointing

6


}



abcd

}}

Atomic region

Challenges


• Costly rollback– Software: Light-weight checkpointing

• Scheduling is expensive

6


}



abcd

}}

Atomic region

7

Framework(on Transmeta CMS)

7

x86 binaryFramework

(on Transmeta CMS)

7

x86 binary

Interpreter & profiler


7

Hot region optimizations

x86 binary



7


Acyclic scheduler

x86 binary



7


Acyclic scheduler

Assembler

x86 binary



7


Acyclic scheduler

Assembler

x86 binary


Code cache


8



Acyclic scheduler

Assembler

x86 binary


Code cache

8



Acyclic scheduler

Assembler

8



Acyclic scheduler

Assembler

Loop selection

Acyclic scheduler

8



Acyclic scheduler

Assembler

Initialization

Loop selection

Acyclic scheduler

8



Acyclic scheduler

Assembler

Scheduling

Initialization

Loop selection

Acyclic scheduler

8



Acyclic scheduler

Assembler

Scheduling

Rotating alias reg alloc

Initialization

Loop selection

Acyclic scheduler

8



Acyclic scheduler

Assembler

Scheduling


Code generation

Initialization

Loop selection

Acyclic scheduler

8



Acyclic scheduler

Assembler

Scheduling


Code generation

Initialization

Loop selection

Acyclic scheduler

Rotating alias register fileAtomic region

Scheduling is expensive

• NP-complete problem to find an optimal schedule [Colland et al. 1996]

9



• O(V3) at least, exponential at worst [Rau et al. 1992]– V: number of operations

9



• O(V3) at least, exponential at worst [Rau et al. 1992]– V: number of operations

• Can we linearize software pipelining?

9

Keep scheduling time under control

10


• Schedule in linear time– Use either simple or fast sub-algorithms

• Avoid cubic or exponential complexity

10




– For iterative sub-algorithms, have a threshold: the maximum #iterations allowed

10





• The smaller the threshold, the less the compile overhead• Once exceeded, abort software pipelining

10





• The smaller the threshold, the less the compile overhead• Once exceeded, abort software pipelining• Key question: How small can the threshold be?

10






• Find a good enough schedule– No backtracking– Priority function: approximate and never update– Separate dependence and resource constraints– Separate local and loop-carried dependences

10






• Find a good enough schedule– No backtracking– Priority function: approximate and never update– Separate dependence and resource constraints– Separate local and loop-carried dependences

• Iteratively improve a schedule

10

Just-In-Time Software Pipelining

11

Just-In-Time Software Pipelining• Quickly creates an initial schedule to start with

11

Prepartition

H, B


11

Prepartition

Local scheduling• Handles local dependences and resources

H, B


11

Prepartition

Local scheduling

Kernel expansion

• Handles local dependences and resources

• Adjusts the schedule for loop-carried dependences

H, B


11

Prepartition

Local scheduling

Kernel expansion



H, B

Exit


11

Prepartition

Local scheduling

Kernel expansion

Rotation



• Iteratively improves the schedule

≤3

H, B

Exit


11

Prepartition

Local scheduling

Kernel expansion

Rotation



• Iteratively improves the schedule

• Time complexity: O(V+E)– V: #operations E: #dependences

≤3

H, B

Exit

Illustration12

Iteration0 1 2 3

Local dependences

Resources

Loop-carried dependences

PrepartitionPrepartitionPrepartitionPrepartition

Local scheduling

Kernel expansion

RotationExit

Illustration12

abc

d

RecMII

Iteration0 1 2 3

Local dependences

Resources



Local scheduling

Kernel expansion

RotationExit

Illustration12

Kernel

abc

d abc

d abc

d

RecMII

Iteration0 1 2 3

Local dependences

Resources

Loop-carried dependencesabc

d


Local scheduling

Kernel expansion

RotationExit

Calculating RecMII

13

• Recurrence Minimum II– II determined by the biggest dependence cycles

– Needed by almost every software pipelining method

Calculating RecMII

13

Conventional

O(V3)



Calculating RecMII

• Turn the problem into a Markov decision process– 1st time Howard algorithm is applied to pipelining

13

Conventional

O(V3)



Calculating RecMII


13

Conventional

O(V3)

Howard policyiteration algo.

O(exponential*E)



Calculating RecMII


• Linearize Howard with a small constant H 13

Conventional

O(V3)

Linearized Howard

O( E)


O(exponential*E) H*



Calculating RecMII


• Linearize Howard with a small constant H 13

Conventional

O(V3)

Linearized Howard

O( E)


O(exponential*E)



Calculating RecMII


• Linearize Howard with a small constant H• A by-product: critical operations13

Conventional

O(V3)

Linearized Howard

O( E)


O(exponential*E)



Divide operations into stages

14

Bellman-Ford

O(V*E)


• Also an iterative sub-algorithm

14

Bellman-Ford

O(V*E)


• Also an iterative sub-algorithm• Linearize it with a constant B

14

Bellman-Ford

O(V*E)

Linearized Bellman-Ford

O( E)B*



• Linearize it with a constant B

• Specific order in visiting edges– Scan nodes in sequential order, and visit their incoming edges

• Values are propagated along local edges in the 1st itr.

• Values are propagated along loop-carried edges in the 2nd itr.14

Bellman-Ford

O(V*E)


O( E)B*



• Linearize it with a constant B

• Specific order in visiting edges– Scan nodes in sequential order, and visit their incoming edges

• Values are propagated along local edges in the 1st itr.

• Values are propagated along loop-carried edges in the 2nd itr.14

Bellman-Ford

O(V*E)


O( E)

Illustration15

Kernel

abc

d abc

d abc

d

RecMII

Iteration0 1 2 3

Local dependences

Resources

Loop-carried dependencesabc

d


Local scheduling

Kernel expansion

RotationExit

16

Kernel

a

d a

d a

d

IIbc

bc

bc

Iteration0 1 2 3

a

d

bc

Illustration (Cont.)

Local dependences

Resources


Prepartition

Local schedulingLocal schedulingLocal schedulingLocal scheduling

Kernel expansion

RotationExit

16

Kernel

a

d a

d a

d

IIb c

b c

b c

Iteration0 1 2 3

a

db c


XX

Local dependences

Resources


Prepartition


Kernel expansion

RotationExit

Local scheduling

• Any local scheduling algorithm can be used– E.g. list scheduling

17

Local scheduling


• Weakness: – Loop-carried dependences may be violated

17

Local scheduling


• Weakness: – Loop-carried dependences may be violated

• To reduce the chance of violation: – Before scheduling, priority function considers loop-carried dependences in advance

– Prioritize critical operations

17

18

Kernel

ab c

dab c

d ab c

d

II

Iteration0 1 2 3

ab c

dIllustration (Cont.)

XX

Local dependences

Resources


Prepartition


Kernel expansion

RotationExit

d

d

d

19

ab c

ab c

ab c

II

Iteration0 1 2 3

ab c

dIllustration (Cont.)

xx

Local dependences

Resources


Prepartition

Local scheduling

Kernel expansionKernel expansionKernel expansionKernel expansion

RotationExit

d

d

d19

Kernel

ab c

ab c

ab c

II

Iteration0 1 2 3

ab c


xx

Local dependences

Resources

Loop-carried dependencesx

Prepartition

Local scheduling

Kernel expansionKernel expansionKernel expansionKernel expansion

RotationExit

20

Iteration0 1 2 3

ab c

ab c

ab c

ab c

d

d

d Illustration (Cont.)

xx

Local dependences

Resources

Loop-carried dependencesx

Prepartition

Local scheduling

Kernel expansion

RotationRotationRotationRotationExit

20

Kernel

Iteration0 1 2 3 ab c

ab c

ab c

ab c

d

d

d

d Illustration (Cont.)

Local dependences

Resources


Prepartition

Local scheduling

Kernel expansion

RotationRotationRotationRotationExit

Experiments

• Transmeta CMS on SPEC2k traces

21

Experiments


• Functional simulator to comprehensively

– Explore thresholds H and B– Evaluate compile overhead and schedules’ quality

21

Experiments


• Functional simulator to comprehensively

– Explore thresholds H and B– Evaluate compile overhead and schedules’ quality

• Cycle-accurate simulator– Simulates cache misses, latencies, …

– Initial performance study

21

Linearized Howard vs.

exponential backoff + binary

search [Rau et al. 1992]

22




22

0.1

1

10

100

1000

0 100 200 300 400 500 600 700

#dependences




22

0.1

1

10

100

1000

0 100 200 300 400 500 600 700

#dependences

Average 9X faster Average 9X faster Average 9X faster Average 9X faster

Peak 812X fasterPeak 812X fasterPeak 812X fasterPeak 812X faster

H≤3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops




22

0.1

1

10

100

1000

0 100 200 300 400 500 600 700

#dependences

Average 9X faster Average 9X faster Average 9X faster Average 9X faster

Peak 812X fasterPeak 812X fasterPeak 812X fasterPeak 812X faster

H≤3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops

H≤14 for all loops 14 for all loops 14 for all loops 14 for all loops


23


23

0%

50%

100%

2 3 4 5

% Total loops

B


• B ≤ 3 for 98.8% of 11,992 loops

23

0%

50%

100%

2 3 4 5

% Total loops

B


• B ≤ 3 for 98.8% of 11,992 loops• B ≤ 5 for all the loops

23

0%

50%

100%

2 3 4 5

% Total loops

B


• B ≤ 3 for 98.8% of 11,992 loops• B ≤ 5 for all the loops• From now on, we set H=10, B=5 � 11,910 loops scheduled

23

0%

50%

100%

2 3 4 5

% Total loops

B

Scheduling overhead &

schedules’ quality

24

RS2 DESP JITSP RS2 DESP JITSP

Overhead % loops with optimal schedules

95%

1X



• JITSP achieves optimal schedules for 95% loops

24



95%

1X




24



95%12X

1X




24



95%13%

12X

1X




24



95%13%

25%

12X

1X




24



95%

23%13%

25%

12X

1X

Acyclic

scheduler,

27%

Software

pipelining,

33%

Hot region

optimizations,

34%

Assembler,

6%

Note: acyclic scheduler handles acyclic code, or loops NOT selected for software pipelining

25

Compile overhead distribution

Preliminary performance

• 40 hot loops

26


• 40 hot loops

26

filtered

19%

too many

registers

55%

too many

unrolls

1%

Successfully

generated

code

25%


• 40 hot loops

• The architecture has bottleneck in registers

– 2/7/24/32 predicate/static alias/integer/floating point

available for pipelining 26

filtered

19%

too many

registers

55%

too many

unrolls

1%

Successfully

generated

code

25%

Preliminary performance (Cont.)

27

0.00

0.50

1.00

1.50

2.00

1 2 3 4 5 6 7 8 9 10

II/MII (The lower, the better)

RS2 DESP JITSP

Optimal


• JITSP achieves optimal schedules for all but 1 loops

27

0.00

0.50

1.00

1.50

2.00

1 2 3 4 5 6 7 8 9 10

II/MII (The lower, the better)

RS2 DESP JITSP

Optimal


• JITSP 10~36% speedup. Better than the others

28

-40%

-20%

0%

20%

40%

1 2 3 4 5 6 7 8 9 10

Speedup

RS2

DESP

JITSP


• JITSP 10~36% speedup. Better than the others• Exception: loop 2. Optimal schedule but slowdown due to memory aliases �retranslation needed

28

-40%

-20%

0%

20%

40%

1 2 3 4 5 6 7 8 9 10

Speedup

RS2

DESP

JITSP


• JITSP 10~36% speedup. Better than the others• Exception: loop 2. Optimal schedule but slowdown due to memory aliases �retranslation needed

• Speedup swim(5.3%), ammp(4.4%), mcf(3.6%)28

-40%

-20%

0%

20%

40%

1 2 3 4 5 6 7 8 9 10

Speedup

RS2

DESP

JITSP

Conclusion

29

Conclusion

• 1st linear software pipelining algorithm implemented for dynamic compilers

29

Conclusion


• Turns a traditionally-expensive optimization into linear time O(V+E)

29

Conclusion


• Turns a traditionally-expensive optimization into linear time O(V+E)– Taking advantages of results from various domains:

• hardware circuit design (Retiming)

• stochastic control (Howard algorithm)

• Graph (Bellman-Ford)

• software pipelining (Rotation scheduling and DESP)

29

Conclusion


• Turns a traditionally-expensive optimization into linear time O(V+E)– Taking advantages of results from various domains:

• hardware circuit design (Retiming)

• stochastic control (Howard algorithm)

• Graph (Bellman-Ford)

• software pipelining (Rotation scheduling and DESP)

• Generates optimal or near-optimal schedules with reasonable compile overhead

29

Future work

• Register availability– Add more architecture registers

– Algorithm: Register pressure-aware

• Implementation– Loop selection

– Re-translation

• Evaluation– Benchmarks

30

Backup slides

31

Howard algorithm

32

• Each operation is rewarded to reach a cycle via 1 policy edge– The bigger the cycle, the more the reward

Howard algorithm

32

a

c b

d

• Each operation is rewarded to reach a cycle via 1 policy edge– The bigger the cycle, the more the reward

Distribution statistics of JITSP

33

Just-In-Time Software Pipelining -...

Documents

Transcript of Just-In-Time Software Pipelining -...