Just-In-Time Software Pipelining -...
Transcript of Just-In-Time Software Pipelining -...
Just-In-Time Software Pipelining
Hongbo Rong Hyunchul Park
Youfeng Wu Cheng Wang
Programming Systems Lab
Intel Labs, Santa Clara
What is software pipelining?
2
for (i = 0; i<N; i++){a: x = y + 1 b: y = A[i] + x c: B[i+2] = B[i]*x d: A[i+2] = B[i+2]
}
A loop optimization exposing instruction-level parallelism (ILP)
What is software pipelining?
2
for (i = 0; i<N; i++){a: x = y + 1 b: y = A[i] + x c: B[i+2] = B[i]*x d: A[i+2] = B[i+2]
}
a
c b
dLocal dependence
Loop-carried dependence
<2>
<1>
<1>
A loop optimization exposing instruction-level parallelism (ILP)
3
a
c b
d<2>
<1>
<1>
3
Iteration0 1 2 3
a
c b
d<2>
<1>
<1>
3
Iteration
a b c
d
0 1 2 3
a
c b
d<2>
<1>
<1>
3
Iteration
a b c
da b c
d
0 1 2 3
a
c b
d<2>
<1>
<1>
3
Iteration
a b c
da b c
d
0 1 2 3
Initiation interval(II)=2
a
c b
d<2>
<1>
<1>
3
Iteration
a b c
da b c
da b c
d
0 1 2 3
Initiation interval(II)=2
a
c b
d<2>
<1>
<1>
3
Iteration
a b c
da b c
da b c
da b c
d
0 1 2 3
Initiation interval(II)=2
a
c b
d<2>
<1>
<1>
3
Iteration
a b c
da b c
da b c
da b c
d
0 1 2 3
Initiation interval(II)=2
a
c b
d<2>
<1>
<1>
3
Iteration
a b c
da b c
da b c
da b c
d
0 1 2 3
Initiation interval(II)=2
• Different iterations work on different stages in parallel
Kernel2 stages
3
Iteration
a b c
da b c
da b c
da b c
d
0 1 2 3
Initiation interval(II)=2
• Different iterations work on different stages in parallel
Kernel2 stages
3
Iteration
a b c
da b c
da b c
da b c
d
0 1 2 3
Initiation interval(II)=2
• Different iterations work on different stages in parallel
Kernel2 stages
3
Iteration
a b c
da b c
da b c
da b c
d
0 1 2 3
Initiation interval(II)=2
• Different iterations work on different stages in parallel
• II is the performance indicator
Kernel2 stages
Software pipelining has been static
• Extensively studied in 3 decades, and efficient for wide-issue architectures– VLIW [Lam 1988]
– superscalar [Ruttenberg et al. 1996]
4
Software pipelining has been static
• Extensively studied in 3 decades, and efficient for wide-issue architectures– VLIW [Lam 1988]
– superscalar [Ruttenberg et al. 1996]
• It is seen only in static compilers
4
Software pipelining has been static
• Extensively studied in 3 decades, and efficient for wide-issue architectures– VLIW [Lam 1988]
– superscalar [Ruttenberg et al. 1996]
• It is seen only in static compilers
• Most works aim to minimize II but not compile overhead
4
It is time now to extend software
pipelining to dynamic compilers!
5
It is time now to extend software
pipelining to dynamic compilers!
• Dynamic languages are increasingly popular– JavaScript and PhP 88.9% and 81.5% in client and server websites (W3Techs)
5
It is time now to extend software
pipelining to dynamic compilers!
• Dynamic languages are increasingly popular– JavaScript and PhP 88.9% and 81.5% in client and server websites (W3Techs)
• Huge amount of legacy code – Small optimization scope: a loop iteration
– Software pipelining enlarges the scope to many iterations
5
It is time now to extend software
pipelining to dynamic compilers!
• Dynamic languages are increasingly popular– JavaScript and PhP 88.9% and 81.5% in client and server websites (W3Techs)
• Huge amount of legacy code – Small optimization scope: a loop iteration
– Software pipelining enlarges the scope to many iterations
• Minimizing compile overhead must be the 1st
objective– Only simple/fast algorithms can be used
5
It is time now to extend software
pipelining to dynamic compilers!
• Dynamic languages are increasingly popular– JavaScript and PhP 88.9% and 81.5% in client and server websites (W3Techs)
• Huge amount of legacy code – Small optimization scope: a loop iteration
– Software pipelining enlarges the scope to many iterations
• Minimizing compile overhead must be the 1st
objective– Only simple/fast algorithms can be used
– linear-time algorithms are preferred
5
Challenges
• Memory aliases kill parallelism– Hardware: Atomic region + rotating alias registers [MICRO-46]
6
Challenges
• Memory aliases kill parallelism– Hardware: Atomic region + rotating alias registers [MICRO-46]
6
for (i = 0; i<N; i++){abcd
}
Challenges
• Memory aliases kill parallelism– Hardware: Atomic region + rotating alias registers [MICRO-46]
6
for (i = 0; i<N; i++){abcd
}
Original optimization scope
Challenges
• Memory aliases kill parallelism– Hardware: Atomic region + rotating alias registers [MICRO-46]
6
for (i = 0; i<N; i++){abcd
}
Original optimization scope
for (j = 0; j<N; j+=M){for (i = j; i<j+M; i++){
abcd
}}
Challenges
• Memory aliases kill parallelism– Hardware: Atomic region + rotating alias registers [MICRO-46]
6
for (i = 0; i<N; i++){abcd
}
Original optimization scope
for (j = 0; j<N; j+=M){for (i = j; i<j+M; i++){
abcd
}}
Atomic region
Challenges
• Memory aliases kill parallelism– Hardware: Atomic region + rotating alias registers [MICRO-46]
• Costly rollback– Software: Light-weight checkpointing
6
for (i = 0; i<N; i++){abcd
}
Original optimization scope
for (j = 0; j<N; j+=M){for (i = j; i<j+M; i++){
abcd
}}
Atomic region
Challenges
• Memory aliases kill parallelism– Hardware: Atomic region + rotating alias registers [MICRO-46]
• Costly rollback– Software: Light-weight checkpointing
• Scheduling is expensive
6
for (i = 0; i<N; i++){abcd
}
Original optimization scope
for (j = 0; j<N; j+=M){for (i = j; i<j+M; i++){
abcd
}}
Atomic region
7
Framework(on Transmeta CMS)
7
x86 binaryFramework
(on Transmeta CMS)
7
x86 binary
Interpreter & profiler
Framework(on Transmeta CMS)
7
Hot region optimizations
x86 binary
Interpreter & profiler
Framework(on Transmeta CMS)
7
Hot region optimizations
Acyclic scheduler
x86 binary
Interpreter & profiler
Framework(on Transmeta CMS)
7
Hot region optimizations
Acyclic scheduler
Assembler
x86 binary
Interpreter & profiler
Framework(on Transmeta CMS)
7
Hot region optimizations
Acyclic scheduler
Assembler
x86 binary
Interpreter & profiler
Code cache
Framework(on Transmeta CMS)
8
Framework(on Transmeta CMS)
Hot region optimizations
Acyclic scheduler
Assembler
x86 binary
Interpreter & profiler
Code cache
8
Framework(on Transmeta CMS)
Hot region optimizations
Acyclic scheduler
Assembler
8
Framework(on Transmeta CMS)
Hot region optimizations
Acyclic scheduler
Assembler
Loop selection
Acyclic scheduler
8
Framework(on Transmeta CMS)
Hot region optimizations
Acyclic scheduler
Assembler
Initialization
Loop selection
Acyclic scheduler
8
Framework(on Transmeta CMS)
Hot region optimizations
Acyclic scheduler
Assembler
Scheduling
Initialization
Loop selection
Acyclic scheduler
8
Framework(on Transmeta CMS)
Hot region optimizations
Acyclic scheduler
Assembler
Scheduling
Rotating alias reg alloc
Initialization
Loop selection
Acyclic scheduler
8
Framework(on Transmeta CMS)
Hot region optimizations
Acyclic scheduler
Assembler
Scheduling
Rotating alias reg alloc
Code generation
Initialization
Loop selection
Acyclic scheduler
8
Framework(on Transmeta CMS)
Hot region optimizations
Acyclic scheduler
Assembler
Scheduling
Rotating alias reg alloc
Code generation
Initialization
Loop selection
Acyclic scheduler
8
Framework(on Transmeta CMS)
Hot region optimizations
Acyclic scheduler
Assembler
Scheduling
Rotating alias reg alloc
Code generation
Initialization
Loop selection
Acyclic scheduler
Rotating alias register fileAtomic region
Scheduling is expensive
• NP-complete problem to find an optimal schedule [Colland et al. 1996]
9
Scheduling is expensive
• NP-complete problem to find an optimal schedule [Colland et al. 1996]
• O(V3) at least, exponential at worst [Rau et al. 1992]– V: number of operations
9
Scheduling is expensive
• NP-complete problem to find an optimal schedule [Colland et al. 1996]
• O(V3) at least, exponential at worst [Rau et al. 1992]– V: number of operations
• Can we linearize software pipelining?
9
Keep scheduling time under control
10
Keep scheduling time under control
• Schedule in linear time– Use either simple or fast sub-algorithms
• Avoid cubic or exponential complexity
10
Keep scheduling time under control
• Schedule in linear time– Use either simple or fast sub-algorithms
• Avoid cubic or exponential complexity
– For iterative sub-algorithms, have a threshold: the maximum #iterations allowed
10
Keep scheduling time under control
• Schedule in linear time– Use either simple or fast sub-algorithms
• Avoid cubic or exponential complexity
– For iterative sub-algorithms, have a threshold: the maximum #iterations allowed
• The smaller the threshold, the less the compile overhead• Once exceeded, abort software pipelining
10
Keep scheduling time under control
• Schedule in linear time– Use either simple or fast sub-algorithms
• Avoid cubic or exponential complexity
– For iterative sub-algorithms, have a threshold: the maximum #iterations allowed
• The smaller the threshold, the less the compile overhead• Once exceeded, abort software pipelining• Key question: How small can the threshold be?
10
Keep scheduling time under control
• Schedule in linear time– Use either simple or fast sub-algorithms
• Avoid cubic or exponential complexity
– For iterative sub-algorithms, have a threshold: the maximum #iterations allowed
• The smaller the threshold, the less the compile overhead• Once exceeded, abort software pipelining• Key question: How small can the threshold be?
• Find a good enough schedule– No backtracking– Priority function: approximate and never update– Separate dependence and resource constraints– Separate local and loop-carried dependences
10
Keep scheduling time under control
• Schedule in linear time– Use either simple or fast sub-algorithms
• Avoid cubic or exponential complexity
– For iterative sub-algorithms, have a threshold: the maximum #iterations allowed
• The smaller the threshold, the less the compile overhead• Once exceeded, abort software pipelining• Key question: How small can the threshold be?
• Find a good enough schedule– No backtracking– Priority function: approximate and never update– Separate dependence and resource constraints– Separate local and loop-carried dependences
• Iteratively improve a schedule
10
Just-In-Time Software Pipelining
11
Just-In-Time Software Pipelining• Quickly creates an initial schedule to start with
11
Prepartition
H, B
Just-In-Time Software Pipelining• Quickly creates an initial schedule to start with
11
Prepartition
Local scheduling• Handles local dependences and resources
H, B
Just-In-Time Software Pipelining• Quickly creates an initial schedule to start with
11
Prepartition
Local scheduling
Kernel expansion
• Handles local dependences and resources
• Adjusts the schedule for loop-carried dependences
H, B
Just-In-Time Software Pipelining• Quickly creates an initial schedule to start with
11
Prepartition
Local scheduling
Kernel expansion
• Handles local dependences and resources
• Adjusts the schedule for loop-carried dependences
H, B
Exit
Just-In-Time Software Pipelining• Quickly creates an initial schedule to start with
11
Prepartition
Local scheduling
Kernel expansion
Rotation
• Handles local dependences and resources
• Adjusts the schedule for loop-carried dependences
• Iteratively improves the schedule
≤3
H, B
Exit
Just-In-Time Software Pipelining• Quickly creates an initial schedule to start with
11
Prepartition
Local scheduling
Kernel expansion
Rotation
• Handles local dependences and resources
• Adjusts the schedule for loop-carried dependences
• Iteratively improves the schedule
• Time complexity: O(V+E)– V: #operations E: #dependences
≤3
H, B
Exit
Illustration12
Iteration0 1 2 3
Local dependences
Resources
Loop-carried dependences
PrepartitionPrepartitionPrepartitionPrepartition
Local scheduling
Kernel expansion
RotationExit
Illustration12
abc
d
RecMII
Iteration0 1 2 3
Local dependences
Resources
Loop-carried dependences
PrepartitionPrepartitionPrepartitionPrepartition
Local scheduling
Kernel expansion
RotationExit
Illustration12
Kernel
abc
d abc
d abc
d
RecMII
Iteration0 1 2 3
Local dependences
Resources
Loop-carried dependencesabc
d
PrepartitionPrepartitionPrepartitionPrepartition
Local scheduling
Kernel expansion
RotationExit
Calculating RecMII
13
• Recurrence Minimum II– II determined by the biggest dependence cycles
– Needed by almost every software pipelining method
Calculating RecMII
13
Conventional
O(V3)
• Recurrence Minimum II– II determined by the biggest dependence cycles
– Needed by almost every software pipelining method
Calculating RecMII
• Turn the problem into a Markov decision process– 1st time Howard algorithm is applied to pipelining
13
Conventional
O(V3)
• Recurrence Minimum II– II determined by the biggest dependence cycles
– Needed by almost every software pipelining method
Calculating RecMII
• Turn the problem into a Markov decision process– 1st time Howard algorithm is applied to pipelining
13
Conventional
O(V3)
Howard policyiteration algo.
O(exponential*E)
• Recurrence Minimum II– II determined by the biggest dependence cycles
– Needed by almost every software pipelining method
Calculating RecMII
• Turn the problem into a Markov decision process– 1st time Howard algorithm is applied to pipelining
• Linearize Howard with a small constant H 13
Conventional
O(V3)
Linearized Howard
O( E)
Howard policyiteration algo.
O(exponential*E) H*
• Recurrence Minimum II– II determined by the biggest dependence cycles
– Needed by almost every software pipelining method
Calculating RecMII
• Turn the problem into a Markov decision process– 1st time Howard algorithm is applied to pipelining
• Linearize Howard with a small constant H 13
Conventional
O(V3)
Linearized Howard
O( E)
Howard policyiteration algo.
O(exponential*E)
• Recurrence Minimum II– II determined by the biggest dependence cycles
– Needed by almost every software pipelining method
Calculating RecMII
• Turn the problem into a Markov decision process– 1st time Howard algorithm is applied to pipelining
• Linearize Howard with a small constant H• A by-product: critical operations13
Conventional
O(V3)
Linearized Howard
O( E)
Howard policyiteration algo.
O(exponential*E)
• Recurrence Minimum II– II determined by the biggest dependence cycles
– Needed by almost every software pipelining method
Divide operations into stages
14
Bellman-Ford
O(V*E)
Divide operations into stages
• Also an iterative sub-algorithm
14
Bellman-Ford
O(V*E)
Divide operations into stages
• Also an iterative sub-algorithm• Linearize it with a constant B
14
Bellman-Ford
O(V*E)
Linearized Bellman-Ford
O( E)B*
Divide operations into stages
• Also an iterative sub-algorithm
• Linearize it with a constant B
• Specific order in visiting edges– Scan nodes in sequential order, and visit their incoming edges
• Values are propagated along local edges in the 1st itr.
• Values are propagated along loop-carried edges in the 2nd itr.14
Bellman-Ford
O(V*E)
Linearized Bellman-Ford
O( E)B*
Divide operations into stages
• Also an iterative sub-algorithm
• Linearize it with a constant B
• Specific order in visiting edges– Scan nodes in sequential order, and visit their incoming edges
• Values are propagated along local edges in the 1st itr.
• Values are propagated along loop-carried edges in the 2nd itr.14
Bellman-Ford
O(V*E)
Linearized Bellman-Ford
O( E)
Illustration15
Kernel
abc
d abc
d abc
d
RecMII
Iteration0 1 2 3
Local dependences
Resources
Loop-carried dependencesabc
d
PrepartitionPrepartitionPrepartitionPrepartition
Local scheduling
Kernel expansion
RotationExit
Illustration15
Kernel
abc
d abc
d abc
d
RecMII
Iteration0 1 2 3
Local dependences
Resources
Loop-carried dependencesabc
d
PrepartitionPrepartitionPrepartitionPrepartition
Local scheduling
Kernel expansion
RotationExit
16
Kernel
a
d a
d a
d
IIbc
bc
bc
Iteration0 1 2 3
a
d
bc
Illustration (Cont.)
Local dependences
Resources
Loop-carried dependences
Prepartition
Local schedulingLocal schedulingLocal schedulingLocal scheduling
Kernel expansion
RotationExit
16
Kernel
a
d a
d a
d
IIb c
b c
b c
Iteration0 1 2 3
a
db c
Illustration (Cont.)
XX
Local dependences
Resources
Loop-carried dependences
Prepartition
Local schedulingLocal schedulingLocal schedulingLocal scheduling
Kernel expansion
RotationExit
Local scheduling
• Any local scheduling algorithm can be used– E.g. list scheduling
17
Local scheduling
• Any local scheduling algorithm can be used– E.g. list scheduling
• Weakness: – Loop-carried dependences may be violated
17
Local scheduling
• Any local scheduling algorithm can be used– E.g. list scheduling
• Weakness: – Loop-carried dependences may be violated
• To reduce the chance of violation: – Before scheduling, priority function considers loop-carried dependences in advance
– Prioritize critical operations
17
18
Kernel
ab c
dab c
d ab c
d
II
Iteration0 1 2 3
ab c
dIllustration (Cont.)
XX
Local dependences
Resources
Loop-carried dependences
Prepartition
Local schedulingLocal schedulingLocal schedulingLocal scheduling
Kernel expansion
RotationExit
18
Kernel
ab c
dab c
d ab c
d
II
Iteration0 1 2 3
ab c
dIllustration (Cont.)
XX
Local dependences
Resources
Loop-carried dependences
Prepartition
Local schedulingLocal schedulingLocal schedulingLocal scheduling
Kernel expansion
RotationExit
d
d
d
19
ab c
ab c
ab c
II
Iteration0 1 2 3
ab c
dIllustration (Cont.)
xx
Local dependences
Resources
Loop-carried dependences
Prepartition
Local scheduling
Kernel expansionKernel expansionKernel expansionKernel expansion
RotationExit
d
d
d19
Kernel
ab c
ab c
ab c
II
Iteration0 1 2 3
ab c
Illustration (Cont.)
xx
Local dependences
Resources
Loop-carried dependencesx
Prepartition
Local scheduling
Kernel expansionKernel expansionKernel expansionKernel expansion
RotationExit
20
Iteration0 1 2 3
ab c
ab c
ab c
ab c
d
d
d Illustration (Cont.)
xx
Local dependences
Resources
Loop-carried dependencesx
Prepartition
Local scheduling
Kernel expansion
RotationRotationRotationRotationExit
20
Kernel
Iteration0 1 2 3 ab c
ab c
ab c
ab c
d
d
d
d Illustration (Cont.)
Local dependences
Resources
Loop-carried dependences
Prepartition
Local scheduling
Kernel expansion
RotationRotationRotationRotationExit
Experiments
• Transmeta CMS on SPEC2k traces
21
Experiments
• Transmeta CMS on SPEC2k traces
• Functional simulator to comprehensively
– Explore thresholds H and B– Evaluate compile overhead and schedules’ quality
21
Experiments
• Transmeta CMS on SPEC2k traces
• Functional simulator to comprehensively
– Explore thresholds H and B– Evaluate compile overhead and schedules’ quality
• Cycle-accurate simulator– Simulates cache misses, latencies, …
– Initial performance study
21
Linearized Howard vs.
exponential backoff + binary
search [Rau et al. 1992]
22
Linearized Howard vs.
exponential backoff + binary
search [Rau et al. 1992]
22
0.1
1
10
100
1000
0 100 200 300 400 500 600 700
#dependences
Linearized Howard vs.
exponential backoff + binary
search [Rau et al. 1992]
22
0.1
1
10
100
1000
0 100 200 300 400 500 600 700
#dependences
Linearized Howard vs.
exponential backoff + binary
search [Rau et al. 1992]
22
0.1
1
10
100
1000
0 100 200 300 400 500 600 700
#dependences
Average 9X faster Average 9X faster Average 9X faster Average 9X faster
Peak 812X fasterPeak 812X fasterPeak 812X fasterPeak 812X faster
H≤3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops
Linearized Howard vs.
exponential backoff + binary
search [Rau et al. 1992]
22
0.1
1
10
100
1000
0 100 200 300 400 500 600 700
#dependences
Average 9X faster Average 9X faster Average 9X faster Average 9X faster
Peak 812X fasterPeak 812X fasterPeak 812X fasterPeak 812X faster
H≤3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops
H≤14 for all loops 14 for all loops 14 for all loops 14 for all loops
Linearized Bellman-Ford
23
Linearized Bellman-Ford
23
0%
50%
100%
2 3 4 5
% Total loops
B
Linearized Bellman-Ford
• B ≤ 3 for 98.8% of 11,992 loops
23
0%
50%
100%
2 3 4 5
% Total loops
B
Linearized Bellman-Ford
• B ≤ 3 for 98.8% of 11,992 loops
23
0%
50%
100%
2 3 4 5
% Total loops
B
Linearized Bellman-Ford
• B ≤ 3 for 98.8% of 11,992 loops• B ≤ 5 for all the loops
23
0%
50%
100%
2 3 4 5
% Total loops
B
Linearized Bellman-Ford
• B ≤ 3 for 98.8% of 11,992 loops• B ≤ 5 for all the loops• From now on, we set H=10, B=5 � 11,910 loops scheduled
23
0%
50%
100%
2 3 4 5
% Total loops
B
Scheduling overhead &
schedules’ quality
24
RS2 DESP JITSP RS2 DESP JITSP
Overhead % loops with optimal schedules
95%
1X
Scheduling overhead &
schedules’ quality
• JITSP achieves optimal schedules for 95% loops
24
RS2 DESP JITSP RS2 DESP JITSP
Overhead % loops with optimal schedules
95%
1X
Scheduling overhead &
schedules’ quality
• JITSP achieves optimal schedules for 95% loops
24
RS2 DESP JITSP RS2 DESP JITSP
Overhead % loops with optimal schedules
95%12X
1X
Scheduling overhead &
schedules’ quality
• JITSP achieves optimal schedules for 95% loops
24
RS2 DESP JITSP RS2 DESP JITSP
Overhead % loops with optimal schedules
95%13%
12X
1X
Scheduling overhead &
schedules’ quality
• JITSP achieves optimal schedules for 95% loops
24
RS2 DESP JITSP RS2 DESP JITSP
Overhead % loops with optimal schedules
95%13%
25%
12X
1X
Scheduling overhead &
schedules’ quality
• JITSP achieves optimal schedules for 95% loops
24
RS2 DESP JITSP RS2 DESP JITSP
Overhead % loops with optimal schedules
95%
23%13%
25%
12X
1X
Acyclic
scheduler,
27%
Software
pipelining,
33%
Hot region
optimizations,
34%
Assembler,
6%
Note: acyclic scheduler handles acyclic code, or loops NOT selected for software pipelining
25
Compile overhead distribution
Preliminary performance
• 40 hot loops
26
Preliminary performance
• 40 hot loops
26
filtered
19%
too many
registers
55%
too many
unrolls
1%
Successfully
generated
code
25%
Preliminary performance
• 40 hot loops
26
filtered
19%
too many
registers
55%
too many
unrolls
1%
Successfully
generated
code
25%
Preliminary performance
• 40 hot loops
• The architecture has bottleneck in registers
– 2/7/24/32 predicate/static alias/integer/floating point
available for pipelining 26
filtered
19%
too many
registers
55%
too many
unrolls
1%
Successfully
generated
code
25%
Preliminary performance
• 40 hot loops
• The architecture has bottleneck in registers
– 2/7/24/32 predicate/static alias/integer/floating point
available for pipelining 26
filtered
19%
too many
registers
55%
too many
unrolls
1%
Successfully
generated
code
25%
Preliminary performance
• 40 hot loops
• The architecture has bottleneck in registers
– 2/7/24/32 predicate/static alias/integer/floating point
available for pipelining 26
filtered
19%
too many
registers
55%
too many
unrolls
1%
Successfully
generated
code
25%
Preliminary performance (Cont.)
27
0.00
0.50
1.00
1.50
2.00
1 2 3 4 5 6 7 8 9 10
II/MII (The lower, the better)
RS2 DESP JITSP
Optimal
Preliminary performance (Cont.)
• JITSP achieves optimal schedules for all but 1 loops
27
0.00
0.50
1.00
1.50
2.00
1 2 3 4 5 6 7 8 9 10
II/MII (The lower, the better)
RS2 DESP JITSP
Optimal
Preliminary performance (Cont.)
• JITSP 10~36% speedup. Better than the others
28
-40%
-20%
0%
20%
40%
1 2 3 4 5 6 7 8 9 10
Speedup
RS2
DESP
JITSP
Preliminary performance (Cont.)
• JITSP 10~36% speedup. Better than the others• Exception: loop 2. Optimal schedule but slowdown due to memory aliases �retranslation needed
28
-40%
-20%
0%
20%
40%
1 2 3 4 5 6 7 8 9 10
Speedup
RS2
DESP
JITSP
Preliminary performance (Cont.)
• JITSP 10~36% speedup. Better than the others• Exception: loop 2. Optimal schedule but slowdown due to memory aliases �retranslation needed
• Speedup swim(5.3%), ammp(4.4%), mcf(3.6%)28
-40%
-20%
0%
20%
40%
1 2 3 4 5 6 7 8 9 10
Speedup
RS2
DESP
JITSP
Conclusion
29
Conclusion
• 1st linear software pipelining algorithm implemented for dynamic compilers
29
Conclusion
• 1st linear software pipelining algorithm implemented for dynamic compilers
• Turns a traditionally-expensive optimization into linear time O(V+E)
29
Conclusion
• 1st linear software pipelining algorithm implemented for dynamic compilers
• Turns a traditionally-expensive optimization into linear time O(V+E)– Taking advantages of results from various domains:
• hardware circuit design (Retiming)
• stochastic control (Howard algorithm)
• Graph (Bellman-Ford)
• software pipelining (Rotation scheduling and DESP)
29
Conclusion
• 1st linear software pipelining algorithm implemented for dynamic compilers
• Turns a traditionally-expensive optimization into linear time O(V+E)– Taking advantages of results from various domains:
• hardware circuit design (Retiming)
• stochastic control (Howard algorithm)
• Graph (Bellman-Ford)
• software pipelining (Rotation scheduling and DESP)
• Generates optimal or near-optimal schedules with reasonable compile overhead
29
Future work
• Register availability– Add more architecture registers
– Algorithm: Register pressure-aware
• Implementation– Loop selection
– Re-translation
• Evaluation– Benchmarks
30
Backup slides
31
Howard algorithm
32
• Each operation is rewarded to reach a cycle via 1 policy edge– The bigger the cycle, the more the reward
Howard algorithm
32
a
c b
d
• Each operation is rewarded to reach a cycle via 1 policy edge– The bigger the cycle, the more the reward
Howard algorithm
32
a
c b
d
• Each operation is rewarded to reach a cycle via 1 policy edge– The bigger the cycle, the more the reward
Howard algorithm
32
a
c b
d
• Each operation is rewarded to reach a cycle via 1 policy edge– The bigger the cycle, the more the reward
Distribution statistics of JITSP
33