Static Code Scheduling
description
Transcript of Static Code Scheduling
![Page 1: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/1.jpg)
Static Code Scheduling
CS 671April 1, 2008
![Page 2: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/2.jpg)
2 CS 671 – Spring 2008
Code Scheduling
Scheduling or reordering instructions to improve performance and/or guarantee correctness• Important for dynamically-scheduled architectures• Crucial (assumed!) for statically-scheduled
architectures, e.g. VLIW or EPIC
Takes into account anticipated latencies• Machine-specific, performed later in the optimization
pass
How does this contrast with our earlier exploration of code motion?
![Page 3: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/3.jpg)
3 CS 671 – Spring 2008
Many machines are pipelined and expose some aspects of pipelining to the user (compiler)
Examples:• Branch delay slots!• Memory-access delays• Multi-cycle operations
Some machines don’t have scheduling hardware
Why Must the Compiler Schedule?
![Page 4: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/4.jpg)
4 CS 671 – Spring 2008
Example
Assume loads take 2 cycles and branches have a delay slot.
____cycles
instruction start time
r2 [r1]
r3 [r1+4]
r4 r2 + r3
r5 r2 + 1
goto L1
nop
![Page 5: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/5.jpg)
5 CS 671 – Spring 2008
Example
Assume loads take 2 cycles and branches have a delay slot.
____cycles
instruction start time
r2 [r1]
r3 [r1+4]
r5 r2 + 1
goto L1
r4 r2 + r3
![Page 6: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/6.jpg)
6 CS 671 – Spring 2008
Code Scheduling Strategy
Get resources operating in parallel• Integer data path• Integer multiply / divide hardware• FP adder, multiplier, divider
Method• Fill with computations that do not
require result or same hardware resources
Drawbacks• Highly hardware dependent
Start Op
Use Op
Try to fill
![Page 7: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/7.jpg)
7 CS 671 – Spring 2008
Scheduling Approaches
Local
Branch scheduling
Basic-block scheduling
Global
Cross-block scheduling
Software pipelining
Trace scheduling
Percolation scheduling
![Page 8: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/8.jpg)
8 CS 671 – Spring 2008
Branch Scheduling
Two problems:
Branches often take some number of cycles to complete
Can be a delay between a compare b and its associated branch
A compiler will try to fill these slots with valid instructions (rather than nop)
Delay slots – present in PA-RISC, SPARC, MIPS
Condition delay – PowerPC, Pentium
![Page 9: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/9.jpg)
9 CS 671 – Spring 2008
Recall from Architecture…
IF – Instruction Fetch
ID – Instruction Decode
EX – Execute
MA – Memory access
WB – Write back
IF
IF
IF
ID
ID
ID
EX
EX
EX
MA
MA
MA
WB
WB
WB
![Page 10: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/10.jpg)
10 CS 671 – Spring 2008
Control Hazards
IF
IF
ID
---
EX
---
MA
--- ---
WB
IF ID EX MA WB
IF ID EX MA WB
Taken Branch
Instr + 1
Branch Target
Branch Target + 1
![Page 11: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/11.jpg)
11 CS 671 – Spring 2008
Data Dependences
If two operations access the same register, they are dependent
Types of data dependences
Flow Output Anti
r1 = r2 + r3
r4 = r1 * 6
r1 = r2 + r3
r1 = r4 * 6
r1 = r2 + r3
r2 = r5 * 6
![Page 12: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/12.jpg)
12 CS 671 – Spring 2008
Data Hazards
IF
IF
ID
ID
EX
EX
MA
MA WB
WBlw R1,0(R2)
add R3,R1,R4 stall
Memory latency: data not ready
![Page 13: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/13.jpg)
13 CS 671 – Spring 2008
Data Hazards
IF
IF
ID
ID
EX EX MA
MA WB
WBaddf R3,R1,R2
addf R3,R3,R4 stall EX EX
Assumes floating point ops take 2 execute cycles
Instruction latency: execute takes > 1 cycle
![Page 14: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/14.jpg)
14 CS 671 – Spring 2008
Multi-cycle Instructions
• Scheduling is particularly important for multi-cycle operations• Alpha instructions > 1 cycle latency (partial
list)
mull (32-bit integer multiply) 8mulq (64-bit integer multiply) 16addt (fp add) 4mult (fp multiply) 4divs (fp single-precision divide) 10divt (fp double-precision divide) 23
![Page 15: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/15.jpg)
15 CS 671 – Spring 2008
Avoiding data hazards
• Move loads earlier and stores later (assuming this does not violate correctness) • Other stalls may require more sophisticated
re-ordering, i.e. ((a+b)+c)+d becomes (a+b)+(c+d) • How can we do this in a systematic way??
![Page 16: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/16.jpg)
16 CS 671 – Spring 2008
Example: Without Scheduling
Start Time
Code
lw r1, w
add r1,r1,r1
lw r2,x
mult r1,r1,r2
lw r2,y
mult r1,r1,r2
lw r2,z
mult r1,r1,r2
sw r1, a
Assume:• memory instrs take 3 cycles• mult takes 2 cycles (to have
result in register)• rest take 1 cycle
____cycles
![Page 17: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/17.jpg)
17 CS 671 – Spring 2008
Basic Block Dependence DAGS
Nodes - instructions
Edges - dependence between I1 and I2• When we cannot determine whether there is
a dependence, we must assume there is one
a) lw R2, (R1)
b) lw R3, (R1) 4
c) R4 R2 + R3
d) R5 R2 - 1
a b
d c
2 2 2
![Page 18: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/18.jpg)
18 CS 671 – Spring 2008
Example – Build the DAG
Code
a lw r1, w
b add r1,r1,r1
c load r2,x
d mult r1,r1,r2
e load r2,y
f mult r1,r1,r2
g load r2,z
h mult r1,r1,r2
i sw r1, a
Assume: memory instrs = 3 mult = 2 (to have result in register) rest = 1 cycle
![Page 19: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/19.jpg)
19 CS 671 – Spring 2008
Creating a schedule
•Create a DAG of dependences
•Determine priority
•Schedule instructions with– Ready operands– Highest priority
•Heuristics: If multiple possibilities, fall back on other priority functions
![Page 20: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/20.jpg)
20 CS 671 – Spring 2008
Operation Priority
Priority – Need a mechanism to decide which ops to schedule first (when you have choices)
Common priority functions• Height – Distance from exit node
– Give priority to amount of work left to do• Slackness – inversely proportional to slack
– Give priority to ops on the critical path• Register use – priority to nodes with more source
operands and fewer destination operands– Reduces number of live registers • Uncover – high priority to nodes with many children
– Frees up more nodes• Original order – when all else fails
![Page 21: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/21.jpg)
21 CS 671 – Spring 2008
Computing Priorities
Height(n) =• exec(n) if n is a leaf• max(height(m)) + exec(n) for m, where m is a successor of n
Critical path(s) = path through the dependence DAG with longest latency
![Page 22: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/22.jpg)
22 CS 671 – Spring 2008
Example – Determine Height and CP
Code
a lw r1, w
b add r1,r1,r1
c lw r2,x
d mult r1,r1,r2
e lw r2,y
f mult r1,r1,r2
g lw r2,z
h mult r1,r1,r2
i sw r1, a
Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle
Critical path: _______
![Page 23: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/23.jpg)
23 CS 671 – Spring 2008
Example – List Scheduling
Code
a lw r1, w
b add r1,r1,r1
c lw r2,x
d mult r1,r1,r2
e lw r2,y
f mult r1,r1,r2
g lw r2,z
h mult r1,r1,r2
i sw r1, a
start
Schedule
_____cycles
![Page 24: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/24.jpg)
24 CS 671 – Spring 2008
Scheduling vs. Register Allocation
Code
a lw r1 (r12)
b lw r2 (r12+4)
c r1 r1+r2
d stw (r12) r1
e lw r1 (r12+8)
f lw r2 (r12+12)
g r2 r1+r2
![Page 25: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/25.jpg)
25 CS 671 – Spring 2008
Register Renaming
Code
a lw r1 (r12)
b lw r2 (r12+4)
c r3 r1+r2
d stw (r12) r3
e lw r4 (r12+8)
f lw r5 (r12+12)
g r6 r4+r5
![Page 26: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/26.jpg)
26 CS 671 – Spring 2008
VLIW
• Very Long Instruction Word• Compiler determines exactly what is issued
every cycle (before the program is run)• Schedules also account for latencies• All hardware changes result in a compiler
change
• Usually embedded systems (hence simple HW)• Itanium is actually an EPIC-style machine
(accounts for most parallelism, not latencies)
![Page 27: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/27.jpg)
27 CS 671 – Spring 2008
Sample VLIW code
c = a + b d = a - b e = a * b ld j = [x] nop
g = c + d h = c - d nop ld k = [y] nop
nop nop i = j * c ld f = [z] br g
Add/Sub Add/Sub Mul/Div Ld/St Branch
VLIW processor: 5 issue2 Add/Sub units (1 cycle)1 Mul/Div unit (2 cycle, unpipelined)1 LD/ST unit (2 cycle, pipelined)1 Branch unit (no delay slots)
![Page 28: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/28.jpg)
28 CS 671 – Spring 2008
Multi-Issue Scheduling Example
RU_map
time ALU MEM0123456789
2m
3m
5m
4
6
98
10
7m
1
Schedule
time Ready Placed0123456789
Machine: 2 issue, 1 memory port, 1 ALUMemory port = 2 cycles, non-pipelinedALU = 1 cycle
![Page 29: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/29.jpg)
29 CS 671 – Spring 2008
Earliest Latest Sets
Machine: 2 issue, 1 memory port, 1 ALUMemory port = 2 cycles, pipelinedALU = 1 cycle
1m 2m
4m
7
3
65
8
10
9m
![Page 30: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/30.jpg)
30 CS 671 – Spring 2008
List Scheduling Algorithm
Build dependence graph, calculate priority
Add all ops to UNSCHEDULED set
time = 0
while (UNSCHEDULED is not empty)
time++
READY = UNSCHEDULED ops whose incoming deps have been satisfied
Sort READY using priority function
For each op in READY (highest to lowest priority)
op can be scheduled at current time? (resources free?)
Yes: schedule it, op.issue_time = time
Mark resources busy in RU_map relative to issue time
Remove op from UNSCHEDULED/READY sets
No: continue
![Page 31: Static Code Scheduling](https://reader035.fdocuments.in/reader035/viewer/2022081501/5681491d550346895db6578f/html5/thumbnails/31.jpg)
31 CS 671 – Spring 2008
Improving Basic Block Scheduling
• Loop unrolling – creates longer basic blocks• Register renaming – can change register usage
in blocks to remove immediate reuse of registers
Summary• Static scheduling complements (or replaces)
dynamic scheduling by the hardware