Post on 26-Nov-2021
Compiler Optimisation6 – Instruction Scheduling
Hugh LeatherIF 1.18a
hleather@inf.ed.ac.uk
Institute for Computing Systems ArchitectureSchool of Informatics
University of Edinburgh
2019
Introduction
This lecture:Scheduling to hide latency and exploit ILPDependence graphLocal list Scheduling + prioritiesForward versus backward schedulingSoftware pipelining of loops
Latency, functional units, and ILP
Instructions take clock cycles to execute (latency)Modern machines issue several operations per cycleCannot use results until ready, can do something elseExecution time is order-dependentLatencies not always constant (cache, early exit, etc)
Operation Cyclesload, store 3load /∈ cache 100sloadI, add, shift 1mult 2div 40branch 0 – 8
Machine types
In orderDeep pipelining allows multiple instructions
SuperscalarMultiple functional units, can issue > 1 instruction
Out of orderLarge window of instructions can be reordered dynamically
VLIWCompiler statically allocates to FUs
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1
add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2
mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2
mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
1loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r1
add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2
mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2
mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
1loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r1
loadAI rarp ,@b ⇒ r2mult r1, r2 ⇒ r1
loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
1loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r2
mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2
mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
1loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 Next op does not use r1 r1
loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
1loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2
10 r211 r2
mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
1loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2
10 r211 r212 mult r1, r2 ⇒ r1 r113 r1
storeAI r1 ⇒ rarp ,@aDone
1loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2
10 r211 r212 mult r1, r2 ⇒ r1 r113 r114 storeAI r1 ⇒ rarp ,@a store to complete15 store to complete16 store to complete
Done
1loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3
add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r2 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r1
loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3
add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r2
loadAI rarp ,@c ⇒ r3add r1, r1 ⇒ r1
mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r3
add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r3
mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r1
mult r1, r3 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
2loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r19 storeAI r1 ⇒ rarp ,@a store to complete
10 store to complete11 store to complete
DoneUses one more register
11 versus 16 cycles – 31% faster!
2loads/stores 3 cycles, mults 2, adds 1
Scheduling problem
Schedule maps operations to cycle; ∀a ∈ Ops, S(a) ∈ NRespect latency;∀a, b ∈ Ops, a dependson b =⇒ S(a) ≥ S(b) + λ(b)Respect function units; no more ops per type per cycle thanFUs can handle
Length of schedule, L(S) = maxa∈Ops(S(a) + λ(a))Schedule S is time-optimal if ∀S1, L(S) ≤ L(S1)
Problem: Find a time-optimal schedule3
Even local scheduling with many restrictions is NP-complete
3A schedule might also be optimal in terms of registers, power, or space
List scheduling
Local greedy heuristic to produce schedules for single basic blocks1 Rename to avoid anti-dependences2 Build dependency graph3 Prioritise operations4 For each cycle
1 Choose the highest priority ready operation & schedule it2 Update ready queue
List schedulingDependence/Precedence graph
Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps
Label with latency and FU requirements
Anti-dependences (WAR) restrict movement
Example: a = 2*a*b*c
List schedulingDependence/Precedence graph
Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps
Label with latency and FU requirementsAnti-dependences (WAR) restrict movement
Example: a = 2*a*b*c
List schedulingDependence/Precedence graph
Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps
Label with latency and FU requirementsAnti-dependences (WAR) restrict movement – renamingremoves
Example: a = 2*a*b*c
List scheduling
List scheduling algorithmCycle ← 1Ready ← leaves of (D)Active ← ∅while(Ready ∪ Active 6= ∅)
∀a ∈ Active where S(a) + λ(a) ≤ CycleActive ← Active - a∀ b ∈ succs(a) where isready(b)
Ready ← Ready ∪ bif ∃ a ∈ Ready and ∀ b, apriority ≥ bpriority
Ready ← Ready - aS(op) ← CycleActive ← Active ∪ a
Cycle ← Cycle + 1
List schedulingPriorities
Many different priorities usedQuality of schedules depends on good choice
The longest latency path or critical path is a good priorityTie breakers
Last use of a value - decreases demand for register as moves itnearer defNumber of descendants - encourages scheduler to pursuemultiple pathsLonger latency first - others can fit in shadowRandom
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingExample: Schedule with priority by critical path length
List schedulingForward vs backward
Can schedule from root to leaves (backward)May change schedule timeList scheduling cheap, so try both, choose best
List schedulingForward vs backward
Opcode loadI lshift add addI cmp storeLatency 1 1 2 1 1 4
List schedulingForward vs backward
ForwardsInt Int Stores
1 loadI1 lshift2 loadI2 loadI33 loadI4 add14 add2 add35 add4 addI store16 cmp store27 store38 store49 store510111213 cbr
BackwardsInt Int Stores
1 loadI12 addI lshift3 add4 loadI34 add3 loadI2 store55 add2 loadI1 store46 add1 store37 store28 store191011 cmp12 cbr
Scheduling Larger Regions
Schedule extended basic blocks (EBBs)Super block cloning
Schedule tracesSoftware pipelining
Scheduling Larger RegionsExtended basic blocks
Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has
exactly one predecessor
Scheduling Larger RegionsExtended basic blocks
Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has
exactly one predecessor
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB paths
Having B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB paths
Having B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB paths
Having B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB paths
Having B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts
Moving an op out of B1 causesproblems
Must insert compensation codeMoving an op into B1 causesproblems
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation code
Moving an op into B1 causesproblems
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation code
Moving an op into B1 causesproblems
Scheduling Larger RegionsSuperblock cloning
Join points create context problems
Clone blocks to create morecontextMerge any simple control flowSchedule EBBs
Scheduling Larger RegionsSuperblock cloning
Join points create context problemsClone blocks to create morecontext
Merge any simple control flowSchedule EBBs
Scheduling Larger RegionsSuperblock cloning
Join points create context problemsClone blocks to create morecontextMerge any simple control flow
Schedule EBBs
Scheduling Larger RegionsSuperblock cloning
Join points create context problemsClone blocks to create morecontextMerge any simple control flowSchedule EBBs
Scheduling Larger RegionsTrace scheduling
Edge frequency from profile (notblock frequency)
Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat
Scheduling Larger RegionsTrace scheduling
Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation code
Remove from CFGRepeat
Scheduling Larger RegionsTrace scheduling
Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFG
Repeat
Scheduling Larger RegionsTrace scheduling
Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat
Loop scheduling
Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops
Why not loop unrolling?
Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure
Loop scheduling
Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops
Why not loop unrolling?Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure
Software pipelining
Consider simple loop to sum array
Software pipeliningSchedule on 1 FU - 5 cycles
load 3 cycles, add 1 cycle, branch 1 cycle
Software pipeliningSchedule on VLIW 3 FUs - 4 cycles
load 3 cycles, add 1 cycle, branch 1 cycle
Software pipeliningA better steady state schedule exists
load 3 cycles, add 1 cycle, branch 1 cycle
Software pipeliningRequires prologue and epilogue (may schedule others in epilogue)
load 3 cycles, add 1 cycle, branch 1 cycle
Software pipeliningRespect dependences and latency – including loop carries
load 3 cycles, add 1 cycle, branch 1 cycle
Software pipeliningComplete code
load 3 cycles, add 1 cycle, branch 1 cycle
Software pipeliningSome definitions
Initiation interval (ii)Number of cycles between initiating loop iterations
Original loop had ii of 5 cyclesFinal loop had ii of 2 cycles
RecurrenceLoop-based computation whose value is used in later loop iteration
Might be several iterations laterHas dependency chain(s) on itselfRecurrence latency is latency of dependency chain
Software pipeliningAlgorithm
Choose an initiation interval, iiCompute lower bounds on iiShorter ii means faster overall execution
Generate a loop body that takes ii cyclesTry to schedule into ii cycles, using modulo schedulerIf it fails, increase ii by one and try again
Generate the needed prologue and epilogue codeFor prologue, work backward from upward exposed uses in thescheduled loop bodyFor epilogue, work forward from downward exposed definitionsin the scheduled loop body
Software pipeliningInitial initiation interval (ii)
Starting value for ii based on minimum resource and recurrenceconstraints
Resource constraintii must be large enough to issue every operationLet Nu = number of FUs of type uLet Iu = number of operations of type udIu/Nue is lower bound on ii for type umaxu(dIu/Nue) is lower bound on ii
Software pipeliningInitial initiation interval (ii)
Starting value for ii based on minimum resource and recurrenceconstraints
Recurrence constraintii cannot be smaller than longest recurrence latencyRecurrence r is over kr iterations with latency λr
dλr/kue is lower bound on ii for type rmaxr(dλr/kue) is lower bound on ii
Software pipeliningInitial initiation interval (ii)
Starting value for ii based on minimum resource and recurrenceconstraints
Start value = max(maxu(dIu/Nue),maxr (dλr/kue)
For simple loop
a = A[ i ]b = b + ai = i + 1if i < n gotoend
Resource constraintMemory Integer Branch
Iu 1 2 1Nu 1 1 1
dIu/Nue 1 2 1Recurrence constraint
b ikr 1 1λr 2 1
dIu/Nue 2 1
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
Software pipeliningCurrent research
Much research in different software pipelining techniquesDifficult when there is general control flow in the loopPredication in IA64 for example really helps hereSome recent work in exhaustive scheduling -i.e. solve theNP-complete problem for basic blocks
Summary
Scheduling to hide latency and exploit ILPDependence graph - dependences between instructions +latencyLocal list Scheduling + prioritiesForward versus backward schedulingScheduling EBBs, superblock cloning, trace schedulingSoftware pipelining of loops
PPar CDT Advert
The biggest revolution in the technological landscape for fifty years
Now accepting applications! Find out more and apply at:
pervasiveparallelism.inf.ed.ac.uk
• • 4-year programme: 4-year programme: MSc by Research + PhDMSc by Research + PhD
• Collaboration between: ▶ University of Edinburgh’s School of Informatics ✴ Ranked top in the UK by 2014 REF
▶ Edinburgh Parallel Computing Centre ✴ UK’s largest supercomputing centre
• Full funding available
• Industrial engagement programme includes internships at leading companies
• Research-focused: Work on your thesis topic from the start
• Research topics in software, hardware, theory and
application of: ▶ Parallelism ▶ Concurrency ▶ Distribution