Topic 2: Foundamentals in Basic Back-End Optimization
description
Transcript of Topic 2: Foundamentals in Basic Back-End Optimization
112/04/22 \course\cpeg421-10F\Topic-2.ppt 1
Topic 2: Foundamentals in Basic Back-End Optimization
Instruction Selection
Instruction scheduling
Register allocation
112/04/22 \course\cpeg421-10F\Topic-2.ppt 2
ABET Outcome
• Ability to apply knowledge of basic code generation techniques,
e.g. Instruction selection, instruction scheduling, register
allocation, to solve code generation problems.
• Ability to analyze the basic algorithms on the above techniques
and conduct experiments to show their effectiveness.
• Ability to use a modern compiler development platform and tools
for the practice of above.
• A Knowledge on contemporary issues on this topic.
112/04/22 \course\cpeg421-10F\Topic-2.ppt 3
• Good IPO• Good LNO• Good global
optimization• Good integration of
IPO/LNO/OPT• Smooth information
passing between FE and CG
• Complete and flexible support of inner-loop scheduling (SWP), instruction scheduling and register allocation
Inter-ProceduralOptimization (IPA)
Loop NestOptimization (LNO)
Global Optimization(OPT)
Source
InnermostLoop
scheduling
Global instscheduling
Reg alloc
Local instscheduling
Executable
ArchModels
BE/CG
ME
General Compiler Framework
112/04/22 \course\cpeg421-10F\Topic-2.ppt 4
Three Basic Back-End Optimization
Instruction selection• Mapping IR into assembly code• Assume a fixed storage mapping & code shape• Combining operations, using address modes
Instruction scheduling• Reordering operations to hide latencies• Assume a fixed program (set of operations)• Changes demand for registers
Register allocation• Deciding which values will reside in registers• Changes the storage mapping may add false sharing• Concerns about placement of data & memory operations
112/04/22 \course\cpeg421-10F\Topic-2.ppt 5
Instruction Selection
Some slides are from CS 640 lecture in
George Mason University
112/04/22 \course\cpeg421-10F\Topic-2.ppt 6
Reading List
Some slides are from CS 640 lecture in
George Mason University
(1) K. D. Cooper & L. Torczon, Engineering a Compiler, Chapter 11
(2) Dragon Book, Chapter 8.7, 8.9
112/04/22 \course\cpeg421-10F\Topic-2.ppt 7
Objectives
• Introduce the complexity and importance
of instruction selection
• Study practical issues and solutions
• Case study: EBO (Extended Basic block
Optimization) in Open64
112/04/22 \course\cpeg421-10F\Topic-2.ppt 8
Instruction Selection: Retargetable
Machine description should also help with scheduling & allocation
Front End Back EndMiddle End
Infrastructure
Tables
PatternMatchingEngine
Back-endGenerator
Machinedescription
Description-based retargeting
This is simplistic but useful view
112/04/22 \course\cpeg421-10F\Topic-2.ppt 9
Complexity of Instruction Selection
Modern computers have many ways to do things.
Consider a register-to-register copy
• Obvious operation is: mv rj, ri
• Many others exist add rj, ri,0 sub rj, ri, 0 rshiftI rj, ri, 0
mul rj, ri, 1 or rj, ri, 0 divI rj, r, 1
xor rj, ri, 0 others …
112/04/22 \course\cpeg421-10F\Topic-2.ppt 10
Complexity of Instruction Selection
(Cont.)
• Multiple addressing modes
• Each alternate sequence has its cost Complex ops (mult, div): several cycles
Memory ops: latency vary
• Sometimes, cost is context related
• Use under-utilized FUs
• Dependent on objectives: speed, power, code size
112/04/22 \course\cpeg421-10F\Topic-2.ppt 11
Complexity of Instruction Selection
(Cont.)• Additional constraints on specific operations
Load/store multiple words: contiguous registers
Multiply: need special register Accumulator
• Interaction between instruction selection,
instruction scheduling, and register allocation For scheduling, instruction selection predetermines latencies and
function units
For register allocation, instruction selection pre-colors some
variables. e.g. non-uniform registers (such as registers for
multiplication)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 12
Instruction Selection Techniques
Tree Pattern-Matching
• Tree-oriented IR suggests pattern matching on trees
• Tree-patterns as input, matcher as output
• Each pattern maps to a target-machine instruction sequence
• Use dynamic programming or bottom-up rewrite systems
Peephole-based Matching
• Linear IR suggests using some sort of string matching
• Inspired by peephole optimization
• Strings as input, matcher as output
• Each string maps to a target-machine instruction sequence
In practice, both work well; matchers are quite different.
112/04/22 \course\cpeg421-10F\Topic-2.ppt 13
A Simple Tree-Walk Code Generation Method
• Assume starting with a Tree-like IR
• Starting from the root, recursively walking
through the tree
• At each node use a simple (unique) rule to
generate a low-level instruction
112/04/22 \course\cpeg421-10F\Topic-2.ppt 14
Tree Pattern-Matching
Assumptions
tree-like IR - an AST
Assume each subtree of IR – there is a corresponding set of tree
patterns (or “operation trees” - low-level abstract syntax tree)
Problem formulation: Find a best mapping of the AST to
operations by “tiling” the AST with operation trees (where
tiling is a collection of (AST-node, operation-tree) pairs).
112/04/22 \course\cpeg421-10F\Topic-2.ppt 15
Tile AST
gets
+ -
val num
ref *
ref num
ref
val num
lab num
+ +
Tile 6
Tile 1
Tile 2
Tile 3
Tile 4
Tile 5
An AST tree
112/04/22 \course\cpeg421-10F\Topic-2.ppt 16
Goal is to “tile” AST with operation trees. • A tiling is collection of <ast-node, op-tree > pairs ◊ ast-node is a node in the AST ◊ op-tree is an operation tree ◊ <ast-node, op-tree> means that op-tree could
implement the subtree at ast-node • A tiling ‘implements” an AST if it covers every
node in the AST and the overlap between any two trees is
limited to a single node ◊ <ast-node, op-tree> tiling means ast-node is
also covered by a leaf in another operation tree in the tiling, unless it is the root
◊ Where two operation trees meet, they must be compatible (expect the value in the same location)
Tile AST with Operation Trees
112/04/22 \course\cpeg421-10F\Topic-2.ppt 17
Tree Walk by Tiling: An Example
a = a + 22;
MOVE
SP
+
a
+
MEM
SP
+
a
22
112/04/22 \course\cpeg421-10F\Topic-2.ppt 18
Example
a = a + 22;
MOVE
+
MEM
SP
+
a
22SP
+
at1
t2t3 ld t1, [sp+a]
st [t3], t2
add t2, t1, 22add t3, sp, a
112/04/22 \course\cpeg421-10F\Topic-2.ppt 19
Example: An Alternative
a = a + 22;
MOVE
+
MEM
SP
+
a
22SP
+
at1
t2 ld t1, [sp+a]
st [sp+a], t2add t2, t1, 22
112/04/22 \course\cpeg421-10F\Topic-2.ppt 20
Finding Matches to Tile the Tree
• Compiler writer connects operation trees to AST
subtrees ◊ Provides a set of rewrite rules ◊ Encode tree syntax, in linear form ◊ Associated with each is a code template
112/04/22 \course\cpeg421-10F\Topic-2.ppt 21
Generating Code in Tilings
Given a tiled tree
• Postorder treewalk, with node-dependent order for children
◊ Do right child before its left child • Emit code sequence for tiles, in order
• Tie boundaries together with register names
◊ Can incorporate a “real” register allocator or
can simply use “NextRegister++” approach
112/04/22 \course\cpeg421-10F\Topic-2.ppt 22
Optimal Tilings
• Best tiling corresponds to least cost
instruction sequence
• Optimal tiling
no two adjacent tiles can be combined to a
tile of lower cost
112/04/22 \course\cpeg421-10F\Topic-2.ppt 23
Dynamic Programming for Optimal Tiling
• For a node x, let f(x) be the cost of the optimal
tiling for the whole expression tree rooted at x.
Then
)(min)( Ty xT
f(y)Txf tile of child covering tile
)cost(
112/04/22 \course\cpeg421-10F\Topic-2.ppt 24
Dynamic Programming for Optimal Tiling (Con’t)
• Maintain a table: node x the optimal tiling
covering node x and its cost
• Start from root recursively: check in table for optimal tiling for this node
If not computed, try all possible tiling and find the optimal,
store lowest-cost tile in table and return
• Finally, use entries in table to emit code
112/04/22 \course\cpeg421-10F\Topic-2.ppt 25
Peephole-based Matching
• Basic idea inspired by peephole optimization
• Compiler can discover local improvements locally
◊ Look at a small set of adjacent operations ◊ Move a “peephole” over code & search for improvement
A Classic example is store followed by loadst $r1,($r0)ld $r2,($r0)
st $r1,($r0) move $r2,$r1
Original code Improved code
112/04/22 \course\cpeg421-10F\Topic-2.ppt 26
Implementing Peephole Matching
• Early systems used limited set of hand-coded patterns
• Window size ensured quick processing
• Modern peephole instruction selectors break problem
into three tasks
ExpanderIRLLIR
SimplifierLLIRLLIR
MatcherLLIRASM
IR LLIR LLIR ASM
LLIR: Low Level IR
ASM: Assembly Code
112/04/22 \course\cpeg421-10F\Topic-2.ppt 27
ExpanderIRLLIR
SimplifierLLIRLLIR
MatcherLLIRASM
IR LLIR LLIR ASM
Implementing Peephole Matching (Con’t)
Simplifier• Looks at LLIR through
window and rewrites it
• Uses forward substitution, algebraic simplification, local constant propagation, and dead-effect elimination
• Performs local optimization within window
• This is the heart of the peephole system and benefit of peephole optimization shows up in this step
Expander• Turns IR code into a low-
level IR (LLIR)• Operation-by-operation,
template-driven rewriting
• LLIR form includes all direct effects
• Significant, albeit constant, expansion of size
Matcher• Compares simplified LLIR
against a library of patterns
• Picks low-cost pattern that captures effects
• Must preserve LLIR effects, may add new ones
• Generates the assembly code output
112/04/22 \course\cpeg421-10F\Topic-2.ppt 28
Some Design Issues of Peephole Optimization
• Dead values Recognizing dead values is critical to remove useless
effects, e.g., condition code
Expander Construct a list of dead values for each low-level operation by
backward pass over the code
Example: consider the code sequence:
r1=ri*rj
cc=fx(ri, rj) // is this dead ?
r2=r1+ rk
cc=fx(r1, rk)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 29
Some Design Issues of Peephole Optimization (Cont.)
• Control flow and predicated operations
A simple way: clear the simplifier’s window when
it reaches a branch, a jump, or a labeled or
predicated instruction
A more aggressive way: to be discussed next
112/04/22 \course\cpeg421-10F\Topic-2.ppt 30
Some Design Issues of Peephole Optimization (Cont.)
• Physical vs. Logical Window
Simplifier uses a window containing adjacent low
level operations
However, adjacent operations may not operate on
the same values
In practice, they may tend to be independent for
parallelism or resource usage reasons
112/04/22 \course\cpeg421-10F\Topic-2.ppt 31
Some Design Issues of Peephole Optimization (Cont.)
• Use Logical Window
Simplifier can link each definition with the next use
of its value in the same basic block
Simplifier largely based on forward substitution
No need for operations to be physically adjacent
More aggressively, extend to larger scopes beyond a basic
block.
112/04/22 \course\cpeg421-10F\Topic-2.ppt 32
An Example
Original IR Code
OP Arg1 Arg2 Result
mult 2 y t1
sub x t1 w
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
Expand
where (@x,@y,@w are offsets of x, y and w from aglobal location stored in r0
R12: y mem addressR20: w mem addressR13: yR14: t1R17: xR18: w
112/04/22 \course\cpeg421-10F\Topic-2.ppt 33
An Example (Con’t)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
Simplify
LLIR Code
r13 MEM(r0+ @y) r14 2 * r13
r17 MEM(r0 + @x) r18 r17 - r14
MEM(r0 + @w) r18
Original IR Code
OP Arg1 Arg2 Result
mult 2 y t1
sub x t1 w
112/04/22 \course\cpeg421-10F\Topic-2.ppt 34
• Introduced all memory operations & temporary names
• Turned out pretty good code
An Example (Con’t)
MatchLLIR Code
r13 MEM(r0+ @y) r14 2 * r13
r17 MEM(r0 + @x) r18 r17 - r14
MEM(r0 + @w) r18
ILOC Assembly CodeloadAI r0,@y r13
multI 2 * r13 r14 loadAI r0,@x r17 sub r17 - r14 r18
storeAI r18 r0,@w
Original IR Code
OP Arg1 Arg2 Result
mult 2 y t1
sub x t1 w
112/04/22 \course\cpeg421-10F\Topic-2.ppt 35
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r10 2r11 @yr12 r0 + r11
r10 2r12 r0 + @yr13 MEM(r12)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 36
Simplifier (3-operation window)
LLIR Code r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r10 2r12 r0 + @yr13 MEM(r12)
r10 2r13 MEM(r0 + @y)r14 r10 x r13
112/04/22 \course\cpeg421-10F\Topic-2.ppt 37
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r13 MEM(r0 + @y)r14 2 * r13
r15 @x
r10 2r13 MEM(r0 + @y)r14 r10 * r13
112/04/22 \course\cpeg421-10F\Topic-2.ppt 38
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r13 MEM(r0 + @y)r14 2 * r13
r15 @x
r14 2 * r13
r15 @xr16 r0 + r15
1st op it has rolled out of window
r13 MEM(r0+ @y)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 39
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r14 2 * r13
r15 @xr16 r0 + r15
r14 2 * r13
r16 r0 + @xr17 MEM(r16)
r13 MEM(r0+ @y)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 40
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r14 2 * r13
r17 MEM(r0+@x) r18 r17 - r14
r14 2 * r13
r16 r0 + @xr17 MEM(r16)
r13 MEM(r0+ @y)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 41
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r17 MEM(r0+@x) r18 r17 - r14
r19 @w
r14 2 * r13
r17 MEM(r0+@x) r18 r17 - r14
r13 MEM(r0+ @y) r14 2 * r13
112/04/22 \course\cpeg421-10F\Topic-2.ppt 42
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r18 r17 - r14
r19 @wr20 r0 + r19
r17 MEM(r0+@x) r18 r17 - r14
r19 @w
r13 MEM(r0+ @y) r14 2 * r13 r17 MEM(r0 + @x)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 43
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r18 r17 - r14
r20 r0 + @wMEM(r20) r18
r18 r17 - r14
r19 @wr20 r0 + r19
r13 MEM(r0+ @y) r14 2 * r13 r17 MEM(r0 + @x)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 44
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r18 r17 - r14
MEM(r0 + @w) r18
r18 r17 - r14
r20 r0 + @wMEM(r20) r18
r13 MEM(r0+ @y) r14 2 * r13 r17 MEM(r0 + @x)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 45
Simplifier (3-operation window)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
r18 r17 - r14
MEM(r0 + @w) r18
r18 r17 - r14
r20 r0 + @wMEM(r20) r18
r13 MEM(r0+ @y) r14 2 * r13 r17 MEM(r0 + @x)
112/04/22 \course\cpeg421-10F\Topic-2.ppt 46
An Example (Con’t)
LLIR Code
r10 2 r11 @y r12 r0 + r11
r13 MEM(r12) r14 r10 * r13
r15 @x r16 r0 + r15 r17 MEM(r16) r18 r17 - r14
r19 @w r20 r0 + r19
MEM(r20) r18
Simplify
LLIR Code r13 MEM(r0+ @y) r14 2 * r13
r17 MEM(r0 + @x) r18 r17 - r14
MEM(r0 + @w) r18
112/04/22 \course\cpeg421-10F\Topic-2.ppt 47
Making It All Work
• LLIR is largely machine independent
• Target machine described as LLIR ASM pattern
• Actual pattern matching Use a hand-coded pattern matcher
Turn patterns into grammar & use LR parser
• Several important compilers use this technology
• It seems to produce good portable instruction selectors
• Key strength appears to be late low-level optimization
112/04/22 \course\cpeg421-07s\Topic-5.ppt 48112/04/22 \course\cpeg421-07s\Topic-5.ppt 48
Case Study: LLIR and Instruction Selection
• LLIR stands for low-level intermediate representation
• GCC RTL
• Open64 CGIR
• Instruction Selection Techniques in Real Compiler:
• GCC: peephole based matching
• LCC: tree pattern match method
112/04/22 \course\cpeg421-10F\Topic-2.ppt 49
Case Study: Peephole Optimization in Kcc/Open64
• Aggressive simplifier using a logical window
that spans a set of basic blocks (Extended
Basic Blocks: EBO)
• Basis One definition for multiple uses
Constant propagation
Forward substitution
Others: pattern matching
112/04/22 \course\cpeg421-10F\Topic-2.ppt 50
KCC/Open64: Where Instruction Selection Happens?
f90
Very High WHIRL
VHO(Very High WHIRL Optimizer)
Standalone Inliner
W2C/W2F
Fortran
High WHIRL
Middle WHIRL
Low WHIRL
Very Low WHIRL
lowering
Source to IR
Scanner →Parser → RTL → WHIRL
LNO
• Loop unrolling/
• Loop reversal/Loop fission/Loop fussion
• Loop tiling/Loop peeling…
WOPT
• SSAPRE(Partial Redundency Elimination)
• VNFRE(Value Numbering based Full Redundancy Elimination)
RVI-1(Register Variable Identification)
lowering
• RVI-2
• IVR(Induction Variable Recognition)
lowering
• Cflow(control flow opt), HBS (hyperblock schedule)
• EBO (Extended Block Opt.) • GCM (Global Code Motion)
• PQS (Predicate Query System)
• SWP, Loop unrolling
Assembly Code
W2C/W2F
CFG/DDG
WHIRL-to-TOP lowering IGLS(pre-pass)
GRA
LRA
IGLS(post-pass)
DDG
gfecc gfec
C++ C
IPA
• IPL(Pre_IPA)
• IPA_LINK(main_IPA)
◦ Analysis
◦ Optimization
PREOPT
SSA
• IGLS(Global and Local Instruction Scheduling)
• GRA(Global Register Allocation)
• LRA(Local Register Allocation)
Machine Description
GCC Compile
Mach
ine M
odel
Fron
t En
dM
idd
le En
dB
ack E
nd
SSA
CGIR
Some peephole optimization
lowering
lowering
112/04/22 \course\cpeg421-10F\Topic-2.ppt 51
Case Study: Peephole Optimization in Kcc/Open64 (cont.)
• Summary recognize an extended block/sequence
perform peephole optimizations on the instructions within
a logical window (The size of the logical window grows as
the basic block sequence grows)
• Optimizations forward propagation of constants
redundant expression elimination
dead expression elimination
112/04/22 \course\cpeg421-10F\Topic-2.ppt 52
Flowchart of Kcc/Open64 Code Generator
WHIRL
Process Inner Loops: unrolling, EBO
Loop prep, software pipelining
IGLS: pre-passGRA, LRA, EBOIGLS: post-passControl Flow Opt
Code Emission
WHIRL-to-TOP Lowering
CGIR: Quad Op List
Control Flow Opt IEBO
Hyperblock Formation Critical-Path Reduction
Control Flow Opt IIEBO
EBO:Extendedbasic blockoptimizationpeephole,etc.
PQS:PredicateQuery System
112/04/22 \course\cpeg421-10F\Topic-2.ppt 53
Case Study: Peephole Optimization in Kcc/Open64 (cont.)
• Design issues
need to recognize the definition and use
connections between values.
called several times during compilation by
different modules, and may have different
information available each time it is called