1 EE587 SoC Design & Test Partha Pande School of EECS Washington State University [email protected].
CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs...
-
Upload
harold-lee -
Category
Documents
-
view
219 -
download
0
Transcript of CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs...
![Page 1: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/1.jpg)
CS8803: Compilers for Embedded SystemSantosh Pande – Summer 2007
Chapter 8Compiling for VLIWs and ILP
1
![Page 2: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/2.jpg)
Outline
• 8.1 Profiling• 8.2 Scheduling
– Acyclic Region Types and Shapes– Region Formation– Schedule Construction– Resource Management During Scheduling– Loop Scheduling– Clustering
• 8.3 Register Allocation• 8.4 Speculation and Predication• 8.5 Instruction Selection
2
![Page 3: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/3.jpg)
Overview
• This chapter…– Focuses on optimizations, or code transformations– These topics are common across all types of ILP-
processors, for both general-purpose and embedded applications
– Compilers and toolchains used for embedded processors are very similar to those in general-purpose computers
3
![Page 4: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/4.jpg)
1. Profiling
• Profiles– Statistics about how a program spends its time
and resources– Many ILP optimizations require good profile
information
• Two types of profiles– “Point profiles”
• Call graphs and CFG
– “Path profiles”
4
![Page 5: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/5.jpg)
Types of Profiles
• Call graph– Nodes: procedures– Edges: procedure calls– Information
• How many times each proc was called?• How many times each caller proc invoked a callee?
– Limitation: • Can’t tell what to do possibly beneficial procedures
5
![Page 6: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/6.jpg)
Types of Profiles (cont.)
• Control Flow Graph (CFG)– Nodes: each basic blocks
• Basic block: a sequence of always executed instructions
– Edges: one basic block can execute after another basic block
– Information• How many times a particular basic block was executed?• How many times control flowed from one basic block to
one of its immediate neighbors?
6
![Page 7: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/7.jpg)
7
Call Graph
Control Flow Graph
![Page 8: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/8.jpg)
Types of Profiles (cont.)
• Path profiles– Measuring # of times a path, or sequence of
contiguous blocks in CFG is executed– Optimizations using path profiles appeared in
research compilers, but not into production compilers
– Note that call graphs and CFG are “Point profiles”
8
![Page 9: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/9.jpg)
Profile Collection
• Instrumentation– Extra code is inserted into program to gather data– Can be done by compilers or post-compilation tool
• e.g. Pin: dynamic instrumentation tools and API– http://rogue.colorado.edu/pin/
– Hardware techniques• Special registers record stats various events• Statistical-sampling profilers
9
![Page 10: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/10.jpg)
Synthetic Profiles (Heuristics in Lieu of Profiles)
• Synthetic profile– Assigns weights to each part of program based
solely on the structure of source program– Pros
• Need not to collect stats on actual running programs
– Cons• Can’t see how the program behaves w/ read data
– None of synthetic profile techniques does as well as actual profiling
10
![Page 11: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/11.jpg)
2. Scheduling
• Instruction scheduling– Directly responsible for identifying and grouping
operations that can be executed in parallel
• Taxonomy– Cyclic: operates on loops in the program– Acyclic: handles loop-free regions, not directly loops– Current compilers include both schedulers
• Hardware support– Helps the choices available to the scheduler
11
![Page 12: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/12.jpg)
12
![Page 13: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/13.jpg)
Acyclic Region Types and Shapes
• Shapes of Regions– Basic blocks, Traces, …
• Basic Blocks– A “degenerate” form of region– Maximal straight line code fragments
13
![Page 14: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/14.jpg)
Acyclic Region Types and Shapes (cont.)
• Traces: the first proposed region– Linear paths thru code: multiple-entrances & exits– A trace consists of the operations from a list of
basic blocks with the following properties• Each basic block is a predecessor of the next on the list
– e.g. Bk falls thru or branches to Bk+1
• For any i and k, there is no path Bi->Bk->Bi except for those that go through B0
– e.g. Code is cycle free, except entire region can be part of some encompassing loop
– Allow forward branches and so on: complex!14
![Page 15: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/15.jpg)
Acyclic Region Types and Shapes (cont.)
15
Basic Block
Control Flow
Trace:a linear,
multiple-entry, multiple-exit
region
side entrance
![Page 16: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/16.jpg)
Acyclic Region Types and Shapes (cont.)
• Superblocks– Traces with added restriction
• Single-entry, multiple-exit traces
– Same properties with traces, but one addition• There may be no branches into a block in the region,
except to B0. These outlawed branches are referred to in the superblock literature as side entrances
– Tail duplication: a region enlarging technique• Avoids side entrances and adds compensation codes
16
![Page 17: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/17.jpg)
Acyclic Region Types and Shapes (cont.)
17
Tail duplication to eliminate side
entrances
e.g. 70*0.8=56
Superblock
![Page 18: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/18.jpg)
Acyclic Region Types and Shapes (cont.)
• Hyperblocks– Single-entry, multiple-exit regions with internal
control flow– Variants of superblocks that employ predication to
fold multiple control paths into a single superblock– Removing some control flow complexity
18
![Page 19: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/19.jpg)
Acyclic Region Types and Shapes (cont.)
19
Hyperblock
if-conversion of basic blocks B2,
B5
![Page 20: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/20.jpg)
Acyclic Region Types and Shapes (cont.)
• Treegions– Regions containing a tree of basic blocks within
the control flow of the program– Properties
• Each basic block Bj except for B0 has exactly one predecessor.
• That predecessor, Bi, is on the list, where i < j.
– Any path thru treegion yield a superblock• A trace with no side entrances
– Treegion-2: w/o restriction on side entrances
20
![Page 21: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/21.jpg)
Acyclic Region Types and Shapes (cont.)
21
Treegion 1
Treegion 2
Treegion 3
Trace-2: trace
![Page 22: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/22.jpg)
Acyclic Region Types and Shapes (cont.)
• Percolation Scheduling– Many code motion rules are applied to regions
that resemble traces– One of the earliest versions of DAG scheduling
• DAG scheduling: most general of acyclic scheduling
• Cycle scheduler– Limited region shapes
• A single innermost loop• An inner loop that has very simple control flow
22
![Page 23: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/23.jpg)
Acyclic Region Types and Shapes (cont.)
23
![Page 24: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/24.jpg)
Region Formation
• So far, discussed about a region shape• Remaining two questions
– Region Formation• How does one divide a program into regions?• Region formation is more than selecting good regions
from CFG; also includes duplication (region enlargement)
– Schedule Construction• How does one build schedules for them?• Well-selected regions are critical for schedule
construction– Using profiles: how frequently executed?
24
![Page 25: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/25.jpg)
Region Formation (cont.)
• Region Selection– Trace growing
• The most popular algorithm
– Using the mutual most likely heuristic– Steps
• A is the last block of the current trace• Block B is A’s most likely successor, and vice versa
– A and B are “mutually most likely”
• Adds B to the trace• Repeats until no mutually-most-likely successor
25
![Page 26: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/26.jpg)
Region Formation (cont.)
• Region Selection– Shortcomings of using point profiles
• Cumulative effect of conditional probability• Point profiles independently measure probability• Probability of remaining on the trace rapidly decreases• Example:
– A trace that crosses ten splits, each with 90% of staying on the trace, appears to have only 35% (=0.9^10) probability of running from start to end
• Solutions: – building different shaped regions, predication– Using predication to remove branches
26
![Page 27: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/27.jpg)
Region Formation (cont.)
• Region Selection– Hyperblock formation
• Based on the mutual-most-likely trace formation• Considers block size and execution frequency• Predication can remove unpredictable branches
– Researches on better statistics• Using global, bounded-length path profiles to improve
static branch prediction
27
![Page 28: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/28.jpg)
Region Formation (cont.)
• Enlargement Techniques– Region selection is not enough alone– Needs to increase ILP by using enlargement
• Code size increased, but better scheduled code• Based on the fact programs iterate (loop)
– Loop unrolling• Performed before region selection to make the larger
unrolled codes available to region selector• Induction variable simplification and etc performed to
expose more parallelism across iterations
28
![Page 29: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/29.jpg)
Region Formation (cont.)
29
• Simplified example of variants of loop unrollingFor while loop:most general
case
For for loop:counted loops
![Page 30: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/30.jpg)
Region Formation (cont.)
• Induction var manipulations for loops
30
![Page 31: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/31.jpg)
Region Formation (cont.)
• Enlargement Techniques– Different approach for superblocks
• Superblock loop unrolling– Unrolling superblock loops (the most likely exit from some
superblocks jump to the beginning)
• Superblock loop peeling– Profile suggests a small # of iterations for the superblock loop– The expected # of iterations is copied
• Superblock target expansion– Similar to the mutual-most-likely heuristic for growing traces– If superblock A ends in a likely branch to B, then B is added
31
![Page 32: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/32.jpg)
32
Superblock-enlarging optimizations
Target expansion Loop unrolling Loop peeling
![Page 33: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/33.jpg)
Region Formation (cont.)
• Phase-ordering Considerations– Which one first?
• Multiflow compiler: enlargement before trace selection• Superblock-based chose and formed superblocks first• Neither is clearly preferable
– Other transformations• i.e. Dependence height reduction should be run before
region formation
33
![Page 34: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/34.jpg)
Schedule Construction
• So far, discussed about region formations– Selecting and enlarging individual regions
• A Schedule– Set of annotations that indicate unit assignment
and cycle time of the operations in a region– Depending on the shape of the region
• Goal: minimizing objective function– Estimated completion time + code size or energy
efficiency (in embedded systems)
34
![Page 35: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/35.jpg)
Schedule Construction (cont.)
• Analyzing Programs for Schedule Construction– Dependences (data & control) prohibit reordering
• Partial ordering on the pieces of code• Represented as a DAG or its variants
– DDG (data dependence graph)– PDG (program dependence graph)
• Creating DDG and PDG typically O(n^2)– Where, n is the number of operations
35
![Page 36: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/36.jpg)
36Data dependences example
Output dependence
True dependence
![Page 37: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/37.jpg)
37Control dependence example
Control flow example
![Page 38: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/38.jpg)
Schedule Construction (cont.)
• Compaction Techniques– Cycle versus Operation Scheduling
• Two strategies to minimize an objective function• 1) Operation scheduling
– Selects an operation in the region and allocates it in the “best” cycle w/o dependences
• 2) Cycle scheduling– Fills a cycle with operations from region, proceeding to the
next cycle only after exhausting available operations
• Operation scheduling is theoretically powerful because of consideration of long-latency operations
38
![Page 39: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/39.jpg)
Schedule Construction (cont.)
• Compaction Techniques– Linear Techniques
• Algorithm using DDG gives O(n^2) • In practical, linear O(n) used in modern compilers• Two techniques• 1) As-soon-as-possible (ASAP) scheduling
– Placing op in the earliest possible cycle (top-down linear scan)
• 2) As-late-as-possible (ALAP) scheduling– Placing op in the latest possible cycle (bottom-up linear scan)
• Example: critical-path scheduling uses ASAP followed by ALAP to identify operations in the critical path
39
![Page 40: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/40.jpg)
Schedule Construction (cont.)
• Compaction Techniques– Graph-based Techniques (List Scheduling)
• Linear techniques can’t see the global properties (DDG)• Repeatedly assigning a cycle to operation w/o
backtracking (greedy algorithms): O(nlogn)• Steps
– Selects an operation from a data-ready-queue (DRQ)– An op is ready when all of its DDG predecessors scheduled– Once scheduled, op is removed from the DRQ
• Performance is dependent on the order selecting candidates, or on the scheduler’s greediness
40
![Page 41: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/41.jpg)
Schedule Construction (cont.)
• Compensation Code– Restoring the correct flow of data and control– Four basic scenarios
41
• (a) No Compensation– Code motion don’t change relative order of
operations wrt joins and splits– Also covers moving operations above a split
point (becoming speculative)– Recall that compensation code for speculative
code motions depends on recovery model
![Page 42: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/42.jpg)
Schedule Construction (cont.)
• Compensation Code
42
• (b) Joint Compensation– B moves above a join point A– Drop a copy of B (B’) in the join path
• (c) Split Compensation• Split op B (i.e. branch) moves
above a previous op A• Produces a copy of A (A’) in the
split path
![Page 43: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/43.jpg)
Schedule Construction (cont.)
• Compensation Code
• Summary– In general, make sure preserve all paths from the
original sequence in the transformed control flow after scheduling
43
• (d) Joint Compensation– Splits moved above joins (in the figure)– Splits moved above splits
Z-B-W path
![Page 44: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/44.jpg)
Resource Management During Scheduling
• Resource hazards– Dependences and operational latencies and
available resources (i.e., functional units)
• Approaches– Reservation table: a simple and early method– Using finite-state-automata
44
![Page 45: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/45.jpg)
Resource Management During Scheduling (cont.)
• Resource Vectors– Easy scheduling of instructions– Row: each cycle of schedule– Col: each resource in the machine– Recent work on reduction of the size
45
Busy
![Page 46: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/46.jpg)
Resource Management During Scheduling (cont.)
• Finite-state Automata– Intuition
• Is this instruction sequence a resource-legal schedule?– Similar with “Does this FSA accept this string?”
• A schedule is a sequence of instructions– Similar with “a string is a sequence of alphabet character”– Resource-valid schedules = a language
– FSAs are enough to accept these language– Several approaches for improving efficiency
• Breaking them into “factor” automata, reversing automata, and non-determinism
46
![Page 47: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/47.jpg)
Resource Management During Scheduling (cont.)
• Finite-state Automata
47
• Original automaton: representing two-resource machine• Factored automata: “Letter” and “Number” since independent operations• Cross-product of factored automaton is equivalent to the original one
![Page 48: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/48.jpg)
Resource Management During Scheduling (cont.)
• TODO:– Reverse automata?– Nondeterminism?
48
![Page 49: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/49.jpg)
Loop Scheduling
• Loop scheduling approaches– Most of execution time spent in loops– The simplest approach was loop unrolling– Software pipelining
• Exploits inter-iteration ILP: parallelism across iterations• Modulo scheduling
– Produces a kernel of code– Kernel: overlapped multiple iterations of a loop, where
neither data dependence, nor resource conflicts
• Prologues and epilogues code is needed for correctness– Increased code size, H/W techniques can reduce this
49
![Page 50: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/50.jpg)
• Conceptual illustration of software pipelining
Loop Scheduling
50
![Page 51: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/51.jpg)
Loop Scheduling (cont.)
• Modular Scheduling– Initiation Interval (II)
• The length of the kernel: the constant interval b/w start of successive kernel iterations
• Minimum II (MII)– Determines lower bound on II
• Two constraints on the MII– Recurrence-constrained minimum II (RecMII)– Resource-constrained minimum II (ResMII)
51
![Page 52: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/52.jpg)
Loop Scheduling (cont.)
• Modular Scheduling– Goal
• Arranging operations so that they can be repeated at the smallest possible II distance (related throughput)
– Rather than minimizing the stage count of each iterations, which means minimizing latency
– But, stage count is also important because it relates to prologue (pipeline filling) and epilogue (pipeline draining)
– Downsides of modular scheduling• Hard to handle nested loops• Control flow in the loop handled by only predication
52
![Page 53: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/53.jpg)
• Conceptual model of Modulo scheduling– 4-wide, load (3 cycles), mult & compare (2 cycles)
53
How many inter-iteration dependences?
![Page 54: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/54.jpg)
Loop Scheduling (cont.)
• Modular Scheduling– Modulo Reservation Table (MRT)
• Find a resource conflict-free schedule over multiple II intervals
• Ensure the same resources are not reused more than once in the same cycle
• MRT records and checks resources usage for cycle
54
![Page 55: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/55.jpg)
Modulo Reservation Table
55
![Page 56: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/56.jpg)
Loop Scheduling (cont.)
• Modular Scheduling– Searching for the II
• Find two candidates: minII and maxII• maxII: trivial, sum of all latencies of operations in loop• minII: complex, max(resII, recII)
– Consider resource constraints, and both intra- and inter-iteration dependences
• Then, find a legal schedule within the range– Usually using a modified list scheduling in which resource
checking for each assignment through MRT
56
![Page 57: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/57.jpg)
Loop Scheduling (cont.)
• Modular Scheduling– Searching for the II
• basic scheme of iterative modulo scheduling
57
minII = compute_minII();maxII = compute_maxII();found = false;II = minII;while (!found && II < maxII) { found = try_to_modulo_schedule(II, budget); II = II + 1;}if (!found)trouble(); /* wrong maxII */
![Page 58: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/58.jpg)
Loop Scheduling (cont.)
• Modular Scheduling– Prologues and Epilogues
• Partial copies of kernel• More complex when multiple-exit loops• In practice, multiple epilogues are almost always a
necessity (but, this is beyond our scope!)• Kernel-only loop scheduling
– Condition 1: prologues and epilogues are proper subsets of kernel code in which some operations have been disabled
– Condition 2: fully predication architecture
58
Kernel-only code by predicates
![Page 59: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/59.jpg)
Loop Scheduling (cont.)
• Modular Scheduling– Modulo Variable Expansion
• MRT solved a correct resource scheduling for a given II• What about register allocation when lifetime of a value
within an iteration exceeds the II length?– Simple register allocation policy won’t work: overwritten!
• Solution: artificially extend II w/o perf degradation by unrolling loop body -> Modulo Variable Expansion
• Must unroll at least by a factor k = ceil (v / II)– v = the length of the longest life time
59
![Page 60: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/60.jpg)
Loop Scheduling (cont.)
• Modular Scheduling– Modulo Variable Expansion
• But, increased length of kernel code, reg pressure, …• Solution: rotating registers
– Physical register instantiation: combination of a logical identifier and a register base incremented at every iteration
• A reference to register r at iteration i points to a different location than iteration i+1
– It’s possible to avoid modulo variable expansion
60
![Page 61: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/61.jpg)
61
Register r1 needs to hold the same variable in twodifferent iterations, but the lifetimes overlap
Unroll kernel twice!
![Page 62: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/62.jpg)
62
Used two registers (r1, r11) to resolve overlappingSame throughput, but code size hurts
![Page 63: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/63.jpg)
Loop Scheduling (cont.)
• Modular Scheduling– Iterative Modulo Scheduling
• Sometimes hard to find a schedule due to complex MRT• To improve probability of finding a schedule, allow a
controlled form of backtracking (unscheduling and rescheduling of instructions)
– Advanced Modulo Scheduling Techniques• So far, several heuristics: e.g. guessing a good minII• Recent techniques
– e.g. Hypernode reduction modulo scheduling (HRMS):» reduces loop-variant lifetimes while keeping II constant
63
![Page 64: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/64.jpg)
Loop Scheduling (cont.)
• Clustering– Review of the need of clustering
• A practical solution to solve high register demands rather than multiported register file, or bypassing logic
– Multiports are expensive and poor scalability
• A clustered architecture divides into separate clusters• Each cluster has its own register bank and func units• In general, intercluster (explicit) operations needed
– Compilers’ new role• Minimizing intercluster moves and balancing clusters
64
![Page 65: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/65.jpg)
Loop Scheduling (cont.)
• Clustering– Preassignment techniques
• In general, clustering before scheduling• Two techniques
– Bottom-up-greedy (BUG)» Two phases: traversing from exit to entry, and assignment
– Partial-component clustering (PCC)» Reduce complexity by constructing macronodes
– Clustering overheads• Two clusters: 15~20% lost cycles, Four: 25~30%
65
![Page 66: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/66.jpg)
3. Register Allocation
• Register allocation– Memory >> register space– NP-Hard problem– This problem is old and well known
• Standard technique: coloring of interference graph• Recent: nonstandard register allocation techniques
– Faster and better than graph-coloring– linear-scan allocators
» Interested in JIT, dynamic translation
• Tradeoffs b/w compile- and run-time– Feasible today because of faster machines
66
![Page 67: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/67.jpg)
Phase-ordering Issues
• Phase ordering is hard problem– Should it be don before, after, or same time?– Register allocation and scheduling conflicts goals
• Register allocator tries to minimize spill and restore, creating sequential constraints for register reuse)
• Scheduler tries to fill all parallel units• How to order them?
– Very tricky problem
67
![Page 68: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/68.jpg)
Phase-ordering Issues
• Scheduling followed by Register Allocation followed by Post-scheduling– The most popular choice (common for modern RISC)
– ILP over efficient register utilization• Enough registers are available
– Post-scheduler rearranges the code
68
Scheduling: without regard for the
number of physical registers actually
available
Register allocation:though no allocation
might exist that makes the schedule legal, so insert spills/restores
Post-scheduling: after inserting spills/restores,
fix up schedule, making it legal, with least
possible added cycles
![Page 69: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/69.jpg)
Phase-ordering Issues
• Register Allocation followed by Scheduling– Register use over exploiting ILP– Works well with few GPRs (e.g. x86)– But, register allocator introduces additional
dependences every time it reuses a register
69
Register allocation:producing code withall registers assigned
Register allocation:Scheduling (though not very
effectively, because the register allocation has inserted many
false dependences)
![Page 70: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/70.jpg)
Phase-ordering Issues
• Combined Register Allocation and Scheduling– Potentially very powerful, but very complex– A list-scheduling algorithm may not converge
• Cooperative Approaches– Scheduler monitors register resources and
estimates pressure in its heuristics
70
Scheduling and register allocation done together:
difficult engineering, and it is difficult to ensure that
scheduling will ever terminate)
![Page 71: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/71.jpg)
4. Speculation and Predication
• Speculation and Predication– Removes and transforms control dependences– Usually, they are independent techniques, and
one is much more appropriate than the other– Note that predication is important in software
pipelining
71
![Page 72: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/72.jpg)
Control and Data Speculation
• Control and Data Speculation– Recall exception behavior in recovery model
• Nonexcepting parts and sentinel (checking) parts• In compiler’s perspective
– It’s complicated to support nonexcepting loads because of recovery code handling
– Speculative code motion (or code hoisting)• Removes actual control dependences unlike predication• Compiler need to consider supported exception model
and speculative memory operations
72
![Page 73: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/73.jpg)
73
Speculative code motion example
load operation becomes speculative load (load.s)
![Page 74: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/74.jpg)
Predicated Execution
• Compiler techniques for predication– Examples: if-conversion, logical reduction of
predicates, reverse if-conversion, and hyperblock-based scheduling
– If-conversion• Translates control dependence into data dependence• Converts an acyclic subset of CFG from an unpredicated
code into straight-line code with predication• Also try to minimize # of predicate values
– logical reduction of predicates
74
![Page 75: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/75.jpg)
Predicated Execution (cont.)
• Compiler techniques for predication– Reverse if-conversion
• Removing predicates, returning to unpredicated code• May be worthwhile to if-convert• When insufficient predicate registers, selectively
reverse if-converting
– Hyperblock based scheduling• Unified framework for both speculation and predication• First, choose a hyperblock region, then if-conversions
– Gives the schedule constructor much more freedom to schedule, and removes speculative constraints
75
![Page 76: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/76.jpg)
76
Example of predicated codesAlways executed
![Page 77: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/77.jpg)
Predicated Execution (cont.)
• Case studies in embedded systems– No usually full predicated like IPF architecture– ARM includes a 4-bit predicates in every operation
• Looks like always being predicated• But, the predicate registers is usual set of condition
code flags instead of an index to general predicates
– TI C6x supports full predication• Five of GPR can be specified as condition registers
77
![Page 78: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/78.jpg)
Prefetching
• Memory prefetching– A form of speculation, and invisible to programs– Compiler-supported prefetching better than pure
hardware prefetching in many cases• Compiler assist in prefetching
– ISA includes a prefetch instruction• Only hints to the hardware
– Automatic insertion requires to understand loop behaviors
– Unneeded prefetches waste resources
78
![Page 79: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/79.jpg)
Other Topics
• Data Layout Methods– Increase locality by considering cache line
• Static and Hybrid Branch Prediction– Profiles are used to set static branch predication– More sophisticated approach
• Hybrid method: statically or dynamically
– e.g. IPF includes four branch encoding hints• static taken, static not-taken, dynamic T, and dynamic NT
79
![Page 80: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649eb65503460f94bbf27a/html5/thumbnails/80.jpg)
5. Instruction Selection
• Instruction Selection– Translates from a tree-structured linguistically-
oriented IR to operation- and machine-oriented IR– Especially important with complex instruction sets– Recent technique
• Cost-based pattern-matching rewriting systems– “match” or “cover” parse tree produced by front end using
minimum-cost set of operation subtrees
• e.g. BURS (bottom-up rewriting systems)– 1st pass, labels each node in parse tree– 2nd pass, reads labels & generates target machine operations
80