Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines,...
-
Upload
frederick-campbell -
Category
Documents
-
view
218 -
download
1
Transcript of Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines,...
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers
Stephen Hines, David Whalley and Gary TysonComputer Science Dept.Florida State University
October 23, 2006
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 2/17
Instruction Packing
Store frequently occurring instructions as specified by the compiler in a small, low-power Instruction Register File (IRF)
Allow multiple instruction fetches from the IRF by packing instruction references together Tightly packed – multiple IRF references Loosely packed – piggybacks an IRF reference
onto an existing instruction Facilitate parameterization of some
instructions using an Immediate Table (IMM)
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 3/17
Instruction Cache
PC
IF/ID
IRF
IMM
IRWP
packed instruction
packed instruction insn1
insn1insn2
insn2
imm3
insn3
insn3
imm3
insn4
insn4
Execution of IRF Instructions
Instruction Fetch Stage First Half of Instruction Decode Stage
To InstructionDecoder
Executing a Tightly Packed Param4c Instruction
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 4/17
Outline
Introduction Improved Promotion to the IRF Compiler Optimizations
Instruction Selection Register Re-assignment Instruction Scheduling
Experimental Evaluation Conclusions & Future Work
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 5/17
Improved Promotion to the IRF
Different classes of instructions can consume 1 – 5 slots More accurately model the benefits of promoting from one
class of instruction to another Original IRF papers did not promote multiple I-type instructions
with different default immediate values addi $3, $3, 4 and addi $3, $3, 1 would not both reside in the
IRF, no matter how frequently they occurred
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 6/17
Mixed Profiling
Static profiling is best for decreasing code size
Dynamic profiling is best for reducing energy consumption
Can simultaneously weight static and dynamic profile data to obtain a mixed result that has both good code compression and reduced energy consumption
Can obtain most of the benefits of individual static/dynamic profiling
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 7/17
Compiler Optimizations
Instruction Selection Choose beneficial encodings for increasing redundancy
Register Re-assignment Attempts to rename registers such that instructions can be
accessed via IRF Instruction Scheduling
Intra-block – focus on reordering instructions so that dense packs are formed (both tight and loose)
Inter-block – attempt to move instructions between blocks to fill up packs ending with branches/jumps Code duplication Predication
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 8/17
Intra-block Instruction Scheduling
3 1
2
4
5
InstructionDependence DAG
1
2
3
4 4’
5
1
2
3
4 4’
5
2
5
1
54 4’
Without InstructionScheduling
With InstructionScheduling
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 9/17
a
c
Code Duplication to Reduce Code Size
• • •
5
b
5’
2
W
X Y
Z 11
3 3’
4 4’
3 3’4 4’
6 slots is too manyto fit in a singlepacked instruction …but we can duplicatea single instruction …resulting in the abilityto pack the remaining5 slots together.
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 10/17
a
Predication – Forward Branches
• • •
Cond Branch
X
• • •
4 4’
Y
Z
Fall-through
Branchtaken path
2 3 2’b4 4’1Instructions packedafter forward brancheswill only be executedwhen the branch isnot taken
4 4’2’32
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 11/17
Predication – Backward Branches
• • •
a b c
2 2’
Branch d e f
• • •
11 2
Branchoffset
2’
Instructions packedafter backwardbranches will only beexecuted when thebranch is taken
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 12/17
Predication Advantages with IRF
IRF facilitates a form of predication for the MIPS – a baseline architecture that traditionally does not support predication
No need to waste instruction encoding space specifying predicate bits for most/all instructions (even ARM traded away general predication for reducing code size with Thumb and Thumb2)
No need to fetch, decode and possibly execute instructions that are annulled after the branch within a pack (reducing energy consumption and execution time)
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 13/17
Experimental Evaluation
MiBench embedded benchmark suite – 6 categories representing common tasks for various domains
SimpleScalar MIPS/PISA architectural simulator Out-of-order, single issue embedded machine with 8KB 4-
way set associative L1 instruction and data caches and 128-entry bimodal branch predictor
Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating)
VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 14/17
Energy Consumption
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
Automotive
Consumer
NetworkOffic
e
Security
Telecomm
Average
Benchmark Category
Tot
al E
nerg
yNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 15/17
Static Code Size
57.5%
62.5%
67.5%
72.5%
77.5%
82.5%
87.5%
92.5%
97.5%
Automotive
Consumer
NetworkOffic
e
Security
Telecomm
Average
Benchmark Category
Sta
tic C
ode
Siz
eNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 16/17
IRF Promotion with Mixed Profiling
67.5%
72.5%
77.5%
82.5%
87.5%
92.5%
97.5%
100/0(Dynamic)
75/25 50/50 25/75 0/100(Static)
Dynamic/Static Mixture
Re
lativ
e M
ea
sure
(%
)Code Size Optimized Code Size Total Energy Optimized Total Energy
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 17/17
Conclusions & Future Work
Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%15.8%), code size (16.8%28.8%) and execution time
Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication
As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations
Register targeting and loop unrolling should also be explored with instruction packing
Enhanced parameterization techniques
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 19/17
Tightly Packed Instruction Format
New opcodes for this T-format of MISA instructions Supports sequential execution of up to 5 RISA instructions
from the IRF Unnecessary fields are padded with nop
Supports up to 2 parameters replacing instruction slots Parameters can come from 32-entry IMM Each IRF entry also retains a default immediate value as well Branches use these 5 bits for displacements R-type RISA instructions can use parameter to replace RD field
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 20/17
MIPS Instruction Format Modifications
Creating Loosely Packed instructions R-type: Removed shamt field and merged with rs I-type: Shortened immediate values (16-bit 11bit)
Lui now uses 21-bit immediate values, hence no loose packing
J-type: Unchanged
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 22/17
Introduction
Embedded Processor Design Constraints Energy Consumption Static Code Size Execution Time
Fetch logic is responsible for 36% of total processor power on StrongARM
Two Primary Areas for Improvement in Instruction Fetch Better fetch mechanism and storage
Instruction Cache and/or ROM – Lower power than main memory, but still a fairly large, flat storage method
Better instruction encodings Instruction encodings are wasteful with bits
Maximize functionality, but simplify decoding (fixed length) Most applications only apply a subset of available instructions Nowhere near theoretical compression limits
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 23/17
Instruction Redundancy
Profiled largest benchmark in each of six MiBench categories
Most frequent 32 instructions comprise 66.5% of total dynamic and 31% of total static instructions
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 24/17
Access of Data & Instructions
Each lower layer is designed to improve accessibility of current/frequent items, albeit at a reduction in number of available items.
Caching is beneficial, but compilers can do better for the “most frequently” accessed data items (e.g. Register Allocation).
Instructions have no analogue to the Data Register File (RF).
L1 Data Cache
?????Data Register File
L1 Instruction Cache
L2 Cache
Main Memory
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 25/17
Instruction Packing
Store frequently occurring instructions as specified by the compiler in a small, low-power Instruction Register File (IRF)
Allow multiple instruction fetches from the IRF by packing instruction references together
Facilitate parameterization of some instructions using an Immediate Table (IMM)
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 26/17
Outline
Introduction IRF Instruction Set Architecture IRF Register Windowing Compiler Optimizations Experimental Framework Results Interaction with Other Techniques Proposed Enhancements for Instruction Packing Conclusions Publication Plan
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 27/17
IRF Instruction Set Architecture
MIPS ISA – commonly known and provides simple encoding
RISA (Register ISA) – instructions available via IRF access
MISA (Memory ISA) – instructions available in memory New instruction formats that can reference
multiple RISA instructions – Tightly Packed Original instructions modified to pack an
additional RISA reference – Loosely Packed
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 28/17
Instruction Cache
PC
IF/ID
IRF
IMM
IRWP
packed instruction
packed instruction insn1
insn1insn2
insn2
imm3
insn3
insn3
imm3
insn4
insn4
Execution of IRF Instructions
Instruction Fetch Stage First Half of Instruction Decode Stage
To InstructionDecoder
Executing a Tightly Packed Param4c Instruction
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 29/17
Tightly Packed Instruction Format
New opcodes for this T-format of MISA instructions Supports sequential execution of up to 5 RISA instructions
from the IRF Unnecessary fields are padded with nop
Supports up to 2 parameters replacing instruction slots Parameters can come from 32-entry IMM Each IRF entry also retains a default immediate value as well Branches use these 5 bits for displacements R-type RISA instructions can use parameter to replace RD field
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 30/17
MIPS Instruction Format Modifications
Creating Loosely Packed instructions R-type: Removed shamt field and merged with rs I-type: Shortened immediate values (16-bit 11bit)
Lui now uses 21-bit immediate values, hence no loose packing
J-type: Unchanged
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 31/17
Compilation Framework
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 32/17
Packing Instructions
Greedy selection of instructions based on static or dynamic profile data
Examine a sliding window of instructions for each basic block, attempting to pack adjacent RISA instructions together Denser packs are more favorable (e.g. tight5) Branch offset distances can slip into range when
we apply an iterative packing algorithm
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 33/17
IRF Register Windowing
Simply increasing the number of IRF registers reduced number of RISA instructions available in a single MISA instruction
Instead, keep format the same and provide multiple IRF windows to switch between on a per-function basis
Can dynamically switch IRF contents on calls/returns or provide actual physically separate register windows
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 34/17
Software Windowing
Add new instructions to dynamically save/restore instruction registers
Calculate cost/benefit analysis of creating a new window/partition for a given function
Insert appropriate instructions for saving/restoring IRs between call/return sequences
Drawbacks Requires the complete call graph for the application Extra instructions for saving/restoring can add to
execution time
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 35/17
Hardware Windowing
IRF windows are shared amongst functions with similar instruction mix
Function addresses are modified to incorporate an instruction register window pointer (IRWP) Call instruction – transfers to new function and switches
the window by adjusting the IRWP; also saves the return address and current IRWP
Return Instruction – transfers control back to the caller and restores the previous IRWP
Inactive windows can be kept in a low-power state
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 36/17
Compiler Optimizations
Improved Instruction Promotion Adapt existing techniques to improve
instruction packing Instruction Selection Register Re-assignment Instruction Scheduling
Increase application redundancy Eliminate constraints on packing
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 37/17
Improving Instruction Promotion
Different Classes of instructions – can consume between one and five of the potentially available slots for a tight pack
Goal is to more accurately model the benefits of promoting from one class of instruction to another Original IRF papers did not consider the promotion of multiple I-
type instructions with different default immediate values addi $3, $3, 4 and addi $3, $3, 1 would not both reside in the
IRF, no matter how frequently they occurred Can simultaneously weight static and dynamic profile data to
obtain a mixed result that has both good code compression and reduced energy consumption
Can obtain most of the benefits of individual static/dynamic profiling
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 38/17
Instruction Selection
Encode short jumps as branches (now parameterizable) j label beq $0, $0, rel_label
Replace simple instructions with equivalent parameterizable forms addu $2, $3, $0 addiu $2, $3, 0
Ensure that commutative operations always have the same order of operands addu $2, $4, $2 addu $2, $2, $4
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 39/17
Register Re-assignment
Performed after registers have been assigned in the second compilation with packing
Use a Register Interference Graph to represent register live ranges in the function
Re-assign live ranges to other registers if this improves packing density Live ranges ordered by dynamic frequency Modify register in conflicting live ranges if
beneficial
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 40/17
Instruction Scheduling
Intra-block – focus on reordering instructions so that dense packs are formed (both tight and loose)
Inter-block – attempt to move instructions between blocks to fill up packs ending with branches/jumps Code duplication Predication
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 41/17
Experimental Framework
MiBench embedded benchmark suite – 6 categories representing common tasks for various domains
SimpleScalar MIPS/PISA architectural simulator Out-of-order, single issue embedded machine with 8KB 4-
way set associative L1 instruction and data caches and 128-entry bimodal branch predictor
Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating)
VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 42/17
Results – Processor Energy
15.82%
-2.50%
0.00%
2.50%
5.00%
7.50%
10.00%
12.50%
15.00%
17.50%
20.00%
22.50%
25.00%
27.50%
30.00%
Bas
icm
ath
Bitc
ount
Qso
rt
Sus
an
Jpeg
Lam
e
Tiff
2bw
Dijk
stra
Pat
ricia
Ispe
ll
Rsy
nth
Str
ings
earc
h
Blo
wfis
h
Pgp
Rijn
dael
Sha
CR
C32
FF
T
Adp
cm
Gsm
Ave
rage
Benchmark
Cu
mu
lati
ve E
ner
gy
Red
uct
ion
(%
)
IRF Parameterization Window ing Mixed Profile Optimizations
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 43/17
Results – Static Code Size
28.83%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
55.00%
60.00%
65.00%
Bas
icm
ath
Bitc
ount
Qso
rt
Sus
an
Jpeg
Lam
e
Tiff
2bw
Dijk
stra
Pat
ricia
Ispe
ll
Rsy
nth
Str
ings
earc
h
Blo
wfis
h
Pgp
Rijn
dael
Sha
CR
C32
FF
T
Adp
cm
Gsm
Ave
rage
Benchmark
Cu
mu
lati
ve C
od
e S
ize
Red
uct
ion
(%
)
IRF Parameterization Window ing Mixed Profile Optimizations
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 44/17
Results – Execution Cycles
1.30%
-4.00%
-3.00%
-2.00%
-1.00%
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
Bas
icm
ath
Bitc
ount
Qso
rt
Sus
an
Jpeg
Lam
e
Tiff
2bw
Dijk
stra
Pat
ricia
Ispe
ll
Rsy
nth
Str
ings
earc
h
Blo
wfis
h
Pgp
Rijn
dael
Sha
CR
C32
FF
T
Adp
cm
Gsm
Ave
rage
Benchmark
Cu
mu
lati
ve C
ycle
Tim
e Im
pro
vem
ent
(%)
IRF Parameterization Window ing Mixed Profile Optimizations
22.34%
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 45/17
Results – Mixed Profiling
67.5%
72.5%
77.5%
82.5%
87.5%
92.5%
97.5%
100/0(Dynamic)
75/25 50/50 25/75 0/100(Static)
Dynamic/Static Mixture
Re
lativ
e M
ea
sure
(%
)Code Size Optimized Code Size Total Energy Optimized Total Energy
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 46/17
Interaction with Other Techniques
Loop Cache – small buffer that captures innermost loops dynamically IRF can operate synergistically, increasing utilization of
loop cache when innermost loops are larger than normal Fetch Energy reduced by 56% with 8-entry loop cache
and 4 window IRF L0 (Filter) Cache – small instruction cache with low
energy consumption, but can negatively impact execution time due to cache misses IRF reduces working set size and can mask a portion of
the cache miss penalty due to overlapped fetch Fetch energy reduced by 70% with 256-byte L0 cache and
4 window IRF
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 47/17
Proposed Enhancements for Instruction Packing
Splitting of MISA/RISA Split Opcode/Operand Encoding in
RISA Improved Parameterization Link-Time Instruction Packing Instruction Promotion as a Side Effect
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 48/17
Splitting of MISA/RISA
No need to arbitrarily limit RISA to same instructions as MISA
Tailor each ISA to particular design goals MISA – reduced code size (small instructions) RISA – improved performance &
expressiveness (large instructions) Baseline ISA choices
ARM/Thumb – 32/16 bit dual width ISA FITS – opcodes mapped to 16 bit instructions ARM/Thumb2 – 32/16 bit variable length ISA
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 49/17
Split MISA Encodings
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 50/17
Splitting Opcodes/Operands
Existing code compression schemes have benefited from separating the encoding of opcodes and operands
Can re-encode RISA to separate opcode and operand streams
Loosely packed instructions can use traditional or new split encodings
Tightly packed instructions could even continue supporting parameterization
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 51/17
Split Opcode/Operand Encodings
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 52/17
Improved Parameterization
Current IRF parameterization is limited to immediate values and destination registers Branches use parameter for offset distance Other I-type instructions can index into IMM
based on the parameter value R-type instructions use parameter to replace RD
Extend parameterization to any field: RS, RT, RD, immediate, and even Opcode!!!
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 53/17
Parameterization Encoding
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 54/17
Link-Time Instruction Packing
Current implementation cannot pack statically linked library routines
Incremental packing strategy with a more precise evaluation of exact benefits of packing particular instructions
Integration with existing tools like DIABLO for ARM/Thumb or MIPS
Support for new inter-procedural optimizations like procedural abstraction and dead code elimination
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 55/17
Promotion as a Side Effect
Current implementation uses a statically allocated IRF that does not change at run-time Prior dynamic scheme experienced a great deal
of overhead for saving/restoring IRF entries Change promotion mechanism to make
saves and restores simpler Capture simple loop behaviors in the same
manner as other techniques like Zero Overhead Loop Buffer (ZOLB)
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 56/17
Loosely Packed Promotion
Use entire 5-bit field to specify IRF entry to write into All RISA executions must come from tightly
packed instructions Reserve 1-bit in the loosely packed field to
represent read/write of IRF entry Write this instruction (after executing it) into the
corresponding 4-bit IRF entry Reduces available number of IRs
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 57/17
IRF Calling Conventions
Maintain some IRs as non-scratch (callee-save) and some as scratch (caller-save)
Use appropriate registers for varying behavior (tight loops with and without function calls)
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 58/17
Dynamic Promotion
Peel loop and then pack instructions in the actual loop Special handling for conditional control flow Can alleviate huge code size increases when combined
with loop unrolling Added code size and potentially extra instructions
for save/restore, but huge potential for energy savings in tight loops Link-time analysis can eliminate saves/restores Potential for actual rotating register file (like SPARC)
Surpasses existing ZOLB and loop cache techniques since loops with conditional control flow can be captured
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 59/17
Conclusions
Efficient fetch mechanism that simultaneously reduces Processor Energy Consumption (15.8% reduction) Static Code Size (28.8% reduction) Execution Time (1.3% reduction)
Improvements via parameterization, windowing, compiler optimizations and integration with other processor enhancements
Future work is rich and diverse: Specialized MISA and RISA encodings Improved parameterization Link-time analysis and instruction packing Improved methods for IRF promotion
Demonstrates the need for careful design/balance of instruction sets and instruction fetch mechanisms
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 60/17
Previous Publications
1. Hines, S., Green, J., Tyson, G., and Whalley, D. Improving program efficiency by packing instructions into registers. In Proceedings of the 2005 ACM/IEEE International Symposium on Computer Architecture (June 2005), IEEE Computer Society, pp. 260–271.
2. Hines, S., Tyson, G., and Whalley, D. Reducing instruction fetch cost by packing instructions into register windows. In Proceedings of the 38th annual ACM/IEEE International Symposium on Microarchitecture (November 2005), IEEE Computer Society, pp. 19–29.
3. Hines, S., Tyson, G., and Whalley, D. Improving the energy and execution efficiency of a small instruction cache by using an instruction register file. In Proceedings of the 2nd Watson Conference on Interaction between Architecture, Circuits, and Compilers (September 2005), pp. 160–169.
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 61/17
Publications to be Submitted
4. Hines, S., Whalley, D., and Tyson, G. Adapting compilation techniques to enhance the packing of instructions into registers. Submitting to the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, October 2006.
5. Hines, S., Tyson, G., and Whalley, D. Improving fetch efficiency by using an instruction register file. Planning to submit.
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 62/17
Other Publications
6. Hines, S., Kulkarni, P., Whalley, D., and Davidson, J. Using de-optimization to re-optimize code. In Proceedings of the 5th International Conference on Embedded Software (September 2005), ACM Press, pp. 114–123.
7. Kulkarni, P., Hines, S., Hiser, J., Whalley, D., Davidson, J., and Jones, D. Fast searches for effective optimization phase sequences. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (June 2004), pp. 171–182.
8. Kulkarni, P. A., Hines, S. R., Whalley, D. B., Hiser, J. D., Davidson, J. W., and Jones, D. Fast and efficient searches for effective optimization phase sequences. ACM Transactions on Architecture and Code Optimization (June 2005), 165–198.
9. Kulkarni, P., Zhao, W., Hines, S., Whalley, D., Yuan, X., van Engelen, R., Gallivan, K., Hiser, J., Davidson, J., Cai, B., Bailey, M., Moon, H., Cho, K., Paek, Y., and Jones, D. VISTA: VPO interactive system for tuning applications. ACM Transactions on Embedded Computing Systems, 2006.
10. Kleffner, M., Jones, D., Hiser, J., Kulkarni, P., Parent, J., Hines, S., Whalley, D., Davidson, J., and Gallivan, K. On the use of compilers in DSP laboratory instruction. In Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (May 2006).
11. Kreahling, W., Hines, S., Whalley, D., and Tyson, G. Reducing the cost of conditional transfers of control by using comparison specifications. In Proceedings of the 2006 ACM SIGPLAN conference on Languages, Compilers, and Tools for Embedded Systems (June 2006), ACM Press.
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 63/17
Conferences/Journals
Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)
Conference on Languages, Compilers and Tools for Embedded Systems (LCTES)
Conference on Programming Language Design and Implementation (PLDI)
International Symposium on Computer Architecture (ISCA) International Symposium on Microarchitecture (MICRO) ACM Transactions on Architecture and Code Optimizations
(TACO) IEEE Transactions on Computers (TOC) ACM Transactions on Computer Systems (TOCS) ACM Transactions in Embedded Computing Systems (TECS)
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 64/17
Publication Plan
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 65/17
The End
Questions ???
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 68/17
Compiler Optimization Legend
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 69/17
Intra-block Instruction Scheduling
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 70/17
Code Duplication to Reduce Code Size
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 71/17
Predication – Forward Branches
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 72/17
Predication – Backward Branches
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 73/17
Predication Advantages with IRF
IRF facilitates a form of predication for the MIPS – a baseline architecture that traditionally does not support predication
No need to waste instruction encoding space specifying predicate bits for most/all instructions (even ARM traded away general predication for reducing code size with Thumb and Thumb2)
No need to fetch, decode and possibly execute instructions that are annulled after the branch within a pack (reducing energy consumption and execution time)
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 74/17
Energy Consumption
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
Automotive
Consumer
NetworkOffic
e
Security
Telecomm
Average
Benchmark Category
Tot
al E
nerg
yNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 75/17
Static Code Size
57.5%
62.5%
67.5%
72.5%
77.5%
82.5%
87.5%
92.5%
97.5%
Automotive
Consumer
NetworkOffic
e
Security
Telecomm
Average
Benchmark Category
Sta
tic C
ode
Siz
eNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 76/17
Execution Cycles
90%
92%
94%
96%
98%
100%
102%
Automotive
Consumer
NetworkOffic
e
Security
Telecomm
Average
Benchmark Category
Exe
cutio
n C
ycle
sNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 77/17
Conclusions
IRF provides dense encodings for frequently occurring instructions
Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%15.8%), code size (16.8%28.8%) and execution time
Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication
As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations