Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines,...

Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Stephen Hines, David Whalley and Gary TysonComputer Science Dept.Florida State University

October 23, 2006

Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 2/17

Instruction Packing

Store frequently occurring instructions as specified by the compiler in a small, low-power Instruction Register File (IRF)

Allow multiple instruction fetches from the IRF by packing instruction references together Tightly packed – multiple IRF references Loosely packed – piggybacks an IRF reference

onto an existing instruction Facilitate parameterization of some

instructions using an Immediate Table (IMM)


Instruction Cache

PC

IF/ID

IRF

IMM

IRWP

packed instruction

packed instruction insn1

insn1insn2

insn2

imm3

insn3

insn3

imm3

insn4

insn4

Execution of IRF Instructions

Instruction Fetch Stage First Half of Instruction Decode Stage

To InstructionDecoder

Executing a Tightly Packed Param4c Instruction


Outline

Introduction Improved Promotion to the IRF Compiler Optimizations

Instruction Selection Register Re-assignment Instruction Scheduling

Experimental Evaluation Conclusions & Future Work


Improved Promotion to the IRF

Different classes of instructions can consume 1 – 5 slots More accurately model the benefits of promoting from one

class of instruction to another Original IRF papers did not promote multiple I-type instructions

with different default immediate values addi $3, $3, 4 and addi $3, $3, 1 would not both reside in the

IRF, no matter how frequently they occurred


Mixed Profiling

Static profiling is best for decreasing code size

Dynamic profiling is best for reducing energy consumption

Can simultaneously weight static and dynamic profile data to obtain a mixed result that has both good code compression and reduced energy consumption

Can obtain most of the benefits of individual static/dynamic profiling


Compiler Optimizations

Instruction Selection Choose beneficial encodings for increasing redundancy

Register Re-assignment Attempts to rename registers such that instructions can be

accessed via IRF Instruction Scheduling

Intra-block – focus on reordering instructions so that dense packs are formed (both tight and loose)

Inter-block – attempt to move instructions between blocks to fill up packs ending with branches/jumps Code duplication Predication


Intra-block Instruction Scheduling

3 1

2

4

5

InstructionDependence DAG

1

2

3

4 4’

5

1

2

3

4 4’

5

2

5

1

54 4’

Without InstructionScheduling

With InstructionScheduling


a

c

Code Duplication to Reduce Code Size

• • •

5

b

5’

2

W

X Y

Z 11

3 3’

4 4’

3 3’4 4’

6 slots is too manyto fit in a singlepacked instruction …but we can duplicatea single instruction …resulting in the abilityto pack the remaining5 slots together.


a

Predication – Forward Branches

• • •

Cond Branch

X

• • •

4 4’

Y

Z

Fall-through

Branchtaken path

2 3 2’b4 4’1Instructions packedafter forward brancheswill only be executedwhen the branch isnot taken

4 4’2’32


Predication – Backward Branches

• • •

a b c

2 2’

Branch d e f

• • •

11 2

Branchoffset

2’

Instructions packedafter backwardbranches will only beexecuted when thebranch is taken


Predication Advantages with IRF

IRF facilitates a form of predication for the MIPS – a baseline architecture that traditionally does not support predication

No need to waste instruction encoding space specifying predicate bits for most/all instructions (even ARM traded away general predication for reducing code size with Thumb and Thumb2)

No need to fetch, decode and possibly execute instructions that are annulled after the branch within a pack (reducing energy consumption and execution time)


Experimental Evaluation

MiBench embedded benchmark suite – 6 categories representing common tasks for various domains

SimpleScalar MIPS/PISA architectural simulator Out-of-order, single issue embedded machine with 8KB 4-

way set associative L1 instruction and data caches and 128-entry bimodal branch predictor

Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating)

VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA


Energy Consumption

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

Automotive

Consumer

NetworkOffic

e

Security

Telecomm

Average

Benchmark Category

Tot

al E

nerg

yNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched


Static Code Size

57.5%

62.5%

67.5%

72.5%

77.5%

82.5%

87.5%

92.5%

97.5%

Automotive

Consumer

NetworkOffic

e

Security

Telecomm

Average

Benchmark Category

Sta

tic C

ode

Siz

eNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched


IRF Promotion with Mixed Profiling

67.5%

72.5%

77.5%

82.5%

87.5%

92.5%

97.5%

100/0(Dynamic)

75/25 50/50 25/75 0/100(Static)

Dynamic/Static Mixture

Re

lativ

e M

ea

sure

(%

)Code Size Optimized Code Size Total Energy Optimized Total Energy


Conclusions & Future Work

Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%15.8%), code size (16.8%28.8%) and execution time

Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication

As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations

Register targeting and loop unrolling should also be explored with instruction packing

Enhanced parameterization techniques


Tightly Packed Instruction Format

New opcodes for this T-format of MISA instructions Supports sequential execution of up to 5 RISA instructions

from the IRF Unnecessary fields are padded with nop

Supports up to 2 parameters replacing instruction slots Parameters can come from 32-entry IMM Each IRF entry also retains a default immediate value as well Branches use these 5 bits for displacements R-type RISA instructions can use parameter to replace RD field


MIPS Instruction Format Modifications

Creating Loosely Packed instructions R-type: Removed shamt field and merged with rs I-type: Shortened immediate values (16-bit 11bit)

Lui now uses 21-bit immediate values, hence no loose packing

J-type: Unchanged


Introduction

Embedded Processor Design Constraints Energy Consumption Static Code Size Execution Time

Fetch logic is responsible for 36% of total processor power on StrongARM

Two Primary Areas for Improvement in Instruction Fetch Better fetch mechanism and storage

Instruction Cache and/or ROM – Lower power than main memory, but still a fairly large, flat storage method

Better instruction encodings Instruction encodings are wasteful with bits

Maximize functionality, but simplify decoding (fixed length) Most applications only apply a subset of available instructions Nowhere near theoretical compression limits


Instruction Redundancy

Profiled largest benchmark in each of six MiBench categories

Most frequent 32 instructions comprise 66.5% of total dynamic and 31% of total static instructions


Access of Data & Instructions

Each lower layer is designed to improve accessibility of current/frequent items, albeit at a reduction in number of available items.

Caching is beneficial, but compilers can do better for the “most frequently” accessed data items (e.g. Register Allocation).

Instructions have no analogue to the Data Register File (RF).

L1 Data Cache

?????Data Register File

L1 Instruction Cache

L2 Cache

Main Memory


Instruction Packing

Store frequently occurring instructions as specified by the compiler in a small, low-power Instruction Register File (IRF)

Allow multiple instruction fetches from the IRF by packing instruction references together

Facilitate parameterization of some instructions using an Immediate Table (IMM)


Outline

Introduction IRF Instruction Set Architecture IRF Register Windowing Compiler Optimizations Experimental Framework Results Interaction with Other Techniques Proposed Enhancements for Instruction Packing Conclusions Publication Plan


IRF Instruction Set Architecture

MIPS ISA – commonly known and provides simple encoding

RISA (Register ISA) – instructions available via IRF access

MISA (Memory ISA) – instructions available in memory New instruction formats that can reference

multiple RISA instructions – Tightly Packed Original instructions modified to pack an

additional RISA reference – Loosely Packed


Instruction Cache

PC

IF/ID

IRF

IMM

IRWP

packed instruction

packed instruction insn1

insn1insn2

insn2

imm3

insn3

insn3

imm3

insn4

insn4

Execution of IRF Instructions

Instruction Fetch Stage First Half of Instruction Decode Stage

To InstructionDecoder

Executing a Tightly Packed Param4c Instruction


Tightly Packed Instruction Format

New opcodes for this T-format of MISA instructions Supports sequential execution of up to 5 RISA instructions

from the IRF Unnecessary fields are padded with nop

Supports up to 2 parameters replacing instruction slots Parameters can come from 32-entry IMM Each IRF entry also retains a default immediate value as well Branches use these 5 bits for displacements R-type RISA instructions can use parameter to replace RD field


MIPS Instruction Format Modifications

Creating Loosely Packed instructions R-type: Removed shamt field and merged with rs I-type: Shortened immediate values (16-bit 11bit)

Lui now uses 21-bit immediate values, hence no loose packing

J-type: Unchanged


Compilation Framework


Packing Instructions

Greedy selection of instructions based on static or dynamic profile data

Examine a sliding window of instructions for each basic block, attempting to pack adjacent RISA instructions together Denser packs are more favorable (e.g. tight5) Branch offset distances can slip into range when

we apply an iterative packing algorithm


IRF Register Windowing

Simply increasing the number of IRF registers reduced number of RISA instructions available in a single MISA instruction

Instead, keep format the same and provide multiple IRF windows to switch between on a per-function basis

Can dynamically switch IRF contents on calls/returns or provide actual physically separate register windows


Software Windowing

Add new instructions to dynamically save/restore instruction registers

Calculate cost/benefit analysis of creating a new window/partition for a given function

Insert appropriate instructions for saving/restoring IRs between call/return sequences

Drawbacks Requires the complete call graph for the application Extra instructions for saving/restoring can add to

execution time


Hardware Windowing

IRF windows are shared amongst functions with similar instruction mix

Function addresses are modified to incorporate an instruction register window pointer (IRWP) Call instruction – transfers to new function and switches

the window by adjusting the IRWP; also saves the return address and current IRWP

Return Instruction – transfers control back to the caller and restores the previous IRWP

Inactive windows can be kept in a low-power state


Compiler Optimizations

Improved Instruction Promotion Adapt existing techniques to improve

instruction packing Instruction Selection Register Re-assignment Instruction Scheduling

Increase application redundancy Eliminate constraints on packing


Improving Instruction Promotion

Different Classes of instructions – can consume between one and five of the potentially available slots for a tight pack

Goal is to more accurately model the benefits of promoting from one class of instruction to another Original IRF papers did not consider the promotion of multiple I-

type instructions with different default immediate values addi $3, $3, 4 and addi $3, $3, 1 would not both reside in the

IRF, no matter how frequently they occurred Can simultaneously weight static and dynamic profile data to

obtain a mixed result that has both good code compression and reduced energy consumption

Can obtain most of the benefits of individual static/dynamic profiling


Instruction Selection

Encode short jumps as branches (now parameterizable) j label beq $0, $0, rel_label

Replace simple instructions with equivalent parameterizable forms addu $2, $3, $0 addiu $2, $3, 0

Ensure that commutative operations always have the same order of operands addu $2, $4, $2 addu $2, $2, $4


Register Re-assignment

Performed after registers have been assigned in the second compilation with packing

Use a Register Interference Graph to represent register live ranges in the function

Re-assign live ranges to other registers if this improves packing density Live ranges ordered by dynamic frequency Modify register in conflicting live ranges if

beneficial


Instruction Scheduling

Intra-block – focus on reordering instructions so that dense packs are formed (both tight and loose)

Inter-block – attempt to move instructions between blocks to fill up packs ending with branches/jumps Code duplication Predication


Experimental Framework

MiBench embedded benchmark suite – 6 categories representing common tasks for various domains

SimpleScalar MIPS/PISA architectural simulator Out-of-order, single issue embedded machine with 8KB 4-

way set associative L1 instruction and data caches and 128-entry bimodal branch predictor

Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating)

VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA


Results – Processor Energy

15.82%

-2.50%

0.00%

2.50%

5.00%

7.50%

10.00%

12.50%

15.00%

17.50%

20.00%

22.50%

25.00%

27.50%

30.00%

Bas

icm

ath

Bitc

ount

Qso

rt

Sus

an

Jpeg

Lam

e

Tiff

2bw

Dijk

stra

Pat

ricia

Ispe

ll

Rsy

nth

Str

ings

earc

h

Blo

wfis

h

Pgp

Rijn

dael

Sha

CR

C32

FF

T

Adp

cm

Gsm

Ave

rage

Benchmark

Cu

mu

lati

ve E

ner

gy

Red

uct

ion

(%

)

IRF Parameterization Window ing Mixed Profile Optimizations


Results – Static Code Size

28.83%

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

50.00%

55.00%

60.00%

65.00%

Bas

icm

ath

Bitc

ount

Qso

rt

Sus

an

Jpeg

Lam

e

Tiff

2bw

Dijk

stra

Pat

ricia

Ispe

ll

Rsy

nth

Str

ings

earc

h

Blo

wfis

h

Pgp

Rijn

dael

Sha

CR

C32

FF

T

Adp

cm

Gsm

Ave

rage

Benchmark

Cu

mu

lati

ve C

od

e S

ize

Red

uct

ion

(%

)



Results – Execution Cycles

1.30%

-4.00%

-3.00%

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

Bas

icm

ath

Bitc

ount

Qso

rt

Sus

an

Jpeg

Lam

e

Tiff

2bw

Dijk

stra

Pat

ricia

Ispe

ll

Rsy

nth

Str

ings

earc

h

Blo

wfis

h

Pgp

Rijn

dael

Sha

CR

C32

FF

T

Adp

cm

Gsm

Ave

rage

Benchmark

Cu

mu

lati

ve C

ycle

Tim

e Im

pro

vem

ent

(%)


22.34%


Results – Mixed Profiling

67.5%

72.5%

77.5%

82.5%

87.5%

92.5%

97.5%

100/0(Dynamic)

75/25 50/50 25/75 0/100(Static)

Dynamic/Static Mixture

Re

lativ

e M

ea

sure

(%

)Code Size Optimized Code Size Total Energy Optimized Total Energy


Interaction with Other Techniques

Loop Cache – small buffer that captures innermost loops dynamically IRF can operate synergistically, increasing utilization of

loop cache when innermost loops are larger than normal Fetch Energy reduced by 56% with 8-entry loop cache

and 4 window IRF L0 (Filter) Cache – small instruction cache with low

energy consumption, but can negatively impact execution time due to cache misses IRF reduces working set size and can mask a portion of

the cache miss penalty due to overlapped fetch Fetch energy reduced by 70% with 256-byte L0 cache and

4 window IRF


Proposed Enhancements for Instruction Packing

Splitting of MISA/RISA Split Opcode/Operand Encoding in

RISA Improved Parameterization Link-Time Instruction Packing Instruction Promotion as a Side Effect


Splitting of MISA/RISA

No need to arbitrarily limit RISA to same instructions as MISA

Tailor each ISA to particular design goals MISA – reduced code size (small instructions) RISA – improved performance &

expressiveness (large instructions) Baseline ISA choices

ARM/Thumb – 32/16 bit dual width ISA FITS – opcodes mapped to 16 bit instructions ARM/Thumb2 – 32/16 bit variable length ISA


Split MISA Encodings


Splitting Opcodes/Operands

Existing code compression schemes have benefited from separating the encoding of opcodes and operands

Can re-encode RISA to separate opcode and operand streams

Loosely packed instructions can use traditional or new split encodings

Tightly packed instructions could even continue supporting parameterization


Split Opcode/Operand Encodings


Improved Parameterization

Current IRF parameterization is limited to immediate values and destination registers Branches use parameter for offset distance Other I-type instructions can index into IMM

based on the parameter value R-type instructions use parameter to replace RD

Extend parameterization to any field: RS, RT, RD, immediate, and even Opcode!!!


Parameterization Encoding


Link-Time Instruction Packing

Current implementation cannot pack statically linked library routines

Incremental packing strategy with a more precise evaluation of exact benefits of packing particular instructions

Integration with existing tools like DIABLO for ARM/Thumb or MIPS

Support for new inter-procedural optimizations like procedural abstraction and dead code elimination


Promotion as a Side Effect

Current implementation uses a statically allocated IRF that does not change at run-time Prior dynamic scheme experienced a great deal

of overhead for saving/restoring IRF entries Change promotion mechanism to make

saves and restores simpler Capture simple loop behaviors in the same

manner as other techniques like Zero Overhead Loop Buffer (ZOLB)


Loosely Packed Promotion

Use entire 5-bit field to specify IRF entry to write into All RISA executions must come from tightly

packed instructions Reserve 1-bit in the loosely packed field to

represent read/write of IRF entry Write this instruction (after executing it) into the

corresponding 4-bit IRF entry Reduces available number of IRs


IRF Calling Conventions

Maintain some IRs as non-scratch (callee-save) and some as scratch (caller-save)

Use appropriate registers for varying behavior (tight loops with and without function calls)


Dynamic Promotion

Peel loop and then pack instructions in the actual loop Special handling for conditional control flow Can alleviate huge code size increases when combined

with loop unrolling Added code size and potentially extra instructions

for save/restore, but huge potential for energy savings in tight loops Link-time analysis can eliminate saves/restores Potential for actual rotating register file (like SPARC)

Surpasses existing ZOLB and loop cache techniques since loops with conditional control flow can be captured


Conclusions

Efficient fetch mechanism that simultaneously reduces Processor Energy Consumption (15.8% reduction) Static Code Size (28.8% reduction) Execution Time (1.3% reduction)

Improvements via parameterization, windowing, compiler optimizations and integration with other processor enhancements

Future work is rich and diverse: Specialized MISA and RISA encodings Improved parameterization Link-time analysis and instruction packing Improved methods for IRF promotion

Demonstrates the need for careful design/balance of instruction sets and instruction fetch mechanisms


Previous Publications

1. Hines, S., Green, J., Tyson, G., and Whalley, D. Improving program efficiency by packing instructions into registers. In Proceedings of the 2005 ACM/IEEE International Symposium on Computer Architecture (June 2005), IEEE Computer Society, pp. 260–271.

2. Hines, S., Tyson, G., and Whalley, D. Reducing instruction fetch cost by packing instructions into register windows. In Proceedings of the 38th annual ACM/IEEE International Symposium on Microarchitecture (November 2005), IEEE Computer Society, pp. 19–29.

3. Hines, S., Tyson, G., and Whalley, D. Improving the energy and execution efficiency of a small instruction cache by using an instruction register file. In Proceedings of the 2nd Watson Conference on Interaction between Architecture, Circuits, and Compilers (September 2005), pp. 160–169.


Publications to be Submitted

4. Hines, S., Whalley, D., and Tyson, G. Adapting compilation techniques to enhance the packing of instructions into registers. Submitting to the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, October 2006.

5. Hines, S., Tyson, G., and Whalley, D. Improving fetch efficiency by using an instruction register file. Planning to submit.


Other Publications

6. Hines, S., Kulkarni, P., Whalley, D., and Davidson, J. Using de-optimization to re-optimize code. In Proceedings of the 5th International Conference on Embedded Software (September 2005), ACM Press, pp. 114–123.

7. Kulkarni, P., Hines, S., Hiser, J., Whalley, D., Davidson, J., and Jones, D. Fast searches for effective optimization phase sequences. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (June 2004), pp. 171–182.

8. Kulkarni, P. A., Hines, S. R., Whalley, D. B., Hiser, J. D., Davidson, J. W., and Jones, D. Fast and efficient searches for effective optimization phase sequences. ACM Transactions on Architecture and Code Optimization (June 2005), 165–198.

9. Kulkarni, P., Zhao, W., Hines, S., Whalley, D., Yuan, X., van Engelen, R., Gallivan, K., Hiser, J., Davidson, J., Cai, B., Bailey, M., Moon, H., Cho, K., Paek, Y., and Jones, D. VISTA: VPO interactive system for tuning applications. ACM Transactions on Embedded Computing Systems, 2006.

10. Kleffner, M., Jones, D., Hiser, J., Kulkarni, P., Parent, J., Hines, S., Whalley, D., Davidson, J., and Gallivan, K. On the use of compilers in DSP laboratory instruction. In Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (May 2006).

11. Kreahling, W., Hines, S., Whalley, D., and Tyson, G. Reducing the cost of conditional transfers of control by using comparison specifications. In Proceedings of the 2006 ACM SIGPLAN conference on Languages, Compilers, and Tools for Embedded Systems (June 2006), ACM Press.


Conferences/Journals

Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

Conference on Languages, Compilers and Tools for Embedded Systems (LCTES)

Conference on Programming Language Design and Implementation (PLDI)

International Symposium on Computer Architecture (ISCA) International Symposium on Microarchitecture (MICRO) ACM Transactions on Architecture and Code Optimizations

(TACO) IEEE Transactions on Computers (TOC) ACM Transactions on Computer Systems (TOCS) ACM Transactions in Embedded Computing Systems (TECS)


Publication Plan


The End

Questions ???


Compiler Optimization Legend


Intra-block Instruction Scheduling


Code Duplication to Reduce Code Size


Predication – Forward Branches


Predication – Backward Branches


Predication Advantages with IRF

IRF facilitates a form of predication for the MIPS – a baseline architecture that traditionally does not support predication

No need to waste instruction encoding space specifying predicate bits for most/all instructions (even ARM traded away general predication for reducing code size with Thumb and Thumb2)

No need to fetch, decode and possibly execute instructions that are annulled after the branch within a pack (reducing energy consumption and execution time)


Energy Consumption

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

Automotive

Consumer

NetworkOffic

e

Security

Telecomm

Average

Benchmark Category

Tot

al E

nerg

yNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched


Static Code Size

57.5%

62.5%

67.5%

72.5%

77.5%

82.5%

87.5%

92.5%

97.5%

Automotive

Consumer

NetworkOffic

e

Security

Telecomm

Average

Benchmark Category

Sta

tic C

ode

Siz

eNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched


Execution Cycles

90%

92%

94%

96%

98%

100%

102%

Automotive

Consumer

NetworkOffic

e

Security

Telecomm

Average

Benchmark Category

Exe

cutio

n C

ycle

sNo optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched


Conclusions

IRF provides dense encodings for frequently occurring instructions

Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%15.8%), code size (16.8%28.8%) and execution time

Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication

As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations

Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines,...

Documents

Transcript of Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines,...