Hardware Support for Compiler Speculation Compiler needs to move instructions before branch,...

35

Transcript of Hardware Support for Compiler Speculation Compiler needs to move instructions before branch,...

Hardware Support for Compiler Speculation

• Compiler needs to move instructions before branch, possibly before condition

• Requirements:– Instructions that can be moved without

disrupting data flow– Exceptions that can be ignored until outcome is

known– Ability to speculatively access memory with

potential address conflicts

Exception Support

• Four methods:– Hardware and OS cooperate to ignore

exceptions for speculative instructions– Speculative instructions never raise exceptions;

explicit checks must be made– Poison bits used to mark registers with invalid

results; use causes exception– Speculative results are buffered until certain

Exception Handling

• Nonterminating exceptions can be handled normally (e.g. page fault)– May cause serious performance loss

Memory Reference Speculation

• Moving loads across stores is only safe if the addresses do not conflict

• Special instructions check for address conflicts

4.6. Crosscutting Issues: Hardware–vs– Software Speculation

• A number of trade-offs and limitations– Disambiguating memory references is hard for

a compiler– Hardware branch prediction is usually better– Precise exceptions easier in hardware– Hardware does not require “housekeeping”

code– Compilers can “look” further– Hardware techniques are more portable

Hardware/Software Speculation

• Major disadvantage of hardware: complexity!

• Some architectures combine hardware and software approaches

4.7. Putting It All Together:IA-64 and Itanium

• IA-64 – RISC-style

• Register-register

• Emphasis on software-based optimisations

• Features:– 128 × 65-bit integer registers– 128 × 82-bit FP registers– 64 predicate registers; 8 branch registers

Registers

• Integer registers– Use windowing mechanism

• 0–31 always visible

• Remainder arranged in overlapping windows– Local and out areas (variable size)

– Hardware for over-/underflow

• Int and FP registers support register rotation– Supports software pipelining

Instruction Format and VLIW

• Compiler schedules parallel instructions; flags dependences

• Instruction group– Sequence of (register) independent instructions– Compiler marks boundaries between groups

(stop)

• Bundle– 128-bits: 5-bit template + 3 × 41-bit

instructions

Instruction Bundle

• Template specifies stops and execution unit– I-unit (int + special — multimedia, etc.)– M-unit (int + memory access)– F-unit (FP)– B-unit (branches)– L+X (extended instructions)

Example

• Unrolled seven times– Optimised for size:

• 9 bundles; 15% nops

• 21 cycles (3 per calculation)

– Optimised for performance:• 11 bundles; 30% nops

• 12 cycles (1.7 per calculation)

for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

Instructions

• 41-bits long– 4-bit opcode (+ template bits)– 6-bit predicate register specifier

• Predication– Almost all instructions can be predicated

• Branch is jump with predicate check!

– Complex comparisons set two predicate registers

Speculation

• Exceptions can be deferred– Uses poison bits (65-bit registers)– Nonspeculative and chk instructions raise

exception

• Speculative loads– Called advanced load (ld.a)– Stores check addresses

Itanium

• First implementation of IA-64

• Issues up to six instructions per cycle (two bundles)

• Nine functional units– 2 × I, 2 × M, 3 × B, 2 × F

• 10-stage pipeline

• Multilevel dynamic branch predictor

Itanium

• Complex hardware with many features of dynamically scheduled pipelines!– Branch prediction– Register renaming– Scoreboarding– Deep pipeline– etc.

Itanium: Performance

• SPECint not too impressive– 85% of Alpha 21264 (older, more power-

efficient processor!)

• FP better– Faster, even with slower clock!– But skewed by one benchmark for Pentium– Alpha compilers need improvement

4.8. Another View:ILP in Embedded Processors

• Trimedia (see chapter 2)– “Classic” VLIW– Hardware decompression of code

• Crusoe– Software translation of 80x86 to VLIW– Low power

Trimedia TM32 Architecture

• VLIW– Instruction specifies five operations– Static scheduling– No hardware hazard detection– 23 functional units (11 types)

Transmeta Crusoe

• Low power design

• Emulates 80x86

• VLIW– 64-bit (2 op) and 128-bit (4 op) instructions– Five types of operations:

• ALU (int, register-register)

• Compute (int ALU, FP, multimedia)

• Memory

• Branch

• Immediate

Crusoe

• Simple, in-order pipeline– Integer: 6-stage (IF1, IF2, DEC, OP, EX, WB)– FP: 10-stage (5 EX stages)

Crusoe

• Software interpretation of 80x86 code:– Basic blocks cached– Exception handling complicated

• Crusoe has good support for speculative reordering

• Memory writes buffered and committed only when safe

Crusoe Performance

• Hard to measure accurately

• Power consumption is low (⅓ of Pentium)

4.9. Fallacies and Pitfalls

• Fallacy: There is a simple approach to multiple-issue (high performance with low complexity)– Big gap between peak and sustained

performance for multiple issue processors• Need dynamic scheduling, speculation support,

branch prediction, sophisticated prefetch, etc.

• Sophisticated compilers are required

4.10. Concluding Comments

• “Hardware” techniques migrating to “software” and vice versa

• Multiprocessors may be important in future

Chapter 5Memory Hierarchy Design

Memory Hierarchies

• Not a new idea!

• Takes advantage of the principle of locality– Temporal– Spatial

• Small, fast memories close to processor

Memory Hierarchies

Registers

Cache

Memory

I/O Devices (virtual memory)

SpeedCost

Size

Introduction

• Usually includes responsibility for memory protection

• Performance is a major problem

Figure 5.2

Characterising Levels of the Memory Hierarchy

• Four questions:– Where can a block be placed? (placement)– How is a block found? (identification)– Which block should be replaced on a miss?

(replacement)– What happens on a write? (write strategy)

Example

• The Alpha 21264 is used as an example throughout

Caches• Where is a block placed in a cache?

– Three possible answers three different types

Anywhere Fully associative

Only intoone block

Direct mapped

Into subsetof blocks

Set associative

Cache Categories

• Set associative– n-way set associative, where n is number of

blocks in set– Commonly, n = 2 or n = 4

• Direct-mapped– “1-way set associative”

• Fully associative– “m-way set associative” (m is total number of

blocks in cache)