1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in...
-
Upload
melissa-mcdowell -
Category
Documents
-
view
215 -
download
1
Transcript of 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in...
![Page 1: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/1.jpg)
1/36
by Martin Labrecque
How to Fake 1000 Registers
Oehmke, Binkert, Mudge, Reinhartto appear in Nov @ Micro 2005
![Page 2: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/2.jpg)
2/36
Outline● Motivation:
– Observations on registers● Idea
– Virtual Context Architecture● Evaluation in 2 types of applications
![Page 3: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/3.jpg)
3/36
Some definitions
● Activation record:
Data structure {● variables belonging to one particular scope
(e.g. a procedure body)● links to other activation records
};
Synonyms: "data frame", "stack frame"● Context:
– Activation record of a thread of execution
A register is only meaningful to the current activation record
![Page 4: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/4.jpg)
4/36
Key observation● Virtual Memory:
– For the ISA standpoint: each process has an 'infinite' amount of memory available
– Memory is managed in caches, RAM and disk
– Memory is context free● This is not true for registers
– Limited resource
Need to virtualize registers
![Page 5: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/5.jpg)
5/36
How registers are used
Compiler
Pipeline
Source code: variables
IR: virtual registers
Binary: logical registers
Data path: physical registers
Register allocation
Decode/Rename
![Page 6: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/6.jpg)
6/36
Registers are useful
● Can't get rid of registers:– Efficient address encoding in instructions– Unambiguous data dependences– Efficient integration in the micro-
architecture
![Page 7: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/7.jpg)
7/36
Attach a memory address tothe content of the register!
Dawn of a New Idea
![Page 8: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/8.jpg)
8/36
Virtualizing registers
![Page 9: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/9.jpg)
9/36
Mapping registers to memory
● Registers are virtualized because they hold the content of a memory location
● 2 options– At register allocation, map compiler
virtual registers to memory● Memory to memory operations ● Doesn't make use of ISA registers
– Map ISA registers to memory ● Key Idea of the Virtual Context Architecture
![Page 10: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/10.jpg)
10/36
Programming the VCA
● Where are the registers mapped in memory?
● The Stack Pointer is the Reference– Allows to 'allocate' memory dynamically– Efficient way of passing parameters to a a
function – Need some architectural support to
address with offsets to the stack pointer
![Page 11: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/11.jpg)
11/36
Renaming
● To get the register memory address, combine:– the source/destination register index of
the binary program– base pointer (stack pointer)
● ISA register index register memory address physical register
![Page 12: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/12.jpg)
12/36
Register memory address physical reg.
● The address = base pointer + offset● Exploit locality of the addresses to
compress the number of bits in the conversion, low probability of capacity miss
![Page 13: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/13.jpg)
13/36
Register File is a Cache
● Hardware controlled cache● An instruction requires its source
operands and destination register to execute
What happens on a “cache” miss?We need some hardware control!
![Page 14: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/14.jpg)
14/36
Some additional HW
● Each register has 3 new attributes:
1) A reference count: ● Incremented when instruction using it goes
through rename● Decremented when instruction is committed● Non zero value means that register cannot be
reallocated to other logical registers● Guarantees instruction correct execution
![Page 15: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/15.jpg)
15/36
Some additionnal HW (ctnd)
2) A 'committed' bit● Valid, non speculative value
3) A 'dirty' bit● Value more up-to-date than memory
• Using those attributes, a state machine controls which registers are available or not
• Branch recovery works by having a duplicate renaming table containing the committed architectural state
![Page 16: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/16.jpg)
16/36
Source operand to physical
registerconversion
![Page 17: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/17.jpg)
17/36
Destination logical
register to physical register
conversion
![Page 18: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/18.jpg)
18/36
Allocation of an entry for
destination register
● Replacement policy in rename table
![Page 19: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/19.jpg)
19/36
Pipeline modifications
● Changes in the renaming● ATSQ: architectural state transfer queue
– Adds to the queue upon fills and spills– Has priority on the instruction to execute– Addresses for fills and spills are pre-calculated– No memory disambiguation required– No data dependences
![Page 20: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/20.jpg)
20/36
Outline
● Motivation:– Observations on registers
● Idea– Virtual Context Architecture
● Evaluation in 2 types of applications– Baseline & Methodology– Register windows w/ results– SMT w/ results– Combined register windows + SMT
![Page 21: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/21.jpg)
21/36
Baseline machine
![Page 22: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/22.jpg)
22/36
More on methodology
● Uses SimPoints to find representative simulation intervals
● SPEC CPU 2000● Baseline doesn't have register windows
– (Alpha’s register remapping with issue queues)● Window overflow/underflow: 10 cycles
![Page 23: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/23.jpg)
23/36
Applications
● Register windows● Multithreading
http://en.wikipedia.org/wiki/Register_windowhttp://www.sics.se/~psm/sparcstack.html
![Page 24: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/24.jpg)
24/36
Register Windows● Global register allocation
– How many registers should we reserve for the current procedure versus the rest of the program?
– SPARC example:● usually contains as many as 128 GPRs● At any point only 32 are available:
– 8 global, 8 params in, 8 params out, 8 local values– Up to 32 windows– Windows changed by an instruction usually along with 'call' and
'return'– Partial overlap: 'params out' of caller are 'params in' of callee
– Also used in Itanium (variable sized window)– Alternative is e.g.: renaming with reservation
stations
Save some memory (stack) traffic on function calls
![Page 25: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/25.jpg)
25/36
Register Windows Caveats
● Problem: – Overflow of windows: call depth too deep– Underflow of window: need to restore a
window from memory● Solution
– Operating system handler– typical scheme saves and restores
windows– VCA handles registers individually
Performance Advantage of the Register Stack in Intel® Itanium™ Processors
![Page 26: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/26.jpg)
26/36
Register windows evaluation
‘Ideal’: fills and spills are freeVCA is especially good with few
registersClose to ideal at 256 registersVCA 4% faster than baseline
@256 regs
Less registers means less in-flight
instructions and less branch
misprediction increaseFor others decrease
![Page 27: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/27.jpg)
27/36
Single data cache port experiment
● Normalized to 2-port baseline● 7% faster than baseline @ 256 regs● 0.5 % slower than ideal @ 256 regs
![Page 28: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/28.jpg)
28/36
2nd App:
multi-threadin
g
![Page 29: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/29.jpg)
29/36
SMT: simultaneous multi-threading
● Lots of replicated resources (larger register file)
● VCA: renaming table is not replicated, only base thread pointer
● VCA: – # of in-flight instructions determine
number of registers required– not # of threads
![Page 30: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/30.jpg)
30/36
SMT:
2 and 4
threads
● Normalized to single thread baseline 256 regs (not shown)
● @ 192 regs, VCA 2T is 97% of baseline @ 320 regs (baseline is at 88%)
● @192 regs, VCA 4T is at 98.7% of baseline @448 regs
![Page 31: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/31.jpg)
31/36
Combined
SMT w/ register windows
● Normalized to single thread baseline @ 256 regs● VCA 4T: 98% of peak performance @ 192 regs
![Page 32: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/32.jpg)
32/36
SMT + register windows
● Register window reduces cache accesses while SMT increases them
● VCA 4T non-windowed @192 regs is 98% perf. of baseline, it still has 24% more cache accesses, adding windows makes cache accesses 5% below baseline
![Page 33: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/33.jpg)
33/36
VCA summarized● unifies support for both multiple independent
threads and register windowing within each thread;
● backwards compatible with existing ISAs at the application level for multithreaded contexts;
● requires only minimal ISA changes for register windowing;
● requires no changes to the physical register file design and the performance-critical schedule/execute/writeback loop;
● builds on existing rename logic to map logical registers to physical registers and handles register cache misses in the decode/rename stages;
![Page 34: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/34.jpg)
34/36
VCA summarized (ctnd)
● completely decouples physical register file size from the number of logical registers by using memory as a backing store, rather than another larger register file;
● does not involve speculation or prediction, avoiding the need for recovery mechanisms.
![Page 35: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/35.jpg)
35/36
Conclusions● A VCA-based implementation of register
windows in an out-of-order processor reduces execution time by 4% while reducing data cache accesses by nearly 20% compared to a non-windowed machine, with an even larger performance advantage over a conventional register-window implementation.
● VCA's data cache traffic reduction is large enough that it can achieve the same performance with one cache port as an otherwise similar conventional machine would with two cache ports.
![Page 36: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005.](https://reader030.fdocuments.in/reader030/viewer/2022032414/56649eec5503460f94bfd439/html5/thumbnails/36.jpg)
36/36
Conclusions (ctnd)
● VCA is also able to manage thread contexts efficiently, enabling effective implementation of simultaneous multithreading (SMT) using as few as half the registers of a standard architecture.
● VCA allows SMT to be combined with register windows with no additional physical registers.
● a 4-thread VCA machine with 192 registers can achieve higher performance than a conventional non-windowed SMT machine with twice as many registers.