6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System,...
-
date post
21-Dec-2015 -
Category
Documents
-
view
222 -
download
0
Transcript of 6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System,...
6/15/06 Derek Chiou, UT Austin, RAMP 1
Confessions of a RAMP Heretic:Fast, Full-System, Cycle-Accurate
x86/PowerPC/ARM/Sparc Simulators
Derek ChiouUniversity of Texas at Austin
Electrical and Computer Engineering
6/15/06 Derek Chiou, UT Austin, RAMP 2
FAST Goals Fast: as fast as possible
2-3 orders of magnitude slower than target? Fast enough to run real datasets to completion Interactive?
Accurate: produce cycle-accurate numbers for modern microprocessors (Pentium M)
Complete: run unmodified operating systems, applications, ISAs,…
Transparent: full visibility, no performance hit Inexpensive: need thousands Usable: quick changes, use RTL to generate I/O: the MOST important part of systems
6/15/06 Derek Chiou, UT Austin, RAMP 3
Functional/Timing Partitioning Proven Partitioning
Asim, Simplescalar, Timing-First, Memoized, etc.
Simplifies simulator. Promotes reuse
Same performance in software Asim at 10KHz
Most of the time spent in timing model!
Hardware???
FunctionalModel
(ISA)
TimingModel
(Micro-architecture)
InstructionsArchitectural registers
Peripheral functionality…..
FetchDecodeRenameReservation stationsScheduling windowReorder buffer….
Inst stream
6/15/06 Derek Chiou, UT Austin, RAMP 4
FAST
Functional model could be Pure software (QEMU, Bochs, Simics, SimNow)
Use JIT for performance, very fast No better hardware for executing ISA than processor Can operate under the covers (flush cache for example)
Pure Hardware (Hoe et al) Hybrid (Hoe et al)
Timing model very simple hardware
FunctionalModel
(ISA)
TimingModel
(Micro-architecture)
Inst stream
FPGAFull-SystemSimulator
6/15/06 Derek Chiou, UT Austin, RAMP 5
What is a FAST Timing Model?
TraceTrace
0x2
addrinst
InstructionMemory
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
Bypass/interlock I1
I2
6/15/06 Derek Chiou, UT Austin, RAMP 6
More Complexity
Caches/TLBs? Keep tags, pass address (virtual and
physical if necessary) Hits, misses determined but don’t
need data Superscalar (multiple issue)?
“Fetch and issue” multiple instructions assuming they meet boundary constraints
Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions
NO DATAPATH (and only part of control path)!!!!
6/15/06 Derek Chiou, UT Austin, RAMP 7
Driving a Timing ModeliTLB iCache
dTLB dCache
Align & Pick
Decode Decode Decode
Sched Sched Sched Sched
L2 Cache
FunctionalModel
Memory &I/O timingmodels
6/15/06 Derek Chiou, UT Austin, RAMP 8
Complexity: BPiTLB iCache
dTLB dCache
Align & Pick
Decode Decode Decode
Sched Sched Sched Sched
L2 Cache
FunctionalModel
Memory &I/O timingmodels
Wrong-path instructions! Implement BP in timing model Timing model forces ISA
simulator to mis-speculate Rollback, restore
BP only works in processor if it’s fairly accurate Degrades to trace driven!
FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP,
parallelism) can be handled this way
6/15/06 Derek Chiou, UT Austin, RAMP 9
Parallelism: Detect Problem & Rollback
FM
Memory
FM FM FM TM
Network
TM TM TM
Memory Model
6/15/06 Derek Chiou, UT Austin, RAMP 10
Functional Model Rollback Need to
Rollback, force branch Rollback, restore and continue
How? set_pc(inst_num, pc)
Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC
Sufficient Currently implemented with
checkpoints ISA state, memory, peripherals
Works for parallelism too
BR
BR
BR
BRBR
6/15/06 Derek Chiou, UT Austin, RAMP 11
RTL to Timing Model
TraceTrace
0x2
addrinst
InstructionMemory
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
Bypass/interlock I1
I2
Timing model perfectly models RTLVerification???
6/15/06 Derek Chiou, UT Austin, RAMP 12
Current FAST System
FM TM
Linux
EmbeddedPowerPC
FPGAFabric
EmbeddedPowerPC
Virtex FPGA (XC2VP30)
Xilinx ML310/XUP Board
6/15/06 Derek Chiou, UT Austin, RAMP 14
Status x86 functional model boots Linux, targeting 80486 to
Pentium D-like and beyond (Dam Sunwoo) Modified Bochs and QEMU
Branch-predicted multi-function unit, OOO timing model compiles in Bluespec (FAST group) Synthesized for FPGA, 8.5K lines of code, rated Top 5 User!
Memory, disk models Hope to have network model soon
Have straight pipeline 486 model with TLBs and caches Preliminary statistics gathered in hardware timing model RTL-to-timing model (Nikhil Patil) Defining tools for ISA extension and timing model assembly
6/15/06 Derek Chiou, UT Austin, RAMP 15
Timing Model Resources OOO, superscalar, 2b branch prediction, five functional units, 32KB
DCache [INTERFACE: Fast_if]+ [TM: IfcVB(interface bt. Bluespec &
Verilog)/CmdQ/Fetch/Decode/Rename/Execute] : 26% of V2P30 (3593 slices)
22 Block RAMS (out of 136) ROB broken right now
Early configurable cache model (state shouldn’t change much) 32KB 4-way set associative cache with 16B cache-lines
165 slices (1% of a 2VP30) 17 block RAMs (12% of a 2VP30)
2MB 4-way set-associative cache with 64B cache-lines 140 slices (1% of a 2VP30) 40 block RAMs (29% of a 2VP30)
6/15/06 Derek Chiou, UT Austin, RAMP 16
Current Performance
Functional model Up to 500K x86 inst/sec today on V2P30 FPGA
includes rollbacks assuming 5% mis-speculation Not that optimized
5MIPS unmodified 10M+ on 3.0GHz Pentium 4
DRC box should give this performance PowerPC ISA should be much faster!
PowerPC on PowerPC Timing model
Not bottleneck!
6/15/06 Derek Chiou, UT Austin, RAMP 17
Conclusions
1MHz to 100MHz, cycle-accurate, full-system, multiprocessor x86, x86-64, PowerPC, ARM, Sparc simulator
Leverage extant full-system simulators FPGA timing models maximize performance and
statistic gathering capabilities Pretty much any timing model seems to fit into a
single FPGA (Pentium M in V2P30?) Uniprocesssor, multi-processor capable Tools can minimize creation/modification effort