Overview of SimpleScalar - Chalmers · · 2009-10-21Overview of SimpleScalar ... load...
-
Upload
vuongquynh -
Category
Documents
-
view
226 -
download
2
Transcript of Overview of SimpleScalar - Chalmers · · 2009-10-21Overview of SimpleScalar ... load...
2007-11-14 1
Overview of SimpleScalar
Mafijul Islam
Department of Computer Science and Engineering
2007-11-14 2
Acknowledgement
• SimpleScalar Tutorial available at www.simplescalar.com
• http://hpc5.cs.tamu.edu/docs/SimplescalarOverview2006.ppt
SimpleScalar TutorialSimpleScalar TutorialPage 4
• What is an architectural simulator?q tool that reproduces the behavior of a computing device
• Why use a simulator?q leverage faster, more flexible S/W development cycle
q permits more design space exploration
q facilitates validation before H/W becomes available
q level of abstraction can be throttled to design task
q possible to increase/improve system instrumentation
A Computer Architecture Simulator Primer
DeviceSimulator
SystemInputs
System Outputs
System Metrics
2007-11-14 4
Simulators
• Almost 40 simulators listed at http://www.cs.wisc.edu/arch/www/tools.html
• 1st simulator of the list
– SimpleScalar (uni-processor, superscalar)
– Developed by Todd Austin (U Michigan) while
in U of Wisconsin-Madison
– Still widely used in the academia and industry
Page 7SimpleScalar Tutorial
A Taxonomy of Simulation Tools
• shaded tools are part of SimpleScalar
Architectural simulators
Cycle timers
Performance
Inst. schedulersExec-driven
Functional
Trace-driven
Interpreters Direct execution
Page 9SimpleScalar Tutorial
Functional vs. performance simulators
• Functional simulators implement the architecture
- Perform the actual execution
- Implement what programmers see
• Performance (or timing) simulators implement the microarch.
- Model system resources/internals
- Measure time
- Implement what programmers do not see
2007-11-14 8
Trace Driven vs. Execution Driven
Simulators• Trace-Driven
– Simulator reads a ‘trace’ of the instructions captured during a previous execution
– Easy to implement
– No functional components necessary
– No feedback to trace (eg. mis-prediction)
• Execution-Driven– Simulator runs the program (trace-on-the-fly)
– Hard to implement
– Advantages• Faster than tracing
• No need to store traces
• Register and memory values usually are not in trace
• Support mis-speculation cost modeling
2007-11-14 9
Instruction Schedulers vs. Cycle Timers
• Instruction Schedulers
– Simulator schedules instruction when resources are available
– Instructions proceeded one at a time
– Simpler, but less detailed
• Cycle Timers
– Simulator tracks microarchitecture state each cycle
– Simulator state == microarchitecture state
– Perfect for microarchitecture simulation
SimpleScalar TutorialSimpleScalar TutorialPage 10
The SimpleScalar Tool Set• computer architecture research test bed
q compilers, assembler, linker, libraries, and simulators
q targeted to the virtual SimpleScalar PISA architecture
q hosted on most any Unix-like machine
• developed during Austin's dissertation work at UW-Madisonq third generation simulation tool (Sohi → Franklin → SimpleScalar)
q in development since ‘94
q first public release (1.0) in July ‘96
q second public release (2.0) testing completed in January ‘97
• available with source and docs from SimpleScalar LLC http://www.simplescalar.com
SimpleScalar TutorialSimpleScalar TutorialPage 12
SimpleScalar Tool Set Overview
• compiler chain is GNU tools ported to SimpleScalar
• Fortran codes are compiled with AT&T’s f2c
• libraries are GLIBC ported to SimpleScalar
F2C GCC
GAS
GLDlibf77.a
libm.alibc.a
Simulators
Bin Utils
Fortran code C code
Assembly code
object files
Executables
Page 4SimpleScalar Tutorial
Advantages of SimpleScalar• Extensible
- source for compiler, libraries, simulators
- user-extensible instruction format
• Portable
- runs on NT and most UNIX platforms
- target can support multiple ISAs
• Detailed
- Interfaces support simulators of arbitrary detail
- Multiple simulators included with distribution
• Fast (millions of instructions per second)
SimpleScalar TutorialSimpleScalar TutorialPage 9
The Zen of Simulator Design
• design goals will drive which aspects are optimized
• the SimpleScalar Tool Setq optimizes performance and flexibility
q in addition, provides portability and varied detail
Performance
Detail Flexibility
PickTwo
Performance: speeds design cycle
Flexibility: maximizes design scope
Detail: minimizes risk
SimpleScalar TutorialSimpleScalar TutorialPage 13
Simulation Suite Overview
Performance
Detail
Sim-Fast Sim-SafeSim-Cache/
Sim-Cheetah/Sim-BPred
Sim-Profile Sim-Outorder
- 420 lines- functional- 4+ MIPS
- 350 lines- functional w/ checks
- < 1000 lines- functional- cache stats- pred stats
- 900 lines- functional- lot of stats
- 3900 lines- performance- OoO issue- branch pred.- mis-spec.- ALUs- cache- TLB- 200+ KIPS
2007-11-14 12
Sim-Fast
• Functional simulation
• Optimized for speed
• Assumes no cache
• Assumes no instruction checking
• Does not support Dlite!
• Does not allow command line arguments
• <300 lines of code
2007-11-14 13
Sim-Safe
• Functional simulation
• Checks for instruction errors
• Optimized for speed
• Assumes no cache
• Supports Dlite!
• Does not allow command line arguments
2007-11-14 14
Sim-Cache
• Cache simulation
• Ideal for fast simulation of caches (if the effect of cache
performance on execution time is not necessary)
• Accepts command line arguments for:
– level 1 & 2 instruction and data caches
– TLB configuration (data and instruction)
– Flush and compress
– and more
• Ideal for performing high-level cache studies that don’t
take access time of the caches into account
2007-11-14 15
Sim-Bpred
• Simulate different branch prediction mechanisms
• Generate prediction hit and miss rate reports
• Does not simulate the effect of branch prediction on total
execution time
nottaken
taken
perfect
bimod bimodal predictor
2lev 2-level adaptive predictor
comb combined predictor (bimodal and 2-level)
2007-11-14 16
Sim-Profile
● Program Profiler
● Generates detailed profiles, by symbol and by address
● Keeps track of and reports
● Dynamic instruction counts
● Instruction class counts
● Branch class counts
● Usage of address modes
● Profiles of the text & data segment
2007-11-14 17
Sim-Outorder
• Most complicated and detailed simulator
• Supports out-of-order issue and execution
• Provides reports
– branch prediction
– cache
– external memory
– various configuration
SimpleScalar TutorialSimpleScalar TutorialPage 16
Generating SimpleScalar Binaries• compiling a C program, e.g.,
ssbig-na-sstrix-gcc -g -O -o foo foo.c -lm
• compiling a Fortran program, e.g.,ssbig-na-sstrix-f77 -g -O -o foo foo.f -lm
• compiling a SimpleScalar assembly program, e.g.,ssbig-na-sstrix-gcc -g -O -o foo foo.s -lm
• running a program, e.g.,sim-safe [-sim opts] program [-program opts]
• disassembling a program, e.g.,ssbig-na-sstrix-objdump -x -d -l foo
• building a library, use:ssbig-na-sstrix-{ar,ranlib}
SimpleScalar TutorialSimpleScalar TutorialPage 17
Global Simulator Options• supported on all simulators:
-h - print simulator help message-d - enable debug message-i - start up in DLite! debugger-q - quit immediately (use w/ -dumpconfig)-config <file> - read config parameters from <file>-dumpconfig <file>- save config parameters into <file>
• configuration files:q to generate a configuration file:
q specify non-default options on command line
q and, include “-dumpconfig <file>” to generate configuration file
q comments allowed in configuration files, all after “#” ignored
q reload configuration files using “-config <file>”
SimpleScalar TutorialSimpleScalar TutorialPage 18
The SimpleScalar Instruction Set• clean and simple instruction set architecture:
q MIPS/DLX + more addressing modes - delay slots
• bi-endian instruction set definitionq facilitates portability, build to match host endian
• 64-bit inst encoding facilitates instruction set researchq 16-bit space for hints, new insts, and annotations
q four operand instruction format, up to 256 registers
16-annote 16-opcode 8-ru 8-rt 8-rs 8-rd
16-imm
081624324863
SimpleScalar TutorialSimpleScalar TutorialPage 19
SimpleScalar InstructionsControl:j - jumpjal - jump and linkjr - jump registerjalr - jump and link registerbeq - branch == 0bne - branch != 0blez - branch <= 0bgtz - branch > 0bltz - branch < 0bgez - branch >= 0bct - branch FCC TRUEbcf - branch FCC FALSE
Load/Store:lb - load bytelbu - load byte unsignedlh - load half (short)lhu - load half (short) unsignedlw - load worddlw - load double wordl.s - load single-precision FPl.d - load double-precision FPsb - store bytesbu - store byte unsignedsh - store half (short)shu - store half (short) unsignedsw - store worddsw - store double words.s - store single-precision FPs.d - store double-precision FP
addressing modes: (C) (reg + C) (w/ pre/post inc/dec) (reg + reg) (w/ pre/post inc/dec)
Integer Arithmetic:add - integer addaddu - integer add unsignedsub - integer subtractsubu - integer subtract unsignedmult - integer multiplymultu - integer multiply unsigneddiv - integer dividedivu - integer divide unsignedand - logical ANDor - logical ORxor - logical XORnor - logical NORsll - shift left logicalsrl - shift right logicalsra - shift right arithmeticslt - set less thansltu - set less than unsigned
SimpleScalar TutorialSimpleScalar TutorialPage 20
SimpleScalar Instructions
Floating Point Arithmetic:add.s - single-precision addadd.d - double-precision addsub.s - single-precision subtractsub.d - double-precision subtractmult.s - single-precision multiplymult.d - double-precision multiplydiv.s - single-precision dividediv.d - double-precision divideabs.s - single-precision absolute valueabs.d - double-precision absolute valueneg.s - single-precision negationneg.d - double-precision negationsqrt.s - single-precision square rootsqrt.d - double-precision square rootcvt - integer, single, double conversionc.s - single-precision comparec.d - double-precision compare
Miscellaneous:nop - no operationsyscall - system callbreak - declare program error
SimpleScalar TutorialSimpleScalar TutorialPage 21
SimpleScalar Architected StateVirtual Memory
0x00000000
0x7fffffff
Unused
Text(code)
Data(init)(bss)
StackArgs & Env
0x00400000
0x10000000
0x7fffc000
.
.
r0 - 0 source/sink
r1 (32 bits)
r2
r31
Integer Reg File
.
.
f0 (32 bits)
f1
f2
f31
FP Reg File (SP and DP views)
r30
f30
f1
f3
f31
PC
HI
LO
FCC
SimpleScalar TutorialSimpleScalar TutorialPage 22
Simulator I/O
• a useful simulator must implement some form of I/Oq I/O implemented via SYSCALL instruction
q supports a subset of Ultrix system calls, proxied out to host
• basic algorithm (implemented in syscall.c):q decode system call
q copy arguments (if any) into simulator memory
q perform system call on host
q copy results (if any) into simulated program memory
write(fd, p, 4)
Simulated Program Simulator
sys_write(fd, p, 4)
args in
results out
SimpleScalar TutorialSimpleScalar TutorialPage 23
Simulator S/W Architecture• interface programming style
q all “.c” files have an accompanying “.h” file with same base
q “.h” files define public interfaces “exported” by moduleq mostly stable, documented with comments, studying these files
q “.c” files implement the exported interfacesq not as stable, study these if you need to hack the functionality
• simulator modulesq sim-*.c files, each implements a complete simulator core
• reusable S/W components facilitate “rolling your own”q system components
q simulation components
q “really useful” components
SimpleScalar TutorialSimpleScalar TutorialPage 24
Simulator S/W Architecture
• most of performance core is optional
• most projects will enhance on the “simulator core”
BPred SimulatorCore
Machine DefinitionFunctional
Core
SimpleScalar ISA POSIX System Calls
Proxy Syscall Handler
Dlite!
Cache MemoryRegsLoader
Resource
Stats
PerformanceCore
Prog/SimInterface
SimpleScalar Program BinaryUserPrograms
SimpleScalar TutorialSimpleScalar TutorialPage 41
SIM-OUTORDER: H/W Architecture
• implemented in sim-outorder.c and components
Fetch DispatchRegister
Scheduler
MemoryScheduler
Writeback CommitExec
Mem
D-Cache(DL1)
I-Cache(IL1)
Virtual Memory
D-TLBI-TLB
I-Cache(IL2)
D-Cache(DL2)
2007-11-14 18
Sim-Outorder HW Architecture
Fetch DispatchRegister
SchedulerExe Writeback Commit
I-Cache
Memory
SchedulerMem
Virtual Memory
D-Cache D-TLBI-TLB
ruu_fetch ruu_dispatch ruu_issue
lsq_refresh
ruu_writeback ruu_commit
2007-11-14 19
Sim-Outorder (Main Loop) • sim_main() in sim-outorder.c
ruu_init();
for(;;){
ruu_commit();
ruu_writeback();
lsq_refresh();
ruu_issue();
ruu_dispatch();
ruu_fetch();
}
• Executed once for each simulated machine cycle
• Walks pipeline from Commit to Fetch– Reverse traversal handles inter-stage latch synchronization by only
one pass
2007-11-14 20
Sim-Outorder (RUU/LSQ)
• RUU (Register Update Unit)
– Handles register synchronization/communication
– Serves as reorder buffer and reservation stations
• LSQ (Load/Store Queue)
– Handles memory synchronization/communication
– Contains all loads and stores in program order
• Relationship between RUU and LSQ
– Memory dependencies are resolved by LSQ
– Load/Store effective address calculated in RUU
SimpleScalar TutorialSimpleScalar TutorialPage 46
Fetch
misprediction (from Writeback)
to instruction fetch queue (IFQ)
Fetch Stage Implementation
• models machine fetch bandwidth
• implemented in ruu_fetch()
• inputs:q program counter
q predictor state (see bpred.[hc])
q misprediction detection from branch execution unit(s)
• outputs:q fetched instructions sent to instruction fetch queue (IFQ)
SimpleScalar TutorialSimpleScalar TutorialPage 47
Fetch
misprediction (from Writeback)
to instruction fetch queue (IFQ)
Fetch Stage Implementation
• procedure (once per cycle):q fetch instructions from one I-cache line, block until I-cache or
I-TLB misses are resolved
q queue fetched instructions to instruction fetch queue (IFQ)
q probe branch predictor for cache line to access in next cycle
SimpleScalar TutorialSimpleScalar TutorialPage 48
Dispatch to RUU or LSQinstructionsfrom IFQ
Dispatch Stage Implementation
• models machine decode, rename, RUU/LSQ allocationbandwidth, implements register renaming
• implemented in ruu_dispatch()
• inputs:q instructions from IFQ, from Fetch stage
q RUU/LSQ occupancy
q rename table (create_vector)
q architected machine state (for execution)
• outputs:q updated RUU/LSQ, rename table, machine state
SimpleScalar TutorialSimpleScalar TutorialPage 49
Dispatch Stage Implementation
• procedure (once per cycle):q fetch insts from IFQ
q decode and execute instructionsq permits early detection of branch mis-predicts
q facilitates simulation of “oracle” studies
q if branch misprediction occurs:q start copy-on-write of architected state to speculative state buffers
q enter instructions into RUU and LSQ (load/store queue)q link to sourcing instruction(s) using RS_LINK structure
q loads/stores are split into two insts: ADD + Load/Storeq improves performance of memory dependence checking
Dispatch to RUU or LSQinstructionsfrom IFQ
SimpleScalar TutorialSimpleScalar TutorialPage 50
RegisterScheduler
MemoryScheduler
RUU, LSQ to functional units
Scheduler Stage Implementation
• models instruction wakeup, selection, and issueq separate schedulers track register and memory dependencies
• implemented in ruu_issue()and lsq_refresh()
• inputs:q RUU/LSQ
• outputs:q updated RUU/LSQ
q updated functional unit state
SimpleScalar TutorialSimpleScalar TutorialPage 51
RegisterScheduler
MemoryScheduler
RUU, LSQ to functional units
Scheduler Stage Implementation
• procedure (once per cycle):q locate instructions with all register inputs ready
q in ready queue, inserted when dependent insts enter Writeback
q locate loads with all memory inputs readyq determined by walking the load/store queue
q if load addr unknown, then stall issue (and poll again next cycle)
q if earlier store w/ unknown addr, then stall issue (and poll again)
q if earlier store w/ matching addr, then forward store data
q else, access D-cache
SimpleScalar TutorialSimpleScalar TutorialPage 52
insts issued by Scheduler completed insts to WritebackExec
Mem
requests to memory hierarchy
Execute Stage Implementation
• models functional units and D-cacheq access port bandwidths, issue and execute latencies
• implemented in ruu_issue()
• inputs:q instructions ready to execute, issued by Scheduler stage
q functional unit and D-cache state
• outputs:q updated functional unit and D-cache state, Writeback events
SimpleScalar TutorialSimpleScalar TutorialPage 53
Execute Stage Implementation
• procedure (once per cycle):q get ready instructions (as many as supported by issue B/W)
q find free functional unit and access port
q reserve unit for entire issue latency
q schedule writeback event using operation latency offunctional unitq for loads satisfied in D-cache, probe D-cache for access latency
q also probe D-TLB, stall future issue on a miss
q D-TLB misses serviced in Commit with fixed latency
insts issued by Scheduler completed insts to WritebackExec
Mem
requests to memory hierarchy
SimpleScalar TutorialSimpleScalar TutorialPage 54
detected mispredictions to Fetch
Writebackfinished insts from Execute insts ready to Commit
Writeback Stage Implementation
• models writeback bandwidth, wakes up ready insts,detects mispredictions, initiated misprediction recovery
• implemented in ruu_writeback()
• inputs:q completed instructions as indicated by event queue
q RUU/LSQ state (for wakeup walks)
• outputs:q updated event queue, RUU/LSQ, ready queue
q branch misprediction recovery updates
SimpleScalar TutorialSimpleScalar TutorialPage 55
detected mispredictions to Fetch
Writebackfinished insts from Execute insts ready to Commit
Writeback Stage Implementation
• procedure (once per cycle):q get finished instructions (specified by event queue)
q if mispredicted branch, recover state:q recover RUU
q walk newest instruction to mispredicted branch
q unlink instructions from output dependence chains (tag increment)
q recover architected stateq roll back to checkpoint (copy-on-write bits reset, spec mem freed)
q wakeup walk: walk output dependence chains of finished instsq mark dependent instruction’s input as now ready
q if deps satisfied, wake up inst (memory checked in lsq_refresh())
SimpleScalar TutorialSimpleScalar TutorialPage 56
Commitinsts ready to Commit
Commit Stage Implementation
• models in-order retirement of instructions, storecommits to the D-cache, and D-TLB miss handling
• implemented in ruu_commit()
• inputs:q completed instructions in RUU/LSQ that are ready to retire
q D-cache state (for store commits)
• outputs:q updated RUU, LSQ, D-cache state
SimpleScalar TutorialSimpleScalar TutorialPage 57
Commit Stage Implementation
• procedure (once per cycle):q while head of RUU/LSQ is ready to commit (in-order
retirement)q if D-TLB miss, then service it
q if store, attempt to retire store into D-cache, stall commitotherwise
q commit instruction result to the architected register file, updaterename table to point to architected register file
q reclaim RUU/LSQ resources (adjust head pointer)
Commitinsts ready to Commit
2007-11-14 27
Sim-Outorder parameters
• Instruction fetch queue size, decode and issue bandwidth
• Capacity of RUU and LSQ
• Branch mis-prediction latency
• Number of functional units– integer ALU, integer multipliers/dividers
– FP ALU, FP multipliers/dividers
• Latency of I-cache/D-cache, memory and TLB
• Record statistic by text address
SimpleScalar TutorialSimpleScalar TutorialPage 109
Specifying Cache Configurations• all caches and TLB configurations specified with same format:
<name>:<nsets>:<bsize>:<assoc>:<repl>
• where:<name> - cache name (make this unique)<nsets> - number of sets<assoc> - associativity (number of “ways”)<repl> - set replacement policy
l - for LRUf - for FIFOr - for RANDOM
• examples:il1:1024:32:2:l 2-way set-assoc 64k-byte cache, LRUdtlb:1:4096:64:r 64-entry fully assoc TLB w/ 4k pages,
random replacement
SimpleScalar TutorialSimpleScalar TutorialPage 110
Specifying Cache Hierarchies• specify all cache parameters in no unified levels exist, e.g.,
-cache:il1 il1:128:64:1:l -cache:il2 il2:128:64:4:l
-cache:dl1 dl1:256:32:1:l -cache:dl2 dl2:1024:64:2:l
• to unify any level of the hierarchy, “point” an I-cache level into thedata cache hierarchy:
-cache:il1 il1:128:64:1:l -cache:il2 dl2
-cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l
il1 dl1
il2 dl2
il1 dl1
ul2
SimpleScalar TutorialSimpleScalar TutorialPage 115
Specifying the Branch Predictor• specifying the branch predictor type:
-bpred <type>
the supported predictor types are:nottaken always predict not takentaken always predict takenperfect perfect predictorbimod bimodal predictor (BTB w/ 2 bit counters)2lev 2-level adaptive predictor
• configuring the bimodal predictor (only useful when “-bpred bimod” isspecified):
-bpred:bimod <size> size of direct-mapped BTB
SimpleScalar TutorialSimpleScalar TutorialPage 116
Specifying the Branch Predictor (cont.)• configuring the 2-level adaptive predictor (only useful when “-bpred
2lev” is specified):
-bpred:2lev <l1size> <l2size> <hist_size>
where:
<l1size> size of the first level table<l2size> size of the second level table<hist_size> history (pattern) width
l1size
patternhistory
hist_size
branchaddress
l2size
2-bitpredictors
branchprediction