1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
4
Transcript of 1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.
1
RAMP Models and Platforms
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Krste AsanovicUC Berkeley
RAMP Retreat, Berkeley, CAJanuary 15, 2009
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
2
Much confusion about RAMP
Frequently asked questions: When will RAMP be finished/usable? What ISA does RAMP use? Can RAMP model my new feature “X”? How accurate is RAMP? Why so many different RAMP projects? Why is there not more sharing among projects?
3
Not much confusion about software simulators
Rarely asked questions: When will software simulation be finished/usable? What ISA do software simulators use? Can a software simulator model my new feature
“X”? How accurate is software simulation? Why so many software simulators? Why is there not more sharing among software
simulators?
4
RAMP is a consortium, not a project Many projects with different goals
sometimes multiple per site So far, much sharing of ideas and techniques
Very healthy and active community Some sharing of low-level infrastructure
Boards + platform-level interfaces to DRAM, Ethernet, etc.
Not a single complete infrastructure that everyone uses and that’s been OK, and might continue to be OK
5
Host Host PlatformPlatform
CPU CPU CPU CPU
Interconnect Network
DRAM
Target Target MachineMachine
Hard WorkHard Work
Run Model of Target on Host Platform
6
RAMP Projects’ Goals
Model some target machine trading off:FidelityModel design effortEmulation speed (and capacity)
7
Space of Target Machines
Which ISA? x86, SPARC, PowerPC, Alpha, ARM, MIPS?
In-order or out-of-order cores? How many cores?
1, 16, 256, 1M? Processor+memory of general-purpose machine,
or whole SoC including I/O devices? Accelerators, GPUs? Which operating system? Hypervisor?
8
ISA Wars
Original pick to standardize around was SPARC Open standard Available verification suite Simplest ISA with extensive general-purpose software
support (i.e., desktop/server development environment available)
SGI/MIPS sorely missed… Leon implementation for FPGA Simics
But the intent was always to support multiple ISAs
9
ISA usage in RAMP models UCB RAMP Blue: Microblaze++
Xilinx soft core modified to add 64-bit FPU Stanford RAMP Red: PowerPC
Used Virtex-II Pro hard cores UT FAST: x86
Functional simulation in software on front-end machine (or on PowerPC hardcore)
UT RAMP White: PowerPC -> SPARC Initial version used hard PowerPC cores moving to Leon soft
cores MIT/Intel HASIM: Alpha -> x86?
Initially Alpha ISA, eventually to form basis of x86/uOP machine CMU ProtoFLEX: SPARC
“SPARC three ways” (own core + emulation on hard PowerPC core + emulation on front-end machine)
UCB RAMP Gold & Internet-in-a-Box: SPARC Own core design
UCB/LBNL Green Flash: Tensilica RTL generated from Tensilica tools
10
Supporting new ISAs
x86 still very desirable, but difficult FAST software functional model is probably current best
approach if want to play with different timings Microcoded functional model would be good way to go if
had resources (HASIM?) Even with working functional model, timing model is
difficult? Adding new features difficult? ARM also desirable for mobile device modeling
Renewed interest in engaging here MIT/IBM PowerPC work in progress, could form
functional model But nobody does this for fun - only to advance
their own research goals…
11
Commercial/Existing RTL Cores Originally seen as big benefit of RAMP But didn’t turn out that way in practice (except
for prototyping usage model - see later)
Cores don’t provide features we need, too big, too difficult to modify
For simple ISAs (i.e. non-x86), biggest help is ISA verification suites, and/or *really* simple synthesizable ISA pipeline to form basis of functional model
12
Operating System Support
Currently only ProtoFLEX, FAST, RAMP-White support OS Others can run one application with proxy mechanism
for I/O
Reflects interests of groups. OS is not primary subject of research for groups building models so far. RAMP Gold to add support for ParLab OS work
(Tessellation) Green Flash to add support for HPC-style microkernel
13
Target systems From a few, to millions of cores
Scaling simulation to 100s of cores was a shared goal But smaller core counts (16-128) very interesting also Huge core counts (>1E6) also of interest
Single node versus clusters RAMP Blue & Internet-in-a-box are message-passing
clusters Rest are shared-memory systems
Memory hierarchy and cache coherence protocols Wide variety of possibilities
Desktop/Laptop/Server versus Handheld or SoC What is important to model for given research topic?
Accelerators/GPUs Even wider variety than CPU ISAs/microarchitectures
14
Wide variety, how to reuse?Proposal: ISA functional models
also FPU across ISAs Perhaps even common uOP engine across all ISAs?
CPU Microarchitecture timing model E.g., in-order superscalar, out-of-order with unified physical
register file Memory functional model
Host-level caches + memory interleaving Memory hierarchy timing models
On-chip network types as subset I/O bus shims
To allow random RTL to be attached for I/O devices and non-GPU accelerators
This won’t be easy, as have to agree on interfaces between these components, might need further specializationDefinitely need more experience doing all of the above
15
Simulator Types
Functional model only (no timing) RTL models (functional includes timing)
Also used for chip prototyping Split functional and timing models
+ Hybrids of above
16
Simulator Mapping Styles Gate-level emulator (Quickturn, Palladium)
~1MHz Direct RTL emulator
5-20MHz FPGA-tuned RTL emulator
20-50MHz Virtualized RTL emulator
50-100MHz Host-multithreaded models
>100MHz
18
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
RAMP Blue Release 2/25/2008RAMP Blue Release 2/25/2008- design available from RAMP design available from RAMP websitewebsite- ramp.eecs.berkeley.eduramp.eecs.berkeley.edu
19
Climate System Design ConceptStrawman Design Study
10PF sustained
~120 m2
<3MWatts
< $75M
32 boards per rack
100 racks @ ~25KW
power + comms
32 chip + memory clusters per board (2.7
TFLOPS @ 700W
VLIW CPU: • 128b load-store + 2 DP MUL/ADD + integer op/ DMA
per cycle:• Synthesizable at 650MHz in commodity 65nm • 1mm2 core, 1.8-2.8mm2 with inst cache, data cache
data RAM, DMA interface, 0.25mW/MHz• Double precision SIMD FP : 4 ops/cycle (2.7GFLOPs)• Vectorizing compiler, cycle-accurate simulator,
debugger GUI (Existing part of Tensilica Tool Set)• 8 channel DMA for streaming from on/off chip DRAM• Nearest neighbor 2D communications grid
ProcArray
RAM RAM
RAM RAM
8 DRAM perprocessor chip:
~50 GB/s
CPU64-128K D
2x128b
32K I
8 chanDMA
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
Op
t. 8M
B e
mb
ed
de
d D
RA
M
External DRAM interface
External DRAM interface
Exte
rna
l DR
AM
inte
rfaceE
xte
rna
l DR
AM
inte
rfa
ce
MasterProcessor
Comm LinkControl
32 processors per 65nm chip83 GFLOPS @ 7W
20
Virtualized RTL Improves FPGA Resource Usage RAMP allows units to run at varying target-host
clock ratios to optimize area and overall performance
Example 1: Multiported register file Example, Sun Niagara has 3 read ports and 2 write
ports to 6KB of register storage If RTL mapped directly, requires 48K flip-flops
Slow cycle time, large area If mapping into block RAMs (one read+one write per
cycle), takes 3 host cycles and 3x2KB block RAMs Faster cycle time (~3X) and far less resources
Example 2: Large L2/L3 caches Current FPGAs only have ~1MB of on-chip SRAM Use on-chip SRAM to build cache of active piece of
L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM
21
Host Multithreading(Zhangxi Tan (UCB), Chung, (CMU))
CPU1
CPU2
CPU3
CPU4Target Target
ModelModel
Multithreading emulation engine reduces FPGA resource use and improves emulator throughput
Hides emulation latencies (e.g., communicating across FPGAs)
Multithreaded Host Multithreaded Host Emulation Engine (on FPGA)Emulation Engine (on FPGA)
+1
2
PC1PC
1PC1PC
1
I$ IR GPR1GPR1GPR1GPR1
X
Y
2
D$Single hardware Single hardware
pipeline with pipeline with multiple copies multiple copies
of CPU stateof CPU state
22
Split Functional/Timing Models(HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin))
Functional model executes CPU ISA correctly, no timing information Only need to develop functional model once for each ISA
Timing model captures pipeline timing details, does not need to execute code Much easier to change timing model for architectural
experimentation Without RTL design, cannot be 100% certain that timing is
accurate Many possible splits between timing and functional model
Functional Functional ModelModel
Timing Timing ModelModel
23
RAMP WhiteHari Angepat, Derek Chiou (UT Austin)
RAMP-White23
Leon 3Mst Slv DbgInt
Leon3 shim
MPIntCntrl
DSU Eth DDR2
Leon 3Mst Slv DbgInt
AHB bus
Leon3 shim
Intersection Unit NIU Intersection
UnitNIURouter Router
Scalable Coherent Shared Memory Multiprocessor Support standard shared memory programming models
DDR2
AHB busAHB shim AHB shim
24
Multithreaded Func. & Timing Models(RAMP Gold: UCB)
MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single host
link
Functional Functional Model Model
PipelinePipeline
Arch State
Timing Timing Model Model
PipelinePipeline
Timing State
MT-UnitMT-Unit
MT-ChannelsMT-Channels
2525
CMU Simics/RAMP Simulator
16-CPU Shared-memory UltraSPARC III Server
(SunFire 3800) Memory
MMU DMA
Graphics NIC SCSI
Terminal
PCI
CPUCPU CPU..
BEE2 Platform Simics (PC)Xilinx XCV2P70
DDR2MemDDR2Mem
InterleavedPipeline
CPUcontextCPU
context16xCPU
PowerPC
SimulatedI/O devices
26
What Hardware Platforms? RTL mapping approaches
Need large amounts of logic Selected BEE2, and then designed BEE3 for this emulation style Observed that don’t need much interconnect bandwidth (memory +
inter-board links) because RTL cores are slow and latency sensitive Host-multithreading allows large systems to be mapped to small
(one?) FPGA (e.g., 64-128 cores on ML505) Logic gate count not as critical, need to focus on on-chip capacity,
off-chip memory bandwidth and total memory capacity per FPGA (conventional processor memory hierarchy issues multiplied by multithreading factor)
One big FPGA with lots of fast memory channels would be ideal Software functional emulation (FAST) or transplant (ProtoFLEX)
Focus on fast coherent connection to front-end x86 CPU Hypertransport, FSB, QPI interfaces better than PCI I/O connections