1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.

1

RAMP Models and Platforms

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Krste AsanovicUC Berkeley

RAMP Retreat, Berkeley, CAJanuary 15, 2009



QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

2

Much confusion about RAMP

Frequently asked questions: When will RAMP be finished/usable? What ISA does RAMP use? Can RAMP model my new feature “X”? How accurate is RAMP? Why so many different RAMP projects? Why is there not more sharing among projects?

3

Not much confusion about software simulators

Rarely asked questions: When will software simulation be finished/usable? What ISA do software simulators use? Can a software simulator model my new feature

“X”? How accurate is software simulation? Why so many software simulators? Why is there not more sharing among software

simulators?

4

RAMP is a consortium, not a project Many projects with different goals

sometimes multiple per site So far, much sharing of ideas and techniques

Very healthy and active community Some sharing of low-level infrastructure

Boards + platform-level interfaces to DRAM, Ethernet, etc.

Not a single complete infrastructure that everyone uses and that’s been OK, and might continue to be OK

5

Host Host PlatformPlatform

CPU CPU CPU CPU

Interconnect Network

DRAM

Target Target MachineMachine

Hard WorkHard Work

Run Model of Target on Host Platform

6

RAMP Projects’ Goals

Model some target machine trading off:FidelityModel design effortEmulation speed (and capacity)

7

Space of Target Machines

Which ISA? x86, SPARC, PowerPC, Alpha, ARM, MIPS?

In-order or out-of-order cores? How many cores?

1, 16, 256, 1M? Processor+memory of general-purpose machine,

or whole SoC including I/O devices? Accelerators, GPUs? Which operating system? Hypervisor?

8

ISA Wars

Original pick to standardize around was SPARC Open standard Available verification suite Simplest ISA with extensive general-purpose software

support (i.e., desktop/server development environment available)

SGI/MIPS sorely missed… Leon implementation for FPGA Simics

But the intent was always to support multiple ISAs

9

ISA usage in RAMP models UCB RAMP Blue: Microblaze++

Xilinx soft core modified to add 64-bit FPU Stanford RAMP Red: PowerPC

Used Virtex-II Pro hard cores UT FAST: x86

Functional simulation in software on front-end machine (or on PowerPC hardcore)

UT RAMP White: PowerPC -> SPARC Initial version used hard PowerPC cores moving to Leon soft

cores MIT/Intel HASIM: Alpha -> x86?

Initially Alpha ISA, eventually to form basis of x86/uOP machine CMU ProtoFLEX: SPARC

“SPARC three ways” (own core + emulation on hard PowerPC core + emulation on front-end machine)

UCB RAMP Gold & Internet-in-a-Box: SPARC Own core design

UCB/LBNL Green Flash: Tensilica RTL generated from Tensilica tools

10

Supporting new ISAs

x86 still very desirable, but difficult FAST software functional model is probably current best

approach if want to play with different timings Microcoded functional model would be good way to go if

had resources (HASIM?) Even with working functional model, timing model is

difficult? Adding new features difficult? ARM also desirable for mobile device modeling

Renewed interest in engaging here MIT/IBM PowerPC work in progress, could form

functional model But nobody does this for fun - only to advance

their own research goals…

11

Commercial/Existing RTL Cores Originally seen as big benefit of RAMP But didn’t turn out that way in practice (except

for prototyping usage model - see later)

Cores don’t provide features we need, too big, too difficult to modify

For simple ISAs (i.e. non-x86), biggest help is ISA verification suites, and/or *really* simple synthesizable ISA pipeline to form basis of functional model

12

Operating System Support

Currently only ProtoFLEX, FAST, RAMP-White support OS Others can run one application with proxy mechanism

for I/O

Reflects interests of groups. OS is not primary subject of research for groups building models so far. RAMP Gold to add support for ParLab OS work

(Tessellation) Green Flash to add support for HPC-style microkernel

13

Target systems From a few, to millions of cores

Scaling simulation to 100s of cores was a shared goal But smaller core counts (16-128) very interesting also Huge core counts (>1E6) also of interest

Single node versus clusters RAMP Blue & Internet-in-a-box are message-passing

clusters Rest are shared-memory systems

Memory hierarchy and cache coherence protocols Wide variety of possibilities

Desktop/Laptop/Server versus Handheld or SoC What is important to model for given research topic?

Accelerators/GPUs Even wider variety than CPU ISAs/microarchitectures

14

Wide variety, how to reuse?Proposal: ISA functional models

also FPU across ISAs Perhaps even common uOP engine across all ISAs?

CPU Microarchitecture timing model E.g., in-order superscalar, out-of-order with unified physical

register file Memory functional model

Host-level caches + memory interleaving Memory hierarchy timing models

On-chip network types as subset I/O bus shims

To allow random RTL to be attached for I/O devices and non-GPU accelerators

This won’t be easy, as have to agree on interfaces between these components, might need further specializationDefinitely need more experience doing all of the above

15

Simulator Types

Functional model only (no timing) RTL models (functional includes timing)

Also used for chip prototyping Split functional and timing models

+ Hybrids of above

16

Simulator Mapping Styles Gate-level emulator (Quickturn, Palladium)

~1MHz Direct RTL emulator

5-20MHz FPGA-tuned RTL emulator

20-50MHz Virtualized RTL emulator

50-100MHz Host-multithreaded models

>100MHz

17



18



RAMP Blue Release 2/25/2008RAMP Blue Release 2/25/2008- design available from RAMP design available from RAMP websitewebsite- ramp.eecs.berkeley.eduramp.eecs.berkeley.edu

19

Climate System Design ConceptStrawman Design Study

10PF sustained

~120 m2

<3MWatts

< $75M

32 boards per rack

100 racks @ ~25KW

power + comms

32 chip + memory clusters per board (2.7

TFLOPS @ 700W

VLIW CPU: • 128b load-store + 2 DP MUL/ADD + integer op/ DMA

per cycle:• Synthesizable at 650MHz in commodity 65nm • 1mm2 core, 1.8-2.8mm2 with inst cache, data cache

data RAM, DMA interface, 0.25mW/MHz• Double precision SIMD FP : 4 ops/cycle (2.7GFLOPs)• Vectorizing compiler, cycle-accurate simulator,

debugger GUI (Existing part of Tensilica Tool Set)• 8 channel DMA for streaming from on/off chip DRAM• Nearest neighbor 2D communications grid

ProcArray

RAM RAM

RAM RAM

8 DRAM perprocessor chip:

~50 GB/s

CPU64-128K D

2x128b

32K I

8 chanDMA

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

Op

t. 8M

B e

mb

ed

de

d D

RA

M

External DRAM interface

External DRAM interface

Exte

rna

l DR

AM

inte

rfaceE

xte

rna

l DR

AM

inte

rfa

ce

MasterProcessor

Comm LinkControl

32 processors per 65nm chip83 GFLOPS @ 7W

20

Virtualized RTL Improves FPGA Resource Usage RAMP allows units to run at varying target-host

clock ratios to optimize area and overall performance

Example 1: Multiported register file Example, Sun Niagara has 3 read ports and 2 write

ports to 6KB of register storage If RTL mapped directly, requires 48K flip-flops

Slow cycle time, large area If mapping into block RAMs (one read+one write per

cycle), takes 3 host cycles and 3x2KB block RAMs Faster cycle time (~3X) and far less resources

Example 2: Large L2/L3 caches Current FPGAs only have ~1MB of on-chip SRAM Use on-chip SRAM to build cache of active piece of

L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM

21

Host Multithreading(Zhangxi Tan (UCB), Chung, (CMU))

CPU1

CPU2

CPU3

CPU4Target Target

ModelModel

Multithreading emulation engine reduces FPGA resource use and improves emulator throughput

Hides emulation latencies (e.g., communicating across FPGAs)

Multithreaded Host Multithreaded Host Emulation Engine (on FPGA)Emulation Engine (on FPGA)

+1

2

PC1PC

1PC1PC

1

I$ IR GPR1GPR1GPR1GPR1

X

Y

2

D$Single hardware Single hardware

pipeline with pipeline with multiple copies multiple copies

of CPU stateof CPU state

22

Split Functional/Timing Models(HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin))

Functional model executes CPU ISA correctly, no timing information Only need to develop functional model once for each ISA

Timing model captures pipeline timing details, does not need to execute code Much easier to change timing model for architectural

experimentation Without RTL design, cannot be 100% certain that timing is

accurate Many possible splits between timing and functional model

Functional Functional ModelModel

Timing Timing ModelModel

23

RAMP WhiteHari Angepat, Derek Chiou (UT Austin)

RAMP-White23

Leon 3Mst Slv DbgInt

Leon3 shim

MPIntCntrl

DSU Eth DDR2

Leon 3Mst Slv DbgInt

AHB bus

Leon3 shim

Intersection Unit NIU Intersection

UnitNIURouter Router

Scalable Coherent Shared Memory Multiprocessor Support standard shared memory programming models

DDR2

AHB busAHB shim AHB shim

24

Multithreaded Func. & Timing Models(RAMP Gold: UCB)

MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single host

link

Functional Functional Model Model

PipelinePipeline

Arch State

Timing Timing Model Model

PipelinePipeline

Timing State

MT-UnitMT-Unit

MT-ChannelsMT-Channels

2525

CMU Simics/RAMP Simulator

16-CPU Shared-memory UltraSPARC III Server

(SunFire 3800) Memory

MMU DMA

Graphics NIC SCSI

Terminal

PCI

CPUCPU CPU..

BEE2 Platform Simics (PC)Xilinx XCV2P70

DDR2MemDDR2Mem

InterleavedPipeline

CPUcontextCPU

context16xCPU

PowerPC

SimulatedI/O devices

26

What Hardware Platforms? RTL mapping approaches

Need large amounts of logic Selected BEE2, and then designed BEE3 for this emulation style Observed that don’t need much interconnect bandwidth (memory +

inter-board links) because RTL cores are slow and latency sensitive Host-multithreading allows large systems to be mapped to small

(one?) FPGA (e.g., 64-128 cores on ML505) Logic gate count not as critical, need to focus on on-chip capacity,

off-chip memory bandwidth and total memory capacity per FPGA (conventional processor memory hierarchy issues multiplied by multithreading factor)

One big FPGA with lots of fast memory channels would be ideal Software functional emulation (FAST) or transplant (ProtoFLEX)

Focus on fast coherent connection to front-end x86 CPU Hypertransport, FSB, QPI interfaces better than PCI I/O connections

27

Summary

Many reasons for great divergence in RAMP projects Different ISAs, different target machines, different

research topics, different emulation styles Sharing possible, but hard work and more

experience needed

Questions?

1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.

Documents

Transcript of 1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.