Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung,...

Post on 21-Dec-2015

215 views 0 download

Transcript of Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung,...

Computer Architecture Lab at

1

ProtoFlex: Status Update and Design Experiences

Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai

{echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu

PROTOFLEX

Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.

222

Full-system Functional Simulation• Effective substitute for real (or non-existent) HW

– Can boot OS, run commercial apps

– Important in SW research & computer architecture

• But too slow for large-scale MP studies– Multicore won’t help existing tools

– Is serious challenge for large-MP (1000-way) simulation

REVIEW

333

Alternative: FPGA-based simulation• Only 10x slower in clock freq than custom HW

• But FPGAs harder to use than software– Simulating large-MP (100- to 1000-way) can’t be done trivially

– Simulating full-system support need devices + entire ISA

The “build-all” strategy in FPGAs = significant effort + resources

Memory

PCI Bus

Ethernetcontroller

Graphics card

I/O MMUcontroller

DiskDisk

DMAcontroller

IRQ controller

Terminal

SCSIcontroller

CPU CPUFPGAs

444

Reducing complexity w/ virtualization

Hybrid Full-System SimulationVirtualized MP Simulation

Only frequent behaviors hosted in FPGA. Relegate infrequent to SW.

Target full-system behaviors

FPGA Software

frequent infrequent

CPU CPU CPU CPU CPU

Logical CPUs multiplexed onto fewer physical CPUs.

Host resources

1 FPGA CPU

Host resources

Making multiple physical resources appear as a single logical resource

Making a single physical resource appear as multiple logical resources

21

555

Outline

• Hybrid Full-System Simulation

• Virtualized Multiprocessor Simulation

• BlueSPARC Implementation

• Design Experiences

• Future Work

666

3CPU

Hybrid Full-System Simulation

• 3 ways to map target component to hybrid simulation hostFPGA-only Simulation-only Transplantable

• CPUs can fallback to SW by “transplanting” between hosts– Only common-case instructions/behaviors implemented in FPGA

– Remaining behavs relegated to SW (turns out many of complex ones)

1 2 3

CPU CPU

Memory

MMU Fibre

Graphics NIC PCI

Terminal

SCSI

Software full-system simulator host

Hybrid Simulation

FPGA host

12

I/O instr

CPUCPU

transplant

Transplants reduce full-system design effort

CPUCPU CPU

Memory

MMU Fibre

Graphics NIC PCI

Terminal

SCSI

Software full-system simulator host

CPU

Software-only simulation

777

Outline

• Hybrid Full-System Simulation

• Virtualized Multiprocessor Simulation

• BlueSPARC Implementation

• Design Experiences

• Future Work

8

Virtualized Multiprocessor Simulation• Problem: large-scale simulation configurations challenging to

implement in FPGAs using structurally-accurate approaches

# processors in target model

Structural-accuracy1-to-1 mapping

between target and host CPUs

# host processors implemented in FPGA

Pros: fastest possible solution, only 10x slower than real HW

Cons: difficult to build for large-scale configs (e.g., >100-way)

10x slower than real HW

1-to-1

999

Virtualized Multiprocessor Simulation

Advantages:

• Decouple logical target system size from FPGA host size

• Scale FPGA host as-needed to deliver required performance

• High target-to-host ratio (TH) simplifies/consolidates HW (e.g., fewer # nodes in cache coherence, interconnect)

# processors in target model

HostInterleavingMultiplex target

processors onto fewer # FPGA-hosted processors

# host “engines” implemented in FPGA

40x slower than real HW

4-to-1

101010

What’s inside an FPGA host processor?

• An “engine” that architecturally executes multiple contexts– Existing multithreaded designs are good candidates

– Choice is influenced by TH ratio (target-to-host ratio)

• We propose an interleaved pipeline (e.g., TERA-style)– Best suited for high TH ratio

– Switch in new CPU context on each cycle

– Simple, efficient design w/ no stalling or forwarding

– Long-latency tolerance (e.g., cache miss, transplants)

– Coherence is “free” between CPUs mapped onto same engine

CPU CPU CPU

HOSTCPU

111111

Outline

• Hybrid Full-System Simulation

• Virtualized Multiprocessor Simulation

• BlueSPARC Implementation

• Design Experiences

• Future Work

1212

Implementation: BlueSPARC simulator

16-CPU Shared-memory UltraSPARC III Server

(SunFire 3800) Memory

MMU DMA

Graphics NIC SCSI

Terminal

PCI

CPUCPU CPU..

BEE2 Platform Simics (PC)Xilinx XCV2P70

DDR2MemDDR2Mem

InterleavedPipeline

CPUcontextCPU

context16xCPU

PowerPC

SimulatedI/O devices

1313

BlueSPARC Simulator (continued)Processing Nodes 16 64-bit UltraSPARC III contexts

14-stage instruction-interleaved pipeline

L1 caches Split I/D, 64KB, 64B, direct-mapped, writebackNon-blocking loads/stores16-entry MSHR, 4-entry store buffer

Clock frequency 90MHz on Xilinx V2P70

Main memory 4GB total

Resources (Xilinx V2P70)

33,508 LUTs (50%), 222 BRAMs (67%) w/o stats+debug43,206 LUTs (65%), 238 BRAMs (72%)

Instrumentation All internal state fully traceableAttachable to FPGA-based CMP cache simulator*

EDA tools Xilinx EDK 9.2i, Bluespec System Verilog

Statistics 25K lines Bluespec, 511 rules, 89 module types

Checkpointing Fully compatible with Simics checkpointsCan load AND generate checkpoints

1414

BlueSPARC host microarchitecture

TransplantUnit

TransplantUnit

1 2, 3 4,5 6 87 9,10,11 12,13 14

PowerPC405 (transplant service processor)

64KBI-cache64KB

I-cache

I-TLB16-entry(direct-

mapped)x16

I-TLB16-entry(direct-

mapped)x16

I-TLB128-entry

(2-way)x16

I-TLB128-entry

(2-way)x16

ALU1ALU1ALU2ALU2

64KBD-cache64KB

D-cache

TrapUnitTrapUnit

WritebackUnit

WritebackUnit

D-TLB512-entry

(2-way)x16

D-TLB512-entry

(2-way)x16

D-TLB16-entry

(fully-assoc)

x16

D-TLB16-entry

(fully-assoc)

x16

RegFileRegFile

DecodeDecodePC, statex16

PC, statex16

ContextSelectorContextSelector

AssistUnit

AssistUnit

Normal pipeline stageNormal pipeline stage Multi-context stateMulti-context state Transplant support unitTransplant support unit

64-bit ISA, SW-visible MMU, complex memory high # of pipeline stages

1515

Hybrid host partitioning choices

BlueSPARC (FPGA) Micro-transplant (on-chip simulation)• add/sub/shift/logical• multiply/divide• register windows• 38/103 SPARC ASIs• interprocessor x-calls• device interrupts• I-/D-MMU + tlb miss• Loads/stores/atomics• VIS block memory

• 65/103 SPARC ASIs• VIS I/II multimedia• FP add/sub/mul/div + traps• FP/INT conversion• trap on integer arithmetic• alignment• fixed-point arithmetic• tlb/cache diagnostics• tlb demap

Transplant (off-chip simulation)•PCI bus•ISP2200 Fibre Channel•I21152 PCI bridge•IRQ bus•Fibre Channel SCSI disk/cdrom

•Text Console•SBBC PCI device•Serengeti I/O PROM•Cheerio-hme NIC•SCSI bus

BlueSPARC Micro-transplants(PowerPC405)

ON-CHIP FPGATransplants

(Simics on PC)

OFF-CHIP

1616

Performance

Perf comparable to Simics-fast39x speedup on average over Simics-trace

010203040506070

orac

lebz

ip2

craft

y

gcc

gzip

pars

ervo

rtex

aver

age

MIP

S

BlueSPARC (90mhz)Simics-fast (2.0GHz C2Duo)Simics-trace

1.18

171717

Outline

• Hybrid Full-System Simulation

• Virtualized Multiprocessor Simulation

• BlueSPARC Implementation

• Design Experiences

• Future Work

18

Design experiences

2007 TimelineJanuary-February

Initial virtualization ideasAnalysis + simulation of interleavingISA profiling of apps for hybrid partitioningInitial specifications for host pipeline

March Simics API wrappers + software experimentsApril-November

BlueSPARC RTL developmentValidation tools

November-December

Host performance instrumentation and writeup*

* To appear in FPGA’08

19

Design experiences (cont)

• What was important:– Developing effective validation strategies (more on next slide)

– Existing reference model (Simics) to study and compare against

– Efficient mapping of state to FPGA resources (e.g., 16 PCs 16-bit LUT-based distributed RAM)

– Coping with long Xilinx builds by easing up on timing constraints

– “Judicious” Bluespec

• What was NOT important:– Meeting 100MHz timing for every Xilinx build (i.e., deep pipelining)

– Implementing every functionality as efficiently/fast as possible

20

Validation

• THE most challenging aspect of this project

• Strategies used– Auto-generated torture tests + hand-written test cases

– Auto-port test-cases from OpenSPARC T1 framework to UltraSPARC III

– Validated single-threaded + multithreaded ISA execution against Simics (both in Verilog Simulations and in FPGA)

– Flight data recorder for non-deterministic interleaving of CPUs

– Batched Verilog simulations w/ varying parameters

– Validate non-blocking memory system with “shadow” flat memories during Verilog simulation caught self-modifying code bugs

– > 200 synthesizable assertions to Chipscope

– Built-in deadlock/error detectors

21

In retrospect…

• What I would have done differently to begin with– Write entire USIII functional model myself in software first

– Take more advantage of Verilog PLI for validation (interface to C)

– Don’t over-engineer HDL

– Don’t upgrade tools unless necessary (e.g., trial license runs out)

– Validation infrastructure w/ batching capabilities (do earlier!)

– Automated “binary search” tool for bug hunting

– Re-write DDR2 Async FIFOs without BRAMs

– Fast memory checkpoint loader (3GB images per run = 25m)

– Simple, correct >> Fast, buggy

22

Future Work

• Scalability– Burden-of-proof for 1000-way simulation?

– Investigate cache-coherence/interconnect mechanisms for combining multiple interleaved pipelines

• Virtualization design spaces– On-chip storage virtualization (e.g., architectural state)

– Memory + disk capacity (e.g., HW-based demand paging?)

– Virtualizing instrumentation (e.g., paging functional cache tags)

• Fast instrumentation tools– Understanding systems at multiple levels of abstraction (beyond ISA)

– Validation+analysis: beyond ISA, how to sanity-check app+sys behavior?

23

BlueSPARC Demo on BEE2

23

• Demo application– On-Line Transaction

Processing benchmark (TPC-C) in Oracle

– Runs in Solaris 8 (unmodified binary)

– FPGA + Memory directly loaded from Simics checkpoint

4 DDR2 Controllers + 4 GB memory

Ethernet (to Simics

on PC)

Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging)

BEE2 Platform

242424

Conclusion

• “Build-all” simulation approach in FPGAs is challenging

• Two virtualization techniques for reducing complexity

– Hybrid: attain full-system by deferring rare behavs to SW

– Virtualized MP: decouples target system size from host size

• BlueSPARC proof-of-concept

– Models 16-cpu UltraSPARC III server

– Comparable perf to Simics-fast, 39x on avg faster than Simics-trace

• Thanks! Questions? echung@ece.cmu.edu• PROTOFLEX (http://www.ece.cmu.edu/~simflex/protoflex.html)