Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung,...

24
Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai {echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu PROTOFLEX Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung,...

Page 1: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

Computer Architecture Lab at

1

ProtoFlex: Status Update and Design Experiences

Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai

{echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu

PROTOFLEX

Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.

Page 2: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

222

Full-system Functional Simulation• Effective substitute for real (or non-existent) HW

– Can boot OS, run commercial apps

– Important in SW research & computer architecture

• But too slow for large-scale MP studies– Multicore won’t help existing tools

– Is serious challenge for large-MP (1000-way) simulation

REVIEW

Page 3: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

333

Alternative: FPGA-based simulation• Only 10x slower in clock freq than custom HW

• But FPGAs harder to use than software– Simulating large-MP (100- to 1000-way) can’t be done trivially

– Simulating full-system support need devices + entire ISA

The “build-all” strategy in FPGAs = significant effort + resources

Memory

PCI Bus

Ethernetcontroller

Graphics card

I/O MMUcontroller

DiskDisk

DMAcontroller

IRQ controller

Terminal

SCSIcontroller

CPU CPUFPGAs

Page 4: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

444

Reducing complexity w/ virtualization

Hybrid Full-System SimulationVirtualized MP Simulation

Only frequent behaviors hosted in FPGA. Relegate infrequent to SW.

Target full-system behaviors

FPGA Software

frequent infrequent

CPU CPU CPU CPU CPU

Logical CPUs multiplexed onto fewer physical CPUs.

Host resources

1 FPGA CPU

Host resources

Making multiple physical resources appear as a single logical resource

Making a single physical resource appear as multiple logical resources

21

Page 5: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

555

Outline

• Hybrid Full-System Simulation

• Virtualized Multiprocessor Simulation

• BlueSPARC Implementation

• Design Experiences

• Future Work

Page 6: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

666

3CPU

Hybrid Full-System Simulation

• 3 ways to map target component to hybrid simulation hostFPGA-only Simulation-only Transplantable

• CPUs can fallback to SW by “transplanting” between hosts– Only common-case instructions/behaviors implemented in FPGA

– Remaining behavs relegated to SW (turns out many of complex ones)

1 2 3

CPU CPU

Memory

MMU Fibre

Graphics NIC PCI

Terminal

SCSI

Software full-system simulator host

Hybrid Simulation

FPGA host

12

I/O instr

CPUCPU

transplant

Transplants reduce full-system design effort

CPUCPU CPU

Memory

MMU Fibre

Graphics NIC PCI

Terminal

SCSI

Software full-system simulator host

CPU

Software-only simulation

Page 7: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

777

Outline

• Hybrid Full-System Simulation

• Virtualized Multiprocessor Simulation

• BlueSPARC Implementation

• Design Experiences

• Future Work

Page 8: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

8

Virtualized Multiprocessor Simulation• Problem: large-scale simulation configurations challenging to

implement in FPGAs using structurally-accurate approaches

# processors in target model

Structural-accuracy1-to-1 mapping

between target and host CPUs

# host processors implemented in FPGA

Pros: fastest possible solution, only 10x slower than real HW

Cons: difficult to build for large-scale configs (e.g., >100-way)

10x slower than real HW

1-to-1

Page 9: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

999

Virtualized Multiprocessor Simulation

Advantages:

• Decouple logical target system size from FPGA host size

• Scale FPGA host as-needed to deliver required performance

• High target-to-host ratio (TH) simplifies/consolidates HW (e.g., fewer # nodes in cache coherence, interconnect)

# processors in target model

HostInterleavingMultiplex target

processors onto fewer # FPGA-hosted processors

# host “engines” implemented in FPGA

40x slower than real HW

4-to-1

Page 10: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

101010

What’s inside an FPGA host processor?

• An “engine” that architecturally executes multiple contexts– Existing multithreaded designs are good candidates

– Choice is influenced by TH ratio (target-to-host ratio)

• We propose an interleaved pipeline (e.g., TERA-style)– Best suited for high TH ratio

– Switch in new CPU context on each cycle

– Simple, efficient design w/ no stalling or forwarding

– Long-latency tolerance (e.g., cache miss, transplants)

– Coherence is “free” between CPUs mapped onto same engine

CPU CPU CPU

HOSTCPU

Page 11: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

111111

Outline

• Hybrid Full-System Simulation

• Virtualized Multiprocessor Simulation

• BlueSPARC Implementation

• Design Experiences

• Future Work

Page 12: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

1212

Implementation: BlueSPARC simulator

16-CPU Shared-memory UltraSPARC III Server

(SunFire 3800) Memory

MMU DMA

Graphics NIC SCSI

Terminal

PCI

CPUCPU CPU..

BEE2 Platform Simics (PC)Xilinx XCV2P70

DDR2MemDDR2Mem

InterleavedPipeline

CPUcontextCPU

context16xCPU

PowerPC

SimulatedI/O devices

Page 13: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

1313

BlueSPARC Simulator (continued)Processing Nodes 16 64-bit UltraSPARC III contexts

14-stage instruction-interleaved pipeline

L1 caches Split I/D, 64KB, 64B, direct-mapped, writebackNon-blocking loads/stores16-entry MSHR, 4-entry store buffer

Clock frequency 90MHz on Xilinx V2P70

Main memory 4GB total

Resources (Xilinx V2P70)

33,508 LUTs (50%), 222 BRAMs (67%) w/o stats+debug43,206 LUTs (65%), 238 BRAMs (72%)

Instrumentation All internal state fully traceableAttachable to FPGA-based CMP cache simulator*

EDA tools Xilinx EDK 9.2i, Bluespec System Verilog

Statistics 25K lines Bluespec, 511 rules, 89 module types

Checkpointing Fully compatible with Simics checkpointsCan load AND generate checkpoints

Page 14: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

1414

BlueSPARC host microarchitecture

TransplantUnit

TransplantUnit

1 2, 3 4,5 6 87 9,10,11 12,13 14

PowerPC405 (transplant service processor)

64KBI-cache64KB

I-cache

I-TLB16-entry(direct-

mapped)x16

I-TLB16-entry(direct-

mapped)x16

I-TLB128-entry

(2-way)x16

I-TLB128-entry

(2-way)x16

ALU1ALU1ALU2ALU2

64KBD-cache64KB

D-cache

TrapUnitTrapUnit

WritebackUnit

WritebackUnit

D-TLB512-entry

(2-way)x16

D-TLB512-entry

(2-way)x16

D-TLB16-entry

(fully-assoc)

x16

D-TLB16-entry

(fully-assoc)

x16

RegFileRegFile

DecodeDecodePC, statex16

PC, statex16

ContextSelectorContextSelector

AssistUnit

AssistUnit

Normal pipeline stageNormal pipeline stage Multi-context stateMulti-context state Transplant support unitTransplant support unit

64-bit ISA, SW-visible MMU, complex memory high # of pipeline stages

Page 15: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

1515

Hybrid host partitioning choices

BlueSPARC (FPGA) Micro-transplant (on-chip simulation)• add/sub/shift/logical• multiply/divide• register windows• 38/103 SPARC ASIs• interprocessor x-calls• device interrupts• I-/D-MMU + tlb miss• Loads/stores/atomics• VIS block memory

• 65/103 SPARC ASIs• VIS I/II multimedia• FP add/sub/mul/div + traps• FP/INT conversion• trap on integer arithmetic• alignment• fixed-point arithmetic• tlb/cache diagnostics• tlb demap

Transplant (off-chip simulation)•PCI bus•ISP2200 Fibre Channel•I21152 PCI bridge•IRQ bus•Fibre Channel SCSI disk/cdrom

•Text Console•SBBC PCI device•Serengeti I/O PROM•Cheerio-hme NIC•SCSI bus

BlueSPARC Micro-transplants(PowerPC405)

ON-CHIP FPGATransplants

(Simics on PC)

OFF-CHIP

Page 16: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

1616

Performance

Perf comparable to Simics-fast39x speedup on average over Simics-trace

010203040506070

orac

lebz

ip2

craft

y

gcc

gzip

pars

ervo

rtex

aver

age

MIP

S

BlueSPARC (90mhz)Simics-fast (2.0GHz C2Duo)Simics-trace

1.18

Page 17: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

171717

Outline

• Hybrid Full-System Simulation

• Virtualized Multiprocessor Simulation

• BlueSPARC Implementation

• Design Experiences

• Future Work

Page 18: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

18

Design experiences

2007 TimelineJanuary-February

Initial virtualization ideasAnalysis + simulation of interleavingISA profiling of apps for hybrid partitioningInitial specifications for host pipeline

March Simics API wrappers + software experimentsApril-November

BlueSPARC RTL developmentValidation tools

November-December

Host performance instrumentation and writeup*

* To appear in FPGA’08

Page 19: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

19

Design experiences (cont)

• What was important:– Developing effective validation strategies (more on next slide)

– Existing reference model (Simics) to study and compare against

– Efficient mapping of state to FPGA resources (e.g., 16 PCs 16-bit LUT-based distributed RAM)

– Coping with long Xilinx builds by easing up on timing constraints

– “Judicious” Bluespec

• What was NOT important:– Meeting 100MHz timing for every Xilinx build (i.e., deep pipelining)

– Implementing every functionality as efficiently/fast as possible

Page 20: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

20

Validation

• THE most challenging aspect of this project

• Strategies used– Auto-generated torture tests + hand-written test cases

– Auto-port test-cases from OpenSPARC T1 framework to UltraSPARC III

– Validated single-threaded + multithreaded ISA execution against Simics (both in Verilog Simulations and in FPGA)

– Flight data recorder for non-deterministic interleaving of CPUs

– Batched Verilog simulations w/ varying parameters

– Validate non-blocking memory system with “shadow” flat memories during Verilog simulation caught self-modifying code bugs

– > 200 synthesizable assertions to Chipscope

– Built-in deadlock/error detectors

Page 21: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

21

In retrospect…

• What I would have done differently to begin with– Write entire USIII functional model myself in software first

– Take more advantage of Verilog PLI for validation (interface to C)

– Don’t over-engineer HDL

– Don’t upgrade tools unless necessary (e.g., trial license runs out)

– Validation infrastructure w/ batching capabilities (do earlier!)

– Automated “binary search” tool for bug hunting

– Re-write DDR2 Async FIFOs without BRAMs

– Fast memory checkpoint loader (3GB images per run = 25m)

– Simple, correct >> Fast, buggy

Page 22: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

22

Future Work

• Scalability– Burden-of-proof for 1000-way simulation?

– Investigate cache-coherence/interconnect mechanisms for combining multiple interleaved pipelines

• Virtualization design spaces– On-chip storage virtualization (e.g., architectural state)

– Memory + disk capacity (e.g., HW-based demand paging?)

– Virtualizing instrumentation (e.g., paging functional cache tags)

• Fast instrumentation tools– Understanding systems at multiple levels of abstraction (beyond ISA)

– Validation+analysis: beyond ISA, how to sanity-check app+sys behavior?

Page 23: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

23

BlueSPARC Demo on BEE2

23

• Demo application– On-Line Transaction

Processing benchmark (TPC-C) in Oracle

– Runs in Solaris 8 (unmodified binary)

– FPGA + Memory directly loaded from Simics checkpoint

4 DDR2 Controllers + 4 GB memory

Ethernet (to Simics

on PC)

Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging)

BEE2 Platform

Page 24: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

242424

Conclusion

• “Build-all” simulation approach in FPGAs is challenging

• Two virtualization techniques for reducing complexity

– Hybrid: attain full-system by deferring rare behavs to SW

– Virtualized MP: decouples target system size from host size

• BlueSPARC proof-of-concept

– Models 16-cpu UltraSPARC III server

– Comparable perf to Simics-fast, 39x on avg faster than Simics-trace

• Thanks! Questions? [email protected]• PROTOFLEX (http://www.ece.cmu.edu/~simflex/protoflex.html)