Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung,...

Computer Architecture Lab at

ProtoFlex: Status Update and Design Experiences

Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai

{echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu

PROTOFLEX

Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.

Full-system Functional Simulation• Effective substitute for real (or non-existent) HW

– Can boot OS, run commercial apps

– Important in SW research & computer architecture

• But too slow for large-scale MP studies– Multicore won’t help existing tools

– Is serious challenge for large-MP (1000-way) simulation

REVIEW

Alternative: FPGA-based simulation• Only 10x slower in clock freq than custom HW

• But FPGAs harder to use than software– Simulating large-MP (100- to 1000-way) can’t be done trivially

– Simulating full-system support need devices + entire ISA

The “build-all” strategy in FPGAs = significant effort + resources

Memory

PCI Bus

Ethernetcontroller

Graphics card

I/O MMUcontroller

DiskDisk

DMAcontroller

IRQ controller

Terminal

SCSIcontroller

CPU CPUFPGAs

Reducing complexity w/ virtualization

Hybrid Full-System SimulationVirtualized MP Simulation

Only frequent behaviors hosted in FPGA. Relegate infrequent to SW.

Target full-system behaviors

FPGA Software

frequent infrequent

CPU CPU CPU CPU CPU

Logical CPUs multiplexed onto fewer physical CPUs.

Host resources

1 FPGA CPU

Host resources

Making multiple physical resources appear as a single logical resource

Making a single physical resource appear as multiple logical resources

Outline

• Hybrid Full-System Simulation

• Virtualized Multiprocessor Simulation

• BlueSPARC Implementation

• Design Experiences

• Future Work

Hybrid Full-System Simulation

• 3 ways to map target component to hybrid simulation hostFPGA-only Simulation-only Transplantable

• CPUs can fallback to SW by “transplanting” between hosts– Only common-case instructions/behaviors implemented in FPGA

– Remaining behavs relegated to SW (turns out many of complex ones)

CPU CPU

Memory

MMU Fibre

Graphics NIC PCI

Terminal

Software full-system simulator host

Hybrid Simulation

FPGA host

I/O instr

CPUCPU

transplant

Transplants reduce full-system design effort

CPUCPU CPU

Memory

MMU Fibre

Graphics NIC PCI

Terminal

Software full-system simulator host

Software-only simulation

Outline

• Future Work

Virtualized Multiprocessor Simulation• Problem: large-scale simulation configurations challenging to

implement in FPGAs using structurally-accurate approaches

# processors in target model

Structural-accuracy1-to-1 mapping

between target and host CPUs

# host processors implemented in FPGA

Pros: fastest possible solution, only 10x slower than real HW

Cons: difficult to build for large-scale configs (e.g., >100-way)

10x slower than real HW

1-to-1

Virtualized Multiprocessor Simulation

Advantages:

• Decouple logical target system size from FPGA host size

• Scale FPGA host as-needed to deliver required performance

• High target-to-host ratio (TH) simplifies/consolidates HW (e.g., fewer # nodes in cache coherence, interconnect)

# processors in target model

HostInterleavingMultiplex target

processors onto fewer # FPGA-hosted processors

# host “engines” implemented in FPGA

40x slower than real HW

4-to-1

101010

What’s inside an FPGA host processor?

• An “engine” that architecturally executes multiple contexts– Existing multithreaded designs are good candidates

– Choice is influenced by TH ratio (target-to-host ratio)

• We propose an interleaved pipeline (e.g., TERA-style)– Best suited for high TH ratio

– Switch in new CPU context on each cycle

– Simple, efficient design w/ no stalling or forwarding

– Long-latency tolerance (e.g., cache miss, transplants)

– Coherence is “free” between CPUs mapped onto same engine

CPU CPU CPU

HOSTCPU

111111

Outline

• Future Work

Implementation: BlueSPARC simulator

16-CPU Shared-memory UltraSPARC III Server

(SunFire 3800) Memory

MMU DMA

Graphics NIC SCSI

Terminal

CPUCPU CPU..

BEE2 Platform Simics (PC)Xilinx XCV2P70

DDR2MemDDR2Mem

InterleavedPipeline

CPUcontextCPU

context16xCPU

PowerPC

SimulatedI/O devices

BlueSPARC Simulator (continued)Processing Nodes 16 64-bit UltraSPARC III contexts

14-stage instruction-interleaved pipeline

L1 caches Split I/D, 64KB, 64B, direct-mapped, writebackNon-blocking loads/stores16-entry MSHR, 4-entry store buffer

Clock frequency 90MHz on Xilinx V2P70

Main memory 4GB total

Resources (Xilinx V2P70)

33,508 LUTs (50%), 222 BRAMs (67%) w/o stats+debug43,206 LUTs (65%), 238 BRAMs (72%)

Instrumentation All internal state fully traceableAttachable to FPGA-based CMP cache simulator*

EDA tools Xilinx EDK 9.2i, Bluespec System Verilog

Statistics 25K lines Bluespec, 511 rules, 89 module types

Checkpointing Fully compatible with Simics checkpointsCan load AND generate checkpoints

BlueSPARC host microarchitecture

TransplantUnit

1 2, 3 4,5 6 87 9,10,11 12,13 14

PowerPC405 (transplant service processor)

64KBI-cache64KB

I-cache

I-TLB16-entry(direct-

mapped)x16

I-TLB16-entry(direct-

mapped)x16

I-TLB128-entry

(2-way)x16

I-TLB128-entry

(2-way)x16

ALU1ALU1ALU2ALU2

64KBD-cache64KB

D-cache

TrapUnitTrapUnit

WritebackUnit

D-TLB512-entry

(2-way)x16

D-TLB512-entry

(2-way)x16

D-TLB16-entry

(fully-assoc)

D-TLB16-entry

(fully-assoc)

RegFileRegFile

DecodeDecodePC, statex16

PC, statex16

ContextSelectorContextSelector

AssistUnit

Normal pipeline stageNormal pipeline stage Multi-context stateMulti-context state Transplant support unitTransplant support unit

64-bit ISA, SW-visible MMU, complex memory high # of pipeline stages

Hybrid host partitioning choices

BlueSPARC (FPGA) Micro-transplant (on-chip simulation)• add/sub/shift/logical• multiply/divide• register windows• 38/103 SPARC ASIs• interprocessor x-calls• device interrupts• I-/D-MMU + tlb miss• Loads/stores/atomics• VIS block memory

• 65/103 SPARC ASIs• VIS I/II multimedia• FP add/sub/mul/div + traps• FP/INT conversion• trap on integer arithmetic• alignment• fixed-point arithmetic• tlb/cache diagnostics• tlb demap

Transplant (off-chip simulation)•PCI bus•ISP2200 Fibre Channel•I21152 PCI bridge•IRQ bus•Fibre Channel SCSI disk/cdrom

•Text Console•SBBC PCI device•Serengeti I/O PROM•Cheerio-hme NIC•SCSI bus

BlueSPARC Micro-transplants(PowerPC405)

ON-CHIP FPGATransplants

(Simics on PC)

OFF-CHIP

Performance

Perf comparable to Simics-fast39x speedup on average over Simics-trace

010203040506070

BlueSPARC (90mhz)Simics-fast (2.0GHz C2Duo)Simics-trace

171717

Outline

• Future Work

Design experiences

2007 TimelineJanuary-February

Initial virtualization ideasAnalysis + simulation of interleavingISA profiling of apps for hybrid partitioningInitial specifications for host pipeline

March Simics API wrappers + software experimentsApril-November

BlueSPARC RTL developmentValidation tools

November-December

Host performance instrumentation and writeup*

* To appear in FPGA’08

Design experiences (cont)

• What was important:– Developing effective validation strategies (more on next slide)

– Existing reference model (Simics) to study and compare against

– Efficient mapping of state to FPGA resources (e.g., 16 PCs 16-bit LUT-based distributed RAM)

– Coping with long Xilinx builds by easing up on timing constraints

– “Judicious” Bluespec

• What was NOT important:– Meeting 100MHz timing for every Xilinx build (i.e., deep pipelining)

– Implementing every functionality as efficiently/fast as possible

Validation

• THE most challenging aspect of this project

• Strategies used– Auto-generated torture tests + hand-written test cases

– Auto-port test-cases from OpenSPARC T1 framework to UltraSPARC III

– Validated single-threaded + multithreaded ISA execution against Simics (both in Verilog Simulations and in FPGA)

– Flight data recorder for non-deterministic interleaving of CPUs

– Batched Verilog simulations w/ varying parameters

– Validate non-blocking memory system with “shadow” flat memories during Verilog simulation caught self-modifying code bugs

– > 200 synthesizable assertions to Chipscope

– Built-in deadlock/error detectors

In retrospect…

• What I would have done differently to begin with– Write entire USIII functional model myself in software first

– Take more advantage of Verilog PLI for validation (interface to C)

– Don’t over-engineer HDL

– Don’t upgrade tools unless necessary (e.g., trial license runs out)

– Validation infrastructure w/ batching capabilities (do earlier!)

– Automated “binary search” tool for bug hunting

– Re-write DDR2 Async FIFOs without BRAMs

– Fast memory checkpoint loader (3GB images per run = 25m)

– Simple, correct >> Fast, buggy

Future Work

• Scalability– Burden-of-proof for 1000-way simulation?

– Investigate cache-coherence/interconnect mechanisms for combining multiple interleaved pipelines

• Virtualization design spaces– On-chip storage virtualization (e.g., architectural state)

– Memory + disk capacity (e.g., HW-based demand paging?)

– Virtualizing instrumentation (e.g., paging functional cache tags)

• Fast instrumentation tools– Understanding systems at multiple levels of abstraction (beyond ISA)

– Validation+analysis: beyond ISA, how to sanity-check app+sys behavior?

BlueSPARC Demo on BEE2

• Demo application– On-Line Transaction

Processing benchmark (TPC-C) in Oracle

– Runs in Solaris 8 (unmodified binary)

– FPGA + Memory directly loaded from Simics checkpoint

4 DDR2 Controllers + 4 GB memory

Ethernet (to Simics

on PC)

Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging)

BEE2 Platform

242424

Conclusion

• “Build-all” simulation approach in FPGAs is challenging

• Two virtualization techniques for reducing complexity

– Hybrid: attain full-system by deferring rare behavs to SW

– Virtualized MP: decouples target system size from host size

• BlueSPARC proof-of-concept

– Models 16-cpu UltraSPARC III server

– Comparable perf to Simics-fast, 39x on avg faster than Simics-trace

• Thanks! Questions? echung@ece.cmu.edu• PROTOFLEX (http://www.ece.cmu.edu/~simflex/protoflex.html)

Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung,...

Documents

Transcript of Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung,...

energy efficient landscape lighting• 1994 First commercial sulfur lamp. • 2008 Konstantinos Papamichael researches daylight harvesting (CLTC). The project Electric light bulb &

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

Light & Health - Design Strategies & Technologies...Light & Health - Design Strategies & Technologies February 12, 2015 Konstantinos Papamichael, Ph.D. Professor, Department of Design

18 December 2019 ISSUE 15 · Matt Dickins, Jodie Baruta, Andrea Sertori, Esther Tsang, Beth McInnes, Grant Puglia, Anna Papamichael, and Lydia Manz - what a talented team! I hope

Adaptive Lighting - Papamichael

Treatment of the elderly metastatic colorectal cancer ... of the elderly metastatic colorectal cancer patient: SIOG Recommendations D Papamichael MB BS FRCP On behalf of the SIOG CRC

The CoRAM FPGA Architecture for Reconfigurable Computing · The CoRAM FPGA Architecture for Reconfigurable Computing Eric S. Chung, Weinan Ma, Michael Papamichael GbilGabriel WiWeisz,

MIMO ANTENNA MODELLING USING THE … In Electromagnetics Research C, Vol. 10, 111{127, 2009 MIMO ANTENNA MODELLING USING THE EFFECTIVE LENGTH MATRICES V. Papamichael and C. Soras Department

Lecture 18: Interconnection Networks15418.courses.cs.cmu.edu/spring2015content/... · Lecture 18: Interconnection Networks Credit: many of these slides were created by Michael Papamichael

The Open Source ProtoFlex Simulator

Randi K. Myers - CIE-USNC · 2016. 1. 18. · President James E. Leland Senior Vice-President David Sliney, Ph.D. Vice-President, Communication Konstantinos Papamichael, Ph.D. Vice-President,

Fast Flexible FPGA-Tuned Networks-on-Chipmpapamic/research/carl2012_papa... · Fast Flexible FPGA-Tuned Networks-on-Chip Portland, OR, June 2012 Michael K. Papamichael, James C. Hoe

SimFlex & ProtoFlex– Flexus, OoO: ~ 3 kIPS 46 h 150 years for 1-CPU audited TPC-C run in OoO simulation Current simulation practices • Subset or scaled version of benchmark suite

Simics/Serengeti Target Guide - Carnegie Mellon … › ~protoflex › lib › exe › fetch...’bge’ Gb Ethernet controller (BCM5703C) ’bge’ Dual Gb Ethernet controller (BCM5704C)

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

The CONNECT Network- on-Chip Generatormpapamic/research/papamichael_connect...The CONNECT Network-on-Chip Generator Michael K. Papamichael, Microsoft Research James C. Hoe, Carnegie

ProtoFlex: FPGA-Accelerated Hybrid Simulator€¦ · 1 2 3 Target design FPGA Simulator I/OI/O memmem CPU CPU 1 2 I/OI/O 3 “Target objects ... • Build a 1000-MIPS simulator from

How to deal with elderly patients or individuals with co-morbidities · How to deal with elderly patients or individuals with co-morbidities D Papamichael MB BS FRCP Director, Dept.

SMV TUTORIAL – Part Iemc/15817-f09/nurvitadhi... · Agenda Part I – SMV Basics (this talk) About SMV Example 1: a simple 2-way arbiter Creating an SMV Description Correctness

SMV TUTORIAL – Part Iemc/15817-f09/nurvitadhi.SMV-tutorial-part1.pdf · SMV TUTORIAL – Part I Eriko Nurvitadhi Note: content of these slides are from “Getting started with SMV”