OpenSPARC – An Open Platform for Hardware Reliability ...

40
OpenSPARC – An Open Platform for Hardware Reliability Experimentation Ishwar Parulkar and Alan Wood Sun Microsystems, Inc. James C. Hoe and Babak Falsafi Carnegie Mellon University Sarita V. Adve and Josep Torrellas University of Illinois at Urbana- Champaign Subhasish Mitra Stanford University IEEE SELSE 4 - March 26, 2008

Transcript of OpenSPARC – An Open Platform for Hardware Reliability ...

Page 1: OpenSPARC – An Open Platform for Hardware Reliability ...

OpenSPARC – An Open Platform for Hardware Reliability Experimentation

Ishwar Parulkar and Alan Wood Sun Microsystems, Inc.

James C. Hoe and Babak FalsafiCarnegie Mellon University

Sarita V. Adve and Josep TorrellasUniversity of Illinois at Urbana-

ChampaignSubhasish Mitra

Stanford University

IEEE SELSE 4 - March 26, 2008

Page 2: OpenSPARC – An Open Platform for Hardware Reliability ...

2IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Outline

1.Chip Multi-threading (CMT)

2.OpenSPARC T2 and T1 processors

3.Reliability in OpenSPARC processors

4.What is available in OpenSPARC

5.Current university research using OpenSPARC

6.Future research directions

Page 3: OpenSPARC – An Open Platform for Hardware Reliability ...

3IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

World's First 64-bit Open Source Microprocessor

OpenSPARC.net Governed by GPLv2

Complete processor architecture & implementation

Register Transfer Level (RTL) Hypervisor API Verification suite and

architectural models Simulation model for operating

system bringup on s/w

Page 4: OpenSPARC – An Open Platform for Hardware Reliability ...

4IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Instruction-level Parallelism

Thread-level Parallelism

Instruction/DataWorking Set

Data Sharing

Low Low Low LowMedium High

High High High High High

Large Large Medium Large Large

Low Medium High Medium High Medium

Chip Multithreading (CMT)

Page 5: OpenSPARC – An Open Platform for Hardware Reliability ...

5IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Memory BottleneckRelative Performance

10000

11990 1995 2005 1980

1000

100

10

1985 2000

Gap

CPU Frequency

DRAM Speeds

Source: Sun World Wide Analyst Conference Feb. 25, 2003

CPU -- 2x Every 2 Years

DRAM -- 2x Every 6

Years

Page 6: OpenSPARC – An Open Platform for Hardware Reliability ...

6IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Single Threaded Performance

Single Threading

Thread

Memory Latency Compute

Time

HURRYUP ANDWAIT!

C C C

Typical Processor Utilization:15–25%

M M M

Up to 85% Cycles Waiting for Memory

Page 7: OpenSPARC – An Open Platform for Hardware Reliability ...

7IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Single Threaded Performance Chip Multi-threaded

(CMT) Performance

The Power of CMT

UltraSPARC T1 core Processor Utilization: Up to

85%

C MC MC MThread 1

Memory Latency ComputeTime

C MC MC M

C MC MC M

C MC MC M

Thread 2

Thread 3

Thread 4

Page 8: OpenSPARC – An Open Platform for Hardware Reliability ...

8IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Chip Multi-Threading (CMT)

CMP (chip multiprocessing)

HMT (hardware multithreading)

CMT (chip

multithreading)

n cores per processor m threads per core n x m threads per processor

Page 9: OpenSPARC – An Open Platform for Hardware Reliability ...

9IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

CMT Paradigm Shift!

> Higher reliability> Better performance> Lower cost> Faster Installation> More efficient energy use> Lower HVAC cost> Faster time-to-repair> ... and more

CMT technology allows simple, compact system designs, which deliver:

Everybody has changed to multi-core (CMP) and/or chip multi-threaded (CMT) processors: Sun(CMT), IBM(CMT), Intel(CMP), AMD(CMP)

Page 10: OpenSPARC – An Open Platform for Hardware Reliability ...

10IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Instruction-level Parallelism

Thread-level Parallelism

Instruction/DataWorking Set

Data Sharing

Low Low Low LowMedium High

High High High High High

Large Large Medium Large Large

Low Medium High Medium High Medium

UltraSPARC T2 and T1CMT Processors

Page 11: OpenSPARC – An Open Platform for Hardware Reliability ...

11IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

UltraSPARC T2Die Photo

8 SPARC cores, 8 threads each

Shared 4MB L2, 8 banks, 16-way associative

Four dual-channel FBDIMM memory controllers

Two 10/1 Gb Enet ports w/onboard packet classification and filtering

One PCI-E x8 port

Cryptograhic coprocessor on chip

1831 pins, 711 signal I/0

342mm2 die in 65nm

Page 12: OpenSPARC – An Open Platform for Hardware Reliability ...

12IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

UltraSPARC T2Block Diagram

Page 13: OpenSPARC – An Open Platform for Hardware Reliability ...

13IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

UltraSPARC T2

Page 14: OpenSPARC – An Open Platform for Hardware Reliability ...

14IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

UltraSPARC T2 Reliability Extensive error detection and correction

Parity protection on I$, D$ tags and data, ITLB, DTLB, CAM and data, modular arithmetic, store address buffer

ECC on integer RF, floating point RF, store data buffer, trap stack, L2$ and other internal arrays

Combination of hardware and software correction flows Hardware re-fetch for I$ and D$ Software recovery for other errors Offlining of a thread, group of threads or physical core

Hardware error injection for verification Selective disabling of detection and

reporting for bringup

Page 15: OpenSPARC – An Open Platform for Hardware Reliability ...

15IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Single-Core Processor

(Not to Scale)

C1

C2

C3

C4

C5

C6

C7

C8

Faster Can Be Cooler (1)

107C

102C

96C

91C

85C

80C

74C

69C

63C

58C

UltraSPARC T2 Reliability

Page 16: OpenSPARC – An Open Platform for Hardware Reliability ...

16IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Single-Core Processor T2 Processor

(Not to Scale)

C1

C2

C3

C4

C5

C6

C7

C8

107C

102C

96C

91C

85C

80C

74C

69C

63C

58C

UltraSPARC T2 ReliabilityFaster Can Be Cooler (2)

Page 17: OpenSPARC – An Open Platform for Hardware Reliability ...

17IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Instruction-level Parallelism

Thread-level Parallelism

Instruction/DataWorking Set

Data Sharing

Low Low Low LowMedium High

High High High High High

Large Large Medium Large Large

Low Medium High Medium High Medium

OpenSPARC

Page 18: OpenSPARC – An Open Platform for Hardware Reliability ...

18IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

OpenSPARC Communities

Chip Designers

Hardware IP Suppliers

EDA Vendors

CMT Tools

Academia/Universities

Operating Systems

BenchmarkingReference flowFPGAEmulationVerificationPhysical DesignMulti-threaded tools

Architecture, ISA, VLSI course workThreading, Scaling, ParallelizationBenchmarks

PCI cores, SERDES etc.

Compilers, ThreadingOptimizationPerformance Analysis

OpenSolaris,Linux, BSD variants,Embedded OSs

SoC designs, Hard macrosTelecom applications

Page 19: OpenSPARC – An Open Platform for Hardware Reliability ...

19IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

What's Available in OpenSPARC1. Chip design and verification UltraSPARC Architecture 2005 spec UltraSPARC T2/T1 implementation spec Full RTL (Verilog) of OpenSPARC T2/T1

(8 cores, 64/32 threads – more than 4 million lines of code!) Verification test suites Full OpenSPARC simulation environment Synthesis scripts for RTL FPGA implementation support

Reduced (to fit capacity), synthesizable version of RTL Synplicity scripts for FPGA synthesis

Page 20: OpenSPARC – An Open Platform for Hardware Reliability ...

20IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

What's Available in OpenSPARC2. Architecture and performance modeling

SAM – SPARC Architectural Model (including source code)

Legion – Instruction accurate simulator (incl. source code)

OBP – Open Boot PROM source code Hypervisor source code Solaris images for simulation RST Trace Tool – trace format for SPARC

instruction-level traces

Page 21: OpenSPARC – An Open Platform for Hardware Reliability ...

21IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

What's Available in OpenSPARC3. Tools for tuning and debug ATS – Binary reoptimization and recompilation

tool for tuning and troubleshooting applications Corestat – Online monitoring of core and FPU

utilization Discover – Runtime detection of programming

errors in allocating and using program memory Thread Analyzer – Checking of multi-threaded

programming errors such as data races and deadlocks

More...

Page 22: OpenSPARC – An Open Platform for Hardware Reliability ...

22IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

What's Available in OpenSPARC4. Tools for software developers Sun Studio 12 – C, C++, Fortran compilers for

Solaris/Linux combined with Netbeans, etc. BIT – Binary Improvement Tool analyzes and

optimizes SPARC binaries for performance and code coverage

SPOT – produces detailed report on conditions that impact performance of an application

Source code analysis tool to identify incompatible APIs between Solaris and Linux to speed up migration

More...

Page 23: OpenSPARC – An Open Platform for Hardware Reliability ...

23IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Instruction-level Parallelism

Thread-level Parallelism

Instruction/DataWorking Set

Data Sharing

Low Low Low LowMedium High

High High High High High

Large Large Medium Large

Low Medium High Medium High Medium

University research in hardware reliability using

OpenSPARC

Page 24: OpenSPARC – An Open Platform for Hardware Reliability ...

24IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Mem WritebackExDecode

Decode ALUD-

Cache

RegFilex4

StoreBuffer

Hash Queue

FP Match Com

pare

ToL2

Problem: Error detection for the processor pipeline ( soft, wearout, … )

Solution: Architectural fingerprints Summarize retiring architectural updates into compact hash (regs, stores) Periodically compare hash with reference (another core, previous execution)

Results: Multithreaded OpenSPARC T1 RTL implementation — less than 4% area

overhead Scalable to wide-issue superscalar BW Soft fault injection: effective detection for errors propagated to arch. state

0.00.20.40.60.81.0

byp exu fcl fdp lsu swl tlu FullSPARC

Frac

t. ar

ch. e

rror

s

Silent Data Corruption Hang Loop

Architectural Fingerprints

Prof. Hoe and Prof. Falsafi @Carnegie Mellon University

Page 25: OpenSPARC – An Open Platform for Hardware Reliability ...

25IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Problem: Detecting device wearout during soft breakdown stage Faults initially hidden by guardbands & masking

Solution: Periodically test processor cores for signs of growing wearout Reduce freq./voltage guardbands until marginal Test w/Arch. or Arch. fingerprintsμ Observe fails at incr. conservative conditions

Results: Wearout fault injection in OpenSPARC Arch. and Arch. fingerprintsμ

equivalent for wide-spread wearout Arch. needed for isolatedμ wearout

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200Stress past guardband (ps)

Fra

c. F

ails

det

ecte

d

ArchμArchTimeout

FIRST – Detecting Emerging Wearout Faults

Prof. Hoe and Prof. Falsafi @Carnegie Mellon University

Page 26: OpenSPARC – An Open Platform for Hardware Reliability ...

26IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

• Detection: Software symptoms, minimal backup hardware • Recovery: Software/hardware checkpoint and rollback• Diagnosis: Firmware-controlled rollback/replay on multicore• Repair/reconfiguration: Redundant, reconfigurable hardware

Fault Error Symptomdetected

Recovery

Diagnosis Repair

Chkpoint Chkpoint

SWAT – SoftWare Anomaly Treatment

Prof. S. Adve, V. Adve and Y. Zhou @University of Illinois at U-C

Always-on, zero or low cost

May have high overhead, rarely invoked

Low cost solutions needed for in-field detection, diagnosis, recovery and repair for failures due to aging, soft errors inadequate burn-in, design defects, …

SWAT Framework Components

Motivation

Page 27: OpenSPARC – An Open Platform for Hardware Reliability ...

27IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Status Detection techniques with > 95% coverage for most structures

[ASPLOS’08, SELSE’08, DSN’08] Microarchitecture level, firmware-driven diagnosis with > 97%

coverage [SELSE’08, DSN’08] So far, used microarchitecture-level fault injection in simulation

Ongoing/future work with OpenSPARC Gate-level fault modeling Hypervisor implementation

SWAT – Status and Ongoing Work

Prof. S. Adve, V. Adve and Y. Zhou @University of Illinois at U-C

Page 28: OpenSPARC – An Open Platform for Hardware Reliability ...

28IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Goals Understand how gate level faults propagate to microarch & s/w Abstract fault models at microarchitecture level Evaluate reliability solutions and validate results

Methodology Perform fault injections at gate level For better simulation speed

Hierarchical integration of microarchitecture level full system simulator with lower-level simulation of faulty unit

Using OpenSPARC Verilog model

SWAT – Ongoing WorkHigh-level fault models and validation

Prof. S. Adve, V. Adve and Y. Zhou @University of Illinois at U-C

Page 29: OpenSPARC – An Open Platform for Hardware Reliability ...

29IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Plan to use OpenSPARC hypervisor to prototype and evaluate firmware part of SWATMethodology

Leverage, extend interface between hypervisor/hardware and hypervisor/OS

Extend hypervisor for functionality Use for error detection, recovery, diagnosis, repair

SWAT – Future WorkHypervisor implementation

Prof. S. Adve, V. Adve and Y. Zhou @University of Illinois at U-C

Page 30: OpenSPARC – An Open Platform for Hardware Reliability ...

30IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

VARIUS – Process Parameter Variation

Problem: Parameter variation in present and future multicore chips

Goals: Model parameter variation and resulting timing errors Design multicore microarchitectures to detect and tolerate

variation-induced errors Develop new microarchitectural techniques to mitigate

variation and variation-induced errors.

Prof. Torrellas @University of Illinois at U-C

Page 31: OpenSPARC – An Open Platform for Hardware Reliability ...

31IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

VARIUS – Process Parameter VariationAccomplishments

VARIUS model of parameter variation and resulting timing errors for microarchitects [TSM08]

ReCycle: Pipeline rebalance under process variation [ISCA07]

Fine-grain adaptive body bias (ABB) to mitigate variation in multicores [MICRO07]

Workload scheduling and DVFS power management in multicores under variation [ISCA08]

Paceline: Core pairing for reliability under process variation [PACT07]

Prof. Torrellas @University of Illinois at U-C

Page 32: OpenSPARC – An Open Platform for Hardware Reliability ...

32IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

VARIUS – Process Parameter VariationUsing OpenSPARC

Goal: Get insights into the effect of parameter variation on a real processor Measure the distribution of the path delays Apply the variation model

Evaluation Flow: Synopsys dc_shell-t:

compile RTL to gate-level netlist Cadence SOCEncounter

Floorplan, Placement, Routing, Timing analysis Synopsys Primetime

Static timing analysis & timing debugging Cadence NCSim

Simulation

Compile RTL (dc_shell-t)

Design entry

SOCEncounter

RTL & Timing Constraints &Library

Netlist & Timing Constraints &Physical library

Primetime

Placement & Timing report &Routing

Netlist & Timing info

NCSim

Prof. Torrellas @University of Illinois at U-C

Page 33: OpenSPARC – An Open Platform for Hardware Reliability ...

33IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

CASP – Concurrent Autonomous Chip Self-test using Stored PatternsMotivation

33

WearoutInfant mortality Normal lifetimeTime

Failure rate

Burn-in difficult

Circuit agingdominant

Solution: EXTREMELY THOROUGH

online self-test

Soft errors: effective techniques exist Prof. Mitra @Stanford University

Page 34: OpenSPARC – An Open Platform for Hardware Reliability ...

34IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

CASP – Test Flow

34

Core N normal

operation

Schedule test on

next core

Core 4 resume

operation

Core N normal

operation

Core 4 temporarily

isolated

Core N normal

operation

Prepare core for

test

Core 4 selected for

test

Core 4

under test

Core N normal

operation

Thorough scan &

functional testing;

recovery if failed

Test Scheduling Pre-processing

Test Application Post-processing

Bring core from

test to normal

operation

... ...

......

Prof. Mitra @Stanford University

Page 35: OpenSPARC – An Open Platform for Hardware Reliability ...

35IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

OpenSPARC Modifications for CASP

35

8 processor

cores

Modified for

CASP support

Cross-bar

Switch

Modified for CASP support

L2Cache

FPU

DRAMControl

Jbus Interface

on-chip buffer

(7.5KB)

CASP control

CASP off-chip Storage (52MB)

CASP Controller

On-chip buffer for scan test data

Architectural modfications

➢ Before a core is tested➢ stalling/draining pipeline➢ disabling communication with

core under test➢ saving critical state➢ invalidating D$

➢ After a core is tested➢ restoring critical state➢ enabling communication with core

under test➢ restarting pipeline

● 8000 lines of new Verilog code

● Verification regression used to simulate normal operation of chip

Prof. Mitra @Stanford University

Page 36: OpenSPARC – An Open Platform for Hardware Reliability ...

36IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Instruction-level Parallelism

Thread-level Parallelism

Instruction/DataWorking Set

Data Sharing

Low Low Low LowMedium High

High High High High High

Large Large Medium Large

Low Medium High Medium High Medium

Future research possibilities in hardware reliability using

OpenSPARC

Page 37: OpenSPARC – An Open Platform for Hardware Reliability ...

37IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Future research possibilities Using CMT hardware resources for error

detection and recovery cores, threads, structures used by cores/threads

Understanding errors in the context of CMT architectural constructs thread arbitration and scheduling speculative threading

Validate error management solutions using a state-of-the-art microprocessor design

Page 38: OpenSPARC – An Open Platform for Hardware Reliability ...

38IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Future research possibilities Study impact of reliability solutions on

microprocessor performance use performance tools available in OpenSPARC

Firmware and software solutions for hardware reliability FPGA implementation and T1000/2000 servers with

Solaris/Hypervisor source and other tools Study impact of error detectors in processor

on chip level and application failure rates enable error detection selectively, use simulators

Several more...

Page 39: OpenSPARC – An Open Platform for Hardware Reliability ...

39IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Conclusions

OpenSPARC is an open source community based around UltraSPARC T1 and T2 CMT microprocessors

OpenSPARC provides a rich, state-of-the-art infrastructure for research in hardware reliability

Many universities are actively using OpenSPARC in their research, with a lot of success

There is a lot more research in hardware reliability that can be done using OpenSPARC

Page 40: OpenSPARC – An Open Platform for Hardware Reliability ...

40IEEE SELSE 4 – March 26, 2008

www.OpenSPARC.net

Acknowledgment

We would like to acknowledge the students (past and present) from Carnegie Mellon University, University of Illinois at U-C and Stanford University who contributed to the research described in this presentation.