Sidestepping performance bottlenecks and design...

Sidestepping performance bottlenecks and design crises with

Better-Than-Worst-Case design

Munich - March 7th, 2005

ARM, Ltd. Cambridge, UK

Dept. of EECSUniversity of MichiganAnn Arbor, MI – USA

Krisztian [email protected]

Valeria [email protected]

Todd [email protected]

2

Introduction

Limitations of traditional design approachesin light of current technology trends

Pressing Design Challengesin the Nanometer Regime

Design complexity• Billions of transistors lead to untenable designs…

Uncertainty in design parameters• Process and temperature variation, supply noise…

Soft errors upset logic and memory• Cosmic rays, alpha particles, neutrons, etc…

Power demands• Bounding performance, area, battery life

4

Design Complexity TrendsChart Title

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1980 1985 1990 1995 2000 2005

tran

sist

ors

,i486

GeForceFX

i386i286

PentiumProGeforceII

PentiumIVPrescott

K7Riva TNT2

Radeon97

Nvidia NV2Pentium

GeForceIVCrusoe

PowerIV

PentiumII

5

The Burden of VerificationImmense test space• Impossible to fully test the system• Example: 32 regs, 8k caches, 300 pins = 2132396 states

Done with respect to ill-defined reference• What is correct? Often defined by old designs + gurus

Expensive• Large fraction of design team dedicated to verification• Increases time-to-market, often as much as 1-2 years

High-risk• Typically only one chance to “get it right”• Failures can be costly: replacement parts, bad PR, lawsuits, fatalities

6

0

50

100

150

200

250

Hea

t Flu

x (W

/cm

2)

40

50

60

70

80

90

100

110

Tem

pera

ture

(C)

10

100

1000

10000

1000 500 250 130 65 32

Technology Node (nm)

Mea

n N

umbe

r of D

opan

t Ato

ms

Extreme VariationsHeat Flux (W/cm2)Results in Vcc variation

Temperature Variation (°C)Results in Hot spots

Random Dopant Fluctuations

7

Uncertainty in Design Parameters

Uncertainly leads to performance and power overheads• Increasing uncertainty with design scaling• Intra-die process/temperature variations, inductive noise, deep•Key Observation: worst-case scenario is highly improbable• Significant gain for circuits optimized for common case• Efficiency mechanisms needed to tolerate worst-case scenarios

Temperature Si variation Noise Model uncertainty

+++ =

Temperature Si variation Model uncertaintySiTemperature variation Noise Model uncertainty

+++ =

f, Yield, MTTF Vdd

8

Secondary source of upsets: alpha particles from packaging

Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device

+- ++ +-- -

Transistor Device

source drain

neutron strike

Impact of Neutron Strike on a Si Device

9

1.0E-07

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

Sof

t Err

or R

ate

(FIT

/chi

p)

SRAM latch 6 FO4s logic 6 FO4s

Soft-Error Trends

SER per chip of logic circuits• Nine orders of magnitude increase from 600 nm to 50 nm • Dominant source of soft errors after 50 nm

1992 1994 1997 1999 2002 2005 2008 2011Technology Generation

600nm 350nm 250nm 180nm 130nm 100nm 70nm 50nm

[P. Shivkumar et al., DSN 2002]

10

Source: The New York Times, 25 June 2002

Fried Egg a la Athlon XP1500+

11

Wat

ts/c

m2

1

10

100

1000

1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ

i386i386i486i486

PentiumPentium®®PentiumPentium®® ProPro

PentiumPentium®® IIIIPentiumPentium®® IIIIIIHot plateHot plate

RocketNozzleRocketRocketNozzleNozzle

Nuclear ReactorNuclear ReactorNuclear Reactor

* * ““New Microarchitecture Challenges in the Coming Generations of CMNew Microarchitecture Challenges in the Coming Generations of CMOS Process TechnologiesOS Process Technologies”” ––Fred Pollack, Intel Corp. Micro32 conference key note Fred Pollack, Intel Corp. Micro32 conference key note -- 1999. Courtesy 1999. Courtesy AviAvi MendelsonMendelson, Intel., Intel.

PentiumPentium®® 44

Power Density TrendsPower doubles every 4 yearsPower doubles every 4 years55--year projection: 200W total, 125 W/cmyear projection: 200W total, 125 W/cm2 2 !!

P=VI: 75W @ 1.5V = 50 A!P=VI: 75W @ 1.5V = 50 A!

12

Better-Than-Worst-Case (BTWC) designTraditional worst-case design works to avoid errors/faults by assuming worst-case conditions for design validation

Better than worst-case design couples a complex designs with a checker component that validates correctness during operation

Reduces design effort and enables typical-case optimizations

13

What is this tutorial aboutBTWC design• Basic Concepts• DIVA Checker• Razor Logic• Other BTWC solutionsCAD challenges and opportunities• Typical-Case design Optimization (TCO)• Circuit-level observability and system-level performanceOpen discussionConclusion

14

Goals of this tutorial1. Introduce and motivate the concept of Better Than

Worst-Case design2. Familiarize the attendees with a number of BTWC

designs (ours and others)3. Introduce efforts (circuit-aware architectural simulation

typical case optimization) that highlight the challenges and opportunities that BTWC poses to CAD

4. Facilitate an open discussion on the implications of BTWC design on CAD

15

Better-Than-Worst-Case design

16

Design-TimeVerification

andOptimization

Traditional Worst-Case Design

L H

Time-to-Market

L H

Performance

17

Run-TimeVerification

TypicalCase

Optimization

Better-Than-Worst-Case (BTWC) design

L H

Time-to-Market

L H

Performance

L H

Performance

L H

Time-to-Market

Online

Checker

Hardware

18

OutlineDIVA CheckerRazor LogicOther BTWC designs

19

Motivating ObservationsOnline functional verification cover most faults• Single-event upsets and noise-related faults• Design faults and incomplete implementation• Untestable silicon defects and in field circuit failures• Utilize N(2)-version hardware to detect and correct faults

Increasing speculation reduces exposure to faults• Predictors need not be correct, functionally or electrically• Approach leverages a maximally speculative architecture

While complex, processors have simple semantics• Need not validate all internals, only exposed semantics• Only check instruction semantics for low overheads

20

Example BTWC Design:DIVA Checker [Austin’99]

All core function is validated by checker• Simple checker detects and corrects faulty results, restarts core

Checker relaxes burden of correctness on core processor• Tolerates design errors, electrical faults, defects, and failures• Core has burden of accurate prediction, as checker is 15x slower

Core does heavy lifting, removes hazards that slow checker

speculativeinstructions

in-orderwith PC, inst,inputs, addr

IF ID REN REG

EX/MEM

SCHEDULER CHK CT

Performance Correctness

Core CheckerOnline

Checker

Hardware

21

result

Checker Processor Architecture

IF

ID

CTOK

CoreProcessorPrediction

Stream

PC

=inst

PC

inst

EX

=regs

regs

core PC

core inst

core regs

MEM

=res/addr

addrcore res/addr/nextPC

result

D-cache

I-cache

RF

WT

22

Check Mode

result

IF

ID

CTOK

CoreProcessorPrediction

Stream

PC

=inst

inst

EX

=regs

regs

core PC

core inst

core regs

MEM

=res/addr

addrcore res/addr/nextPC

result

D-cache

I-cache

RF

WT

23

Recovery Mode

result

IF

ID

CT

PC inst

PC

inst

EX

regs

regs

MEM

res/addr

addr result

D-cache

I-cache

RF

24

How Can the Simple Checker Keep Up?

Slipstream

Slipstream reduces power requirements of trailing carChecker processor executes inside core processor’s slipstream• fast moving air ⇒ branch predictions and cache prefetches• Core processor slipstream reduces complexity requirements of checker• Checker rarely sees branch mispredictions, data hazards, or cache misses

25


Slipstream


26


Slipstream

IF ID REN REG

EX/MEM

SCHEDULER CHK CT


27

Checker Performance ImpactsChecker throughput bounds core IPC• Only cache misses stall checker pipeline• Core warms cache, leaving few stalls

Checker latency stalls retirement• Stalls decode when speculative state

buffers fill (LSQ, ROB)• Stalled instructions mostly nuked!

Storage hazards stall core progress• Checker may stall core if it lacks resources

Faults flush core to recover state• Small impact if faults are infrequent

0.970.980.991.001.011.021.031.041.05

Relat

ive C

PI

Uber-Check

er

Pico-Check

er

12-cyc

le Check

er

1/4 Cach

e Size

1k Faults

28

REMORA: Physical Checker DesignPhysical checker design effort underway

• Alpha integer ISA subset• 4-wide checker, 0.5k I-cache, 4k D-cache• Synthesized design (using Synopsys)

Physical design estimates• 950 MHz clock speed (degree-8 pipe)• 12 mm2 total area in 0.25um technology• 941 mW worst-case power

Design also includes:• Pipelined checker design, simple core• Clock/voltage tuning infrastructure• Extensive BIST support

205 mm2

(in 0.25um)

Alpha 21264

REMORAChecker

datacache

instcache

pipe-line

BIST

12 mm2

(in 0.25um)

29

Verifying the Checker Processor

Simple checker permits complete functional verification• In-order blocking pipelines (trivial scheduler, no rename/reorder/commit)• No “internal” non-architected state

Fully verified design using Sakallah’s GRASP SAT-solver• For Alpha integer ISA without exceptions• With small register file and memory, and small data types

X

CheckerModel

ReferenceModel

(ISA sim)

==

output

output

ϕUnspecified CorePredictions Always true if

uArch model == Ref model

Identical state?

30

Example BTWC Design:DIVA Checker

All core function is validated by checker• Simple checker detects and corrects faulty results, restarts core

Checker relaxes burden of correctness on core processor• Tolerates design errors, electrical faults, defects, and failures• Core has burden of accurate prediction, as checker is 15x slower

Core does heavy lifting, removes hazards that slow checker

speculativeinstructions

in-orderwith PC, inst,inputs, addr

IF ID REN REG

EX/MEM

SCHEDULER CHK CT

Performance Correctness

Core CheckerOnline

Checker

Hardware

31


32

Motivating Study:Voltage vs. Circuit Error Rate

33

Circuit Under Test

48-b

it LF

SR

48-b

it LF

SR

48-b

it LF

SR

48-b

it LF

SR

X

X

X

clk/2

clk/2

clk clk

clk/2

clk/2

clk

!=

40-b

it E

rror C

ount

er40

-bit

Erro

r Cou

nter

Slow Pipeline A

Slow Pipeline B

Fast Pipeline

clk/2

18

18

36

36

36

18x18

18x18

18x18

stabilize

34

18x18-bit Multiplier Block at 90 MHz and 27 C

0.0000000%0.0000001%0.0000010%0.0000100%0.0001000%0.0010000%0.0100000%0.1000000%1.0000000%10.0000000%100.0000000%

1.141.181.221.261.301.341.381.421.461.501.541.581.621.661.701.741.78

Supply Voltage (V)

Err

or r

ate

random

Zero-margin@ 1.54 V

Safety-margin@ 1.63 V

Environmental-margin@ 1.69 V

Error Rate Studies – Empirical Results

35% energy savings with 1.3% error

22% saving

once every 20 seconds!

35

Error Rate Studies – SPICE-Level SimulationsBased on a SPICE-level simulations of a Kogge-Stone adder

Kogge-Stone Adder at 870 MHz and 27 C

0.00%

0.01%

0.10%

1.00%

10.00%

100.00%

0.60.811.21.41.61.82

Supply Voltage

Erro

r ra

te

random

bzip

ammp

200 mV

36

Another BTWC Design:Razor Logic [Ernst’03]

Main

FF

Shad

ow La

tch

Main

FF

clk clk

clk_del

5

49 MEM39

9

Double-sampling metastability tolerant latches detect timing errors• Second sample is correct-by-design

Microarchitectural support restores state• Timing errors treated like branch mispredictions

Online

Checker

Hardware

37

998

Razor Short Path Constraint

Main

FF

Shad

ow La

tch

Main

FFclk clk

clk_del

5

4

Hold Constraint(~1/2 cycle)

MEM

8

3

2

Short-path timing constraint prevents shadow latch corruption

38

Razor Flip-Flop

D_inQ

Clk_p Clk_n

Clk_n

Clk_n

Clk_p

Restore_p

Restore_n

Restore

Restore

Clk_n

Rbar_latched

Clk_p

P-skewed

N-skewed

Error

Driver

39

inst2

IF

Razo

r FF ID

Razo

r FF EX

Razo

r FF MEM WB

(reg/mem)

error

recover recover recover

Razo

r FF

PC

recover

errorerror error

clock

Cycle: 0inst1inst3inst4inst5

123456inst6

Centralized Pipeline Recovery Control

Once cycle penalty for timing failureGlobal synchronization may be difficult for fast, complex designs

40

recover

IF

Razo

r FF ID

Razo

r FF EX

Razo

r FF MEM

(read-only)WB

(reg/mem)

error bubble

recover recover

Razo

r FF

Stab

ilizer

FF

PC

recover

flushID

bubbleerror bubble

flushID

error bubble

flushIDFlushControl

flushID

error

Cycle: 0

inst1inst2inst3inst4inst5

123456

inst6 inst2inst7inst8

789

inst3inst4

Builds on existing branch prediction frameworkMultiple cycle penalty for timing failureScalable design as all communication is local

Distributed Pipeline Recovery

41

Goal: reduce voltage margins with in-situ error detection and correction

Approach:• Tune processor voltage based on error rate• Eliminate margins, run below critical voltage

- Trade-off: power savings vs. overhead of correction

Shaving Voltage Margins with Razor

0 . 8 1 . 0 1 . 2 1 . 4 1 . 6 1 . 8 2 . 0

0

2 0

4 0

6 0

S u p p l y V o l t a g e

Per

cent

age

Erro

rs

Traditional DVS

Zero margin Sub-critical

42

Razor Opportunity:Typical-Case Energy Reduction

Eref

VoltageControl

Function Σ...

Pipeline

reset

Vdd

Ediff = Eref - Esample

-

EsampleVoltageRegulator

Ediff

errorsignals

Energy reduction can be realized with a simple proportional control function• Control algorithm implemented in software

43

Energy/Performance Characteristics

Decreasing Supply Voltage

Energy

Energy of ProcessorOperations, Eproc

Energy ofPipeline

Recovery,Erecovery

Total Energy,Etotal = Eproc + Erecovery

Optimal Etotal

PipelineThroughput

IPC

Energy of Processorw/o Razor Support

50%

1%

44

Simulation Results:Optimal Voltage Sweep

BZIP

0.31% Error Rate,58% Energy Savings

0.3

0.5

0.7

0.9

1.1

1.3

1.5

0.6

0.650.7

0.750.8

0.850.9

0.951

1.051.1

1.151.2

1.251.3

1.351.4

1.451.5

1.551.6

1.651.7

1.751.8

Voltage

Rel

ativ

e IP

C a

nd E

nerg

y

Rel EnergyRel Performance

Recovery cost includes energy torecover entire pipeline (18x an add)

45Runtime Samples

0 100 200 300 400 500 60002468

10121416

1.351.401.451.501.551.601.651.701.751.80120MHz

27CPe

rcen

tage

Err

or R

ate

Volta

ge O

utpu

t of C

ontr

olle

rVoltage Controller Performance

46

0

10

20

30

40

50

60

70

80

90

100

bzip crafty eon gap gcc gzip mcf parser twolf vortex vpr Average

Rel

ativ

e En

ergy

(%)

Total EnergyDVS EnergyIPCDVS IPC

Simulation Results:Razor DVS Performance

47

Razor Prototype Silicon4 stage 64-bit Alpha pipeline• 120 - 160MHz operation • 0.18μm technology, 1.8V

Razor overhead:• Total of 192 Razor flip-flops

out of 2408 total (9%)• Error-free power overhead:

- Razor flip-flops: < 1%- Short path buffer: 2.1%

• Recovery power overhead:- Razor latch power overhead: 2% at

10% error rate- Additional power overhead due to re-

execution of instructions

D-Cache

IF ID EX

ME

M

WB

Register FileI-Cache

3.3 mm

3 mm

48

Razor Prototype Testbed

49


50


51

1.52 1.56 1.60 1.64 1.68 1.72 1.760.700.750.800.850.900.951.001.051.101.151.20

1E-8

1E-7

1E-6

1E-5

1E-4

1E-3

0.01

0.1

1

10 120MHz

140MHz

Perc

enta

ge E

rror

Rat

e

Nor

mal

ized

Ene

rgy

Voltage (in Volts)

Error Rate and Normalized Energy Savings

52

Point of 0.1% Error Rate andPoint of First Failure

1.4 1.5 1.6 1.7 1.8

1.4

1.5

1.6

1.7

1.8 Chips Linear Fit y=0.78685x + 0.22117

Voltage at First Failure

Volta

ge a

t 0.1

%Er

ror R

ate

1.5 1.6 1.7 1.81.4

1.5

1.6

1.7

1.8 Chips (Linear Fit)

(0.55976x + 0.61752)

Volta

ge a

t 0.1

%Er

ror R

ate

Voltage at First Failure

120MHz 140MHz

53


54

1.40 1.44 1.48 1.521E-91E-81E-71E-61E-51E-41E-30.010.1 45C

65C 95C

Perc

enta

ge E

rror

Rat

e

120MHz

Voltage

45C

65C 95C

Temperature Margins

55

80

100

120

140

160

27.3mW180mV

PowerSupply

Integrity11.3m

W80mVTemp17.3mW

130mV

Process

104.5mW

4.2mW30mV

Process

89.7mW

99.6mW

104.5mW

119.4mW

89.7mW

119.4mW

11.5mW

80mVTemp

27.7mW180mV

PowerSupply

Integrity

104.5mW

119.4mW99.6mW

chip1 chip2 chip1 chip2 chip1 chip2Measured Power

with supply, temperatureand process margins

Power with Razor DVS when Operating at Point

of First Failure

Power with Razor DVS when Operating at Point

of 0.1% Error Rate

Mea

sure

d Po

wer

(in

mW

)160.5mW 162.8mW

Razor Energy Savings@120MHz,45C

56

Another BTWC Design:Razor Logic

Main

FF

Shad

ow La

tch

Main

FF

clk clk

clk_del

5

49 MEM39

9

Double-sampling metastability tolerant latches detect timing errors• Second sample is correct-by-design

Microarchitectural support restores state• Timing errors treated like branch mispredictions

Online

Checker

Hardware

57


58

Other Better Than Worst-Case designsAlgorithmic-Noise Tolerance, Shanbhag et al.• Converting circuit faults to S/N component

Approximate Circuits, Lu et al.• Architecture-level speculation on computation

TEAtime Adaptive Clock, Uht et al.• Adaptive clock control

On-Chip Self-Calibrating Busses, Worm et al.• Error recovery logic for on-chip busses

Self-Tuning Circuits, Kehl et al.• Early work on dynamic timing error avoidance

Time Based Transient Fault Detection, Anghel et al.• Double sampling latches for speed testing

March 2004

59

Algorithmic Noise Tolerance

[Shanbhag ’04]

60

Approximate Circuits

[Lu ’04]

61

TEAtime Adaptive Clock

[Uht ’04]

62

On-Chip Self-Calibrating Busses

[Worm ’04]

ddv

FIFO

chF

ControllerFIFOn

ddv

Enco

der

Dec

oder

Ack

chv

errors

chv

63

Self-Tuning Circuits

[Kehl ’93]

64

Time Redundancy BasedTransient Fault Detection

[Anghel ’00]

65

CAD opportunities for BTWC

66

Key observation

In a BTWC context infrequent faults in the core design are tolerable.infrequent faults in the core design are tolerable.

• A fault is infrequent if:- Environmental conditions trigger it rarely- Normal system operation activate a faulty configuration rarely

67

CAD opportunitiesSynthesis• Optimize performance/power for the most common scenarios

(typical-case optimization)• Flexible synthesis tools - Optimization constraints can be relaxed• Finer granularity of synthesis objectives – probability density curves

Verification• Accurate evaluation- Statistical analysis of execution scenarios• Verification focuses on “frequent” transactions/ path of execution• Verification focuses on critical components• Safety mechanisms detect “rare” problems at a performance cost

- Performance and verification are intertwined

68

OutlineSynthesis - Typical-Case Optimized AddersPerformance/verification - Circuit-Aware simulationVerification - Beta-release processors

69

Kogge-Stone AdderPa

ralle

l Pre

fix C

ompu

tatio

n

€

G0P0

G1P1

G2P2

G3P3

G4P4

G5P5

G6P6

G7P7

G8P8

G9P9

G10P10

G11P11

G12P12

G13P13

G14P14

G15P15

€€€€€€€

€ € € €

€

€

€

€

C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16

€€€€€€€€

€ € € € € € € € € €

€ € € € € € € € € €

€ € € € € €

70

BTWC Opportunity:Typical-Case Optimized Adder

Kogge-Stone Adder

G0P0

G1P1

G2P2

G3P3

G4P4

G5P5

G6P6

G7P7

G8P8

G9P9

G10P10

G11P11

G12P12

G13P13

G14P14

G15P15 Cin

…

71

Carry Propagations for Random Data

081624324048

56

016

3248

64

0

0.01

prob

abili

ty

carry propagationcarry start

Bit PositionCarry Distance

Pro

babi

lity

72

Carry Propagations for Typical Data

08162432404856

016

3248

64

0

0.16

prob

abili

ty

carry propagationcarry start

Carry DistanceBit Position

Pro

babi

lity

73

Typical Case Optimized Adder

G0P0

G1P1

G2P2

G3P3

G4P4

G5P5

G6P6

G7P7

G8P8

G9P9

G10P10

G11P11

G12P12

G13P13

G14P14

G15P15 Cin

…

ripple carry circuitcarry-lookahead circuit

74

Benefits of Typical Case Optimization

Typical-case performance much better than worst case,relevant in a TCO context

3.693.03128TCO Adder

7.095.088Kogge-Stone

RandomTypical-CaseWorst-Case

Latency (in gate delays)Adder Topology

75

OutlineSynthesis - Typical-Case Optimized AddersPerformance/verification - Circuit-Aware simulationVerification - Beta-release processors

76

Simulation for BTWCElectrical, transient phenomena affect the performance of BTWC designs

Simulation tools need to be “electrically accurate”

Functional correctness is re-defined in a statistical context

Need ability to gatherstatistical simulation data

Main

FF

Shad

ow La

tch

Main

FF

clk clk

clk_del

5

4 MEM

93

Example: Razor latches

77

Key Challenges: Speed and Fidelity

Speed

Fideli

tySPICE

AnalyticalCircuit Model

Circuit-AwareSimulation

GOAL: Balance between accuracyand speed of simulation

~4 hr/cycle

30min/prog

5 hrs/prog

78

Circuit-Aware is not only for BTWC designsThere is a recent trend in computer architecture design toward system that can adapt to circuit-level phenomena• e.g., di/dt, thermal throttlingThese novel circuit-aware architectural optimization share a modeling requirement of detailed circuit• Needs to be interaction between architectural state and circuit

behavior ( e.g., device switching activity, detail timing information of pipeline states )

Analytical circuit modeling has been widely used • Simple and fast • At the cost of accuracy

79

Circuit-Aware Architectural Simulation Platform Overview

App

ArchConfig

ArchitecturalSimulator

ArchitecturalSimulator

CircuitSimulatorCircuit

Simulator

Output

ArchMetrics

ModuleCircuitModels

TechModels

CircuitMetrics

Inputs,Voltage,

Constraints

Delay,Power,Switching

IF ID EX MEM WBSpeedand

Scope

Fidelityand

Observability

80

Architectural Simulator Structure

BPred SimulatorCore

Machine DefinitionFunctionalCore

SimpleScalar ISA POSIX System Calls

Proxy Syscall Handler

Dlite!

Cache MemoryRegsLoader

Resource

StatsPerformance

Core

Prog/SimInterface

SimpleScalar Program BinaryUserPrograms

81

Standard Circuit Simulation Approach

a: 0.00336631 b: 0.00262017

c: 0.00143158

d: 0.0046271

n45: 0.0389713

n46: 0.0239953n47: 0.0569968

f: 0.0233215

g: 0.0234223

extracted wire cap

Gate inputcap fromtypical.lib

output capacitance

slewdelaypower

slew

SPICE-characterizedStandard Cell

voltage

U7

U8 U9 U10

U11

82

Event Driven Implementation10

1 f

g

n45

n46n47

U7

U8U9 U10

U11

0

0

0

1

0

Transition Event

1

1

0

1

1

10

101

glitch

• Event driven simulation can capture glitch behavior• Validation against a set of SPICE simulation• Error rates are consistently less than 11%, with most less than 3%• The initial speed of simulation without optimization is 150 insts / sec

83

Some serious performance boostersConstraint-based pruning• Static pruning• Dynamic pruning

Circuit timing memoization (a.k.a. Cashing of electrical simulation results )SimPoints

84

Constraint-Based Pruning – An overview

IF/ID ID/EX EX/MEM MEM/WB

Logicundersim.

Constraintsnot violated

if (delay < Tclk)stop

else simulate

Check constraints

Some analyses have a “don’t care” scenario, e.g. ,“less than X switches”, “less than tclk”

Logic

Architecture does not react to “don’t care” sets

Logic

85

Static Constraint PruningAt each new voltage and temperature, domain specific STA computes worst case values of the constraints measure and can be pruned where the constraints cannot be violatedWhenever the supply voltage is changed, constraint based pruning is re-evaluated before simulation continues

Domain specific STA

1.32 ns

0.65 ns

0.83 ns

Less than clock threshold

abc

d

e

abc

d

eStatically pruned

Clock cycle time: 1 ns @ 1.8Vt_max_req

0.87

86

Dynamic PruningDuring simulation, an event can be dropped if a particular input vector causes a transition such that: tevent + Tmax2output < Tconstraint• Guarantees that simulation will reach output net without a timing violation• Must still perform logic simulation to compute circuit state

Example: Clock cycle time: 5 ns

arrive @ 2.23 ns 01 1

3.2ns 2.5ns 2ns 1ns

01 1

Tmax2outputTiming budget

@ 2.93 ns @ 3.23 ns @ 4.23 ns arrive @ 1.23 ns

Fast logic sim

87

Constraint-Based Circuit PruningIn our case study of 200Mhz Razor system• At 1.8V nominal voltage, pruning eliminated 64% of prime inputs of circuit• At most highly constrained voltage 1.4V, 24% of prime inputs of circuit is

eliminated

Static and Dynamic pruning achieve 445 instructions per second

88

Circuit Timing MemoizationRemember previous circuit evaluations, reuse results if they recurLeverage value locality by recording switching history• Previous vector encodes the internal node values of circuit, input vector

indicates new input transition

Sum16

16

16

opA

opB

0x00AE

0x003B

10x00AE0x003B

0x01FC0x0012

2.5ns250pJ

0x01FC

0x0012

Full Circuit Simulation

(2.5ns, 250pJ)Check

0x0012

0x01FCHash table

89

Circuit Timing MemoizationSize of hash table is limited by 256MBDynamic reordering hash bucket chains• Bringing most recently referenced (MRU) element to the head of chain,

reduces average number of hops• At most 50% hit rate on averagePer-opcode input vector filtering mechanism• Observation:

- load/store instructions ignore “operand B”• Each instruction opcode indicates with mask which do not influence stage

logic evaluation• 70% hit rate on average

90

SimPoint AnalysisAfter marshalling all optimization technique, we achieve approximately 1000 instructions per secondWe deploy a recent developed simulation sampling technique, SimPoint• Uses Basic Block Distribution Analysis to extract representative

samples(10 million insts) of original benchmark (1 billion insts)• Drastically reduces the number of instructions observed to characterize the

program’s performance• Error analysis indicates an error of less than 10% (typically less than 3%)

for a wide variety of benchmarks

A full benchmark execution can be completed within A full benchmark execution can be completed within 5 hours5 hours

91

Circuit-aware simulation to evaluate RazorInitial simulators utilized a hand-generated EX-stage circuit model• insufficient performance

Challenge: instruction latency (in cycles) depends on circuit evaluation latency • Cycle count may vary with input, voltage, temperature, process variation

Circuit-Aware Architectural Simulation combines architectural and circuit simulation• SimpleScalar architectural-level simulation• Gate-level timing simulation of per-stage logic blocks

92

Case Study:Razor Timing Speculation

Benchmark: GCC

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.21.251.31.41.51.61.71.8Voltage

Rel

ativ

e Pe

rf &

Ener

gy

0

0.2

0.4

0.6

0.8

1

1.2

1.4REL_ENERGY REL_PERF

93

Case Study:Razor Timing Speculation

Benchmark: GCC

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

cycles

Volta

ge

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%voltage error rate

94

Circuit-Aware simulation in summaryChallenge is integration between architectural simulation and circuit simulation• Must balance fidelity and speed of simulationThree optimizations were utilized to enhance speed of simulation• Constraint-based Pruning• Circuit Simulation Memoization• SimPoint Simulation SamplingDemonstrated with Razor pipeline simulations• Razor architecture reacts cycle-by-cycle to circuit-level phenomenon

95

OutlineSynthesis - Typical-Case Optimized AddersPerformance/verification - Circuit-Aware simulationVerification - Beta-release designs

96

Beta-Release Designs

Traditional verification stalls launch until debug completeChecked processor verification could overlap with launch• Beta-release when checker works• Launch when performance stable• Step as needed without recalls

0

5

10

15

20

Desi

gn E

rror

s

0

20

40

60

80

100

Perfo

rman

ce

0

5

10

15

20

Desi

gn E

rror

s

0

20

40

60

80

100

Perfo

rman

ce

Beta Launch Step

LaunchChecked Processor Verification

Traditional Verification

TapeOut

Tape Out

97

Additional CAD Opportunities

For synthesis:• Typical-case library characterization (e.g., pdf of delay)• Synthesize design for target performance, power, etc…• TCO-style optimizations possible for macro-modules

For verification:• Full formal verification for checker components• Profile-directed simulation-based verification for core

For testing:• Checker component can facilitate software-based manufacturing test

of core components

98

Open discussion

99

Conclusion

100

Conclusions

Better than worst-case design abandons traditional worst-case design constraintsCouples complex designs with checkersEnables CAD opportunities for typical-case optimizationRequires tool support for observability, synthesis and verification

For more information:http://www.eecs.umich.edu/razor

101

References• Todd Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,”

ACM/IEEE 32nd Annual Symposium on Microarchitecture (MICRO-32), November 1999.• D. Ernst, N. S. Kim, S. Das, S. Pant, T. Pham, R. Rao, C. Ziesler, D. Blaauw, T. Austin, T. Mudge,

and K. Flautner, “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation,” in the 36th Annual Int’l Symposium on Microarchitecture (MICRO-36), December 2003.

• Shanbhag, N.R., “Reliable and efficient system-on-chip design,” Computer, Vol.37, Iss.3, Mar 2004.• Uht, A.K., “Going beyond worst-case specs with TEAtime,” Computer, Vol.37, Iss.3, Mar 2004.• Austin, T.; Blaauw, D.; Mudge, T.; Flautner, K., “Making typical silicon matter with Razor,”

Computer, Vol.37, Iss.3, Mar 2004.• S.-L. Lu, “Speeding up processing with approximation circuits,” Computer, Vol.37, Iss.3, Mar 2004.• Worm, F.; Ienne, P.; Thiran, P.; DeMicheli, G., ”A Robust Self-Calibrating Transmission Scheme for

On-Chip Networks,” IEEE Trans. on VLSI Systems, Vol. 13, Iss. 1, January 2005.• T. Kehl. Hardware self-tuning and circuit performance monitoring, in Proceedings of lnternational

Conference on Computer Design, 1993.• L. Anghel and M. Nicolaidis, “Cost reduction and evaluation of temporary faults detecting

technique,” in Proceedings of the conference on Design, automation and test in Europe (DATE-2000), March 2000.

102

Supplemental Materials

103

More Details on Meta-StabilitySub-critical operation invites meta-stability

• Meta-stability detector itself can become meta-stable• double latch error signal to obtain sufficient small probability

clk_b

clk

clk

clk_b

D Q

clk_del

clk_del_brestore

restorebubbleflush

Dynamic Or / Latch

• Flush entire pipe• No forward progress• Reduce frequency

restorebubbleflush

pos

neg

fail

pos

neg

error

104

998

Razor Short Path Constraint

Main

FF

Shad

ow La

tch

Main

FF

clk clk

clk_del

5

4

Hold Constraint(~1/2 cycle)

MEM

8

3

2

Double-sampling metastability tolerant latches detect timing errors• Second sample correct-by-design, use guarantees forward progress

Microarchitectural support restores correct program state• Timing errors treated in the same way as branch mispredictions

105

Overcoming Short Path ConstraintsDelayed clock imposes a short-path constraint

Pad with extra delay

Razor_ff

ff

clock

Long Paths

Short Paths

• Razor necessary only for latches on slow paths

• Pad fast path for latches with mixed path delays

• Trade-off between DVS headroom and short path constraints

clock

clock_del

tdelay thold

Min. path delay

Min. Path Delay > tdelay + tholdintended path short path

106

Power Overhead of the Razor Flip-Flop

38% error-free latch overhead (assuming 20% switching activity)42% latch overhead with errors (20% switching, 1% error rate)Overhead mitigated by latch-frugal architecture

260fJEnergy of RFF per error event

Error Detection and Recovery

60fJ / 203fJRFF Energy (static/switching)

49fJ / 125fJStandard Flip-Flop Energy (static/switching)

Error Free Operation

107

Simultaneous Events

A B

A

BF

WN

WP

WP

WNCL=10pF

1

0

1 0

Cancel a pair of close events that may cause software static glitch, which cannot occur in real circuits; source of inaccuracy

Software Glitchsimultaneous events

0

NORVdd

GND

108

Accuracy of Simulators

Accuracy of simulation • Validation against a set of SPICE simulation with number of circuit

topology at varied voltages and input slew rates• Error rates are consistently less than 11%, with most less than 3%

The initial speed of simulation without optimization is 150 instructions per second ( comparable to VCS )VCS stands for Verilog Compiler Simulator

109

ld f1,(X)f4 = f1 * f2 + f3br f4 < 0, skip

r8 = r8 + 1skip: ...

ld f1,(X)f4 = f1 * f2 + f3br f4 < 0, skip

r8 = r8 + 1skip: ...

Core Processor Execution Checker Execution

ld * + br +cache miss long operation misprediction ld

+*

br+

ok

ok

ok

ok

ok

Speeding the Checker with Core Computation

Checker executes in wake of core• Leverages non-binding predictions & prefetches

Virtually no stalls remain to slow checker• Control hazards resolved during core execution• Data hazards eliminated by prefetches and input value predictions

Complex microarchitectural structures only necessary in core

110

Motivating ObservationsSpeculative execution is fault-tolerant• Design errors, timing errors, and electrical

faults only manifest as performance divots• Correct checking mechanism will fix errors

What if all computation, communication, control, and progress were speculative?• Any incorrect computation fixed

- maximally speculative• Any core fault fixed

- minimally correct

X

PC

alwaysnot taken

stuck-atfault

branchpredictor

array

111-0.4% Best: -3.2%

Worst: 0.2%

Slowdowns

RF

8 ports

L0 Data4 KB

4 ports

L0 Inst0.5KB2 ports

RF

8 ports

L1 D-cache

64 KB, 2 ports

L1 I-cache

64 KB, 1 ports

L2 Unified Cache

256 KB, 1 port

IF ID REN REG

EX/MEM

SCHEDULER CHK CT

Core Checker

Optimized System ArchitecturePerformance impacts eliminated• Checker RF allows core commit• No storage hazards• Few checker cache misses• Less expensive core storage

architecture (same as baseline)Core cache failures affect checker

112

Fully Decoupled System ArchitectureChecker fully decoupled• Core L1 caches may fail• All L2 writebacks from checker• Core caches flushed on fault• Core accesses and misses warm

up checker caches Eliminates common mode core cache failures• But, generates more L2 traffic• Further optimizations possible

1.2% Best: 0%Worst: 6.7%

Slowdowns

RF

8 ports

L1 Data4 KB

4 ports

L1 Inst0.5KB2 ports

RF

8 ports

L1 D-cache

64 KB, 2 ports

L1 I-cache

64 KB, 1 ports

L2 Unified Cache

256 KB, 1 port

IF ID REN REG

EX/MEM

SCHEDULER CHK CT

Core Checker

prefetchstream

113

Slack detector• Automatic tuning mechanism

- ARM’s Intelligent Energy Manager (IEM)- Processor voltage automatically tuned to

external ambient conditions - Inverter chain designed to track most

restrictive critical path, margin still required

Intelligent Energy Management

L2 Cache L2 Cache

control

Floating pointand

graphics

Data cache

Cache control

L2tags

Ex Unit

ControlUnit

IOUNIT

MemControl

114

EX-Stage Analysis – Optimal Voltage SweepGCC

1.62% Error Rate,24% Energy

Savings

0.3

0.5

0.7

0.9

1.1

1.3

1.5

0.60.68

0.75

0.830.90.9

81.0

51.1

31.21.28

1.35

1.431.51.5

81.6

51.7

31.8

Voltage

Rel

ativ

e IP

C a

nd E

nerg

y

Rel EnergyRel Performance

115

Simulation Results:Energy-Optimal Voltage

0

10

20

30

40

50

60

70

80

90

100

bzip crafty eon gap gcc gzip mcf parser twolf vortex vpr Average

Rel

ativ

e En

ergy

(%)

Total EnergyIPC

116

Low-Cost SER and Noise Protection

Only need to address transients in checker• Checker detects and corrects noise-related faults in core• Core processor designed without regard to strikes (e.g., no ECC…)

Recycle checker inputs suspected core fault• If no error on third execution, transient strike in checker processor• If error on third execution, core processor fault occurred (e.g., SER, design error)

Protect critical checker control with triple-modular redundant (TMR) logic• TMR on simple control results in only 1.3% larger checker (synthesized design)

IF ID REN REG SCHEDULER

EX/MEM CHK

IF

CHKID/REG

CHKEX

CT

CHKMEM

CTL 3rd opinion

CTL

CTL

117

Fully Testable Microprocessor DesignsChecker structure facilitates manufacturing tests• All checker inputs exposed

to built-in-self-test logic• Checker provides built-in

test signature compressionChecker can be fully tested with small BIST module• less than 0.5% area increase

Reduces burden of testing on core• Missed core defects corrected• Checker acts as core tester

IF

ID

OK

PC

=inst

PC

inst

EX

=regs

regs

MEM

=res/addr

addr result

D-cache

I-cache

RF

CT

WT

result

OKresult

BIST ROM and ControlDefectFree?

118

1.48 1.52 1.56 1.60 1.641E-81E-71E-61E-51E-41E-30.01

0.11

10

0.80

0.85

0.90

0.95

1.00

1.05

1.10

1.15

1.20

Pe

rcen

tage

Err

or R

ate

Nor

mal

ized

Ene

rgy

120MHz

140MHz

Voltage (in Volts)

Error Rate and Normalized Energy Savings for Chip1

119

Point of 0.1% Error RatePoint of First Failure

120MHz27C

99.6

89.7

Power(mW)

830

740

Energy perInstruction

(Power/IPC/Freq)(pJ)

119.4

104.5

Power(mW)

990

870

Chip2

Chip1

Energy perInstruction

(Power/IPC/Freq)(pJ)

Measured Power and Energy

Sidestepping performance bottlenecks and design...

Documents

Transcript of Sidestepping performance bottlenecks and design...