Design for Reliability - University of California, Los...

34
Aug. 14, 2007 IC-DFN Design for Reliability -- Tim Cheng Electrical and Computer Engineering University of California, Santa Barbara From Self-Test to Self-Recovery

Transcript of Design for Reliability - University of California, Los...

Page 1: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Design for Reliability --

Tim ChengElectrical and Computer Engineering

University of California, Santa Barbara

From Self-Test to Self-Recovery

Page 2: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Increasing Failure Sources and Failure Rates

40

50

60

70

80

90

100

110

Tem

pera

ture

(C)

On-Die Temperature variations

SEU

random defects

parametric variations

soft errors

design errors

Page 3: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Harder to Design Reliable Chips

• First-silicon success rate has been dropping– ~30% for complex ASIC/[email protected] (according to an ASIC

vendor)– Pre-silicon logic bugs have been increasing at 3X-4X per

generation for Intel’s processors

• Yield has been dropping for volume production and takes longer to ramp up the yield– IBM’s 8-core Cell-Processor chips: ~10-20% yield (July 2006)

• “Better than worst-case” design resulting in failures w/o defects– Increase in variation of process parameters with scaling– Worst-case design getting way too conservative

Page 4: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Design for Reliability

• Systems must be designed to cope with failures • Efficient silicon debug is becoming a must

– Design for debugging would become necessary• Must have embedded self-test for error detection

– For both testing in manufacturing line and in-the-field testing– Both on-line and off-line testing

• Re-configurability and adaptability for error recovery make better sense– Using spares to replace defective parts– Using redundancy to mask errors– Using tuning to compensate variations

Page 5: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

From Test to Recovery/Reconfiguration – Some Examples

• Memory: BIST → BISD → BISR a common practice

• On-Line sensing and tuning– On-chip leakage sensing → Leakage control (adaptive body bias)– On-chip thermal sensing → Cooling adjustment– On-chip delay sensing → Performance tuning

• Analog/RF/High-speed IO components

– Digitally-assisted analog design and test

• Multicore system with spares

Page 6: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Analog Circuit Design Trade-offs• Power dissipation vs. speed and precision trade-offs

– feature size↓ → ratio of power dissipation of analog to digital ↑

– E.g., a 12-bit ADC consumes as much energy as switching 300K gates in 90nm technology

Source: B. Murmann, IEEE Micro, March-April 2006.

Page 7: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Digitally-Assisted Analog Design• Leverage powerful digital computing capability

– Pros: improve precision and/or speed of analog circuit without dramatically increasing its power dissipation

• Use a less-precise/lower-performance analog circuit plus a digital processor which:– measures errors of analog block– then compensates its errors and tune its performance

RelaxedAnalog circuit

Complex Digitalprocessor

MeasureError

Adjust/Compensate

Analog Input Analog/DigitalOutput

Page 8: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Example: Pipelined ADC• ADC suffers from nonlinearity due to capacitor

mismatch, finite opamp gain, … etc.• System output is down-sampled, and compared

with signal from slow-but-accurate ADC– Update tap values in digital FIR filter to minimize the

errors of inaccurate ADC

TH

N

DigitalFilter

N

Input Output

AdaptationDownSample

DigitalAnalog

Slow-but-Accurate

ADC

InaccurateADC

ε

Page 9: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Example: RF Polar Transmitter• VCO-gain and loop linearity in PLL are critical

to accuracy of the polar modulator• Adjust capacitance in VCO and current in CP

to calibrate VCO-gain and tune loop-gain

Divider

TX Data

TXOut

FrequencyMeasurement

FD CP

AlignmentAlgorithm

LPF

VCO

Ref

CoordinateRotation

DigitalTX Data

RFOutPA

Modulator

PLL

Ampitude

PhaseDigitalAnalog

Ref

Polar Modulator

Page 10: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Example: Adaptive Equalizer in High-Speed Serial-Link (HSSL) Receiver

• EQ in RX is realized by a FIR filter• Tap coefficients (C1, C2,…) in EQ are adjusted

by Adaptation engine

CDR

Adaptation ε

Clock

(A)(B)EQ

Input C1 C2 C3EQ

Output(C)

Delay Delay

Channel- induce ISI

(B)

(C)

(D)

(D)(C)(B)(A)

RX

EQ

Page 11: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Observability Problem of Embedded Analog Blocks

• Direct observation of analog signal is problematic• Self-compensation by digital block causes masking

of defects in analog block

RelaxedAnalog circuit

Complex Digitalprocessor

Measure Error Adjust/Compensate

Analog OutputNext

Stage

SOCHard to observe &

Fault masking

Analog Input

Page 12: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Digitally-Assisted Analog Testing

• Proposed solution: Applying specific stimulus to analog block and analyzing the digital calibration and adaptation results stored in digital block for fault detection and diagnosis

RelaxedAnalog circuit

Complex Digitalprocessor

MeasureError

Adjust/Compensate

Analog Input/Test Stimulus

Analog/DigitalOutput

Analyze Digital Calibration/AdaptationResult

DigitallyAssistedAnalogDesign

DigitallyAssistedAnalogTesting

Page 13: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Fault Detection for Pipelined ADC & PLL of RF Transmitter

• Analyze tap values in the digital filter of ADC• Analyze calibration data in self-alignment unit of PLL

DividerTXData

TXOut

FrequencyMeasurement

FD CP

AlignmentAlgorithm

LPF

VCO

Ref

DigitalAnalog

TH

N

DigitalFilter

N

Input Output

AdaptationDownSample

DigitalAnalog

Slow-but-Accurate

ADC

InaccurateADC

ε

Capture calibration results

Page 14: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Existing Methods for Testing Adaptive EQ in HSSL RX

• External scope to capture EQ output via access point– Signal integrity is degraded due to extra loading

• On-chip waveform monitor– Increase circuitry complexity and device cost

EQ CDR

Clock

Signal

Access Point

EqualizedSignal

On-ChipMonitor

Data

RX

DriveCircuit

Eye-diagram

Page 15: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Testable Design for Adaptive EQ

• Insert extra DfT circuitry:– FF chain to store digital tap coefficients ci– A switch and a pattern generator to replace the

slicer output• DfT circuitry are all digital

PatternGenerator

Scan Out

FF FF

EQ CDR

Adaptation

Output

ε

Clock

Input

Ci

Page 16: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Experiment: Detecting Defects in a 5-Tap Feed-forward EQ

• Illustrating 3 types of single-fault instances:– Fault (a): Stuck-at faults at one of the 5 taps– Fault (b): 20% gain error at one of the 5 taps– Fault (c): 10% DC offset due to nonzero common input

EQInput

C1 C2 C3 EQOutput

D D D

C4

D

Adaptation

C5

Fault(a)

Main Cursor Tap

80%

Fault(b)

Page 17: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Fault Masking of Fault (a) if Detection is Made by Examining EQ Output

• The difference in eye-diagrams between fault-free and faulty EQ is small

EQInput

C1 C2 C3

EQOutput

D D D

C4

D

Adaptation

C5

Fault(a)

Main Cursor Tap

1-bit period

782.1=η

712.1=η

Fault-Free EQ

Faulty EQ

Page 18: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Testing Faults (a) & (b) by Proposed Method

• Test Fault (a): with stimuli AI(1) & DI• Test Fault (b): with stimuli AI(2) & DI

– Locate the fault in the 1st tap

DI

AI (1)Tap Number

1-bit period

higherlevel

repeat

EQAIPECard

DSP ci

ATEεAdaptation

CDR

Clock

PatternGen

DI

AI (2)

AI (3)

DUT

EQInput

C1 C2 C3

D D D

C4

D

Adaptation

C5

Fault(a)

80%

Fault(b)

-0.8

-0.4

0

0.4

0.8

1.2

1 2 3 4 5

Tap Number

Tap

Wei

ght

Without Stuck-At FaultWith Stuck-At Fault

-0.20

0.20.40.60.8

1

1 2 3 4 5

Tap Number

Tap

Wei

ght

Without Gain Error

With Gain Error

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Tap NumberTa

p W

eigh

t

Without Gain Error

With Gain Error

Stimuli : AI(1) & DI Stimuli : AI(1) & DI Stimuli : AI(2) & DI

Page 19: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

From Test to Recovery/Reconfiguration – Some Examples

• Memory: BIST → BISD → BISR a common practice

• On-Line sensing and tuning– On-chip leakage sensing → Leakage control (adaptive body bias)– On-chip thermal sensing → Cooling adjustment– On-chip delay sensing → Performance tuning

• Analog/RF/High-speed IO components

– Digitally-assisted analog design and test

• Multicore system with spares

Page 20: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Could 10-20% yields for Cell processors lead to problems for Sony PS3? *

“With standard SiGe single-core processors, IBM can achieve yields of up to 95%. But with a chip like the Cell processor, you're lucky to get 10 or 20 percent."

“If you really want to be focused on reliability and up-time availability, you can design one of these chips to self-detect. You can ship it with eight cores working, blow one of them, and from a user perspective you would have self-healed it in the field.”

“With such systems in place, yields could conceivably increase in a best-case scenario to 40% - still significantly lower than the 95% yields that IBM and others enjoyed during the single-core, "one-by-one" era.”

* Electronic News 7/7/06 and TGDaily 7/14/06, Interview of Tom Reeves, VP of semiconductor and technology services at IBM

Page 21: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Need New Test Strategy and Yield Analysis for Multi-core Systems with Spares

• Understanding impact of core yield, test quality and spare scheme on final system yield and cost– How many spare cores should be included? – How many working spares in a shipped chip would be sufficient?– What is the required core defect quality to achieve required

system reliability?– Can we skip burn-in and repair infant mortality in the field?

IBM CELL Processor (8 SPE)

(ISSCC05)Sun Niagara (8 Sparc cores)

(IEEE Micro 2005)Intel 80-tile

network on chip (ISSCC07)

Page 22: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

M-out-of-N-core System• Definition: A system that has totally N cores and

requires at least M defect-free cores for operation– Cell for PS3 is a 7-out-of-8-core system

• System effective yield is a function of:– core yield– number of active cores (M)– total number of cores (N)– number of partitions of a core– defect coverage

• Finer spare granularity, better spare utilization but lower core yield

B1

B3

B2

B4

partitioning requiresadditional control and

configuring logic,thus increasing core area

Core 1

Core 2

Page 23: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Scenarios for Consideration1. Manufacturing testing and repair in the

manufacturing line― Screening defective chips from shipping― Improving effective yield

2. Self-testing (on-line or off-line) and repair in the field― Covering defects missed by manufacturing testing

and new failures in-the-field ― Reducing service cost

Page 24: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Yield Model for One Core• Raw yield of a core, YCi, is a function of area, defect

density, and clustering factor (α):– α is the degree to which defects are clustered

• Prob (core Ci is defect free | Ci passes testing):

* de Sousa and Agrawal, DATE 2000 * Kuo and Kim, Proc. of IEEE 1999

YCi

Prob(Ci passing testing)=Y’Ci

αc )

αAλ((A, λy

i

−×+= 1),α

α

αλαλ −Ω××

+=Ω )1(),,,(' AAyiC

Page 25: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Example: 1-out-of-2-Core System

• A chip could be shipped if:– Default core C1 passes testing: Prob=y’C1

– C1 fails testing and spare C2 passes: Prob= (1-y’C1 )xy’C2

• A shipped chip is indeed a working chip if:– C1 passes testing and indeed fault-free: Prob=yC1

– C1 fails testing, C2 passes & indeed fault-free: Prob=(1-y’C1 )xyC2

• Effective yield (Probability of a chip that can be shipped and indeed working): yC1 +(1-y’C1 )xyC2

• The reject ratio can be easily calculated

Page 26: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Effective Yield for M-out-of-N System

),()()'1()',,,(0

iMNPyyiM

yyMNy iMC

iC

M

iCCe −−⎟⎟

⎞⎜⎜⎝

⎛= −

=∑

jC

jSC

S

ij

yyjS

iSP )()'1(),( −

=

−⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑

Probability that (M-i) out of M default cores are fault-free

Probability that at least i out of (N -M) spares are fault-free.

Page 27: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

System Yield vs Core Yield (9-out-of-N-core Systems)

Core yield=80% Core yield=70%

Core yield=60% Core yield=50%

Page 28: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Example: 3-out-of-6-Core System • Sample assumptions:

– Manufacturing test defect coverage: 100%– Core yield: 0.65

• Effective chip yield: 88.2%• Shipped chips with S remaining spares

– (A) = 8.5%, (B) = 27.7% , (C) = 37.2%, (D) = 26.6%

2 remaining spares 1 remaining spare No remaining spare(A) (B) (C) (D)

3 remaining spares

C1C5

C3C2C4 C6

sparesC1 C3

C5C2

C4 C6default cores

C1 C3C4C4 C6C2 C1

C5C4 C6C2C3

C3C5

C1C4 C6

C2

C3C5C2C1

C6C4

C1 C3C5C2

C6C4

C3C5

C1C6

C2C4

…total 6 cases

total 3 cases

C1C2C4

C3C6C5

C1C5C4

C3C2C6

…total 7 cases

Should we ship chips without remaining spares?

Page 29: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Scenarios for Consideration1. Manufacturing testing and repair in the

manufacturing line• Screening defective chips from shipping• Improving effective yield

2. Self-testing (on-line or off-line) and repair in the field

– Off-line BIST or on-line checking for fault detection in the field

• Covering defects missed by manufacturing testing as well as new failures occurred after chip shipment

– Reconfiguration in the field

Page 30: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Should We Ship Chips Without Fault-Free Spares?

• Not shipping them reduces effective yield and, thus, increases unit manufacturing cost

• Shipping them increases field return rate and, thus, increases unit service cost

• Factors for consideration:– Core failure rate in the field

– r =

– Number of fault-free spares

cost of replacing/servicing an irreparable chip in the fieldmanufacturing cost per shipped chip

Page 31: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Core Failure Rate in the Field • Weibul distribution model for a core’s lifecycle*:

– 2 parameters: shape (β) & scale (λ)– Scale parameter: the time at which 63.2% of units

will fail– Shape parameter:

• < 1, infant mortality• = 1, grace period• > 1, breakdown period

* Carulli and Anderson, IEEE Design & Test Computers March/April 2006

InfantMortality

GracePeriod

BreakdownPeriod

Failu

re ra

te (f

) ),;( =βλtf )(λλβ t 1−β

Time (t)

Page 32: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Failure Rate for M-out-of-N Systems• Core field-failure-rate over the time t:

βλ)/(

't

C

CC eyyF −×=

• Probability of a M-out-of-N chip NOT failing at time t:

),()()(0

tiRiPtRMN

iSNofoutM ×= ∑

=−−−

probability of a shipped chip with i spares passing test

probability of a chip with i available spares not failing at time t

Page 33: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

Yield Analysis Framework for Multicore Systems with Spares

• We developed an analysis framework that can be used to:– Calculate effective system yield– Determine the number of spares for cost minimization– Analyze feasibility of eliminating burn-in for multi-core systems– Determine whether to ship chips with no or few working spares

• High-quality testing (both manufacturing testing and in-field testing) remains one of most critical requirements for multi-core systems

Page 34: Design for Reliability - University of California, Los Angelescadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007/Tim Cheng... · lead to problems for Sony PS3? * “With

Aug. 14, 2007 IC-DFN

ConclusionsSystems must be designed to cope with failuresCost-effective embedded self-test will replace existing manufacturing test methodologies for heterogeneous SoC/SiPPost-silicon tuning/calibration/reconfiguration is becoming promising, and necessary, for Si nano systems