Design for Reliability - University of California, Los...
Transcript of Design for Reliability - University of California, Los...
Aug. 14, 2007 IC-DFN
Design for Reliability --
Tim ChengElectrical and Computer Engineering
University of California, Santa Barbara
From Self-Test to Self-Recovery
Aug. 14, 2007 IC-DFN
Increasing Failure Sources and Failure Rates
40
50
60
70
80
90
100
110
Tem
pera
ture
(C)
On-Die Temperature variations
SEU
random defects
parametric variations
soft errors
design errors
Aug. 14, 2007 IC-DFN
Harder to Design Reliable Chips
• First-silicon success rate has been dropping– ~30% for complex ASIC/[email protected] (according to an ASIC
vendor)– Pre-silicon logic bugs have been increasing at 3X-4X per
generation for Intel’s processors
• Yield has been dropping for volume production and takes longer to ramp up the yield– IBM’s 8-core Cell-Processor chips: ~10-20% yield (July 2006)
• “Better than worst-case” design resulting in failures w/o defects– Increase in variation of process parameters with scaling– Worst-case design getting way too conservative
Aug. 14, 2007 IC-DFN
Design for Reliability
• Systems must be designed to cope with failures • Efficient silicon debug is becoming a must
– Design for debugging would become necessary• Must have embedded self-test for error detection
– For both testing in manufacturing line and in-the-field testing– Both on-line and off-line testing
• Re-configurability and adaptability for error recovery make better sense– Using spares to replace defective parts– Using redundancy to mask errors– Using tuning to compensate variations
Aug. 14, 2007 IC-DFN
From Test to Recovery/Reconfiguration – Some Examples
• Memory: BIST → BISD → BISR a common practice
• On-Line sensing and tuning– On-chip leakage sensing → Leakage control (adaptive body bias)– On-chip thermal sensing → Cooling adjustment– On-chip delay sensing → Performance tuning
• Analog/RF/High-speed IO components
– Digitally-assisted analog design and test
• Multicore system with spares
Aug. 14, 2007 IC-DFN
Analog Circuit Design Trade-offs• Power dissipation vs. speed and precision trade-offs
– feature size↓ → ratio of power dissipation of analog to digital ↑
– E.g., a 12-bit ADC consumes as much energy as switching 300K gates in 90nm technology
Source: B. Murmann, IEEE Micro, March-April 2006.
Aug. 14, 2007 IC-DFN
Digitally-Assisted Analog Design• Leverage powerful digital computing capability
– Pros: improve precision and/or speed of analog circuit without dramatically increasing its power dissipation
• Use a less-precise/lower-performance analog circuit plus a digital processor which:– measures errors of analog block– then compensates its errors and tune its performance
RelaxedAnalog circuit
Complex Digitalprocessor
MeasureError
Adjust/Compensate
Analog Input Analog/DigitalOutput
Aug. 14, 2007 IC-DFN
Example: Pipelined ADC• ADC suffers from nonlinearity due to capacitor
mismatch, finite opamp gain, … etc.• System output is down-sampled, and compared
with signal from slow-but-accurate ADC– Update tap values in digital FIR filter to minimize the
errors of inaccurate ADC
TH
N
DigitalFilter
N
Input Output
AdaptationDownSample
DigitalAnalog
Slow-but-Accurate
ADC
InaccurateADC
ε
Aug. 14, 2007 IC-DFN
Example: RF Polar Transmitter• VCO-gain and loop linearity in PLL are critical
to accuracy of the polar modulator• Adjust capacitance in VCO and current in CP
to calibrate VCO-gain and tune loop-gain
Divider
TX Data
TXOut
FrequencyMeasurement
FD CP
AlignmentAlgorithm
LPF
VCO
Ref
CoordinateRotation
DigitalTX Data
RFOutPA
Modulator
PLL
Ampitude
PhaseDigitalAnalog
Ref
Polar Modulator
Aug. 14, 2007 IC-DFN
Example: Adaptive Equalizer in High-Speed Serial-Link (HSSL) Receiver
• EQ in RX is realized by a FIR filter• Tap coefficients (C1, C2,…) in EQ are adjusted
by Adaptation engine
CDR
Adaptation ε
Clock
(A)(B)EQ
Input C1 C2 C3EQ
Output(C)
Delay Delay
Channel- induce ISI
(B)
(C)
(D)
(D)(C)(B)(A)
RX
EQ
Aug. 14, 2007 IC-DFN
Observability Problem of Embedded Analog Blocks
• Direct observation of analog signal is problematic• Self-compensation by digital block causes masking
of defects in analog block
RelaxedAnalog circuit
Complex Digitalprocessor
Measure Error Adjust/Compensate
Analog OutputNext
Stage
SOCHard to observe &
Fault masking
Analog Input
Aug. 14, 2007 IC-DFN
Digitally-Assisted Analog Testing
• Proposed solution: Applying specific stimulus to analog block and analyzing the digital calibration and adaptation results stored in digital block for fault detection and diagnosis
RelaxedAnalog circuit
Complex Digitalprocessor
MeasureError
Adjust/Compensate
Analog Input/Test Stimulus
Analog/DigitalOutput
Analyze Digital Calibration/AdaptationResult
DigitallyAssistedAnalogDesign
DigitallyAssistedAnalogTesting
Aug. 14, 2007 IC-DFN
Fault Detection for Pipelined ADC & PLL of RF Transmitter
• Analyze tap values in the digital filter of ADC• Analyze calibration data in self-alignment unit of PLL
DividerTXData
TXOut
FrequencyMeasurement
FD CP
AlignmentAlgorithm
LPF
VCO
Ref
DigitalAnalog
TH
N
DigitalFilter
N
Input Output
AdaptationDownSample
DigitalAnalog
Slow-but-Accurate
ADC
InaccurateADC
ε
Capture calibration results
Aug. 14, 2007 IC-DFN
Existing Methods for Testing Adaptive EQ in HSSL RX
• External scope to capture EQ output via access point– Signal integrity is degraded due to extra loading
• On-chip waveform monitor– Increase circuitry complexity and device cost
EQ CDR
Clock
Signal
Access Point
EqualizedSignal
On-ChipMonitor
Data
RX
DriveCircuit
Eye-diagram
Aug. 14, 2007 IC-DFN
Testable Design for Adaptive EQ
• Insert extra DfT circuitry:– FF chain to store digital tap coefficients ci– A switch and a pattern generator to replace the
slicer output• DfT circuitry are all digital
PatternGenerator
Scan Out
FF FF
EQ CDR
Adaptation
Output
ε
Clock
Input
Ci
Aug. 14, 2007 IC-DFN
Experiment: Detecting Defects in a 5-Tap Feed-forward EQ
• Illustrating 3 types of single-fault instances:– Fault (a): Stuck-at faults at one of the 5 taps– Fault (b): 20% gain error at one of the 5 taps– Fault (c): 10% DC offset due to nonzero common input
EQInput
C1 C2 C3 EQOutput
D D D
C4
D
Adaptation
C5
Fault(a)
Main Cursor Tap
80%
Fault(b)
Aug. 14, 2007 IC-DFN
Fault Masking of Fault (a) if Detection is Made by Examining EQ Output
• The difference in eye-diagrams between fault-free and faulty EQ is small
EQInput
C1 C2 C3
EQOutput
D D D
C4
D
Adaptation
C5
Fault(a)
Main Cursor Tap
1-bit period
782.1=η
712.1=η
Fault-Free EQ
Faulty EQ
Aug. 14, 2007 IC-DFN
Testing Faults (a) & (b) by Proposed Method
• Test Fault (a): with stimuli AI(1) & DI• Test Fault (b): with stimuli AI(2) & DI
– Locate the fault in the 1st tap
DI
AI (1)Tap Number
1-bit period
higherlevel
repeat
EQAIPECard
DSP ci
ATEεAdaptation
CDR
Clock
PatternGen
DI
AI (2)
AI (3)
DUT
EQInput
C1 C2 C3
D D D
C4
D
Adaptation
C5
Fault(a)
80%
Fault(b)
-0.8
-0.4
0
0.4
0.8
1.2
1 2 3 4 5
Tap Number
Tap
Wei
ght
Without Stuck-At FaultWith Stuck-At Fault
-0.20
0.20.40.60.8
1
1 2 3 4 5
Tap Number
Tap
Wei
ght
Without Gain Error
With Gain Error
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
Tap NumberTa
p W
eigh
t
Without Gain Error
With Gain Error
Stimuli : AI(1) & DI Stimuli : AI(1) & DI Stimuli : AI(2) & DI
Aug. 14, 2007 IC-DFN
From Test to Recovery/Reconfiguration – Some Examples
• Memory: BIST → BISD → BISR a common practice
• On-Line sensing and tuning– On-chip leakage sensing → Leakage control (adaptive body bias)– On-chip thermal sensing → Cooling adjustment– On-chip delay sensing → Performance tuning
• Analog/RF/High-speed IO components
– Digitally-assisted analog design and test
• Multicore system with spares
Aug. 14, 2007 IC-DFN
Could 10-20% yields for Cell processors lead to problems for Sony PS3? *
“With standard SiGe single-core processors, IBM can achieve yields of up to 95%. But with a chip like the Cell processor, you're lucky to get 10 or 20 percent."
“If you really want to be focused on reliability and up-time availability, you can design one of these chips to self-detect. You can ship it with eight cores working, blow one of them, and from a user perspective you would have self-healed it in the field.”
“With such systems in place, yields could conceivably increase in a best-case scenario to 40% - still significantly lower than the 95% yields that IBM and others enjoyed during the single-core, "one-by-one" era.”
* Electronic News 7/7/06 and TGDaily 7/14/06, Interview of Tom Reeves, VP of semiconductor and technology services at IBM
Aug. 14, 2007 IC-DFN
Need New Test Strategy and Yield Analysis for Multi-core Systems with Spares
• Understanding impact of core yield, test quality and spare scheme on final system yield and cost– How many spare cores should be included? – How many working spares in a shipped chip would be sufficient?– What is the required core defect quality to achieve required
system reliability?– Can we skip burn-in and repair infant mortality in the field?
IBM CELL Processor (8 SPE)
(ISSCC05)Sun Niagara (8 Sparc cores)
(IEEE Micro 2005)Intel 80-tile
network on chip (ISSCC07)
Aug. 14, 2007 IC-DFN
M-out-of-N-core System• Definition: A system that has totally N cores and
requires at least M defect-free cores for operation– Cell for PS3 is a 7-out-of-8-core system
• System effective yield is a function of:– core yield– number of active cores (M)– total number of cores (N)– number of partitions of a core– defect coverage
• Finer spare granularity, better spare utilization but lower core yield
B1
B3
B2
B4
partitioning requiresadditional control and
configuring logic,thus increasing core area
Core 1
Core 2
Aug. 14, 2007 IC-DFN
Scenarios for Consideration1. Manufacturing testing and repair in the
manufacturing line― Screening defective chips from shipping― Improving effective yield
2. Self-testing (on-line or off-line) and repair in the field― Covering defects missed by manufacturing testing
and new failures in-the-field ― Reducing service cost
Aug. 14, 2007 IC-DFN
Yield Model for One Core• Raw yield of a core, YCi, is a function of area, defect
density, and clustering factor (α):– α is the degree to which defects are clustered
• Prob (core Ci is defect free | Ci passes testing):
* de Sousa and Agrawal, DATE 2000 * Kuo and Kim, Proc. of IEEE 1999
YCi
Prob(Ci passing testing)=Y’Ci
αc )
αAλ((A, λy
i
−×+= 1),α
α
αλαλ −Ω××
+=Ω )1(),,,(' AAyiC
Aug. 14, 2007 IC-DFN
Example: 1-out-of-2-Core System
• A chip could be shipped if:– Default core C1 passes testing: Prob=y’C1
– C1 fails testing and spare C2 passes: Prob= (1-y’C1 )xy’C2
• A shipped chip is indeed a working chip if:– C1 passes testing and indeed fault-free: Prob=yC1
– C1 fails testing, C2 passes & indeed fault-free: Prob=(1-y’C1 )xyC2
• Effective yield (Probability of a chip that can be shipped and indeed working): yC1 +(1-y’C1 )xyC2
• The reject ratio can be easily calculated
Aug. 14, 2007 IC-DFN
Effective Yield for M-out-of-N System
),()()'1()',,,(0
iMNPyyiM
yyMNy iMC
iC
M
iCCe −−⎟⎟
⎠
⎞⎜⎜⎝
⎛= −
=∑
jC
jSC
S
ij
yyjS
iSP )()'1(),( −
=
−⎟⎟⎠
⎞⎜⎜⎝
⎛= ∑
Probability that (M-i) out of M default cores are fault-free
Probability that at least i out of (N -M) spares are fault-free.
Aug. 14, 2007 IC-DFN
System Yield vs Core Yield (9-out-of-N-core Systems)
Core yield=80% Core yield=70%
Core yield=60% Core yield=50%
Aug. 14, 2007 IC-DFN
Example: 3-out-of-6-Core System • Sample assumptions:
– Manufacturing test defect coverage: 100%– Core yield: 0.65
• Effective chip yield: 88.2%• Shipped chips with S remaining spares
– (A) = 8.5%, (B) = 27.7% , (C) = 37.2%, (D) = 26.6%
2 remaining spares 1 remaining spare No remaining spare(A) (B) (C) (D)
3 remaining spares
C1C5
C3C2C4 C6
sparesC1 C3
C5C2
C4 C6default cores
C1 C3C4C4 C6C2 C1
C5C4 C6C2C3
C3C5
C1C4 C6
C2
C3C5C2C1
C6C4
C1 C3C5C2
C6C4
C3C5
C1C6
C2C4
…total 6 cases
total 3 cases
C1C2C4
C3C6C5
C1C5C4
C3C2C6
…total 7 cases
Should we ship chips without remaining spares?
Aug. 14, 2007 IC-DFN
Scenarios for Consideration1. Manufacturing testing and repair in the
manufacturing line• Screening defective chips from shipping• Improving effective yield
2. Self-testing (on-line or off-line) and repair in the field
– Off-line BIST or on-line checking for fault detection in the field
• Covering defects missed by manufacturing testing as well as new failures occurred after chip shipment
– Reconfiguration in the field
Aug. 14, 2007 IC-DFN
Should We Ship Chips Without Fault-Free Spares?
• Not shipping them reduces effective yield and, thus, increases unit manufacturing cost
• Shipping them increases field return rate and, thus, increases unit service cost
• Factors for consideration:– Core failure rate in the field
– r =
– Number of fault-free spares
cost of replacing/servicing an irreparable chip in the fieldmanufacturing cost per shipped chip
Aug. 14, 2007 IC-DFN
Core Failure Rate in the Field • Weibul distribution model for a core’s lifecycle*:
– 2 parameters: shape (β) & scale (λ)– Scale parameter: the time at which 63.2% of units
will fail– Shape parameter:
• < 1, infant mortality• = 1, grace period• > 1, breakdown period
* Carulli and Anderson, IEEE Design & Test Computers March/April 2006
InfantMortality
GracePeriod
BreakdownPeriod
Failu
re ra
te (f
) ),;( =βλtf )(λλβ t 1−β
Time (t)
Aug. 14, 2007 IC-DFN
Failure Rate for M-out-of-N Systems• Core field-failure-rate over the time t:
βλ)/(
't
C
CC eyyF −×=
• Probability of a M-out-of-N chip NOT failing at time t:
),()()(0
tiRiPtRMN
iSNofoutM ×= ∑
−
=−−−
probability of a shipped chip with i spares passing test
probability of a chip with i available spares not failing at time t
Aug. 14, 2007 IC-DFN
Yield Analysis Framework for Multicore Systems with Spares
• We developed an analysis framework that can be used to:– Calculate effective system yield– Determine the number of spares for cost minimization– Analyze feasibility of eliminating burn-in for multi-core systems– Determine whether to ship chips with no or few working spares
• High-quality testing (both manufacturing testing and in-field testing) remains one of most critical requirements for multi-core systems
Aug. 14, 2007 IC-DFN
ConclusionsSystems must be designed to cope with failuresCost-effective embedded self-test will replace existing manufacturing test methodologies for heterogeneous SoC/SiPPost-silicon tuning/calibration/reconfiguration is becoming promising, and necessary, for Si nano systems