Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is...

26
Oct. 2007 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Transcript of Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is...

Page 1: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 1

Fault-Tolerant ComputingDealing with Low-Level Impairments

Page 2: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 2

About This Presentation

Edition Released Revised Revised

First Oct. 2006 Oct. 2007

This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami

Page 3: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 3

Defect Avoidance and Circumvention

Page 4: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 4

Page 5: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 5

Multilevel Model

Component

Logic

Service

Result

Information

System

Low-Level Impaired

Mid-Level Impaired

High-Level Impaired

Initial Entry

Deviation

Remedy

Legned:

Ideal

Defective

Faulty

Erroneous

Malfunctioning

Degraded

Failed

Legend:

Tolerance

Entry

Page 6: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 6

The Manufacturing Process for an IC Part

15-30 cm

30-60 cm

Silicon crystal ingot

Slicer Processing: 20-30 steps

Blank wafer with defects

x x x x x x x

x x x x

0.2 cm

Patterned wafer

(100s of simple or scores of complex processors)

Dicer Die

~1 cm

Good die

~1 cm

Die tester

Microchip or other part

Mounting Part

tester Usable

part to ship

Page 7: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 7

The dramatic decrease in yield with larger dies

Effect of Die Size on Yield

120 dies, 109 good 26 dies, 15 good

Shown are some random defects; there are also bulk or clustered defects that affect a large region

Die yield =def (Number of good dies) / (Total number of dies)

Die yield = Wafer yield × [1 + (Defect density × Die area) / a]–a

Die cost = (Cost of wafer) / (Total number of dies × Die yield)= (Cost of wafer) × (Die area / Wafer area) / (Die yield)

The parameter a ranges from 3 to 4 for modern CMOS processes

Page 8: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 8

Effects of Yield on Testing and Part ReliabilityDie yield =assume 50%

Out of 2,000,000 dies manufactured, ≈1,000,000 are defective

To achieve the goal of 100 defects per million (DPM) in parts shipped, we must catch 999,900 of the 1,000,000 defective parts

Therefore, we need a test coverage of 99.99%

Testing is imperfect: missed faults (coverage), false positives

Going from a coverage of 99.9% to 99.99% involves a significant investment in test development and application times

False positives are not a source of difficulty in this context

Discarding another 1-2% due to false positives in testing does not change the scale of the loss

Page 9: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 9

Examples of Random Defects in ICs

Resistive open due to unfilled via Resistive open due to unfilled via [R. Madge et al.,[R. Madge et al., IEEE D&T, IEEE D&T, 2003]2003]

Particle embedded Particle embedded between layersbetween layers

Page 10: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 10

Defect Modeling

ExtraExtra--material defects are material defects are modeled as circular areasmodeled as circular areas

Pinhole defects are tiny Pinhole defects are tiny breaches in the dielectricbreaches in the dielectric

From: http://www.see.ed.ac.uk/research/IMNS/papers/IEE_SMT95_Yield/IEEAbstract.html

Page 11: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 11

Sensitivity of Layouts to Defects

Extra materialExtra material

VLSI layout must be done with defect VLSI layout must be done with defect patterns and their impacts in mindpatterns and their impacts in mind

A balance must be struck with regard A balance must be struck with regard to sensitivity to different defect typesto sensitivity to different defect types

Missing materialMissing material

Actual photo of a Actual photo of a missingmissing--material defectmaterial defect

http://www.midasvision.com/v3.htmhttp://www.midasvision.com/v3.htm

Killer Killer defectdefect

Latent Latent defectdefect

Page 12: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 12

The Bathtub CurveMany components fail early on because of residual or latent defectsComponents may also wear out due to aging (less so for electronics)In between the two high-mortality regions lies the useful life period

Time

Failure rate Infant

mortalityEnd-of-life wearout

Useful life (low, constant failure rate)

Mechanical

Electronic

Primarily due to latent defects

λ

Page 13: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 13

Survival Probability of Electronic Components

From: http://www.weibull.com/hotwire/issue21/hottopics21.htm

Infant mortality

Time in years

Per

cent

of p

arts

stil

l wor

king

No significantwear-out

Bathtub curve

Page 14: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 14

Burn-In and Stress Testing

From: http://www.weibull.com/hotwire/issue21/hottopics21.htmTime in years

Per

cent

of p

arts

stil

l wor

king

Burn-in and stress tests are done in accelerated form

Difficult to perform on complex and delicate ICs without damaging good parts

Expensive “ovens” are required

Page 15: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 15

Defect Avoidance vs. CircumventionDefect AvoidanceDefect awareness in design, particularly layout and routingExtensive quality control during the manufacturing processComprehensive screening, including burn-in and stress tests

Defect Circumvention (Removal)Built-in dynamic redundancy on the die or waferIdentification of defective parts (visual inspection, testing, association)Bypassing or reconfiguration via embedded switches

Defect Circumvention (Tolerance)Built-in static redundancy on the die or waferIdentification of defective parts (external test or self-test)Adjustment or tuning of redundant structures

Page 16: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 16

Defect Bypassing via Reconfiguration

Works best when the system on die has regular, repetitive structure:MemoryFPGAMulticore chipCMP (chip multiprocessor)

Irregular (random) logic implies greater redundancy due to replication:Replicated structures must not be close to each otherThey should not be very far either (wiring/switching overhead)

Page 17: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 17

Peripheral reconfiguration elements

Defects in Memory ArraysDefect circumvention (removal)Provide several extra (spare) rows and/or columnsRoute external connections to defect-free rows and columns

Spare rows Spare rows

Memoryarray

Memoryarray

Defective rowDefectivecolumn

Defect circumvention (tolerance)Error-correcting code

With m rows and s spares, can model as m-out-of-(m + s)

Somewhat more complex with both spare rows and columns(still combinational, though)

Modeling with coded scheme to be discussed at the info level

Methods in use since the 1970s;e.g., IBM’s defect-tolerant chip

Spa

re c

olum

ns

Spa

re c

olum

ns

Page 18: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 18

Yield Improvement in Memory ArraysExample of IBM’s experimental 16 Mb memory chipCombines the use of spare rows/columns in memory arrays with ECC

Four quadrants, each with 16 spare rows & 24 spare columns

ECC corrects any single error via 9 check bits (137 data bits)

Bits assigned to the same word are separated by 8 bit positions Avg. number of failing cells per chip

40003000200010000

100

80

60

40

20

0

Yield

ECConly

Sparesonly

ECC and spares

Page 19: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 19

(a) Portion of PAL with storable output (b) Generic structure of an FPGA

8-input ANDs

D

C Q

Q

FF

Mux

Mux

0 1

0 1

I/O blocks

Configurable logic block

Programmable connections

CLB

CLB

CLB

CLB

Defects in FPGAsDefect circumvention (removal)Provide several extra (spare) CLBs, I/O blocks, and connectionsRoute external connections to available blocks

Defect circumvention (tolerance)Not applicable

Page 20: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 20

Defects in Multicore Chips or CMPsDefect circumvention (removal)Similar to FPGAs, except that processors are the replacement entities

Interprocessor interconnection network is the main challenge

Will discuss the switching and reconfiguration aspects in more detail when we get to the malfunction level in our multilevel model

Page 21: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 21

Circumventing Defects in Processor ArraysExtensive research done on how to salvage a working array from one that has been damaged by defects

Proposed methods differ inTypes and placement of switches(e.g., 4-port, single/double-track)Types and placement of sparesAlgorithms for determining working configurationsWays of effecting reconfigurationMethods of assessing resilience

The next few slides show some methods based on 4-port, 2-state switches

Page 22: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 22

Defect Tolerance Schemes for Linear Arrays

A linear array with a spare processor and embedded switching

Spare or DefectiveP0 P1 P2 P3

Bypassed

I/O

Test

I/O

Test

Spare or Defective

MuxP0 P1 P2 P3

A linear array with a spare processor and reconfiguration switches

Page 23: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 23

Defect Tolerance in 2D Arrays

Two types of reconfiguration switching for 2D arrays

Pa Pb

Pc Pd

Pa Pb

Pc Pd

Mux

Assumption: A defective unit can be bypassed in its row/column by means of a separate switching mechanism (not shown)

Page 24: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 24

A Reconfiguration Scheme for 2D Arrays

Spare Row

Spare Column

A 5 × 5 working array salvaged from a 6 × 6 redundant mesh through reconfiguration switching

Seven defective processors in a 5 × 5 array and their associated compensation paths

Page 25: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 25

Limits of Reconfigurability

A set of three defective nodes, one of which cannot be accommodated by the compensation-path method.

No compensation path exists for this faulty node

Extension: We can go beyond the 3-defect limit by providing spare rows on top and bottom and spare columns on either side

Page 26: Dealing with Low-Level Impairments - UCSBparhami/pres_folder/f33-ft-computing-lec… · Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9%

Oct. 2007 Defect Avoidance and Circumvention Slide 26

Combinational ModelingNo compensation path exists for this faulty node

Pessimistic/Easy: Any 3 bad cells lead to failure

Model m × m array as (m2 – 2)-out-of-m2 system

Realistic/Hard: Enumerate all combinations of bad cells that lead to failure and assess the probability of at least one of these combinations occurring