Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO v1.2. April 30, 2014 Optima...

19
April 30, 2014 Optima Design Automation Ltd © 1 April 30, 2014 Soft Errors re-examined Jamil R. Mazzawi Founder and CEO www.optima-da.com v1.2

Transcript of Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO v1.2. April 30, 2014 Optima...

Page 1: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 1

April 30, 2014

Soft Errors re-examined

Jamil R. MazzawiFounder and CEO

www.optima-da.comv1.2

Page 2: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 2

Topics:

• Soft errors: definitions

• FIT Rate

• Soft-errors problem strengthening in new nodes

• Logical Masking and deration

• Mitigation techniques

• Flip-flop selection

• CosmicASICs™

Page 3: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 3

Soft-errors• Cosmic Particles influencing our chips

• Particles can flip the values in flops and memory bits

Page 4: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 4

Measuring soft-errors: FIT rate

• FIT – Failure In Time

– How many Failures in 1 billion hours

– FIT = 109 / MTBF (hours)

• FIT of a system = 𝐹𝐼𝑇𝑖i= all its components

• FIT for a server farm =Sum of the FIT of all its servers, routers etc..

• FIT for a chip =

Sum of the FIT of all flops, memory bites, combo logic etc..

Page 5: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 5

Example: FIT req. of a chip

• Server farm for bank XYZ, with 1000 servers

• Required MTBF(the farm) = 1 year

• MTBF(each server) = 1000 years

– Includes power supply, FAN, memory, the CPU Chip, other chips

• MTBF(CPU chip) = 1200 years

• FIT(CPU) = 109

1200 ∗365.25 ∗24= 114077

1200= 95

• Given: FIT(single flip-flop) = 0.01 (@NYC)

• Given: Chip has 300,000 flops

• FIT(all flops) = 3,000 > 95 We have a problem

Does not include:1- Deration factors2- Other component of the chip (i.e, memories)

Page 6: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 6

Problem strengthening these days

• Newer, technologies are more sensitive

– Smaller transistor dimension

=> Smaller critical charge

=> the electrical charge of the particles relatively

bigger than the critical charge

• Two effects that cancel each other:– Smaller area per-transistor decrease per-trans FIT-rate

– More transistor per mm² Increase total FIT (of the chip)

together, they almost don’t influence the FIT rate

Page 7: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 7

Where is it important:• Memories

– Was the only area that needed protection in older nodes

– Solution: ECC protection

• Flop-flops

– Flops must be protected in

newer technology nodes

• Combinatorial logic

– Second degree problem

Solved Problem!

Hottest unsolved Problem!

Not a problem yet

Page 8: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 8

Single Event Upset vs. Soft-Error

• SEU:A particle caused a flip-flop or memory bit to flip its value

• Soft-Error:

An SEU has propagated and caused a system failure seen outside

• Most SEU do not convert to Soft-errors

Page 9: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 9

Most SEUs do not convert to Soft-errors

• Definition: FIT rate with derating factors– FIT calculated taking into account vanishing SEUs

Ilan Beer, IBM

HVC 2008

Page 10: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 10

Common mitigation methods:

• TMR with Majority voting

• DMR with C-Element

• Soft-error detectors

• SE detection with Parity tree

• More….

Page 11: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 11

Solution 1: TMR with Majority voterTMR – Triple Modular Redundancy.Extra area ~ +205%, extra power ~ +205%, FIT = 0 (-100%)

Page 12: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 12

Solution 2: DMR with C-elementDMR - Dual Modular Redundancy using additional C-elementadditional area and power > +100%, FIT = reduced to 5%

Page 13: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 13

Solution 3: Soft-Error detectors

These techniques usually used for detecting single bit flips in pipeline storage elements.

One simple method is to duplicate the critical node and connect the outputs to XOR gate.

Additional area and power is about 100%.

Page 14: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 14

Summary of different solutionsFIT

Extra power

Extra areadescriptionTechniqueFamily

Down to 0+200%+200%

Triplicate of storage elements with majority voter at output

TMR with majority voting

TMR

Triple Modular Redundancy

Three time-delayed storage node

Down by 95%

+100%+105%Copy storage element

C – element

DMR

Dual Modular Redundancy

~+15%+20%Using already existing scan

design-for-testability

~+100%+103%Using duplicated storage

element with XORError Detection

Down to 0Performance

penaltyNot always

possible------

Using transient detector. Used in pipelines and recoverable

modelsParity treeParity Tree

……………Etc..

Page 15: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 15

Flip-flop selection is needed• Hardening all flops is not viable

– Silicon costs: 25%-35%

– Influence on: Unit cost, NRE cost and Power

• Solution:

– Apply these solutions selectively

– Harden flops that are more sensitive to SEUs

• “A flop Sensitive to SEU” means:

– SEU on the flop has higher probability to convert to soft-error

Page 16: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 16

Existing selection methods: Error Injection simulation

• Run a lot of simulations

• Each simulation injects a single error on a random flop, at a random cycle (simulating SEU)

• If the test-bench detects an error this SEU is Soft-Err.

• How many simulations to run?

– Option 1: Loop for all flops and all cycles

– Option 2: select random flops and random cycles to inject errors on lower accuracy

Page 17: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 17

Error Injection simulation

• Benefits:

– Almost the only available option now

• Draw backs:

– Time consuming: 2-4 weeks with low sample-rate

– Compute resources consuming

• 2-4 weeks x 10-20 machines during peak project time

– Internal/in-house solution: needs someone to develop it and maintain it

– Solution available only for big companies

Page 18: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 18

Introducing: CosmicASIC™

• x1000 times faster than existing solutions

• Plug-and-play solution

• 100% accuracy

Page 19: Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO  v1.2. April 30, 2014 Optima Design Automation Ltd ... Optima Design Automation Ltd ...

April 30, 2014Optima Design Automation Ltd © 19

Summary• The Soft-errors problem is strengthening

• Mitigation techniques exist:

– But can cost 25%-35% in silicon, NRE and power

• Flip-flop selection is a must

– Solves the soft-error problem at fraction of the cost

• CosmicASIC™: Flip-flop selection EDA tool

Visit us at booth A03 in the exhibition area

Or at: http://www.optima-da.com