Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO v1.2. April 30, 2014 Optima...
Transcript of Soft Errors re-examined - ChipEx · Jamil R. Mazzawi Founder and CEO v1.2. April 30, 2014 Optima...
April 30, 2014Optima Design Automation Ltd © 1
April 30, 2014
Soft Errors re-examined
Jamil R. MazzawiFounder and CEO
www.optima-da.comv1.2
April 30, 2014Optima Design Automation Ltd © 2
Topics:
• Soft errors: definitions
• FIT Rate
• Soft-errors problem strengthening in new nodes
• Logical Masking and deration
• Mitigation techniques
• Flip-flop selection
• CosmicASICs™
April 30, 2014Optima Design Automation Ltd © 3
Soft-errors• Cosmic Particles influencing our chips
• Particles can flip the values in flops and memory bits
April 30, 2014Optima Design Automation Ltd © 4
Measuring soft-errors: FIT rate
• FIT – Failure In Time
– How many Failures in 1 billion hours
– FIT = 109 / MTBF (hours)
• FIT of a system = 𝐹𝐼𝑇𝑖i= all its components
• FIT for a server farm =Sum of the FIT of all its servers, routers etc..
• FIT for a chip =
Sum of the FIT of all flops, memory bites, combo logic etc..
April 30, 2014Optima Design Automation Ltd © 5
Example: FIT req. of a chip
• Server farm for bank XYZ, with 1000 servers
• Required MTBF(the farm) = 1 year
• MTBF(each server) = 1000 years
– Includes power supply, FAN, memory, the CPU Chip, other chips
• MTBF(CPU chip) = 1200 years
• FIT(CPU) = 109
1200 ∗365.25 ∗24= 114077
1200= 95
• Given: FIT(single flip-flop) = 0.01 (@NYC)
• Given: Chip has 300,000 flops
• FIT(all flops) = 3,000 > 95 We have a problem
Does not include:1- Deration factors2- Other component of the chip (i.e, memories)
April 30, 2014Optima Design Automation Ltd © 6
Problem strengthening these days
• Newer, technologies are more sensitive
– Smaller transistor dimension
=> Smaller critical charge
=> the electrical charge of the particles relatively
bigger than the critical charge
• Two effects that cancel each other:– Smaller area per-transistor decrease per-trans FIT-rate
– More transistor per mm² Increase total FIT (of the chip)
together, they almost don’t influence the FIT rate
April 30, 2014Optima Design Automation Ltd © 7
Where is it important:• Memories
– Was the only area that needed protection in older nodes
– Solution: ECC protection
• Flop-flops
– Flops must be protected in
newer technology nodes
• Combinatorial logic
– Second degree problem
Solved Problem!
Hottest unsolved Problem!
Not a problem yet
April 30, 2014Optima Design Automation Ltd © 8
Single Event Upset vs. Soft-Error
• SEU:A particle caused a flip-flop or memory bit to flip its value
• Soft-Error:
An SEU has propagated and caused a system failure seen outside
• Most SEU do not convert to Soft-errors
April 30, 2014Optima Design Automation Ltd © 9
Most SEUs do not convert to Soft-errors
• Definition: FIT rate with derating factors– FIT calculated taking into account vanishing SEUs
Ilan Beer, IBM
HVC 2008
April 30, 2014Optima Design Automation Ltd © 10
Common mitigation methods:
• TMR with Majority voting
• DMR with C-Element
• Soft-error detectors
• SE detection with Parity tree
• More….
April 30, 2014Optima Design Automation Ltd © 11
Solution 1: TMR with Majority voterTMR – Triple Modular Redundancy.Extra area ~ +205%, extra power ~ +205%, FIT = 0 (-100%)
April 30, 2014Optima Design Automation Ltd © 12
Solution 2: DMR with C-elementDMR - Dual Modular Redundancy using additional C-elementadditional area and power > +100%, FIT = reduced to 5%
April 30, 2014Optima Design Automation Ltd © 13
Solution 3: Soft-Error detectors
These techniques usually used for detecting single bit flips in pipeline storage elements.
One simple method is to duplicate the critical node and connect the outputs to XOR gate.
Additional area and power is about 100%.
April 30, 2014Optima Design Automation Ltd © 14
Summary of different solutionsFIT
Extra power
Extra areadescriptionTechniqueFamily
Down to 0+200%+200%
Triplicate of storage elements with majority voter at output
TMR with majority voting
TMR
Triple Modular Redundancy
Three time-delayed storage node
Down by 95%
+100%+105%Copy storage element
C – element
DMR
Dual Modular Redundancy
~+15%+20%Using already existing scan
design-for-testability
~+100%+103%Using duplicated storage
element with XORError Detection
Down to 0Performance
penaltyNot always
possible------
Using transient detector. Used in pipelines and recoverable
modelsParity treeParity Tree
……………Etc..
April 30, 2014Optima Design Automation Ltd © 15
Flip-flop selection is needed• Hardening all flops is not viable
– Silicon costs: 25%-35%
– Influence on: Unit cost, NRE cost and Power
• Solution:
– Apply these solutions selectively
– Harden flops that are more sensitive to SEUs
• “A flop Sensitive to SEU” means:
– SEU on the flop has higher probability to convert to soft-error
April 30, 2014Optima Design Automation Ltd © 16
Existing selection methods: Error Injection simulation
• Run a lot of simulations
• Each simulation injects a single error on a random flop, at a random cycle (simulating SEU)
• If the test-bench detects an error this SEU is Soft-Err.
• How many simulations to run?
– Option 1: Loop for all flops and all cycles
– Option 2: select random flops and random cycles to inject errors on lower accuracy
April 30, 2014Optima Design Automation Ltd © 17
Error Injection simulation
• Benefits:
– Almost the only available option now
• Draw backs:
– Time consuming: 2-4 weeks with low sample-rate
– Compute resources consuming
• 2-4 weeks x 10-20 machines during peak project time
– Internal/in-house solution: needs someone to develop it and maintain it
– Solution available only for big companies
April 30, 2014Optima Design Automation Ltd © 18
Introducing: CosmicASIC™
• x1000 times faster than existing solutions
• Plug-and-play solution
• 100% accuracy
April 30, 2014Optima Design Automation Ltd © 19
Summary• The Soft-errors problem is strengthening
• Mitigation techniques exist:
– But can cost 25%-35% in silicon, NRE and power
• Flip-flop selection is a must
– Solves the soft-error problem at fraction of the cost
• CosmicASIC™: Flip-flop selection EDA tool
Visit us at booth A03 in the exhibition area
Or at: http://www.optima-da.com