Post on 16-Aug-2015
Copyright 2015 QuEST Forum. All Rights Reserved.
1
The action against Soft-errors
to prevent service outages
NTT Network Service Systems Laboratories
Hidenori Iwashita
2015 APAC QuEST Forum APAC Best Practices Conference
April 2015
Agenda
2
1. Soft error problemsLaboratory non-reproducible errors
Silent errors
2. Soft error mechanismsSoft errors are caused by cosmic rays
3. The increase of soft errorsWith miniaturization of LSI design rules, soft errors are
increasing rapidly
4. PracticesSoft error test using a compact accelerator neutron source
5. Results
6. Conclusion
NTT can reduce service outages and failure recovery costs due
to soft errors.
1. Soft error problems
Laboratory non-reproducible errors
3Network System
Network operations center
① Error
② Alarm
Manufacturer factory
③ Return
④ Tests
⑤ Test OK
1. Soft error problems
Silent errors
4Network System
Network operations center① User complaint
I can’t connect! • Not alarmed
• Fault node
unknown
Prolonged
Significant failure Press release
(Newspaper, TV)
5
SunSupernova explosion
Earth
Cosmic rays
(High energy particles)
Neutron
Nuclei (O or N)陽子
High energy particlesDestruction
Nuclear reactions in the atmosphere
Proton
Muon
π-meson
2. Soft error mechanisms
Neutrons generated by cosmic rays
6
2. Soft error mechanisms
Nuclear reactions in the device
Soft error
(Bit error)
Secondary ions
Silicon nuclei陽子
Destruction
NeutronNetwork System
Neutrons
3. The increase of soft errors
7
Miniaturization of LSI design rule
(Highly integrated)
Soft errors increase
Current,
At ground level
Past,
Only in space or the sky
3. The increase of soft errors
How often do soft errors occur ?
8
FPGA
SRAM
The FPGA contains large capacity SRAM.
Without soft error mitigation you got more than
10000 FIT.
E.g.
Since SRAMs have less critical charge (are more
sensitive), soft errors occur more frequently.
SRAM
×1000 units in networkFPGA×6
About 1.5 devices per day fail
4. Practices
9
Developing and applying soft error countermeasures
4. Practices
Step 1. Specifying requirements
10
Planned network scale
E.g.
1000 units on the network
Specify requirements
E.g.
1 failure per month
on the network
⇒ about 1300FIT / unit
4. Practices
Step 2. Simulating soft errors
11
Device Design
rule
[nm]
Size
[Mb]
Soft error
rate
[FIT]
CPU SRAM 65 2 200
FPGA SRAM 28 100 10000
ASIC SRAM 90 2 150
DRAM① 40 500 10
DRAM ② 40 500 10
DRAM ③ 40 500 10
DRAM ④ 40 500 10
SRAM ① 65 10 1000
SRAM ② 65 1 100
SRAM ③ 65 10 1000
SRAM ④ 65 2 200
SRAM ⑤ 65 10 1000
Flash Mem 90 50 50
Substrate
FPGA ASIC
CPUSRAM
SRAMSRAMSRAMSRAMSRAM
DRAM
DRAM
DRAM
DRAM
Flash
Memory
SRAMSRAM
E.g.
We simulate high soft error rates in devices.
High
High
High
High
4. PracticesStep 3. Apply soft error countermeasures
12
(1) Reducing
soft errors
(2) Protection from
soft errors
(3) Recovery from
soft errors
Devices with low soft
error rates
Using memory devices
with error correction
functions such as ECC*.*Error Correction Code
Systems automatically
restart or overwrite if a
soft error occurs.
Selecting the appropriate soft error countermeasures to suit
functions
MRAM
Special
device
Low
spec
High
cost
1 bit correction
2 bit detection
2 bit correction
3 bit detection
Low
cost
High
cost
Firmware Low cost
ASIC Long-term
development
4. PracticesStep 4. Soft error tests with real products
13
We developed soft error testing technology using Hokkaido
University’s compact accelerator-driven neutron source.
Hokkaido University’s compact
accelerator-driven neutron source
14
4. PracticesStep 4. Soft error tests with real products
5. Results
15
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Comparison of neutron soft error rates
FPGA based device
ASIC based device
w/o ECC function
w/ ECC function
w/o auto recovery function
w/ auto recovery function
We measured the device to confirm the soft error rate reduction using
the accelerator neutron source.
On the real network, the number of soft errors largely decreased.
80% reduction
90% reduction
80% reduction
6. Conclusion
16
We successfully reproduced soft errors using a compact
accelerator-driven neutron source.
We were able to investigate soft error tolerance, and check
the fault detection process and the process of switching to a
backup network system.
We conclude that NTT can reduce service outages and
failure recovery costs due to soft errors.
Message
17
Have you ever experience troubles with unknown
causes on your network ?
It might be caused with soft errors !
Soft errors is able to deal with !
We hope all of the carriers and manufacturers of
the world to be freed from this problems !
Special thanks:
18
Fujitsu, Ltd.
Hitachi, Ltd.
NEC corp.