P189/MAPLD2004Carmichael 1 A Triple Module Redundancy Scheme for SEU Mitigation of Static...
-
Upload
marylou-ramsey -
Category
Documents
-
view
215 -
download
0
Transcript of P189/MAPLD2004Carmichael 1 A Triple Module Redundancy Scheme for SEU Mitigation of Static...
1 P189/MAPLD2004Carmichael
A Triple Module Redundancy Scheme for SEU Mitigation of
Static Latch-Based FPGAsCarl Carmichael1, Brendan Bridgford1, Gary Swift2, Matt Napier3
1Xilinx Corporation, San Jose CA2Jet Propulsion Laboratory, Pasadena CA
3Sandia National Laboratories, Albuquerque NM
"This work was carried out in part by the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration."
"Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology."
2 P189/MAPLD2004Carmichael
ABSTRACT“Xilinx Triple Module Redundancy,” or XTMR, is an SEU mitigation technique and design
methodology intended to remove all single points of failure within the configuration control cells and user logic elements, including those in the voting circuitry, as well as preventing the propagation of single event transients, by “triplicating” all inputs, outputs, logic, clock domains and voters. Voters are also inserted on all state logic feedback paths, conferring full SEU and SET immunity while allowing for autonomous re-synchronization of just-reconfigured state logic to the redundant domains.
This paper presents the fundamental philosophy of the XTMR method, the automated implementation of XTMR provided by the new release of the “Xilinx TMRTool”, as well as Single Event Effects testing and analysis of the combined SEU mitigation technique of XTMR and autonomous partial re-configuration (scrubbing).
The SEE test analysis demonstrates that this combined SEU mitigation technique pushes the cross-section for functional error for any design in any orbit to at least one order of magnitude below the established cross-sections for device level Single Event Functional Interrupts (SEFI). This study has the potential to alleviate the requirement for many users of having to perform independent SEE testing on individual design implementations.
3 P189/MAPLD2004Carmichael
XTMR SEU Mitigation
• Xilinx Triple Module Redundancy (XTMR)– Single Point Failures are eliminated by triplication of every logic
node (gates & nets).– XTMR confers SEU and SET immunity– XTMR does not protect against SEFIs!– Any digital design can be XTMRed by:
• “Triplication” of throughput (combinational & sequential) logic• “Triplication” of feedback logic and inserting majority voters• Adding redundant IO (outputs with minority voters)• Design cleanup (removing half-latches, SRL16s, etc.)
4 P189/MAPLD2004Carmichael
XTMR State-Machines“Pre-TMR”
“Post-XTMR”
• XTMR provides autonomous re-synchronization of the separate redundant domains of a state-machine by inserting majority voters at the origin of any registered feed-back “Looped” path.
• When a configuration upset disables one domain, the other two domains continue to operate providing a correct majority representation of state data and functionality.
• When “Scrubbing” fixes the configuration of the upset domain, the embedded redundant voters automatically correct the state of the upset domain without any external intervention.
• As long as the scrub rate is greater than the upset rate, a single bit upset cannot disturb more than one redundant domain.
5 P189/MAPLD2004Carmichael
XTMR Inputs
• Effective SEU Mitigation requires the use of triple redundant input pins for every input signal.
• Not triplicating input Global signals (clk, rst, etc) can seriously compromise SEU resistance.
• Triplication of input data paths can be traded for EDAC.
• SEU resistance is sometimes a trade-off for resource utilization.
6 P189/MAPLD2004Carmichael
XTMR Outputs with Minority Voters
• Outputs can be triplicated, using three pins for each output signal.
• Minority voters monitor each of the triplicated design modules.
• If one module is different from the others, its output pin is driven to High-Z
• Voters are triplicated
Minority Voter
P
TR0
TR1
TR2
Minority Voter
P
Minority Voter
P
Convergence point is outside FPGA, at trace
7 P189/MAPLD2004Carmichael
Xilinx TMRTool
• The Xilinx TMRTool is a graphical application that automates the implementation of XTMR for FPGA designs.
• The designer is provided the flexibility to selectively apply XTMR to their design at the instance, component, and hierarchical levels.
• Use of custom mitigation methods may be employed for specific portions of the design with the use of user created library macros.
• Designs are imported from a Xilinx netlist (NGO/NGC) and exported as a single standard EDIF project source.
8 P189/MAPLD2004Carmichael
XTMR SEE Testing• Validation of mitigation of architectural resources by superposition.
– Separate experiments were created to cover the major elements of the Virtex-II architecture:
• Configurable Logic Block– Combinatorial Logic, Sequential Logic, Arithmetics, Multiplexing.– Design implementation is an array of state-machines.
• Multipliers– Dedicated 18 x 18 bit multiply function blocks.– Design implementation is array of Multiply and Accumulate functions.
• Block Memories– Synchronous Dual Port 18k bit RAM blocks.– Design implemented as a single large memory space for high speed store and fetch functions.
• Input Output Blocks– Multi-standard discrete & bi-directional un/registered device IO.– Design implemented as feed-thru channels from IOB to IOB.
• Digital Clock Managers– Clock frequency synthesis and phase delay re-allignment.– This will be tested in future work.
9 P189/MAPLD2004Carmichael
2V6000 Dynamic SEU Test
Inside target room
ConfigurationMonitor/ Strip Chart
Functional Monitor/ Strip Chart
Back Side
Front Side
BEAM
Thinned DUT
10 P189/MAPLD2004Carmichael
CLB Test Design
+1
+32 MUX 32x1x32
32
32
32
5
mod0
mod15 MUX 32x1x16
32
32
16
55 10
DUT
MODULE
Configuration Manager Core
Sele
ctM
AP
SERVICE
Functional Monitor
FSM
mod
+
Error Counters
11 P189/MAPLD2004Carmichael
CLB Test Functional Description
• The CLB test “pre-TMR” design consists of 512 (32 bit) counters created as 16 modules of 32 counters per module. Each counter in the module increments by a different value. The output of each module is a multiplex of the 32 counters. The outputs of all the modules are again multiplexed to a single 16 bit bus. A 10 bit address bus is used to select the output of a specific counter and select between the upper and lower 16 bit banks of the 32 bit module outputs.
• The Xilinx TMRTool software is used to process the design into a fully XTMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT.
• All counters are running continuously. Each counter is selected sequentially for sampling of it’s current state and operation.
• For each module sample taken, the actual and expected values are recorded along with sequential count of state errors and the running count of event errors into a strip chart file on the host PC.
• When counters are observed to be permanently in the wrong state the design is reset to regain a fully functioning test.
• The final error count is calculated as the number of events that a counter either lost it’s state or moved to the wrong state.
12 P189/MAPLD2004Carmichael
Multiplier Test Design+1x1
+1x11MUX 3x2x32
36
36
32
3
mod0
mod15 MUX 32x1x16
32
32
16
53 8
DUT
MODULE
Configuration Manager Core
Sele
ctM
AP
SERVICE
Functional Monitor
FSM
mod
+
Error Counters
+
x
Constant MAC
MAC
MAC
+1x10
36
MAC
13 P189/MAPLD2004Carmichael
Multiplier Test Functional Description
• The Mutliplier test “pre-TMR” design consists of 48 (18x18x36 bit) Multiply and Accumulate (MAC) blocks created as 16 modules of 3 MACs per module. Each MAC in the module increments by 1 and multiplies by a different constant (1, 10, and 11, respectively). The output of each module is a multiplex of the 3 MACs and a select of the lower 32 bits and upper 4 bits of the 36 bit registered multiplier output. The outputs of all the modules are again multiplexed to a single 16 bit bus. An 8 bit address bus is used to select the output of a specific MAC and select between the upper and lower 16 bit banks of the 32 bit module outputs.
• The Xilinx TMRTool software is used to process the design into a fully TMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT.
• All MACs are constantly accumulating. Each MAC is selected sequentially for a periodic sampling of it’s sequence.
• For each module sample taken, the actual and expected values are recorded along with sequential count of state errors and the running count of event errors into a strip chart file on the host PC.
• When MACs are observed to be permanently in the wrong state the design is reset to regain a fully functioning test.
• The final error count is calculated as the number of events that a MAC lost it’s state or produced an incorrect result.
14 P189/MAPLD2004Carmichael
BRAM Test Design
16
DUT
Configuration Manager Core
Sele
ctM
AP
SERVICE
Functional Monitor
FSM
+
Error Counters
16
-1
128k byte
RAM
ADDRESS
DATA
15 P189/MAPLD2004Carmichael
BRAM Test Functional Description
• The Block Memory test “pre-TMR” design consists of single large 128k byte single port memory space created from 64 memory blocks of 16k bits each.
• The Xilinx TMRTool software is used to process the design into a fully TMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT.
• Separate WRITE and READ routines are executed to all memory address locations. The data is derived from a decrement of the address value. The entire memory space is refreshed with a write operation and then the data is retrieved with a read operation.
• During the read operation the retrieved data is compared against the expected value.• For each data sample taken, the actual and expected values are recorded with the running count of
event errors into a strip chart file on the host PC.• Each error event is measured for it’s total word error size in bits: 1, 32, 64, 512, 1024, etc.• The final error count is calculated as the number of separate events of word errors.
16 P189/MAPLD2004Carmichael
Configuration Error Detection and Correction Algorithm
CONFIGURE
START DONE
READBACKCONFIG
CRC
CHECKPORT
SEFI
SCRUB
READBACKSCRUB
CRC
CONFIG = SCRUB
CRC ERROR +1
CRC ERROR =
2
CRC ERROR = 0
DONE
0
1
0
1
NO
YES
NO
YES
YESNO
PREV = SCRUB
YES
NO
PREV CRC
• Configure target FPGA with configuration data stored in the configuration PROM(s).
• Read back configuration programming data from target FPGA and calculate 16 bit CRC. Store CRC value as “Config-CRC”.
• Perform a Write/Read check on the internal Frame Address Register of target FPGA.
• Scrub (background refresh) configuration data of target FPGA.
• Read back configuration programming data from target FPGA and calculate 16 bit CRC. Store CRC value as “Rdbk-CRC” and perform bit-for-bit error detection of configuration data.
• Compare “RDBK CRC” with “Config-CRC
• If CRC values mismatch a second time then assert SEFI_ERROR and RECONFIGURE
17 P189/MAPLD2004Carmichael
Previous SEE Test Methodology for Mitigation
• The assertion of the combined mitigation method of XTMR & Scrubbing is that the complete removal of Single Event Functional Errors in the user logic confers any user design to an overall error rate determined by the remaining Single Event Functional Interrupts. Therefore, a successful mitigation test is expected to produce zero errors other than SEFIs.
• Since the effectiveness of TMR is dependent upon no accumulation of errors in the configuration, experiments were attempted to maintain an upset rate that did not exceed the scrub rate. This methodology had two significant flaws:
– One is an impracticality of testing at such low fluxes requiring unreasonably long run times and thus being incapable of reaching sufficient fluence for acceptable statistical significance of data.
– The other flaw is that a zero error rate result is not useful for making any calculations or extrapolations.
• These issues raise concerns over the validity of any results.
18 P189/MAPLD2004Carmichael
Improved SEE Test Methodology for Mitigation
• There is an expected physical relationship between functional error rate of a mitigated system as a function of upset rate. The expected relationship is a function that predicts the increasing probability of upsetting bit combinations that will cause a mitigated (TMR) system to fail as a function of bit upset rate:
MER = (1/2)(NBCA/TS)RU2
– MER = Mitigation Error Rate– NB = Number of Relevant Bits– CA = Average Cluster Size– TS = Scrub Time– RU = Upset Rate of Relevant Bits.
• Therefore, testing at extremely high fluxes over several orders of magnitude variation can be performed to reveal this functional relationship between mitigation error rate and bit upset rate.
• This function can then be extrapolated to make predictions at the much lower upset rates of earth orbits.
19 P189/MAPLD2004Carmichael
Plot Definitions• Predicted SEFI cross-section
– Static and Dynamic SEE Characterization of the Virtex-II FPGA revealed several Single Event Functional Interrupt Modes: POR (2.5E-06), SMAP (1.72E-06), IOB (4.2E-06)
– These combined cross-sections represent the minimum functional error cross-section for a single Virtex-II (XQR2V6000) device on orbit.
• Worst Case Orbital Upset Rate– CREME96 calculation of the worst case orbital upset rate for a XQR2V6000 is 7,740 bit-errors/day (9E-02
bit-errors/sec) in a GEO orbit at 36,000km during the worst day of an Anomalously Large Solar Flare accounting for both Heavy Ion and Proton. In a 40MeV Kr beam the exact same upset rate is achieved with a Flux of 1.25E-01 p/cm2/s. This denotes that the equivalent upset rates for all other orbits and solar conditions would reside to the LEFT of this line.
• Single Event Functional Interrupts– This is the average cross-section of the observed SEFI(s) while collecting the data represented in the plot.
This cross-section is not Flux dependent. Variations from the predicted value are due to statistical significance of the total accumulated fluence during each test.
• Functional Errors– Data plot of the observed events when the Device Under Test returned an incorrect result. Cross-section is
determined by the number of error events divided by total fluence at the specified flux. TMR denotes that the DUT design was fully mitigated with XTMR and scrubbing. The Unmitigated results were obtained with an identically functional design without XTMR, however scrubbing was also used for the unmitigated test.
• Extrapolation– A derived function describing the relation between Mitigation failure as a function of upset rate. Extension of
the function predicts functional error cross-sections at worst case orbital upset rates to be less than SEFI cross-sections.
20 P189/MAPLD2004Carmichael
XQR2V6000 Mitigation Error Statistics(CLB/IOB Logic: State-Machines)
1.00E-07
1.00E-06
1.00E-05
1.00E-04
1.00E-03
1.00E-02
1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04
Beam Flux (particles/cm2/s)
Sig
ma (
cm2 /d
evic
e)
Unmitigated Functional Errors
TMR Functional Errors
Extrapolation (square rootfunction)Single Event FunctionalInterupts (SEFIs)Worst Case Orbital Upset Rate(9E-2 Upsets/Sec)Predicted SEFI Cross-Section
PLOT 1
36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day
All other orbits
SEFIs drive error rate for all designs and all orbits.
Mitigation errors on orbit are always less than SEFI errors by orders of magnitude
3.5E-02 3.5E-01 3.5E+00 3.5E+01 3.5E+02 3.5E+03Configuration Bit
Errors per Scrub Cycle
40 MeV Kr LET= 22.3 MeV/cm2/mg
21 P189/MAPLD2004Carmichael
XQR2V6000 Mitigation Error Statistics(Dedicated Multipliers: Multiply-and-Accumulate)
1.00E-07
1.00E-06
1.00E-05
1.00E-04
1.00E-03
1.00E-02
1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05
Beam Flux (particles/cm2/s)
Sig
ma
(cm2 /d
evic
e)
Unmitigated Functional Errors
TMR Functional Errors
Extrapolation (square rootfunction)Single Event FunctionalInterupts (SEFIs)Worst Case Orbital Upset Rate(9E-2 Upsets/Sec)Predicted SEFI Cross-Section
PLOT 2
36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day
All other orbits
SEFIs drive error rate for all designs and all orbits.
Mitigation errors on orbit are always less than SEFI errors by orders of magnitude
3.5E-02 3.5E-01 3.5E+00 3.5E+01 3.5E+02 3.5E+03Configuration Bit
Errors per Scrub Cycle
40 MeV Kr LET= 22.3 MeV/cm2/mg
3.5E+03
22 P189/MAPLD2004Carmichael
XQR2V6000 Mitigation Error Statistics(Block Memory: Read/Write)
1.00E-12
1.00E-10
1.00E-08
1.00E-06
1.00E-04
1.00E-02
1.00E+00
1.00E+02
1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05
Beam Flux (particles/cm2/s)
Sig
ma(
cm2)
TMR Functional Errors
Extrapolation (square rootfunction)Single Event FunctionalInterupts (SEFIs)Worst Case Orbital Upset Rate(9E-2 Upsets/Sec)Predicted SEFI Cross-Section
PLOT 3
36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/dayAll other
orbitsSEFIs drive error rate for all designs and all orbits.
Mitigation errors on orbit are always less than SEFI errors by orders of magnitude
3.5E-02 3.5E-01 3.5E+00 3.5E+01 3.5E+02 3.5E+03Configuration Bit
Errors per Scrub Cycle
40 MeV Kr LET= 22.3 MeV/cm2/mg
3.5E+03
23 P189/MAPLD2004Carmichael
SEE Test Analysis
• The experiments were conducted over a flux range of 7E+00 to 4E+04 (p/cm2/s). • The Flux rates have been normalized in the secondary (top) x-axis of the plots to “average bit upsets
per scrub cycle” (RS).
• Each experiment demonstrated a drop in failure cross-section over several orders of magnitude, crossing the SEFI cross-section at upset rates that are still several orders of magnitude above worst case orbital upset rates.
• Extrapolating this data for each experiment clearly demonstrates a mitigation error cross-section at least 1 or more orders of magnitude below the SEFI cross-section at worst case orbital upset rates.
• By Superposition of the data fit functions, the total effective mitigated error rate cross-section isSigmaTOTAL = SigmaBRAM + SigmaCLB + SigmaMULT + SigmaSEFI
SigmaTOTAL = 5.0E-8(1.4 RS)(2) + 5.0E-6(0.7 RS)(0.5) + 1.75E-6(1.4 RS)(0.35) + 8.42E-6 (cm2)
• Therefore, at the worst case orbital upset rate of 9E-2 upsets/sec (RS=4.5E-2 upsets/scrub) the effective total cross-section for functional error is calculated:
SigmaTOTAL = 1.05E-5 (cm2/device) {Orbital Worst Case}
24 P189/MAPLD2004Carmichael
Conclusions
• Efficiency and accuracy of the validation of mitigation techniques is greatly improved by demonstrating the upset rate dependency of the mitigation method by testing at Flux rates that overwhelm the mitigation.
• The static SEFI cross-section is the dominating factor for calculating orbital error rates for any Virtex-II design when mitigated with Full XTMR & Scrubbing.
• Future Work– The authors recognize an anomaly in the data fit functions in that they were
not all expressed as a square function. It is anticipated that this is due to the complexity of the bit clusters of the experimental designs. Additional research is called for to derive the separate coefficients for the MER equation for each design and explain their functional associations.