P189/MAPLD2004Carmichael 1 A Triple Module Redundancy Scheme for SEU Mitigation of Static...

1 P189/MAPLD2004Carmichael

A Triple Module Redundancy Scheme for SEU Mitigation of

Static Latch-Based FPGAsCarl Carmichael1, Brendan Bridgford1, Gary Swift2, Matt Napier3

1Xilinx Corporation, San Jose CA2Jet Propulsion Laboratory, Pasadena CA

3Sandia National Laboratories, Albuquerque NM

"This work was carried out in part by the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration."

"Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology."


ABSTRACT“Xilinx Triple Module Redundancy,” or XTMR, is an SEU mitigation technique and design

methodology intended to remove all single points of failure within the configuration control cells and user logic elements, including those in the voting circuitry, as well as preventing the propagation of single event transients, by “triplicating” all inputs, outputs, logic, clock domains and voters. Voters are also inserted on all state logic feedback paths, conferring full SEU and SET immunity while allowing for autonomous re-synchronization of just-reconfigured state logic to the redundant domains.

This paper presents the fundamental philosophy of the XTMR method, the automated implementation of XTMR provided by the new release of the “Xilinx TMRTool”, as well as Single Event Effects testing and analysis of the combined SEU mitigation technique of XTMR and autonomous partial re-configuration (scrubbing).

The SEE test analysis demonstrates that this combined SEU mitigation technique pushes the cross-section for functional error for any design in any orbit to at least one order of magnitude below the established cross-sections for device level Single Event Functional Interrupts (SEFI). This study has the potential to alleviate the requirement for many users of having to perform independent SEE testing on individual design implementations.


XTMR SEU Mitigation

• Xilinx Triple Module Redundancy (XTMR)– Single Point Failures are eliminated by triplication of every logic

node (gates & nets).– XTMR confers SEU and SET immunity– XTMR does not protect against SEFIs!– Any digital design can be XTMRed by:

• “Triplication” of throughput (combinational & sequential) logic• “Triplication” of feedback logic and inserting majority voters• Adding redundant IO (outputs with minority voters)• Design cleanup (removing half-latches, SRL16s, etc.)


XTMR State-Machines“Pre-TMR”

“Post-XTMR”

• XTMR provides autonomous re-synchronization of the separate redundant domains of a state-machine by inserting majority voters at the origin of any registered feed-back “Looped” path.

• When a configuration upset disables one domain, the other two domains continue to operate providing a correct majority representation of state data and functionality.

• When “Scrubbing” fixes the configuration of the upset domain, the embedded redundant voters automatically correct the state of the upset domain without any external intervention.

• As long as the scrub rate is greater than the upset rate, a single bit upset cannot disturb more than one redundant domain.


XTMR Inputs

• Effective SEU Mitigation requires the use of triple redundant input pins for every input signal.

• Not triplicating input Global signals (clk, rst, etc) can seriously compromise SEU resistance.

• Triplication of input data paths can be traded for EDAC.

• SEU resistance is sometimes a trade-off for resource utilization.


XTMR Outputs with Minority Voters

• Outputs can be triplicated, using three pins for each output signal.

• Minority voters monitor each of the triplicated design modules.

• If one module is different from the others, its output pin is driven to High-Z

• Voters are triplicated

Minority Voter

P

TR0

TR1

TR2

Minority Voter

P

Minority Voter

P

Convergence point is outside FPGA, at trace


Xilinx TMRTool

• The Xilinx TMRTool is a graphical application that automates the implementation of XTMR for FPGA designs.

• The designer is provided the flexibility to selectively apply XTMR to their design at the instance, component, and hierarchical levels.

• Use of custom mitigation methods may be employed for specific portions of the design with the use of user created library macros.

• Designs are imported from a Xilinx netlist (NGO/NGC) and exported as a single standard EDIF project source.


XTMR SEE Testing• Validation of mitigation of architectural resources by superposition.

– Separate experiments were created to cover the major elements of the Virtex-II architecture:

• Configurable Logic Block– Combinatorial Logic, Sequential Logic, Arithmetics, Multiplexing.– Design implementation is an array of state-machines.

• Multipliers– Dedicated 18 x 18 bit multiply function blocks.– Design implementation is array of Multiply and Accumulate functions.

• Block Memories– Synchronous Dual Port 18k bit RAM blocks.– Design implemented as a single large memory space for high speed store and fetch functions.

• Input Output Blocks– Multi-standard discrete & bi-directional un/registered device IO.– Design implemented as feed-thru channels from IOB to IOB.

• Digital Clock Managers– Clock frequency synthesis and phase delay re-allignment.– This will be tested in future work.


2V6000 Dynamic SEU Test

Inside target room

ConfigurationMonitor/ Strip Chart

Functional Monitor/ Strip Chart

Back Side

Front Side

BEAM

Thinned DUT


CLB Test Design

+1

+32 MUX 32x1x32

32

32

32

5

mod0

mod15 MUX 32x1x16

32

32

16

55 10

DUT

MODULE

Configuration Manager Core

Sele

ctM

AP

SERVICE

Functional Monitor

FSM

mod

+

Error Counters


CLB Test Functional Description

• The CLB test “pre-TMR” design consists of 512 (32 bit) counters created as 16 modules of 32 counters per module. Each counter in the module increments by a different value. The output of each module is a multiplex of the 32 counters. The outputs of all the modules are again multiplexed to a single 16 bit bus. A 10 bit address bus is used to select the output of a specific counter and select between the upper and lower 16 bit banks of the 32 bit module outputs.

• The Xilinx TMRTool software is used to process the design into a fully XTMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT.

• All counters are running continuously. Each counter is selected sequentially for sampling of it’s current state and operation.

• For each module sample taken, the actual and expected values are recorded along with sequential count of state errors and the running count of event errors into a strip chart file on the host PC.

• When counters are observed to be permanently in the wrong state the design is reset to regain a fully functioning test.

• The final error count is calculated as the number of events that a counter either lost it’s state or moved to the wrong state.


Multiplier Test Design+1x1

+1x11MUX 3x2x32

36

36

32

3

mod0

mod15 MUX 32x1x16

32

32

16

53 8

DUT

MODULE


Sele

ctM

AP

SERVICE

Functional Monitor

FSM

mod

+

Error Counters

+

x

Constant MAC

MAC

MAC

+1x10

36

MAC


Multiplier Test Functional Description

• The Mutliplier test “pre-TMR” design consists of 48 (18x18x36 bit) Multiply and Accumulate (MAC) blocks created as 16 modules of 3 MACs per module. Each MAC in the module increments by 1 and multiplies by a different constant (1, 10, and 11, respectively). The output of each module is a multiplex of the 3 MACs and a select of the lower 32 bits and upper 4 bits of the 36 bit registered multiplier output. The outputs of all the modules are again multiplexed to a single 16 bit bus. An 8 bit address bus is used to select the output of a specific MAC and select between the upper and lower 16 bit banks of the 32 bit module outputs.

• The Xilinx TMRTool software is used to process the design into a fully TMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT.

• All MACs are constantly accumulating. Each MAC is selected sequentially for a periodic sampling of it’s sequence.

• For each module sample taken, the actual and expected values are recorded along with sequential count of state errors and the running count of event errors into a strip chart file on the host PC.

• When MACs are observed to be permanently in the wrong state the design is reset to regain a fully functioning test.

• The final error count is calculated as the number of events that a MAC lost it’s state or produced an incorrect result.


BRAM Test Design

16

DUT


Sele

ctM

AP

SERVICE

Functional Monitor

FSM

+

Error Counters

16

-1

128k byte

RAM

ADDRESS

DATA


BRAM Test Functional Description

• The Block Memory test “pre-TMR” design consists of single large 128k byte single port memory space created from 64 memory blocks of 16k bits each.

• The Xilinx TMRTool software is used to process the design into a fully TMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT.

• Separate WRITE and READ routines are executed to all memory address locations. The data is derived from a decrement of the address value. The entire memory space is refreshed with a write operation and then the data is retrieved with a read operation.

• During the read operation the retrieved data is compared against the expected value.• For each data sample taken, the actual and expected values are recorded with the running count of

event errors into a strip chart file on the host PC.• Each error event is measured for it’s total word error size in bits: 1, 32, 64, 512, 1024, etc.• The final error count is calculated as the number of separate events of word errors.


Configuration Error Detection and Correction Algorithm

CONFIGURE

START DONE

READBACKCONFIG

CRC

CHECKPORT

SEFI

SCRUB

READBACKSCRUB

CRC

CONFIG = SCRUB

CRC ERROR +1

CRC ERROR =

2

CRC ERROR = 0

DONE

0

1

0

1

NO

YES

NO

YES

YESNO

PREV = SCRUB

YES

NO

PREV CRC

• Configure target FPGA with configuration data stored in the configuration PROM(s).

• Read back configuration programming data from target FPGA and calculate 16 bit CRC. Store CRC value as “Config-CRC”.

• Perform a Write/Read check on the internal Frame Address Register of target FPGA.

• Scrub (background refresh) configuration data of target FPGA.

• Read back configuration programming data from target FPGA and calculate 16 bit CRC. Store CRC value as “Rdbk-CRC” and perform bit-for-bit error detection of configuration data.

• Compare “RDBK CRC” with “Config-CRC

• If CRC values mismatch a second time then assert SEFI_ERROR and RECONFIGURE


Previous SEE Test Methodology for Mitigation

• The assertion of the combined mitigation method of XTMR & Scrubbing is that the complete removal of Single Event Functional Errors in the user logic confers any user design to an overall error rate determined by the remaining Single Event Functional Interrupts. Therefore, a successful mitigation test is expected to produce zero errors other than SEFIs.

• Since the effectiveness of TMR is dependent upon no accumulation of errors in the configuration, experiments were attempted to maintain an upset rate that did not exceed the scrub rate. This methodology had two significant flaws:

– One is an impracticality of testing at such low fluxes requiring unreasonably long run times and thus being incapable of reaching sufficient fluence for acceptable statistical significance of data.

– The other flaw is that a zero error rate result is not useful for making any calculations or extrapolations.

• These issues raise concerns over the validity of any results.


Improved SEE Test Methodology for Mitigation

• There is an expected physical relationship between functional error rate of a mitigated system as a function of upset rate. The expected relationship is a function that predicts the increasing probability of upsetting bit combinations that will cause a mitigated (TMR) system to fail as a function of bit upset rate:

MER = (1/2)(NBCA/TS)RU2

– MER = Mitigation Error Rate– NB = Number of Relevant Bits– CA = Average Cluster Size– TS = Scrub Time– RU = Upset Rate of Relevant Bits.

• Therefore, testing at extremely high fluxes over several orders of magnitude variation can be performed to reveal this functional relationship between mitigation error rate and bit upset rate.

• This function can then be extrapolated to make predictions at the much lower upset rates of earth orbits.


Plot Definitions• Predicted SEFI cross-section

– Static and Dynamic SEE Characterization of the Virtex-II FPGA revealed several Single Event Functional Interrupt Modes: POR (2.5E-06), SMAP (1.72E-06), IOB (4.2E-06)

– These combined cross-sections represent the minimum functional error cross-section for a single Virtex-II (XQR2V6000) device on orbit.

• Worst Case Orbital Upset Rate– CREME96 calculation of the worst case orbital upset rate for a XQR2V6000 is 7,740 bit-errors/day (9E-02

bit-errors/sec) in a GEO orbit at 36,000km during the worst day of an Anomalously Large Solar Flare accounting for both Heavy Ion and Proton. In a 40MeV Kr beam the exact same upset rate is achieved with a Flux of 1.25E-01 p/cm2/s. This denotes that the equivalent upset rates for all other orbits and solar conditions would reside to the LEFT of this line.

• Single Event Functional Interrupts– This is the average cross-section of the observed SEFI(s) while collecting the data represented in the plot.

This cross-section is not Flux dependent. Variations from the predicted value are due to statistical significance of the total accumulated fluence during each test.

• Functional Errors– Data plot of the observed events when the Device Under Test returned an incorrect result. Cross-section is

determined by the number of error events divided by total fluence at the specified flux. TMR denotes that the DUT design was fully mitigated with XTMR and scrubbing. The Unmitigated results were obtained with an identically functional design without XTMR, however scrubbing was also used for the unmitigated test.

• Extrapolation– A derived function describing the relation between Mitigation failure as a function of upset rate. Extension of

the function predicts functional error cross-sections at worst case orbital upset rates to be less than SEFI cross-sections.


XQR2V6000 Mitigation Error Statistics(CLB/IOB Logic: State-Machines)

1.00E-07

1.00E-06

1.00E-05

1.00E-04

1.00E-03

1.00E-02

1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04

Beam Flux (particles/cm2/s)

Sig

ma (

cm2 /d

evic

e)

Unmitigated Functional Errors

TMR Functional Errors

Extrapolation (square rootfunction)Single Event FunctionalInterupts (SEFIs)Worst Case Orbital Upset Rate(9E-2 Upsets/Sec)Predicted SEFI Cross-Section

PLOT 1

36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day

All other orbits

SEFIs drive error rate for all designs and all orbits.

Mitigation errors on orbit are always less than SEFI errors by orders of magnitude

3.5E-02 3.5E-01 3.5E+00 3.5E+01 3.5E+02 3.5E+03Configuration Bit

Errors per Scrub Cycle

40 MeV Kr LET= 22.3 MeV/cm2/mg


XQR2V6000 Mitigation Error Statistics(Dedicated Multipliers: Multiply-and-Accumulate)

1.00E-07

1.00E-06

1.00E-05

1.00E-04

1.00E-03

1.00E-02

1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05


Sig

ma

(cm2 /d

evic

e)

Unmitigated Functional Errors



PLOT 2

36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day

All other orbits

SEFIs drive error rate for all designs and all orbits.





3.5E+03


XQR2V6000 Mitigation Error Statistics(Block Memory: Read/Write)

1.00E-12

1.00E-10

1.00E-08

1.00E-06

1.00E-04

1.00E-02

1.00E+00

1.00E+02

1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05


Sig

ma(

cm2)



PLOT 3

36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/dayAll other

orbitsSEFIs drive error rate for all designs and all orbits.





3.5E+03


SEE Test Analysis

• The experiments were conducted over a flux range of 7E+00 to 4E+04 (p/cm2/s). • The Flux rates have been normalized in the secondary (top) x-axis of the plots to “average bit upsets

per scrub cycle” (RS).

• Each experiment demonstrated a drop in failure cross-section over several orders of magnitude, crossing the SEFI cross-section at upset rates that are still several orders of magnitude above worst case orbital upset rates.

• Extrapolating this data for each experiment clearly demonstrates a mitigation error cross-section at least 1 or more orders of magnitude below the SEFI cross-section at worst case orbital upset rates.

• By Superposition of the data fit functions, the total effective mitigated error rate cross-section isSigmaTOTAL = SigmaBRAM + SigmaCLB + SigmaMULT + SigmaSEFI

SigmaTOTAL = 5.0E-8(1.4 RS)(2) + 5.0E-6(0.7 RS)(0.5) + 1.75E-6(1.4 RS)(0.35) + 8.42E-6 (cm2)

• Therefore, at the worst case orbital upset rate of 9E-2 upsets/sec (RS=4.5E-2 upsets/scrub) the effective total cross-section for functional error is calculated:

SigmaTOTAL = 1.05E-5 (cm2/device) {Orbital Worst Case}


Conclusions

• Efficiency and accuracy of the validation of mitigation techniques is greatly improved by demonstrating the upset rate dependency of the mitigation method by testing at Flux rates that overwhelm the mitigation.

• The static SEFI cross-section is the dominating factor for calculating orbital error rates for any Virtex-II design when mitigated with Full XTMR & Scrubbing.

• Future Work– The authors recognize an anomaly in the data fit functions in that they were

not all expressed as a square function. It is anticipated that this is due to the complexity of the bit clusters of the experimental designs. Additional research is called for to derive the separate coefficients for the MER equation for each design and explain their functional associations.

P189/MAPLD2004Carmichael 1 A Triple Module Redundancy Scheme for SEU Mitigation of Static...

Documents

Transcript of P189/MAPLD2004Carmichael 1 A Triple Module Redundancy Scheme for SEU Mitigation of Static...