Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance...

20
Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial, CO Military and Aerospace Programmable Logic Devices (MAPLD) Conference NASA Goddard Space Flight Center in Greenbelt, MD August 31 - September 3, 2009

Transcript of Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance...

Page 1: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

Comparison of Single-Event Effect Mitigation Methods using

Design Impact and Application Performance Metrics

Ian TroxelSEAKR Engineering, Inc.

Centennial, CO

Military and Aerospace Programmable Logic Devices (MAPLD) ConferenceNASA Goddard Space Flight Center in Greenbelt, MD

August 31 - September 3, 2009

Page 2: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

2/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Motivation

o Flexible, multiuse payloads sought to limit NRE in space payload processors

o High-performance, SRAM-based FPGAs frequently required to meet mission requirements but require radiation mitigation to achieve fault tolerance goals

o Mitigation methods are application dependant• SWAP constraints• Processing performance• Reliability requirements• Design schedule• Type of data and peripherals• Latency constraints

o Optimum designs may use several methods

o SBIR Phase 1 topic compared mitigation methods for a particular application within an AFRL mission

Pro

ce

ss

ing

Pe

rfo

rma

nc

e

pe

r u

nit

of

SW

AP

Level of Effort

Pro

ce

ss

ing

Pe

rfo

rma

nc

e

pe

r u

nit

of

SW

AP

Reliability

Page 3: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

3/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Background

o SEAKR’s Application Independent Processor (AIP) formed baseline system along with other component options for onboard processor

o AIP Processing Features• Mixture of processor and I/O capability– Reconfigurable Computer Board(s)

– Xilinx® Virtex®-4 FPGAs

– COTS PowerPC®-based SBC(s)– Gigabit Ethernet and Spacewire

– Mezzanine cards for custom features• Reconfigurable on-orbit• Flexible, scalable architecture• Adaptable fault tolerance

o Mitigation analysis focused on Xilinx Virtex 4 serieso Application analysis included Xilinx, Actel® and microprocessor options

Baseline AIP Flight Unit

Page 4: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

4/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

AIP System Architecture

o Reconfigurable Computer Board(s)• Xilinx Virtex-4, high-speed memory and SERDES backplane

Coprocessor

Xilinx V4

PCI-PCI

Bridge /

Config

PCI

Configuration

cPCI High Speed Serial Network

I/O

High Speed Mezzanine

High Speed Mezzanine

High Speed Mezzanine

Memory

RCC Board Architecture

High Speed

Memory

High Speed

Memory

High Speed

Memory

Coprocessor

Xilinx V4

Coprocessor

Xilinx V4I/O I/O

High-Speed

SERDES

Page 5: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

5/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Radiation Effect Mitigation Techniques

Technique Description and Comments

Chip-level Redundancy

Replicate FPGA devices and all I/O. External voting or devices vote each other. Single device failures masked if triplicated. Use of internal features allowed.

Full Internal Logic Redundancy

Replicate all logic within a design and vote on intermediate signals. Does not require external voter but imposes limitations (no special features, area penalty). All SEFI modes cannot be addressed.

Continuous Read-back and Scrub

Read configuration memory and compare to “golden” standard and correct discrepancies. Unobtrusive and application-independent method; compliments other methods. Potentially allows corrupt data propagation.

Application-based Fault Tolerance

Augment critical data with checksums to determine if error occurred. Highly application-specific.

Data ReplayBuffer input data and provide a mechanism to temporally replicate computations to detect error.

Selective Logic Redundancy

Selectively replicate via circuit analysis to trade area for robustness. Benefits of internal logic redundancy with reduction in restrictions and area penalties.

Page 6: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

6/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Characterization Metrics

Metric Description and Comments

Design Overhead Increase in needed resources as compared to the original design.

External Voter Does the technique require an external voter?

Non-standard Device Use

Does the technique allow the use of non-standard devices within the FPGA (e.g. BRAM, DSP blocks, etc.) Restrictions enumerated.

Development Ease Difficulty to incorporate into design. (5=easiest to 1=difficult)

Performance Impact

What performance impacts does the technique impose on the design? Quantitative values provided where possible.

Other System Impacts

Does the technique impact other aspects of the external system (besides requiring a voter) such as need for additional buffer space?

Error CoverageWhat error types does the technique correct (e.g. SEU, SET, SEFI, etc.). Error types enumerated.

Degree of Robustness

How many faults can the technique detect and correct? Does fault locality affect the technique’s robustness? (5=full coverage, 1=none)

Timeliness of Fault Correction

How fast can errors be detected by the technique? Latent errors? Timeframe within which the technique can detect and correct faults includes a collection of robustness considerations.

Page 7: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

7/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Technique Evaluation

o All techniques require external resources and/or the use of additional internal resources. DR does not require additional internal resources.

o External voters are required in all external schemes and may be used in internal schemes depending on implementation.

o Internal redundancy schemes often do not allow for the use of special internal components such as DSP blocks.

CLR FILR RBAS ABFT DR SILR

OverheadN times devices and I/O

More than N times internal

resources

Support devices

and config. buffer

Additional device

resources

Support devices

Less than N times internal

resources

External Voter Yes Sometimes Yes Sometimes Yes Sometimes

Allow Special Components

Yes Sometimes Yes Yes Yes Sometimes

Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy

Page 8: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

8/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Technique Evaluation (2)

o ABFT and SILR require substantial application development and an intimate knowledge of the application and design

o Design performance typically limited by voter and DR reduces system throughput but has no impact on logic usage

o All designs beside DR require additional logic resources

CLR FILR RBAS ABFT DR SILR

Ease of Development

4 3 5 2 4 1.5

Performance Impact

Limited to speed of

voter

Limited to speed of internal voters

Virtually none

Application dependent

System throughput

reduced

Limited to speed of internal voters

Other System Impacts

N times the cost

and SWaP

Logic reduced by

>N times

Config. buffer

Detailed analysis

Data buffer

Logic reduced by <N times, analysis

Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy

Page 9: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

9/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Technique Evaluation (3)

o CLR and DR are the only two approaches that ensure full coverageo Other approaches may not catch all errors before they propagateo FILR and SILR provide immediate error detection (if covered)o CLR and DR require a processing interval to detect an error (external)o The timeliness of RBAS and ABFT vary with implementation

CLR FILR RBAS ABFT DR SILR

Error Coverage

Full SEU, SET and

SEFI

Full SEU, partial

SET and SEFI

Partial SEU and SEFI

and no SET

Full SEU, partial SET and SEFI

Full SEU, SET and

SEFI

Partial SEU, SET and SEFI

Robustness 5 4.5 2 4.5 5 3

Timeliness of Fault

Correction

One processing

intervalImmediate

Varies based on scrub rate

Varies -- application

specific

N times processing

interval

Immediate if covered

Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy

Page 10: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

10/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Comparison Summary

Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy

Technique Pro Con

CLRoAll errors detectableoStraightforward to implement

oN time devices and external voteroError detect delay of one interval

FILR oImmediate fault detection o>N time logic and internal voters

RBASoNo performance impactoStraightforward to implement

oSupport devices and config. BufferoHigh potential for error propagation

ABFT oApplication dependent oApplication dependent

DR

oNo additional logic resourcesoSpecial structures allowedoAll errors detectable

oHalves throughputoExternal voter required but no impactoError detect delay of one interval

SILRoImmediate fault detection – for covered faultsoLess logic overhead than FILR

oSpecial structures not allowedoPartial fault coverageoDetailed design analysis required

Page 11: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

11/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Recommendations

Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy

Technique Recommendations

CLRoAppropriate when board area and device cost is less of a concern as compared to performance and robustness

FILRoUse instead of CLR when board space is a premium and a given design can fit within 1/N the number of logic cells on the FPGA

RBAS oAppropriate for all designs combined with other techniques

ABFT oUse when the application well understood and other options not feasible

DRoAppropriate when board area and part count is premium and full error coverage required however latency/throughput cannot be sacrificed

SILR

oUse this technique over full internal logic redundancy when board space is a premium and a given design cannot fit within 1/N the number of logic cells on the FPGA. Do not use if the design is not well understood or if hand placement is not an option.

o “Best technique” is application- and mission-dependent and must be further investigated for each application

Page 12: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

12/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Application Analysis Setup

oTwo system types examined for the space application• Two types of architectures used for comparison• FPGAs and microprocessors included in the analysis

FPAs

Processor Processor Processor

Processor Processor Processor

Superframe

FPAs

Processor

Superframe

Processor

Internal Processor Resources External Processor Resources

Sensors Sensors

Data Product

Data Product

Page 13: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

13/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Processor Options

Processor Type Distinguishing Features

Virtex-4 LX200 FPGA Fine-grained ProcessingHigh PerformanceSEU Tolerant (with mitigation)Virtex-5 FX130 FPGA

RTAX-2000 FPGA Fine-grained ProcessingLimited PerformanceSEU ImmuneRTAX-4000 FPGA

603e (350nm) 1-core

Corse-grained ProcessingCommercial PerformanceSEU Tolerant (mitigation)

750FX (130nm) 1-core

7448 (90nm) 1-core

8641 (90nm) 2-core

LEON 3FTTM (250nm) 1-coreRHBD or ProcessImproved SEU Tolerance

Rad750® (150nm) 1-core

MAESTRO (90nm) 49-core

Page 14: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

14/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Analysis Assumptions

oPreliminary candidate systems constructed to meet algorithm and system requirements• Focused on meeting memory and I/O performance requirements• Two versions of application studied with three data rates• Fault tolerance, radiation susceptibility, and cost examined as well

oKey assumptions• Both algorithm versions focus on front-end sensor processing• Nominal processor and FPGA performance capabilities have been

de-rated based on typical performance achieved• Only the highest speed processor interface is considered• FPGAs can support at most 2 DDR interfaces along with a local

bus and any required sensor connections• Xilinx DRAM interfaces four times faster than Actel versions

Page 15: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

15/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

FPGA Processor Analysis (1)

o Internal option is severely memory limitedo Using only internal resources is impractical due to limited FPGA resources

Option Virtex-4 LX200

Virtex-5 FX130

RTAX2000 RTAX4000 DRAM

Rate 1 74 42 1518 810 0

Rate 2 148 84 3035 1619 0

Rate 3 592 334 12137 6473 0

Rate 1 1 1 3 2 0

Rate 2 2 2 5 4 0

Rate 3 5 5 17 13 0

Mem

ory

I/O

Rate 1 185 105 3793 2023 0

Rate 2 370 209 7586 4046 0

Rate 3 1480 835 30341 16182 0

Rate 1 1 1 3 2 0

Rate 2 2 2 5 4 0

Rate 3 5 5 17 13 0

Mem

ory

I/O

Bas

ic V

ersi

on

Inte

rna

l

Ad

van

ced

Ver

sio

n

Page 16: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

16/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

FPGA Processor Analysis (2)

o Design feasible using external memory resources but is I/O limitedo Designs must be implemented to achieve actual part counto Radiation mitigation for Virtex would triple the required number of deviceso Actel I/O performance limits the application when scaling data rate

Option Virtex-4 LX200

Virtex-5 FX130

RTAX2000 RTAX4000Xilinx DRAM

Actel DRAM

Rate 1 1 1 2 2 2 2

Rate 2 1 1 2 2 2 2

Rate 3 2 1 3 3 4 4

Rate 1 1 1 4 3 2 6

Rate 2 2 1 12 8 4 16

Rate 3 6 6 28 22 12 44

Mem

ory

I/O

Rate 1 1 1 2 2 2 2

Rate 2 1 1 2 2 2 2

Rate 3 2 2 5 4 8 8

Rate 1 1 1 7 6 2 12

Rate 2 2 2 23 17 4 34

Rate 3 7 7 34 26 14 52

Mem

ory

I/O

Bas

ic V

ersi

on

Ex

tern

al

Ad

van

ced

Ver

sio

n

Page 17: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

17/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Microprocessor Analysis

o Microprocessors assumed to have DRAM attached (i.e. external architecture) and half of theoretical bandwidth assumed for I/O

o Processors are I/O limited even with RapidIO assumed for those capableo Advanced algorithms would require substantially more memory capacity

Option 603e 750FX 7448 8641D LEON 3FT RAD750 MAESTRO

Rate 1 1 1 1 1 1 1 1

Rate 2 2 1 1 1 1 3 1

Rate 3 7 3 2 2 3 9 1

Rate 1 3222 31 4 2 18 467 1

Rate 2 6443 62 7 4 35 758 2

Rate 3 25770 245 26 13 139 3890 7

Mem

ory

I/O

Rate 1 2 1 1 1 1 3 1

Rate 2 4 1 1 1 1 6 1

Rate 3 16 4 4 2 4 21 2

Rate 1 3234 31 4 2 18 471 1

Rate 2 6468 62 7 4 35 764 2

Rate 3 25871 245 26 13 140 3925 7

Mem

ory

I/OB

asic

Ver

sio

nA

dva

nce

d V

ersi

on

Page 18: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

18/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Analysis Summary

o Data Rate 1:• Higher-end processors with high-speed serial I/O are viable options• Xilinx and Actel roughly tie in memory and I/O performance when

radiation mitigation added

o Data Rate 2:• Microprocessors become less attractive than FPGAs• Xilinx and Actel devices are still roughly equivalent

o Data Rate 3:• Xilinx devices appear more attractive than Actel due to improved

memory bandwidth (i.e. fewer DRAMs required)

o Note:• Algorithm implementation and sizing/timing analysis on logic designs

required to complete the analysis (need to include processing capability)• Processors would likely become a factor when incorporating additional

algorithms that are non-deterministic or require coarse-grained processing further down the application chain

Page 19: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

19/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Conclusions

o Radiation effect mitigation techniques examined

• Characteristics of several techniques compared

• Choice is application-specific and combination of methods often included

o Application performance requirements examined

• FPGA and microprocessor memory bandwidth and I/O capabilities

examined due to the application being I/O intensive

• For this application, microprocessors become less attractive than FPGAs

as data rates are increased

• Xilinx devices more attractive due to improved memory bandwidth

o Future Work

• Implement algorithms to include processing capability in the analysis

• Include additional application processing steps

Page 20: Comparison of Single-Event Effect Mitigation Methods using Design Impact and Application Performance Metrics Ian Troxel SEAKR Engineering, Inc. Centennial,

20/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…

Contact Information

Dr. Ian Troxel Future Systems Architect• 303-784-7673 [email protected]

SEAKR Engineering, Inc.6221 South Racine CircleCentennial, CO 80111-6427main: 303 790 8499fax: 303 790 8720web: http://www.SEAKR.com