POLITECNICO DI MILANO
Lecturer: Antonio Miele – [email protected]
Involved People:
Cristiana Bolchini, Antonio Miele, Marco D. Santambrogio
SEU Mitigation for SRAM-Based FPGAs through Dynamic Partial
Reconfiguration- 3D-DRESD Second Edition -- 3D-DRESD Second Edition -
2
MotivationsMotivations
Designing reliable systems implemented on FPGAs, able to cope with the effects of faults caused by radiations
Appling already known and well studied detection and recovery techniques to novel scenarios
Exploiting dynamic partial reconfiguration to trigger the reconfiguration of the affected portion of the architecture
… while the rest of the system is still working… without need to entirely reprogrammed the system
3
OutlineOutline
GoalsStarting point
Fault tolerance and reliabilityReconfigurable architectureRelated work
The proposed approachRequirementsSolution space exploration
Project roadmapCompleted stepsWork in progress
Other works
Conclusions and Future Work
4
GoalsGoals
Design space exploration w.r.t. reliability
Apply traditional, sound techniques in a different context, exploiting the peculiarity of the platform
Evaluate the alternative designs, comparing costs, performance and fault detection properties
Support the designer in selecting the most convenient solution
5
Fault Model && ReliabilityFault Model && Reliability
Adopted fault modelRadiation and -particles causedSingle Event Transient (SET), Single Event Upset (SEU)
Bit-flipTemporary – data and control registersPermanent – configuration memory
6
Reconfigurable ScenarioReconfigurable Scenario
FPGAs:Xilinx family
(Virtex, VirtexII, VirtexIIPro, Virtex4, ...)
ReconfigurationModular design flow
E.g., Early Access Partial Reconfiguration (EAPR)
7
Related WorkRelated Work
TMR at different levels of abstraction replication of the entire circuit or of each register
Periodic bitstream scrubbing
Bitstream readback
Area overhead, latency in recovering and power consumption
8
Proposed ApproachProposed Approach
Fault detection and masking
Duplication with comparison (DWC)Triple Modular Redundancy (TMR)Redundant Codes
presented in the 70s and 80s
RecoveryPartial dynamic reconfiguration
9
RequirementsRequirements
Fault detection and characterizationIdentification of a mismatchDetect if transient or permanent
Fault localizationIdentification of the portion of the device where the fault occurred
Partial reconfigurationReconfiguration of the smallest portion of the FPGA if fault effect is characterized as permanent
10
Design Space ExplorationDesign Space Exploration
Several solutions with applying DWC
Several solutions with applying TMR
11
Design Space ExplorationDesign Space Exploration
Discarding of disadvantageous solutionsFor instance, elimination of not required error controlling modules (E.g.: voters)
Design Space ExplorationDesign Space Exploration
Presented issues lead to the definition of a framework for the design space explorationIt aims at
Estimating the costs and benefits deriving from the possible different solutionsExploring the solution space on the based of several metrics
E.g.: size of the subsystems, size of the data widthsIdentifying most promising solutions
12
Project roadmap:Completed steps
13
14
Case StudiesCase Studies
Noekeon algorithm:Block cipher (128-bit key, 128-bit block)
FIR filter: Simple and regular architecture
10
0
)()(i
i itxcty
15
A first attemptA first attempt
Few solutions have been implementedDWC (or TMR) has been adoptedEach solution proposes a different grouping of system modules and a different placement on reconfigurable areas
Exhaustive exploration of solution Exhaustive exploration of solution spacespace
Considering TMR, all the possible solutions have been generated (not implemented!)An all-to-all comparison have been performed to choose most promising ones and to discard least interesting
Area occupation has been taken into account as metricSolution area have been estimated by adding single module area occupations
16
Project roadmap:Work in progress
17
Exhaustive exploration of solution Exhaustive exploration of solution spacespace
Designing an algorithm that Enables a “smart” exploration of the solution space Enable the search of the most promising solutions on the base of an objective function that considers cost/benefit metricsExplores the design space considering more than one technique (E.g.: TMR, DWC, redundant codes)
18
A first draft
Implementing the frameworkImplementing the framework
19
RoadRunner Lib
(TRC, ...)
Project Lib
Top ModuleVHDL
Transf.XML
Mod. VHDL
VHDL Parser VHDL Re-builder
Mod. VHDL
Rec Arch VHDL
Graph Manipulator
Rec Lib(TRC, ...)
Component Syntheses Constraint File Builder
Constr File
Tranf. Rules (Rec,
TMR,...)
Other worksOther works
Another related work deals with the design of a fault injector for FPGA Motivations:
Reliability assessment is an important task when designing reliable embedded systemsIt is usually performed by means of fault injector experiments
Requirements:Stop the execution preserving system stateInject a fault by downloading a partial bitstream
It should allow corruption of both data registers and configuration memory
Restart the executionIMPORTANT ISSUES: osservability and controllability of fault injection
20
21
Conclusions and Future WorkConclusions and Future Work
We proposed guidelines for evaluating various alternatives for SEU mitigation techniquesWe applied DWC and TMR to detect faults and partial dynamic reconfiguration to recoverWe explored exhaustively the solution space considering a single technique
Next steps:Automatic system partitioning in reliable areasGathering alternative concurrent error detection techniquesDesigning an EAPR-based flow
22
QuestionsQuestions
??
Top Related