SERENE 2014 School: System-Level Concurrent Error Detection
-
Upload
sereneworkshop -
Category
Engineering
-
view
163 -
download
2
description
Transcript of SERENE 2014 School: System-Level Concurrent Error Detection
SERENE'14SERENE'14 AutumnAutumn SchoolSchoolENGINEERING RESILIENT CYBER PHYSICAL SYSTEMSENGINEERING RESILIENT CYBER PHYSICAL SYSTEMS
SystemSystem--Level Concurrent Error DetectionLevel Concurrent Error Detection
Dr. Luigi PomanteDr. Luigi PomanteUniversitUniversitàà deglidegli StudiStudi delldell’’AquilaAquila
Center of Excellence DEWSCenter of Excellence [email protected]@univaq.it
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 22 --
IntroductionIntroduction
Resilience
ConcurrentError
Detection
Fault Tolerance
Reliability
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 33 --
IntroductionIntroduction
Error detection is one of the basic feature neededto support reliability and then resilience in CPS
So, this talk focuses on error detection issues in the cyber part of a CPS
Such a part is normally a customized electronic digital system, with an ad-hoc hw/sw architecture, typically embedded in a more complex heterogeneous system that heavily interactswith some physical processes
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 44 --
IntroductionIntroduction
Error Detection MethodologiesOff-line vs. Concurrent
System-Level Design MethodologiesSystem-Level Specification
Functional characterization of the system without dealingwith implementation aspects
Specification of implementation objectives and constraintsTiming, Power Consumption, Area
Estimation of the influence of different alternatives on the final implementation
HW/SW system compositionDifferent processors and/or alternative technologies
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 55 --
IntroductionIntroduction
Typically, system resislience/reliability aspects are neglected while dealing with the higher levels of system synthesis process
They are postponed to lower abstraction levels but the use of resislience/reliability methodologies could significantly impacts on timing, energy and area
It is necessary to transfer these aspects toward the upper levels of the synthesis flow by adding the resilience/reliability constraint to the classical cost parameters
This work investigates the problem of adopting design forreliability/resilience approaches at system level, when all the solutions are still open for the implementation of the device, presenting a set of design methodologies to provide concurrenterror detection (CED) properties to the final implementation
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 66 --
GoalGoal
The achievement of this wide resilience/reliability co-design project consists of the following aspects
specification of systems in a co-design environment supporting resilience/reliability constraints
design methodologies providing the desired CED properties
hw/sw system partitioning on the basis of metrics taking into account both traditional co-design issues and resilience/reliability constraints
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 77 --
OverviewOverview
Problem Definition
Target System Architecture
Fault Model
System Specification
Design Methodologies for Reliability
Design Analysis and Metrics
Hw/Sw System Partitioning
A Case Study: a Reliable Pacemaker
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 88 --
ProblemProblem DefinitionDefinition
A Section is a subset of the system specification
A Critical Section is a section where the CED property is required
A Reliable Section is a critical section that propagates either error free critical results or faulty critical results associated with an error indication
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 99 --
ProblemProblem DefinitionDefinition
The underlying assumption refers to the fact that the input data processed by the reliable section is error free
The upstream sections provide either correct data by definition or they are designed to be reliable themselves
The downstream sections also need to be designed reliable or no reliability constraint applies to them
In the former case reliability is extended to all downstream elements, in the latter the property has a pure local effect
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1010 --
ProblemProblem DefinitionDefinition
In order to define formally these two different characterizations, the following definitions are introduced
Local ReliabilityThe Local Reliability property of a critical section specifies that the reliability constraints involve only the related critical section
Global ReliabilityThe Global Reliability property of a critical section specifies that the reliability constraints involve the related sections and recursively all the downstream sections
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1111 --
ProblemProblem DefinitionDefinition
Local and Global Reliability Specification
A
B
D
C
D
A
B
E
C
D
Local reliability on B: the data provided to A are reliable
Global reliability on B: the data provided to A and B are reliable
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1212 --
ProblemProblem DefinitionDefinition
The need of two kinds of reliability is due to the possibility that a specification could comprehend also the environment description, that doesn’t need any property, or a set of functionalities of which only one should be reliable
For example, a digital control system specification for a car could comprehend tachometer, temperature and ABS control: the reliability is needed only for the ABS
In order to be able to specify which sections must be reliable and what kind of reliability is desired particular system level specification languages (or proper extension to the existing ones) are required
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1313 --
System System SpecificationSpecification
Two languages has been considered for system specification: Occam II and SystemC
The first one has been selected since the TOSCA environment (a Co-design environment for embedded systems), used in our studies to verify the proposed approaches, is based on it
The second language is becoming increasingly popular for system level specification, thus making its adoption almost a requirement when pursuing the integration of the proposed approaches in a real design flow
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1414 --
System System SpecificationSpecification
Reliability constraints in Occam IIThe language has been extended with the introduction of statements for identifying critical sections to be added to the standard constraint definition section
CS FROM label1 TO label2 IS LOCAL (GLOBAL)INT a,b CHAN OF INT in,out: TAG A: SEQ
a:=0 WHILE TRUE
TAG B: SEQ
a:=a+1 out ! a TAG C: in ? b a:=a+b
TAG D: MAXDELAY FROM B TO C IS 10: MAXRATE OF B IS 100: CS FROM A to D IS LOCAL:
Tag definition
Reliability constraint
Declaration of a communication channel
Timing constraints
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1515 --
System System SpecificationSpecification
Reliability constraints in SystemCThe language allows an intervention at different abstraction levels: module or process
While working at module level, reliability constraints are imposed by extending the basic class using the inheritance mechanisms
SC_MODULE_GCS, SC_MODULE_LCS– A reliability constraint imposed to the module applies directly to
all processes included in the module itself
When moving to process level, macro mechanisms can be adopted, by introducing additional macros for specifying critical sections and the local/global reliability constraint
SC_GCS, SC_LCS
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1616 --
Target System Target System ArchitectureArchitecture
The reference architecture consists of the basic processor block (either general purpose or DSP), which executes software processes, main memory and a set of co-processors (ASIC or FPGA) implementing hardware functionalities if required
Communication between hardware modules uses the available bus, memory otherwise
CPU
I/O Interface Co-Processors
Memory
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1717 --
Fault Fault ModelModel
The adopted fault model is represented by the Single Functional Failure, where any number of physical faults causes a functional module to perform incorrectly
The considered faults affect the hardware structure of the system, mining the behavior of the software too, but no softwarefailures are considered in this work
The modules that may fail are, thus, the main processor, the co-processors, the main memory, the system bus and the dedicated channels for hardware-hardware module communication
Such a single failure model is based on a commonly adopted hypothesis: module failure is detected before another module fails
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1818 --
Design Design MethodologiesMethodologiesforfor ReliabilityReliability
The resilience/reliability project has investigated design methodologies for guaranteeing error detection capabilities based on the adoption of redundancy strategies
Architectural and information redundancy
The methodologies that have been analyzed and developed can be classified
On the basis of the functionality to be performed and controlledData Processing or Communication
On the partitions involvedHW or SW
On the CED techniques adopted for guaranteeing the reliability properties
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 1919 --
Design Design MethodologiesMethodologiesforfor ReliabilityReliability
The design approach considers as the basic element any functionality that the system must provide in a reliable way
Nominal (N)Denotes such basic element
Checking (C)Identifies the redundant functional elements designed to provide error detection capabilities
Checker (CK)Is the functional element that detects a mismatching behavior between N and C due to failures
Each one of these three elements (N, C and CK) can be independently implemented in hardware or in software, leading to several classes of methodologies
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2020 --
Design Design MethodologiesMethodologiesforfor ReliabilityReliability
Reliable Data ProcessingNominal
ArchitectureChecking
Architecture
Sw
Checker
Hw
Sw
Hw
Sw
Hw
Solution Nominal Checker Checking 1 SW SW SW 2 SW HW SW 3 SW SW HW 4 SW HW HW 5 HW SW SW 6 HW HW SW 7 HW SW HW 8 HW HW HW
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2121 --
Design Design MethodologiesMethodologiesforfor ReliabilityReliability
Reliable Data Processing
Class 1: SW Nominal, SW Checker, and SW Checking
Self-Checking SWAssertionsDual-Processor CheckingVLIW Checking
Class 2: SW Nominal, HW Checker, and SW Checking
Interface for Functional Redundancy CheckDMA CheckerVLIW Checking with HW Checker
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2222 --
Design Design MethodologiesMethodologiesforfor ReliabilityReliability
Reliable Data Processing
Class 4: SW Nominal, HW Checker, and HW Checking
Dynamically Re-Configurable Checker
Class 8: HW Nominal, HW Checker, and HW Checking
Device DuplicationTSC SchedulingTSC Devices
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2323 --
Design Design MethodologiesMethodologiesforfor ReliabilityReliability
Reliable Communications
It is necessary to guarantee that any fault on communication lines is detected
Either hardware redundancy (lines duplication) or information redundancy (data encoding) can be adopted
Two possibilities should be considered
Communications between procedures implemented in HW
Other kind of communications– SW-SW, SW-HW, HW-SW
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2424 --
Design Design MethodologiesMethodologiesforfor ReliabilityReliability
Reliable Communications
Communications between procedures implemented in HWA pair of HW sections communicates by means of dedicated lines
– Line Duplication vs. Data Encoding
Other kinds of communicationWhen the communication involves a SW section then it makesuse of the system bus
– The only viable solution is the use of error detection codes– The best results are obtained keeping the data in memory in a
coding form and let the CPU working only with non-coded data» HW TSC Encoder/Decoder/ChecKer for the processor and
one (or more) for the HW devices
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2525 --
Design Design MethodologiesMethodologiesforfor ReliabilityReliability
Reliable CommunicationsArchitecture with reliable communications
CPU
I/O Interface
Co-Processors
Memory
(Coded Data)
TSC EDCK
TSC EDCK
TSC EDCK
TSC CK
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2626 --
Design Design AnalysisAnalysisand and MetricsMetrics
All the methodologies have been analyzed in details in order to give prominence to main design issuesand to evaluate benefits and costs
The design issues have been analyzed qualitativelyaccording to a reference schema in order to quickly show the main differences between different approaches
Benefits and costs have been analyzed defining a set of significant parameters, constituting the basic elements needed to build metrics useful to compare the quality of different solutions, metrics that play an important role in the partitioning step
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2727 --
Design Design AnalysisAnalysisand and MetricsMetrics
Design issues reference schema: key concepts
Selection of number and typology of processing elementsDetection of the need for a special architectureAnalysis of synchronization issues between processing elementsAnalysis for possible physical and logical resources sharingDetection of modification needs of the original specificationSelection of the execution policies for each processing elementAllocation of the checker memory spaceSelection of the checking policiesAnalysis of the checker structure and complexitySelection of a mechanism to enable the checker to rise exceptions to report error detection
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2828 --
Design Design AnalysisAnalysisand and MetricsMetrics
Benefits and Cost
Let us define the Efficiency of a given methodology as its characterization relatively to three factors
Coverage– It is the percentage of functional faults that it is possible to
detect with respect to the complete fault setDetection Latency (DL)
– It is the time between the instant a fault causes an error and the instant the error is detected
Performance Degradation (PD)– It is related to the overhead (i.e., additional execution time)
caused by fault detection tasks with respect to the original system
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 2929 --
Design Design AnalysisAnalysisand and MetricsMetrics
Benefits and Costs
Let define the Cost of a given solution as the overhead with respect to the original system
Physical cost (Cp)– It represents the cost of the physical components added to the
original architecture
Design Cost (Cd)– It represents the effort needed to design and implement a given
solution
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3030 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
Once the system, the constraints, and the set of possible design solution are specified, the partitioning step selects theimplementation of each task, either hardware or software
The achieved solution is checked against the designer's constraints and, if they are met, the solution is accepted, otherwise a backtrack is performed and another allocation solution is pursued
This process is extremely complex and time consuming, due to the large number of possible alternatives and to the fact that, although heuristics and tuned estimation functions have been defined, it is the final co-simulation of the suggested system implementation that confirms it to be a solution or not
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3131 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
The reliability aspects add a significant number of parameters to the partitioning step for the selection of the final implementation, making this task too complex
In order to cope with the complexity of the partitioning step when reliability goals are also included, a two-level approach is here proposed
A first partitioning is performed which takes into account only the classical aspects and cost functions, meeting the usually stringent time constraints
Given the first assessed solution, a second-level partitioning considers the additional reliability constraints, analyzes the possible approaches, within the set of defined methodologies which fulfill them, and provides the solution that has the best tradeoff (if it exists)
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3232 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
S P E C I F I C A T I O N
P A R T I T I O N I N G
T I M I N G
P O W E R
A R E A
C O S T
A R C H I T E C T U R E
H W S W
O . S .
I N TI N I T I A L
S O L U T I O N
P A R T I T I O N I N G R E L I A B I L I T YM O D E L
S T R E N G T HH A R D / S O F T
F A U L T C O V E R A G E
D E T E C T I O N L A T E N C Y
A R E A O V E R H E A D
P E R F O R M A N C ED E G R A D A T I O N
N OY E SN O
Y E S
O P T I M I Z A T I O N
H W / S W S Y N T H E S I S
T I M I N GT A G S
S O L U T I O NS P E C I F I CA R C H .
R E L I A B I L I T YC O - D E S I G N
P A R T I T I O N I N G
H W S W
O . S .
I N TH W S W
S O L U T I O NW I T H F A U L TD E T E C T I O N
Y E S
N O R E L I A B I L I T YR E Q .
R E L I A B I L I T YT A G S
p a r a m e t e r s
c o n s t r a in t s
c o n s t r a in t s
S E C T I O N S F O RR E L I A B I L I T Y
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3333 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
The 2th-level partitioning problem consists of both
Reliability Model IdentificationDefining a criterion for the identification of the relation between the constrained procedure and the most suitable CED method
OptimizationOptimizing the result produced by the assignment criteria with respect to the global solution
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3434 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
Reliability Model Identification
For each approach is identified a correct evaluation, or a qualitative estimation, of the considered parameter
Methodologies Fault Coverage Detection Latency
Performance Degradation
Area Overhead
SCS min/med/max med/max med/max med/max A min/med/max min/med med/max med/max DP 100% med/max min/med med/max VLIWS 100% 0 med/max min IFRC 100% 0 0 max DMAC 100% med/max med/max max VLIWH 100% 0 0 max DCC 100% med med max D 100% 0 0 max TSCS 100% med/max med/max med/max TSCD 100% 0 0 min/med
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3535 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
Reliability Model Identification
A crisp tag (100% fault coverage, 0 detection latency, etc.) represents a hard system constraint that has to be enforced at any cost
A fuzzy tag (i.e. min, med, max) represents a soft system requirement that is a design directive of the required effort for the identification of anomalies during the deviceoperational time
Note that, for soft requirements, a maximum requirement includes methodologies belonging to the medium or minimumpartitions; and a medium requirement includes minimum
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3636 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
Reliability Model Identification
Crisp tags force a partition on the methodologies set
In particular, 100% fault coverage induces the partitions hard_fc and soft_fc, 0 detection latency induces the partitions hard_dl and soft_dl while, 0 performance degradation induces the partition hard_pd and soft_pd
Since the applicability of a methodology to a specific procedure depends on its hardware/software characteristic, a further partition is induced
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3737 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
Reliability Model IdentificationBy analyzing the properties of the methodologies, the following partitions are identified:
swfc = { {IFRC, DP, DMAC, DCC, VLIWH, VLIWS} ; {A, SCS} }
hwfc = { {TSCS, TSCD, D} ; {} }
swdl = { {IFRC, VLIWH, VLIWS} ; {DP, DMAC, DCC, A, SCS} }
hwdl = { {D, TSCD} ; {TSCS} }
swpd = { {IFRC, VLIWH} ; {DMAC, DP, DCC, VLIWS, A, SCS} }
hwpd = { {D, TSCD} ; {TSCS} }
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3838 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
Reliability Model Identification
The second level partitioning takes into account the hardparameters first for selecting suitable CED techniques, and uses the soft parameters for selecting among them
More precisely, for each critical procedure, on the basis of itsallocation in hardware or in software, the partitions fulfilling the hard/soft requirements are selected, and the intersection between them provides the set of suitable CED techniques
The partitioning thus proceeds with the next critical procedure and moves toward the end of this local CED allocation analysis. At the end, all procedures are associated with a set of admissible CED implementations
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 3939 --
Hw/Sw System Hw/Sw System PartitioningPartitioning
Optimization
The global solution determining for each procedure the CED technique actually adopted is pursued by means of a process of solution extraction and simulation, to verify that the constraints of the first partitioning are still met
This process takes into account the fact that there are techniques with a global effect (such as IFRC, DP), which prevail over those with a local impact (A, SCS)
As an optimization policy, the final solution does not include overlapped methods in order to achieve a significant efficiency
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4040 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
The goal of this case study is to co-design a reliable pacemaker able to detect any anomalies in its behavior due to physical faults in its components
In order to obtain this goal, by starting from system-level specification and following a reliable co-design flow, the design space is explored, identifying an optimal partitioningbetween hardware and software, validated through system-level co-simulation
Hence, by taking into account the reliability requirements, the proper CED methodologies able to meet all the constraints are selected and then the one with the best cost-benefit tradeoff is identified and adopted for the final design
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4141 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
Behavioral analysis
PVARP AEIr
BP
CSW
AVI
LRL
AVIr
Time Intervals Min-Max (ms) PVARP 300-400 AEIr 0-400 BP 25
CSW 75 AVIr 100
Electrocardiographic diagram
showing the relevant timing parameters
Typical values for each interval
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4242 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
State Diagram
BP
AVIrp CSW
AVI r
PVARP AEIrStart
Natural Vtime_out /reset_timer, set_AEIr_timer
Natural V /reset_timer, set_PVARP_timer
Natural A /reset_timer, set_BP_timer
time_out /Stimulated Areset_timer, set_BP_timer
time_out /set_CSW_timer
time_out /reset_timer, set_AVIr_timer
Natural V /set_AVIrp_ timer
time_out /Stimultaed Vrset_timer, set_PVARP_timer
NAtural V/reset_timer, set_PVARP_timer
time_out /Stimulated Vreset_timer, set_PVARP_timer
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4343 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
Timing Constraints
Other ConstraintsThe other constraints to be considered in the first-level partitioning step are the classical ones: power dissipation, area and costThey must be kept as much as possible to minimum values
State Min-Max (ms) PVARP 300-400 AEIr 300-800 BP 325-825 CSW 400-900 AVIr 500-1000
Timing bounds for the intervals
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4444 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
Reliability Constraints
Considering the criticality of the system for the human safety, a hard reliability is imposed on the whole systemMore in detail
100% fault coverage is required
Performance degradation is allowed as long timing constraints are still met
Detection latency and area overhead must be kept as much as possible to minimum values
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4545 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
System Level Specification: the Environment
Main
Heart System
Test bench
Environment
Channels
Calls
RTS[1]
RTS[0]
The heart ... inside
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4646 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
System Level Specification: the System
Pacemaker
PVARP
AEIr
AVIr
Timeout[0]
TimeOut
[2][3][4]
Timeout[1]
System Channels
Calls
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4747 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
Timing and Reliability Requirements Specification
PROC Pacemaker( CHAN OF BIT R; CHAN OF BIT V; CHAN OF BIT P; CHAN OF BIT A; CHAN OF BIT inh_R; CHAN OF BIT inh_P ) BIT val: -- Main body SEQ R ? val WHILE (TRUE) SEQ TAG P1: PVARP[0]( R, V, P, A, inh_R, inh_P, val) TAG P2: : MINDELAY FROM P1 TO P2 IS 500 (MS): MAXDELAY FROM P1 TO P2 IS 1000 (MS): CS FROM P1 TO P2 IS GLOBAL:
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4848 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
1st Level PartitioningTOSCA
Embedded Ultra-Low Power Intel 486 GXGenetic Algorithm
Communication Costs
Selected SolutionAll-in-sw implementation (E486 16 Mhz)
Procedures Allocation Test results Pacemaker PVARP AEIr AVI Timeout[0] [1] [2] [3] [4] T1 T2 T3 T4 T5 T6
SW SW SW SW SW SW SW SW SW OK OK OK OK OK OK SW SW SW SW HW HW HW HW HW OK OK Max
AVI Max AEIr
OK Max AVI
PVARPSW HW HW HW SW SW SW SW SW OK Max
AVI Max AEIr
Max AEIr
OK Max AVI
HW HW HW HW HW HW HW HW HW OK OK OK OK OK OK
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 4949 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
2th Level PartitioningReliability Constraints
FC = 100%PD = mediumDL = maximumA = maximum
PartitionsFC 100%– swfc = {hard_fc} = {IFRC, DP, DMAC, DCC, VLIWH, VLIWS}
PD medium– swpd = {hard_pd; soft_pd}
= {{IFRC, VLIWH };{DMAC, DP, DCC, VLIWS, A, SCS}}– swpd = {{IFRC, VLIWH };{DP}}
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 5050 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
2th Level PartitioningPotential Solutions
{IFRC, DP, VLIWH}
Methodologies ComparisonIFRC and VLIWH doesn’t affect system behaviorDP requires co-simulation (Nominal, Checking, Checker)
– The timing constraints aren’t met: the solution is discarded
Test results T1 T2 T3 T4 T5 T6 OK OK Max
AEIr Max AVI
PVARP
OK Max AEIr
PVARP
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 5151 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
Selected SolutionThe feasible solutions are IFRC and VLIWH
These alternatives are characterized by the same area overhead and detection latency, so they are equivalent
The designer, considering the particular aspects related to other steps of the co-design flow can make the final choice
For example, the IFRC is applicable independently from the number of reliable procedures while VLIWH requires a specific software synthesis step for each reliable procedure
– The first solution has thus a cost that is independent of the number of critical sections, which is not true for VLIWH solutions
– Since in the present case study all the system procedures are made reliable, the first architectural solution requires a lowereffort and design cost and may be preferable
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 5252 --
A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker
Selected SolutionThe final architectural solution for the reliable pacemaker
The selected solution doesn't allow any significant back annotation to the first level partitioning, since the initial hw/sw partitioning achieved an acceptable all-in-softwaresolution, loading all tasks efficiently on one processor
CPU
BUS Interface
and Checker
I/O Interface
Memory CPU_chk
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 5353 --
ConclusionsConclusions
The resilience/reliability co-design project aims at integrating in a standard co-design flow the elements for achieving a final system able to autonomously detect the occurrence of faults during the operational life of the system
The entire flow has been presented in this work, discussing the key elements of the proposed framework
SpecificationDesign MethodologiesSystem Partitioning
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 5454 --
ConclusionsConclusions
Language specification extensions have been defined to specify reliability requirements
A set of possible hw/sw architectural design methodologies has been analyzed considering the possibilities to implement any part of the complete system (nominal, checking and checker) either in hardware or in software
A metric has been introduced taking into account the peculiar elements of reliability properties
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 5555 --
ConclusionsConclusions
A two-level hw/sw partitioning process has been defined, acting initially as a traditional approach to determine a valid solution, while the second step explores the alternatives taking into account the fault detection properties
A case study shows the results of our work
Further research efforts are directed toward the tuning of metrics with respect to the selected suite of design methodologies, to better support the partitioning step
System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 5656 --
ReferencesReferences
L. Pomante. “System Level Concurrent Error Detection”, Technical Report No. 2001.62, Politecnico di Milano, 2001L. Pomante. “System-Level Co-Design of Heterogeneous Multiprocessor EmbeddedSystems”, PhD Thesis, Politecnico di Milano, 2002L. Pomante, C. Bolchini, F. Salice, D. Sciuto. "Reliability Properties Assessment at System Level: a Co Design Framework", Journal of Electronic Testing - Theory and Application (JETTA), Kluwer Academic Publishers, 2002L. Pomante, A. Miele, F. Salice, C. Bolchini, D. Sciuto, "Reliable System Co-Design: the FIR Case Study", IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 2004)L. Pomante, F. Salice, C. Bolchini, D. Sciuto, “Reliable System Specification for Self-Checking Data-Paths”, Design, Automation and Test in Europe – Conference & Exibition(DATE 2005), 2005L. Pomante, D. Sciuto, F. Salice, W. Fornaciari, C. Brandolese. “Affinity-Driven System Design Exploration for Heterogeneous Multiprocessor SoC”, IEEE Transactions on Computers, vol. 55, no. 5, 2006L. Pomante. “System-Level Design Space Exploration for Dedicated Heterogeneous Multi-Processor Systems”. IEEE International Conference on Application-specific Systems, Architectures and Processors, 2011L. Pomante. “HW/SW Co-Design of Dedicated Heterogeneous Parallel Systems: an Extended Design Space Exploration Approach”. IET Computers & Digital Techniques, Institution of Engineering and Technology, 2013