SERENE 2014 School: System-Level Concurrent Error Detection

SERENE'14SERENE'14 AutumnAutumn SchoolSchoolENGINEERING RESILIENT CYBER PHYSICAL SYSTEMSENGINEERING RESILIENT CYBER PHYSICAL SYSTEMS

SystemSystem--Level Concurrent Error DetectionLevel Concurrent Error Detection

Dr. Luigi PomanteDr. Luigi PomanteUniversitUniversitàà deglidegli StudiStudi delldell’’AquilaAquila

Center of Excellence DEWSCenter of Excellence [email protected]@univaq.it

System Level CEDSystem Level CED ©© 2014 2014 -- Luigi PomanteLuigi Pomante-- 22 --

IntroductionIntroduction

Resilience

ConcurrentError

Detection

Fault Tolerance

Reliability



Error detection is one of the basic feature neededto support reliability and then resilience in CPS

So, this talk focuses on error detection issues in the cyber part of a CPS

Such a part is normally a customized electronic digital system, with an ad-hoc hw/sw architecture, typically embedded in a more complex heterogeneous system that heavily interactswith some physical processes



Error Detection MethodologiesOff-line vs. Concurrent

System-Level Design MethodologiesSystem-Level Specification

Functional characterization of the system without dealingwith implementation aspects

Specification of implementation objectives and constraintsTiming, Power Consumption, Area

Estimation of the influence of different alternatives on the final implementation

HW/SW system compositionDifferent processors and/or alternative technologies



Typically, system resislience/reliability aspects are neglected while dealing with the higher levels of system synthesis process

They are postponed to lower abstraction levels but the use of resislience/reliability methodologies could significantly impacts on timing, energy and area

It is necessary to transfer these aspects toward the upper levels of the synthesis flow by adding the resilience/reliability constraint to the classical cost parameters

This work investigates the problem of adopting design forreliability/resilience approaches at system level, when all the solutions are still open for the implementation of the device, presenting a set of design methodologies to provide concurrenterror detection (CED) properties to the final implementation


GoalGoal

The achievement of this wide resilience/reliability co-design project consists of the following aspects

specification of systems in a co-design environment supporting resilience/reliability constraints

design methodologies providing the desired CED properties

hw/sw system partitioning on the basis of metrics taking into account both traditional co-design issues and resilience/reliability constraints


OverviewOverview

Problem Definition

Target System Architecture

Fault Model

System Specification

Design Methodologies for Reliability

Design Analysis and Metrics

Hw/Sw System Partitioning

A Case Study: a Reliable Pacemaker


ProblemProblem DefinitionDefinition

A Section is a subset of the system specification

A Critical Section is a section where the CED property is required

A Reliable Section is a critical section that propagates either error free critical results or faulty critical results associated with an error indication



The underlying assumption refers to the fact that the input data processed by the reliable section is error free

The upstream sections provide either correct data by definition or they are designed to be reliable themselves

The downstream sections also need to be designed reliable or no reliability constraint applies to them

In the former case reliability is extended to all downstream elements, in the latter the property has a pure local effect



In order to define formally these two different characterizations, the following definitions are introduced

Local ReliabilityThe Local Reliability property of a critical section specifies that the reliability constraints involve only the related critical section

Global ReliabilityThe Global Reliability property of a critical section specifies that the reliability constraints involve the related sections and recursively all the downstream sections



Local and Global Reliability Specification

A

B

D

C

D

A

B

E

C

D

Local reliability on B: the data provided to A are reliable

Global reliability on B: the data provided to A and B are reliable



The need of two kinds of reliability is due to the possibility that a specification could comprehend also the environment description, that doesn’t need any property, or a set of functionalities of which only one should be reliable

For example, a digital control system specification for a car could comprehend tachometer, temperature and ABS control: the reliability is needed only for the ABS

In order to be able to specify which sections must be reliable and what kind of reliability is desired particular system level specification languages (or proper extension to the existing ones) are required


System System SpecificationSpecification

Two languages has been considered for system specification: Occam II and SystemC

The first one has been selected since the TOSCA environment (a Co-design environment for embedded systems), used in our studies to verify the proposed approaches, is based on it

The second language is becoming increasingly popular for system level specification, thus making its adoption almost a requirement when pursuing the integration of the proposed approaches in a real design flow



Reliability constraints in Occam IIThe language has been extended with the introduction of statements for identifying critical sections to be added to the standard constraint definition section

CS FROM label1 TO label2 IS LOCAL (GLOBAL)INT a,b CHAN OF INT in,out: TAG A: SEQ

a:=0 WHILE TRUE

TAG B: SEQ

a:=a+1 out ! a TAG C: in ? b a:=a+b

TAG D: MAXDELAY FROM B TO C IS 10: MAXRATE OF B IS 100: CS FROM A to D IS LOCAL:

Tag definition

Reliability constraint

Declaration of a communication channel

Timing constraints



Reliability constraints in SystemCThe language allows an intervention at different abstraction levels: module or process

While working at module level, reliability constraints are imposed by extending the basic class using the inheritance mechanisms

SC_MODULE_GCS, SC_MODULE_LCS– A reliability constraint imposed to the module applies directly to

all processes included in the module itself

When moving to process level, macro mechanisms can be adopted, by introducing additional macros for specifying critical sections and the local/global reliability constraint

SC_GCS, SC_LCS


Target System Target System ArchitectureArchitecture

The reference architecture consists of the basic processor block (either general purpose or DSP), which executes software processes, main memory and a set of co-processors (ASIC or FPGA) implementing hardware functionalities if required

Communication between hardware modules uses the available bus, memory otherwise

CPU

I/O Interface Co-Processors

Memory


Fault Fault ModelModel

The adopted fault model is represented by the Single Functional Failure, where any number of physical faults causes a functional module to perform incorrectly

The considered faults affect the hardware structure of the system, mining the behavior of the software too, but no softwarefailures are considered in this work

The modules that may fail are, thus, the main processor, the co-processors, the main memory, the system bus and the dedicated channels for hardware-hardware module communication

Such a single failure model is based on a commonly adopted hypothesis: module failure is detected before another module fails


Design Design MethodologiesMethodologiesforfor ReliabilityReliability

The resilience/reliability project has investigated design methodologies for guaranteeing error detection capabilities based on the adoption of redundancy strategies

Architectural and information redundancy

The methodologies that have been analyzed and developed can be classified

On the basis of the functionality to be performed and controlledData Processing or Communication

On the partitions involvedHW or SW

On the CED techniques adopted for guaranteeing the reliability properties



The design approach considers as the basic element any functionality that the system must provide in a reliable way

Nominal (N)Denotes such basic element

Checking (C)Identifies the redundant functional elements designed to provide error detection capabilities

Checker (CK)Is the functional element that detects a mismatching behavior between N and C due to failures

Each one of these three elements (N, C and CK) can be independently implemented in hardware or in software, leading to several classes of methodologies



Reliable Data ProcessingNominal

ArchitectureChecking

Architecture

Sw

Checker

Hw

Sw

Hw

Sw

Hw

Solution Nominal Checker Checking 1 SW SW SW 2 SW HW SW 3 SW SW HW 4 SW HW HW 5 HW SW SW 6 HW HW SW 7 HW SW HW 8 HW HW HW



Reliable Data Processing

Class 1: SW Nominal, SW Checker, and SW Checking

Self-Checking SWAssertionsDual-Processor CheckingVLIW Checking

Class 2: SW Nominal, HW Checker, and SW Checking

Interface for Functional Redundancy CheckDMA CheckerVLIW Checking with HW Checker



Reliable Data Processing

Class 4: SW Nominal, HW Checker, and HW Checking

Dynamically Re-Configurable Checker

Class 8: HW Nominal, HW Checker, and HW Checking

Device DuplicationTSC SchedulingTSC Devices



Reliable Communications

It is necessary to guarantee that any fault on communication lines is detected

Either hardware redundancy (lines duplication) or information redundancy (data encoding) can be adopted

Two possibilities should be considered

Communications between procedures implemented in HW

Other kind of communications– SW-SW, SW-HW, HW-SW



Reliable Communications

Communications between procedures implemented in HWA pair of HW sections communicates by means of dedicated lines

– Line Duplication vs. Data Encoding

Other kinds of communicationWhen the communication involves a SW section then it makesuse of the system bus

– The only viable solution is the use of error detection codes– The best results are obtained keeping the data in memory in a

coding form and let the CPU working only with non-coded data» HW TSC Encoder/Decoder/ChecKer for the processor and

one (or more) for the HW devices



Reliable CommunicationsArchitecture with reliable communications

CPU

I/O Interface

Co-Processors

Memory

(Coded Data)

TSC EDCK

TSC EDCK

TSC EDCK

TSC CK


Design Design AnalysisAnalysisand and MetricsMetrics

All the methodologies have been analyzed in details in order to give prominence to main design issuesand to evaluate benefits and costs

The design issues have been analyzed qualitativelyaccording to a reference schema in order to quickly show the main differences between different approaches

Benefits and costs have been analyzed defining a set of significant parameters, constituting the basic elements needed to build metrics useful to compare the quality of different solutions, metrics that play an important role in the partitioning step



Design issues reference schema: key concepts

Selection of number and typology of processing elementsDetection of the need for a special architectureAnalysis of synchronization issues between processing elementsAnalysis for possible physical and logical resources sharingDetection of modification needs of the original specificationSelection of the execution policies for each processing elementAllocation of the checker memory spaceSelection of the checking policiesAnalysis of the checker structure and complexitySelection of a mechanism to enable the checker to rise exceptions to report error detection



Benefits and Cost

Let us define the Efficiency of a given methodology as its characterization relatively to three factors

Coverage– It is the percentage of functional faults that it is possible to

detect with respect to the complete fault setDetection Latency (DL)

– It is the time between the instant a fault causes an error and the instant the error is detected

Performance Degradation (PD)– It is related to the overhead (i.e., additional execution time)

caused by fault detection tasks with respect to the original system



Benefits and Costs

Let define the Cost of a given solution as the overhead with respect to the original system

Physical cost (Cp)– It represents the cost of the physical components added to the

original architecture

Design Cost (Cd)– It represents the effort needed to design and implement a given

solution


Hw/Sw System Hw/Sw System PartitioningPartitioning

Once the system, the constraints, and the set of possible design solution are specified, the partitioning step selects theimplementation of each task, either hardware or software

The achieved solution is checked against the designer's constraints and, if they are met, the solution is accepted, otherwise a backtrack is performed and another allocation solution is pursued

This process is extremely complex and time consuming, due to the large number of possible alternatives and to the fact that, although heuristics and tuned estimation functions have been defined, it is the final co-simulation of the suggested system implementation that confirms it to be a solution or not



The reliability aspects add a significant number of parameters to the partitioning step for the selection of the final implementation, making this task too complex

In order to cope with the complexity of the partitioning step when reliability goals are also included, a two-level approach is here proposed

A first partitioning is performed which takes into account only the classical aspects and cost functions, meeting the usually stringent time constraints

Given the first assessed solution, a second-level partitioning considers the additional reliability constraints, analyzes the possible approaches, within the set of defined methodologies which fulfill them, and provides the solution that has the best tradeoff (if it exists)



S P E C I F I C A T I O N

P A R T I T I O N I N G

T I M I N G

P O W E R

A R E A

C O S T

A R C H I T E C T U R E

H W S W

O . S .

I N TI N I T I A L

S O L U T I O N

P A R T I T I O N I N G R E L I A B I L I T YM O D E L

S T R E N G T HH A R D / S O F T

F A U L T C O V E R A G E

D E T E C T I O N L A T E N C Y

A R E A O V E R H E A D

P E R F O R M A N C ED E G R A D A T I O N

N OY E SN O

Y E S

O P T I M I Z A T I O N

H W / S W S Y N T H E S I S

T I M I N GT A G S

S O L U T I O NS P E C I F I CA R C H .

R E L I A B I L I T YC O - D E S I G N

P A R T I T I O N I N G

H W S W

O . S .

I N TH W S W

S O L U T I O NW I T H F A U L TD E T E C T I O N

Y E S

N O R E L I A B I L I T YR E Q .

R E L I A B I L I T YT A G S

p a r a m e t e r s

c o n s t r a in t s

c o n s t r a in t s

S E C T I O N S F O RR E L I A B I L I T Y



The 2th-level partitioning problem consists of both

Reliability Model IdentificationDefining a criterion for the identification of the relation between the constrained procedure and the most suitable CED method

OptimizationOptimizing the result produced by the assignment criteria with respect to the global solution



Reliability Model Identification

For each approach is identified a correct evaluation, or a qualitative estimation, of the considered parameter

Methodologies Fault Coverage Detection Latency

Performance Degradation

Area Overhead

SCS min/med/max med/max med/max med/max A min/med/max min/med med/max med/max DP 100% med/max min/med med/max VLIWS 100% 0 med/max min IFRC 100% 0 0 max DMAC 100% med/max med/max max VLIWH 100% 0 0 max DCC 100% med med max D 100% 0 0 max TSCS 100% med/max med/max med/max TSCD 100% 0 0 min/med




A crisp tag (100% fault coverage, 0 detection latency, etc.) represents a hard system constraint that has to be enforced at any cost

A fuzzy tag (i.e. min, med, max) represents a soft system requirement that is a design directive of the required effort for the identification of anomalies during the deviceoperational time

Note that, for soft requirements, a maximum requirement includes methodologies belonging to the medium or minimumpartitions; and a medium requirement includes minimum




Crisp tags force a partition on the methodologies set

In particular, 100% fault coverage induces the partitions hard_fc and soft_fc, 0 detection latency induces the partitions hard_dl and soft_dl while, 0 performance degradation induces the partition hard_pd and soft_pd

Since the applicability of a methodology to a specific procedure depends on its hardware/software characteristic, a further partition is induced



Reliability Model IdentificationBy analyzing the properties of the methodologies, the following partitions are identified:

swfc = { {IFRC, DP, DMAC, DCC, VLIWH, VLIWS} ; {A, SCS} }

hwfc = { {TSCS, TSCD, D} ; {} }

swdl = { {IFRC, VLIWH, VLIWS} ; {DP, DMAC, DCC, A, SCS} }

hwdl = { {D, TSCD} ; {TSCS} }

swpd = { {IFRC, VLIWH} ; {DMAC, DP, DCC, VLIWS, A, SCS} }

hwpd = { {D, TSCD} ; {TSCS} }




The second level partitioning takes into account the hardparameters first for selecting suitable CED techniques, and uses the soft parameters for selecting among them

More precisely, for each critical procedure, on the basis of itsallocation in hardware or in software, the partitions fulfilling the hard/soft requirements are selected, and the intersection between them provides the set of suitable CED techniques

The partitioning thus proceeds with the next critical procedure and moves toward the end of this local CED allocation analysis. At the end, all procedures are associated with a set of admissible CED implementations



Optimization

The global solution determining for each procedure the CED technique actually adopted is pursued by means of a process of solution extraction and simulation, to verify that the constraints of the first partitioning are still met

This process takes into account the fact that there are techniques with a global effect (such as IFRC, DP), which prevail over those with a local impact (A, SCS)

As an optimization policy, the final solution does not include overlapped methods in order to achieve a significant efficiency


A Case A Case StudyStudy::a a ReliableReliable PacemakerPacemaker

The goal of this case study is to co-design a reliable pacemaker able to detect any anomalies in its behavior due to physical faults in its components

In order to obtain this goal, by starting from system-level specification and following a reliable co-design flow, the design space is explored, identifying an optimal partitioningbetween hardware and software, validated through system-level co-simulation

Hence, by taking into account the reliability requirements, the proper CED methodologies able to meet all the constraints are selected and then the one with the best cost-benefit tradeoff is identified and adopted for the final design



Behavioral analysis

PVARP AEIr

BP

CSW

AVI

LRL

AVIr

Time Intervals Min-Max (ms) PVARP 300-400 AEIr 0-400 BP 25

CSW 75 AVIr 100

Electrocardiographic diagram

showing the relevant timing parameters

Typical values for each interval



State Diagram

BP

AVIrp CSW

AVI r

PVARP AEIrStart

Natural Vtime_out /reset_timer, set_AEIr_timer

Natural V /reset_timer, set_PVARP_timer

Natural A /reset_timer, set_BP_timer

time_out /Stimulated Areset_timer, set_BP_timer

time_out /set_CSW_timer

time_out /reset_timer, set_AVIr_timer

Natural V /set_AVIrp_ timer

time_out /Stimultaed Vrset_timer, set_PVARP_timer

NAtural V/reset_timer, set_PVARP_timer

time_out /Stimulated Vreset_timer, set_PVARP_timer



Timing Constraints

Other ConstraintsThe other constraints to be considered in the first-level partitioning step are the classical ones: power dissipation, area and costThey must be kept as much as possible to minimum values

State Min-Max (ms) PVARP 300-400 AEIr 300-800 BP 325-825 CSW 400-900 AVIr 500-1000

Timing bounds for the intervals



Reliability Constraints

Considering the criticality of the system for the human safety, a hard reliability is imposed on the whole systemMore in detail

100% fault coverage is required

Performance degradation is allowed as long timing constraints are still met

Detection latency and area overhead must be kept as much as possible to minimum values



System Level Specification: the Environment

Main

Heart System

Test bench

Environment

Channels

Calls

RTS[1]

RTS[0]

The heart ... inside



System Level Specification: the System

Pacemaker

PVARP

AEIr

AVIr

Timeout[0]

TimeOut

[2][3][4]

Timeout[1]

System Channels

Calls



Timing and Reliability Requirements Specification

PROC Pacemaker( CHAN OF BIT R; CHAN OF BIT V; CHAN OF BIT P; CHAN OF BIT A; CHAN OF BIT inh_R; CHAN OF BIT inh_P ) BIT val: -- Main body SEQ R ? val WHILE (TRUE) SEQ TAG P1: PVARP[0]( R, V, P, A, inh_R, inh_P, val) TAG P2: : MINDELAY FROM P1 TO P2 IS 500 (MS): MAXDELAY FROM P1 TO P2 IS 1000 (MS): CS FROM P1 TO P2 IS GLOBAL:



1st Level PartitioningTOSCA

Embedded Ultra-Low Power Intel 486 GXGenetic Algorithm

Communication Costs

Selected SolutionAll-in-sw implementation (E486 16 Mhz)

Procedures Allocation Test results Pacemaker PVARP AEIr AVI Timeout[0] [1] [2] [3] [4] T1 T2 T3 T4 T5 T6

SW SW SW SW SW SW SW SW SW OK OK OK OK OK OK SW SW SW SW HW HW HW HW HW OK OK Max

AVI Max AEIr

OK Max AVI

PVARPSW HW HW HW SW SW SW SW SW OK Max

AVI Max AEIr

Max AEIr

OK Max AVI

HW HW HW HW HW HW HW HW HW OK OK OK OK OK OK



2th Level PartitioningReliability Constraints

FC = 100%PD = mediumDL = maximumA = maximum

PartitionsFC 100%– swfc = {hard_fc} = {IFRC, DP, DMAC, DCC, VLIWH, VLIWS}

PD medium– swpd = {hard_pd; soft_pd}

= {{IFRC, VLIWH };{DMAC, DP, DCC, VLIWS, A, SCS}}– swpd = {{IFRC, VLIWH };{DP}}



2th Level PartitioningPotential Solutions

{IFRC, DP, VLIWH}

Methodologies ComparisonIFRC and VLIWH doesn’t affect system behaviorDP requires co-simulation (Nominal, Checking, Checker)

– The timing constraints aren’t met: the solution is discarded

Test results T1 T2 T3 T4 T5 T6 OK OK Max

AEIr Max AVI

PVARP

OK Max AEIr

PVARP



Selected SolutionThe feasible solutions are IFRC and VLIWH

These alternatives are characterized by the same area overhead and detection latency, so they are equivalent

The designer, considering the particular aspects related to other steps of the co-design flow can make the final choice

For example, the IFRC is applicable independently from the number of reliable procedures while VLIWH requires a specific software synthesis step for each reliable procedure

– The first solution has thus a cost that is independent of the number of critical sections, which is not true for VLIWH solutions

– Since in the present case study all the system procedures are made reliable, the first architectural solution requires a lowereffort and design cost and may be preferable



Selected SolutionThe final architectural solution for the reliable pacemaker

The selected solution doesn't allow any significant back annotation to the first level partitioning, since the initial hw/sw partitioning achieved an acceptable all-in-softwaresolution, loading all tasks efficiently on one processor

CPU

BUS Interface

and Checker

I/O Interface

Memory CPU_chk


ConclusionsConclusions

The resilience/reliability co-design project aims at integrating in a standard co-design flow the elements for achieving a final system able to autonomously detect the occurrence of faults during the operational life of the system

The entire flow has been presented in this work, discussing the key elements of the proposed framework

SpecificationDesign MethodologiesSystem Partitioning



Language specification extensions have been defined to specify reliability requirements

A set of possible hw/sw architectural design methodologies has been analyzed considering the possibilities to implement any part of the complete system (nominal, checking and checker) either in hardware or in software

A metric has been introduced taking into account the peculiar elements of reliability properties



A two-level hw/sw partitioning process has been defined, acting initially as a traditional approach to determine a valid solution, while the second step explores the alternatives taking into account the fault detection properties

A case study shows the results of our work

Further research efforts are directed toward the tuning of metrics with respect to the selected suite of design methodologies, to better support the partitioning step


ReferencesReferences

L. Pomante. “System Level Concurrent Error Detection”, Technical Report No. 2001.62, Politecnico di Milano, 2001L. Pomante. “System-Level Co-Design of Heterogeneous Multiprocessor EmbeddedSystems”, PhD Thesis, Politecnico di Milano, 2002L. Pomante, C. Bolchini, F. Salice, D. Sciuto. "Reliability Properties Assessment at System Level: a Co Design Framework", Journal of Electronic Testing - Theory and Application (JETTA), Kluwer Academic Publishers, 2002L. Pomante, A. Miele, F. Salice, C. Bolchini, D. Sciuto, "Reliable System Co-Design: the FIR Case Study", IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 2004)L. Pomante, F. Salice, C. Bolchini, D. Sciuto, “Reliable System Specification for Self-Checking Data-Paths”, Design, Automation and Test in Europe – Conference & Exibition(DATE 2005), 2005L. Pomante, D. Sciuto, F. Salice, W. Fornaciari, C. Brandolese. “Affinity-Driven System Design Exploration for Heterogeneous Multiprocessor SoC”, IEEE Transactions on Computers, vol. 55, no. 5, 2006L. Pomante. “System-Level Design Space Exploration for Dedicated Heterogeneous Multi-Processor Systems”. IEEE International Conference on Application-specific Systems, Architectures and Processors, 2011L. Pomante. “HW/SW Co-Design of Dedicated Heterogeneous Parallel Systems: an Extended Design Space Exploration Approach”. IET Computers & Digital Techniques, Institution of Engineering and Technology, 2013

SERENE 2014 School: System-Level Concurrent Error Detection

Engineering

Transcript of SERENE 2014 School: System-Level Concurrent Error Detection