© 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering...

50
© 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter H. Feiler Jan 24, 2013

Transcript of © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering...

Page 1: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

© 2013 Carnegie Mellon University

Analytical Architecture Fault Models

Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 15213

Peter H. FeilerJan 24, 2013

Page 2: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

2

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Copyright 2012 Carnegie Mellon University and IEEE

This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center.

NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.

This material has been approved for public release and unlimited distribution.

This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at [email protected].

DM-0000087

Page 3: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

3

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Outline

SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion

Page 4: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

4

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Mismatched Assumptions in Embedded SWSystem

EngineerControl

Engineer

System

Under Contro

l

Control

System

Physical Plant CharacteristicsLag, proximity

Operator ErrorAutomation & human actions

Sy

ste

m

Us

er/

En

vir

on

me

nt

HazardsImpact of

system failures

Ap

plic

atio

n

De

velo

perComput

ePlatform

RuntimeArchitectur

e

Application

Software

Embedded SW System Engineer

Data Stream Characteristics

ETE Latency (F16)State delta (NASA)

Measurement UnitsAriane 4/5Air Canada

Concurrency Communication

ITunes crashes on dual-cores

Distribution & RedundancyVirtualization of HW

(ARPA-Net split)

Hardware

Engineer

Why do system level failures still occur despite fault tolerance techniques being deployed in systems?

Page 5: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

5

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

SAE Architecture Analysis & Design Language (AADL) for Software-reliant Systems

The Computer System

The System

Computer SystemHardware & OS

Physical platformAircraft

ControlGuidance

Deployed onUtilizes

Physical interfacePlatform component

AADL focuses on interaction between the three elements of a software-intensive system based on architectural abstractions of each.

Embedded Application Software

Flight control & MissionThe Software

Focus on software runtime architecture

Page 6: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

6

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

System Level Fault Root Causes

Violation of data stream assumptions• Stream miss rates, Mismatched data representation, Latency jitter & age

Partitions as Isolation Regions• Space, time, and bandwidth partitioning• Isolation not guaranteed due to undocumented resource sharing• fault containment, security levels, safety levels, distribution

Virtualization of time & resources• Logical vs. physical redundancy• Time stamping of data & asynchronous systems

Inconsistent System States & Interactions• Modal systems with modal components• Concurrency & redundancy management• Application level interaction protocols

Performance impedance mismatches• Processor, memory & network resources• Compositional & replacement performance mismatches• Unmanaged computer system resources

Fault propagation Security analysis

Architectural redundancy patterns

End-to-end latency analysisPort connection consistency

Partitioned architecture modelsModel compliance

Resource budget analysis & task roll-up analysisResource allocation &

deployment configurations

Virtual processors & busesSynchronization domains

Page 7: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

7

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Architecture-Centric Modeling Approach

Security• Intrusion

• Integrity

• Confidentiality

Safety & Reliability

• MTBF

• FMEA

• Hazard analysis

Real-timePerformance• Execution time/

Deadline

• Deadlock/starvation

• Latency

ResourceConsumption• Bandwidth

• CPU time

• Power consumption

• Data precision/accuracy

• Temporal correctness

• Confidence

Data Quality

Architecture Model

Single Annotated Architecture Model Addresses Impact Across Non-Functional Properties

Auto-generated analytical models

Page 8: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

8

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

AADL Error Model Scope and Purpose

System safety process uses many individual methods and analyses, e.g.•hazard analysis• failure modes and effects analysis• fault trees•Markov processes

Related analyses are also useful for other purposes, e.g.•maintainability•availability• Integrity

Goal: a general facility for modeling fault/error/failure behaviors that can be used for several modeling and analysis activities.

SAE ARP 4761 Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment

Annotated architecture model permits checking for consistency and completeness between these various

declarations.

System

Component

Subsystem

Capture FMEA model

Capture hazards

Capture risk mitigation architecture

Page 9: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

9

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Error Propagation Paths

Component A Component B

Processor 1 Processor 2Bus

Error_Free Failed Error_Free Failed

Error_Free Failed Error_Free FailedError_Free Failed

Error Model and the Architecture

Error eventError propagation Color: Different types of error

Error propagation pathError flow

Propagation of errors of different types from error sources along propagation paths between architecture components.Error flows as abstractions of propagation through components.Component error behavior as transitions between states triggered by error events and incoming propagations.Composite error behavior in terms of component error behavior states.

Page 10: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

10

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Error Propagation PathsAn Error Propagation View

Error eventError propagation Color: Different types of error

Error propagation pathError flow

Components can have incoming and outgoing error propagationsError propagations follow propagation paths determined by connections and bindings in the architecture or undocumented observable paths.Error flows specify flow of errors through a component, or the component acting as propagation source or sink.

Component AComponent B Error Propagation View

Page 11: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

11

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Fault Propagation & Transformation Calculus

Default mapping• Any to all

Single in port to multiple out ports• late → (value, *, late)

Multiple in ports• (late, _) → (value, late) : wildcard• (late, f) → (f, late) : variable

Overlapping rules• (*, late) → (*, *)• (*, f) → (*, f)

Conditional mappings (considered too complex)• (f, g) → late, if f = late• *, if f = value and g = value• value, if f = g = value

Build on York U. work

Leverage architecture model

Overcome short comings

to address practical problems

Fault mapping rules

Page 12: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

12

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Component Error Propagation

Component CNoData

NoDataBadValue

ProcessorMemory

Bus NoResource

Error Flow:

Path P1.NoData->P3.NoData

Source P2.BadData;

Path processor.NoResource -> P2.NoData

P3

P1

LateData

ValueError

Incoming/Assumed• Propagated errors

• Errors not propagated

Outgoing/Intention• Propagated errors

• Errors not propagated

Bound resources• Propagated errors

• Errors not propagated

• Propagation to resource

Incoming

Outgoing

Binding

NoData

P2BadValue

“Not“ indicates that this error type is not intended to be propagated.

This allows us to determine whether propagation specification is complete.

Propagation of Error Types

Not propagated

Propagated Error Type

Error Flow through component

Path P1.NoData->P2.NoData

Source P2.BadData

Path processor.NoResource -> P2.NoData

P1 Port

ProcessorHW Binding

Legend

Direction

Error propagation specification without error behavior state machine is possible.

Page 13: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

13

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Consistency in Error Propagation

Component ANoData

NoDataBadData

ProcessorMemory

Bus NoResource

P2P1

LateDataBadData

Component BNoData

ProcessorMemory

Bus NoResource

P2P1

BadData

Mismatched fault propagation and masking assumptions

Robustness to unexpected fault propagation

Component A

Outgoing Propagation

P2

Not propagated

Component BP1

Not propagated

Propagated

Propagated

Not propagated

Propagated

Not propagated

PropagatedUnspecified

UnspecifiedPropagatedUnspecified

Unspecified

Unspecified

IncomingPropagation

Page 14: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

14

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Error Propagation PathsComponent Error Behavior Specification

Error eventError propagation Color: Different types of error

Propagation pathError flow

Components can have error behavior specified by an error behavior state machineTransitions between states triggered by error events and incoming propagations.Conditions for outgoing propagations are specified in terms of the current state and incoming propagations.Detection of error states and incoming propagations is mapped into a reporting port in the system architecture model

Detection report Binding

Port/access point

Component A

Operational Failed

Detection

Recover/repair event

Page 15: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

15

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Compositional Abstraction of Fault Model

• Abstracted error behavior of FGS and subcomponents• Error behavior and propagation specification

• Composite error behavior specification of FGS• State in terms of subcomponent states

[1 ormore(FG1.Failed or AP1.Failed) and

1 ormore(FG2.Failed or AP2.Failed) or AC.Failed]->Failed

FG1

FG2

AP1

AP2

AC

FGSOperational Failed

Operational FailedOperational Failed

Operational Failed

Operational Failed

NoValue

NoValueNoValue

NoValue

NoValue

FG Operational Failed

NoValue

Failure

Page 16: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

16

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Error Types & Error Type SetsError type declarationsTimingError: type ;EarlyValue: type extends TimingError;LateLate: type extends TimingError;ValueError: type ;BadValue: type extends ValueError;

An error type set represents a set of type tokens •Elements in a type set are mutually exclusive•An error type with subtypes includes tokens of any subtype•A type product represents a simultaneously occurring types

–Combinations of subtypesInputOutputError : type set {TimingError, ValueError, TimingError+ValueError};

An error type token represents a typed error instance •Represents actual event, propagation, or state types{LateValue + BadValue} or {LateValue}

Error Type Set as Constraint{T1} tokens of one type hierarchy{T1, T2} tokens of one of two error type hierarchies{T1+T2} type product (one error type from each error type hierarchy){NoError} represents the empty setConstraint on state, propagation, flow, transition condition, detection condition, outgoing propagation condition, composite state condition

Page 17: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

17

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

A Common Set of Error Propagation TypesIndependent hierarchies of error typesCan be combined to characterize failure modes, resulting error states, and types of error propagation

Use aliases to adapt to domainNoPower = ServiceOmission

Introduce user defined types

Page 18: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

18

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Formal Error Type SpecificationsService is a sequence of action items. Service errors with respect to the service as a whole rather than individual service items

• Item Omission is perceived as a transient fault in that a service item is not provided. Service Omission: ai A | ai= where is the empty action.

•Service Omission is perceived as a permanent fault in that no service items are provided after the point of failure. Service Omission: [ai aj] A |S([ai aj]) with S defined as ak [ai aj] | ak= .

Value errors with respect to the value of an individual service item•Out Of Range error indicating that the value is outside the expected range of

values, a detectable error. V represents the range of acceptable values across the service sequence.Out Of Range error: aiA | vi > max(V) OR vi < min(V)

Page 19: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

19

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Outline

SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion

Page 20: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

20

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Hazard Information in AADL Error Model

Hazards modeled as error propagations in the AADL Error Model

Page 21: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

21

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Sample FHA in Spreadsheet View

Hazard information exported to spreadsheet format

Architecture Fault Model allows hazards to be recorded for use throughout all development phases

Page 22: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

22

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Original System Safety Analysis

Auto Pilot

FMSProcessor

Operational

Failed

Flight Mgnt System

Anticipated: No Stall Propagation

FMS Power

AirspeedData

Failed

ActuatorCmd

Stall

NoService

Anticipated: NoService

Operational

NoData

EGI

EGI HW

EGI Logic

Oper’l

Failed

Oper’l

Failed

Anticipated: No EGI data

NoData

Page 23: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

23

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Discovery of Unexpected PSSA Hazard

Auto Pilot

FMSProcessor

Operational

Failed

Flight Mgnt System

FMS Power

AirspeedData

Failed

ActuatorCmd

Stall

NoService

Anticipated: NoService

CorruptedData

Unexpected propagation of corrupted Airspeed data results in Stall due to miss-correction

Operational

NoData

EGI

EGI HW

EGI Logic

Oper’l

Failed

Oper’l

Failed

Corrupted

EGI maintainer adds corrupted data hazard to model. Error Model analysis detects unhandled propagation.

Transient hardware failure corrupts EGI data

Anticipated: No EGI data

Stall caused by corrupted data

Page 24: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

24

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Recent Automated FMEA Experience

Failure Modes and Effects Analyses are rigorous and comprehensive reliability and safety design evaluations•Required by industry standards and Government policies•When performed manually are usually done once due to cost and schedule• If automated allows for

–multiple iterations from conceptual to detailed design–Tradeoff studies and evaluation of alternatives–Early identification of potential problems

Largest analysis of satellite to date consists of 26,000 failure modes• Includes detailed model of satellite bus•20 states perform failure mode•Longest failure mode sequences have 25 transitions (i.e., 25 effects)

24

Myron Hecht, Aerospace Corp.Safety Analysis for JPL, member of DO-178C committee

Page 25: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

25

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Page 26: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

26

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Outline

SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion

Page 27: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

27

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Simple Error Behavior State Machine

annex Error_Model {** error behavior Example events -- both events will have mode-specific occurrence values for powered,unpowered SelfCheckedFault: error event; UncoveredFault: error event; SelfRepair: recover event; Fix: repair event; states Operational: initial state ; FailStopped: state; FailTransient: state; FailUnknown: state; transitions SelfFail: Operational -[SelfCheckedFault]-> (FailStopped with 0.7, FailTransient with 0.3); Recover: FailTransient -[SelfRepair]-> Operational; UncoveredFail: Operational -[UncoveredFault]-> FailUnknown;end behavior;**};

Simple state machine with branching transition

Page 28: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

28

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Component Error Behavior SpecificationComponent-specific behavior specification• Identifies an error behavior state machine•Specifies transition trigger conditions in terms of incoming propagated errors

or working condition of connected component•Specifies propagation conditions for outgoing propagated errors in terms of

states & incoming propagated errors•Specifies detection events, i.e., conditions under which an event is raised in

the component (core AADL model)component error behavior use types MyErrorLibrary ; use behavior MyErrorLibrary::SWStateBehavior ; transitions Operational-[port1{BadData}]->FailTransient; FailStopped-[port1{BadData}]->Mask; propagations all –[2 ormore (Port1{BadData}, Port2{BadData},Port3{BadData})]-> Outport3{BadData}; detections FailedState –[]-> Self.Failed ; properties Occurrence => 0.0005 Poisson applies to BadDataFault; end behavior;

Page 29: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

29

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Transient ErrorsIn error behavior state machine•Probability that error event results in transient error behavior

–Branching transition with branch probability•Error state with transition triggered by recover event

–Recover event has property to indicate time (range) and distribution over time (range)• Distribution over single value vs. range

Operational

FailStop

TransientFailure

Recover Event

Error EventPermenant failure

Probability 0.1

Page 30: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

30

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Presence & Absence of Incoming Propagations

Conjunctions on incoming propagations with single type• Inport1{LateValue} vs. Inport1{LateValue} and inport2{EarlyValue} vs.

Inport1{LateValue} and inport2{noerror}•noerror means that no error propagation occurs at a given time on a feature

Conditions on incoming propagation type set tuples• Inport1{LateValue} vs. Inport1{LateValue,BadValue} •Tuple with just a LateValue means that no other error types have occurred

Page 31: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

31

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Error Detection and Error Recovery

Error detection and reporting•Example: novalue as detectable persistent fault•Self detection and reporting (internal to component)•Detection of error propagation by recipient component

Error recovery with probability of success and failure•Branch transitions

Operational

BadValueState

BadValueEvent

out propagation BadValue

Outport2

BadValueState –[RecoverEvent]-> ( Operational with 0.99, PersistentFaultState with 0.01);

PersistentFaultState

ErrorOutport

RecoverEventOn BranchTransition

Recover FailureRecover Success

detectionsPersistentFaultState-[]-> ErrorOutPort {NoValue} ;

Component internal detection and reporting

External detection and reporting

Inport1

detectionsall -[Inport1{NoValue}]-> ErrorOutPort {NoValue} ;

ErrorOutport

out propagation NoValue

Page 32: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

32

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Outline

SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion

Page 33: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

33

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Error Model at Each Architecture Level

• Abstracted error behavior of FMS• Error behavior and propagation specification

• Composite error behavior specification of FMS• State in terms of subcomponent states

[1 ormore(FG1.Failed or AP1.Failed) and

1 ormore(FG2.Failed or AP2.Failed) or AC.Failed]->Failed

FG1

FG2

AP1

AP2

AC

FMSOperational Failed

Operational FailedOperational Failed

Operational Failed

Operational Failed

NoValue

NoValueNoValue

NoValue

NoValue

FMSOperational Failed

NoValue

Failure

Consistency Checking Across Levels of the

Hierarchy

Composite error models lead to fault trees and reliability predictions

Fault occurrence probability

Fault occurrence probability

Page 34: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

34

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Impact of Deployment Configuration Changes

FMS Failure on 2 or 3 processor configuration (CPU failure rate = 10-5)

FMS Failure Rate 0 5*10-6 5*10-5

MTTF – One CPU operational 112,000 67,000 14,000

MTTF – Two CPU operational 48,000 31,000 7,000

Side effects of design and deployment decisions in reliability predictions

Workload balancing of partitions later in development affects reliability

3 processor configuration can be less reliable than 2 processor configuration

Example: AP and FG distributed across two processors

Page 35: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

35

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

SAFEbus with Dual Communication Channel•Network is aware of dual host nature

• Reflected in replication factor properties for connections and bus access • Dual channel communication as abstraction reflected in Replication Errors

•Integrity gate keeper for host system• Maps inconsistent value/omission errors into item omissions• Expressed by error path from incoming to outgoing binding propagation

•Fail-op/fail-stop tolerance of SAFEbus faults• Operational, SingleErrorOp, Failed states: SAFEbus error events cause state change• Operational: gate keeper mappings• SingleErrorOp: no error source & full gate keeper mappings• Failed: service omission as sole outgoing propagation• Modeled by error behavior state machine and component behavior conditions

SAFEbus

SAFEbus as Application Gate Keeper and Source of Error

2X 2X

Inconsistent Value Item/ServiceOmission

Inconsistent omissionInconsistent Value

Corrupt Value

Page 36: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

36

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Assumptions & Hazards in Use of SAFEbus

•Assumption: no replicated Host stand-by operation• Item/service omission on one Host-S channel => Item/service omission for both Host-

T channels• Modeled as Inconsistent Omission propagation in stand-by operational mode

•Assumption: no identical corrupt/bad value propagation from sender• Need to ensure no replication of corrupt/bad value (e.g., sensor fan-out)• Modeled as no outgoing symmetric value error from sending host

•SAFEbus as shared resource• Manage babbling host (item/service commission)• Modeled as propagated commission that is expected to be masked

SAFEbus

Host-S Host-T

Replicated Host System

2X

Item/ServiceOmission

Service/item Commission

Service/item Commission

Mask incoming Mask incoming

Single sourceCorrupt/Bad value

Corrupt/Bad value2X

Inconsistent ValueCorrupt/Bad value

Inconsistent Omission

Page 37: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

37

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Multi-layered Platforms

Component A Component B

Processor 1 Processor 2Bus

Operational Failed Operational Failed

Operational Failed Operational FailedOperational Failed

Processor

Bindings

Processor

Bindings

Bindings

Binding

Virtual Bus

Binding

Bindings

Operational Failed

Type transformation rules specify how binding propagations affect /augment nominal communication and error propagation between component A and component B

Connection Bindingto buses, processors, devices,

virtual processors, virtual buses

Virtual Processor

Partition X

Bindings

Binding

Platform layers exposed via virtual bus and virtual processor

Page 38: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

38

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Outline

SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion

Page 39: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

39

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Operational Modes and Failure Modes•Nominal system behavior

• Operational modes• Functional behavior• Represented by AADL modes and Behavior Annex

•Anomalous behavior reflecting failure modes• Deviation from nominal behavior• Due to design error• Due to physical failure (operational error)• Represented by AADL Error Model Annex

•System awareness of anomalous behavior• Detection, reporting/recording• Transformation, masking• Represented by AADL Error Model Annex

Page 40: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

40

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Example: GPS•Operational modes: Hi-Precision, Lo-Precision, Off

• User initiated transitions•Failures: Sensor failure, Processing failure•Error states of GPS:

• Dual Sensor Op(operational) • Sensor1{NoError} and Sensor2{NoError} and Processing{NoError}

• Single Sensor Op (Degraded)• 1 orless(Sensor1{Failed}, Sensor2{Failed})

• Dual sensor failure or processing failure (FailStop)• 1 ormore(Sensor1{Failed} and Sensor2{Failed},

Processing{Failed})

•Mapping of Error States onto Operational Modes

HiP LoP

HiP, LoP, Off

Sensor1

Sensor2

Processing

Off

Degraded

Operational

Sensor Fails

FailStopOmission

Value

Recover Event

(Reset)

Processing Fails

HiP LoP Off

Operational

DegradedFailStop

Page 41: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

41

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Combined GPS Behavior Model•Error states constrain operational modes

• Degraded supports LoP and Off• FailStop shows Off behavior

•Forced mode transitions• New error state that excludes current mode• Explicit reflection of forced transition in mode behavior

• Detection event and “Emergency” mode transition•Behavioral record of error state (Mode state/Error state pairs)

• Specify detection and reporting of error state• Allows for detection and reporting of limits on user initiated transitions

Expected mode

Operational Degraded Fail Stop

HiP HiP Force transition to LoP{ValueError}

Force transition to Off{Omission}

LoP LoP LoPOk: LoP <-> Off

Force transition to Off{Omission}

Off Off Off Off Perceived {NoError}

HiP-O LoP-O Off-O

Operational

Degraded

FailStop

LoP-D Off-D

Off-F

Page 42: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

42

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Outline

SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion

Page 43: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

46

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Analyzable Architecture Fault Models

Through model-based analysis identify architecture induced unhandled, testable, and untestable faults and understand the root cause, impact, and potential mitigation options.

Fault Impact & FDIR Analysis

Architecture Fault Model Analysis

Discover testable and untestable faults

Discover unhandled faults &

safety violations

FADEC Operational Mode & Fault Mgnt Behavior Analysis

Model validation

Requirements

Faults that can be tested

Decision coverage

Faults that cannot be testedRace conditions

Improved documentation &

design

Faults that are unhandled

Transient data loss in protocol

C

Fault Impact Analysis

Omission

Detection of Unhandled Data Loss Fault

Sequence

Fault propagation Effects Engine Control Mode to Issue Shut Down Engine Sequence

Reachability AnalysisOf Unsafe States

Root Cause of Data Loss Is Non-deterministic Temporal Buffer Read/Write Ordering

ProcessorCyclic Executive RMS

Config1 Config2

Read/write Timeline Analysis Under Cyclic Executive &

Preemptive Scheduler

Page 44: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

47

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

End-to-end Latency in Control Systems

System Engineer

Control Engineer

System

Under

Control

Control

System

Operational

Environment

•Processing latency•Sampling latency•Physical signal latency

Impact of Scheduler Choice on Controller Stability

A. Cervin, Lund U., CCACSD 2006

Impact of Software Implemented Tasks

Jitter is typically managed by Periodic I/O

AADL immediate & delayed connections specify deterministic sampling

Page 45: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

48

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Software-Based Latency Contributors

Execution time variation: algorithm, use of cacheProcessor speedResource contentionPreemptionLegacy & shared variable communicationRate group optimizationProtocol specific communication delayPartitioned architectureMigration of functionalityFault tolerance strategy

Quantification of Timing Errors

How much jitter is acceptable?

When is early/late too early/late?

Page 46: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

49

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Migration of Dual Fault Tolerant Control SystemHighly unstable system being controlledControl software fault tolerance•Simplex: Baseline, experimental, recovery controllers•Monitor experimental, recover to baseline control•Unrelated higher priority task shares processor

Dual redundancy to address hardware failures•Two instances of Simplex control system•Distributed leadership decision making•Asynchronous processor hardware

Migration to dual core processors•One core dedicated to Simplex control system•Unrelated task on second core•=> Failure to provide control

PC1 PC2

LeaderLogic

LeaderLogic

SimplexControl

SimplexControl

Ind

epen

den

tTa

sk

Ind

epen

den

tTa

sk

Core1Core2

PC1

IndependentTask

SimplexControl

Synchronous system was not explicitly specified

PALS protocol [Sha] provides time drift containment

Accidental Resilience

Page 47: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

50

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Outline

Challenges in Safety-critical Software-intensive systemsSAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion

Page 48: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

51

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Architecture Fault Modeling Summary

Architecture Fault Modeling with AADL•SAE AADL V2.1 published in Sept 2012 (V1 published in 2004)•Error Model Annex was published in 2006

–Supported in AADL V1 and AADL V2•Error Model concepts and ontology not specific to AADL, can be applied to

other modeling notations•Revised Error Model Annex (V2) based on user experiences currently in

review

Safety Analysis and Verification •Error Model Annex front-end available in OSATE open source toolset

–Allows for integration with in-house safety analysis tools•Multiple tool chains support various forms of safety analysis (Honeywell,

Aerospace Corp., AVSI SAVI, ESA COMPASS, WW Technology)•FHA, FMEA, fault tree, Markov models, stochastic Petri net generation from

AADL/Error Model–Open source implementation as part of Error Model V2 publication

Page 49: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

52

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

References

Website www.aadl.info

Public Wiki https://wiki.sei.cmu.edu/aadl

Dependability Modeling with AADL (V1), SEI Technical Report, 2006.

Draft Error Model V2 Annex & Architecture Fault Modeling report, in progress.

Requirements Definition and Analysis Annex supporting FAA Requirements Engineering Handbook Process and Example, Dominique Blouin. https://wiki.sei.cmu.edu/aadl/index.php/Jan_2012_User_Day

Reliability Validation and Improvement Framework, SEI Special Report CMU/SEI-2012-SR-013, November 2012.

Virtual Upgrade Validation (VUV) Method, SEI Technical Report CMU/SEI-2012-TR-005, June 2012.

Page 50: © 2013 Carnegie Mellon University Analytical Architecture Fault Models Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Peter.

53

Analytical Architecture Fault ModelsFeiler, Jan 24, 2013

© 2013 Carnegie Mellon University

Contact Information

Peter H. FeilerSenior MTSRTSSTelephone: +1 412-268-7790Email: [email protected]

U.S. MailSoftware Engineering InstituteCustomer Relations4500 Fifth AvenuePittsburgh, PA 15213-2612USA

Webwww.sei.cmu.eduwww.sei.cmu.edu/contact.cfm

Customer RelationsEmail: [email protected] Phone: +1 412-268-5800SEI Fax: +1 412-268-6257