SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus...

25
SPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center [email protected] May 11, 2005 Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Transcript of SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus...

Page 1: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

SPIDER: A Fault-Tolerant Bus Architecture

Lee Pike

Formal Methods GroupNASA Langley Research Center

[email protected]

May 11, 2005

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 2: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 3: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Motivation

I Safety-critical distributed x-by-wire applications are beingdeployed in inhospitable environments.

I Failure rates must be on the order of 10−9 per hour ofoperation.

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 4: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Desiderata1

I IntegrationI Off-the-shelf application integrationI Off-the-shelf fault-toleranceI Eliminate redundancy

I PartitioningI Fault-partitioningI Modular certification

I PredictabilityI Hard real-time guaranteesI A “virtual” TDMA bus

1John Rushby’s A Comparison of Bus Architectures for Safety-CriticalEmbedded Systems

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 5: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Prominent Architectures

I TTTech’s Time-Triggered Architecture (TTA)

I Honeywell’s SAFEbus

I FlexRay (being developed by an automotive consortium)

I NASA Langley’s Scalable Processor-Independent Design forEnhanced Reliability (SPIDER)

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 6: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

SPIDER

“Time turns the improbable into the inevitable”

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 7: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Collaborators

I Permanent InvestigatorsI Alfons Geser (formerly National Inst. of Aerospace)I Jeffrey Maddalon (NASA)I Mahyar Malekpour (NASA)I Paul Miner (NASA)I Radu Siminiceanu (National Inst. of Aerospace)I Wilfredo Torres-Pomales (NASA)

I Industry PartnersI DSI, Inc.I National Institute of Aerospace

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 8: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

SPIDER Architecture

ProcessorElements

Middleware

OS Drivers

App. ASoftware

ROBUS

PE 1 PE 2

Hardware

App. B App. B

RMU RMU RMU

Middleware

OS Drivers

InterfacePE−ROBUS

InterfacePE−ROBUS

BIU BIU BIU

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 9: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

SPIDER Services

I Fault-tolerant time-reference and synchronization

I Diagnostic consensus and reconfiguration

I (Application-level) reintegration

I Communication with guaranteed consensus and latency

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 10: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

BIU/RMU Modes of Operation

I Self-Test ModeI Initialization Mode

I Initial DiagnosisI Initial SynchronizationI Collective Diagnosis

I Preservation ModeI Clock SynchronizationI Collective DiagnosisI PE Communication

I Reintegration Mode

Continuous on-line diagnosis. . .

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 11: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

A Hybrid Fault Model

I Nonfaulty The correct message is received at the scheduledtime.

I Benign The message is detectably faulty by all receivers:I The message is received is outside the communication window.I The message is corrupted (or not present).

I Symmetric All receivers detect the same fault.

I Asymmetric (Byzantine) The messages received arearbitrary (in time and value).

I Omissive Asymmetric Each receiver determines the senderto be either nonfaulty or benign.

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 12: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

The Dynamic Maximum Fault Assumption

I For each BIU or RMU i , let Ei be i ’s eligibility set: the set ofnodes i believes to be nonfaulty.

I Let N be the set of nonfaulty nodes.

I Let B be the set of benign nodes.

I Let A be the set of asymmetric nodes.

1. 2|N ∩ Ei | > |Ei \ B| for all nodes i .

2. |A ∩ Er | = 0 for all RMUs r , or |A ∩ Eb| = 0 for all BIUs b.

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 13: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Motivation

I Fault-injection testing cannot demonstrate 10−9 reliability

I Criticality warrants effort

I Complexity warrants effort

I Formal methods being integrated into certification standards

I Improved and structured design and understanding

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 14: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Formal Methods Challenges

I Modeling faultsI Variety of faults and locationsI Nondeterminism in when they occur and duration

I Protocol/mode interaction and interdependence

I Protocols are distributed

I Protocols are real-time

I Varying degrees of synchrony

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 15: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Formal Methods Tools for SPIDER

I Mechanical theorem-proving PVS (SRI)I Model-checking and decision procedures

I SAL (SRI)I SMART (William & Mary and National Institute of Aerospace)

I Interactive synthesis from Lisp-like language to a HDL

DRS (Derivation Systems, Inc. and Indiana University)

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 16: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Reintegration Overview

Allows a node that has suffered a transient fault to regain stateconsistent with the operational nodes. The node must regain:

I Clock synchronization

I Diagnostic data

I Dynamic scheduling data and other volatile state

I Developers: Wilfredo Torres-Pomales, Mahyar Malekpour, andPaul Miner (NASA)

I Formal Verification: Lee Pike (NASA)

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 17: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

The Frame Property

time

P > lπ + 2πgoodechos

tn tn+1tn − π

I l : number of faulty nodes not accused by the reintegrator

I π: maximum skew of nonfaulty nodes

I P: frame duration

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 18: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

State Variables & Initialization

I accs: ARRAY of booleans, one for each monitored node

I seen: ARRAY of naturals, one for each monitored node

I mode: {prelim diag , frame synch, synch capture}I clock: R0≤

I fs finish: R0≤

I pd finish: R0≤

for each i, accs[i ] := false;mode := prelim diag;for each i, seen[i ] := 0;

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 19: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Preliminary Diagnosis Mode

pd finish := clock + P + π;while clock < pd finish do {

for each i, when echo(i) do {if (seen[i ] < 2 and not accs[i ])then seen[i ] := seen[i ] + 1else accs[i ] := true;};

};for each i, if seen[i ] = 0 then accs[i ];mode := frame synch;

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 20: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Frame Synchronization Mode

for each i, seen[i ] := 0;fs finish := clock;while clock − fs finish < π do {for each i, when echo(i) do {if (seen[i ] = 0 and not accs[i ])then {fs finish := clock;seen[i ] := seen[i ] + 1;};else accs[i ] := true;};

};mode := synch capture;

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 21: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Synchronization Capture Mode

for each i, seen[i ] := 0;while seen cnt ≤ trusted/2 do {for each i, when echo(i) do {if (seen[i ] = 0 and not accs[i ])then seen[i ] := seen[i ] + 1;};

};clock := 0;

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 22: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Safety Properties

Theorem (No Operational Accusations)

For all operational nodes i , accs[i ] does not hold during thereintegration protocol.

Theorem (Synchronization Acquisition)

For all operational nodes i , |clock − echo(i)| < π upon terminationof the reintegration protocol.

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 23: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Recent Successes

I A unified fault-tolerance protocol

I A fault-tolerant distributed system verification library

I Time-triggered schedule verification

I Case-study for research in model-checking, theorem-proving,and decision-procedures

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 24: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Future Work

I Intrusion-tolerance

I OS and middleware

I Flight-testing

I Self-stablization

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture

Page 25: SPIDER: A Fault-Tolerant Bus Architecturelepike/talks/spider_intel.pdfSPIDER: A Fault-Tolerant Bus Architecture Lee Pike Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov

Further Information

Some Talks & Papers

http://www.cs.indiana.edu/~lepike/Google: lee pike

SPIDER Homepage

http://shemesh.larc.nasa.gov/fm/spider/Google: formal methods spider

NASA Langley Research Center Formal Methods Group

http://shemesh.larc.nasa.gov/fm/Google: nasa formal methods

Lee Pike SPIDER: A Fault-Tolerant Bus Architecture