Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

44
Fault-Tolerant Fault-Tolerant Platforms for Platforms for Automotive Safety- Automotive Safety- Critical Critical Applications Applications Baver Şahin Baver Şahin 2006701344 2006701344

Transcript of Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Page 1: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Platforms Fault-Tolerant Platforms for Automotive Safety-for Automotive Safety-

Critical ApplicationsCritical Applications

Baver ŞahinBaver Şahin

20067013442006701344

Page 2: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

AgendaAgenda

IntroductionIntroduction Fault-Tolerance in SOCFault-Tolerance in SOC Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor

ArchitecturesArchitectures SOC Fault-Tolerant Architecture SOC Fault-Tolerant Architecture

ImplementationImplementation Implementation Issues & ComparisonsImplementation Issues & Comparisons Concluding RemarksConcluding Remarks

Page 3: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

IntroductionIntroduction

Electonics in the carElectonics in the car

-- In the late 70’sIn the late 70’s

-- digitally controlled combustion engines digitally controlled combustion engines

-- digitally controlled anti-lock brake digitally controlled anti-lock brake systems(ABS)systems(ABS)

-- Synergy between mechanics and electronicsSynergy between mechanics and electronics

-- better fuel economy better fuel economy

-- better vehicle performance better vehicle performance

-- driver assisting functions (ABS, TCS, ESP, BA & driver assisting functions (ABS, TCS, ESP, BA & safety features)safety features)

Page 4: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

IntroductionIntroduction

- X-by-wire systems: To design cars with better performance and higher level of safety, engineers must substitute mechanical interfaces between the driver and the vehicle with electronic systems.

- throttle pedal, brake pedal, gear selector, steering

wheel

- electrical output is processed by micro-controllers that manage the power-train, braking and steering activities via electrical actuators.

Page 5: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

IntroductionIntroduction

-- An example of a Brake-by An example of a Brake-by

Wire system:Wire system:

It consists ofIt consists of several computerseveral computer

nodes controlling variousnodes controlling various sensorssensors

andand actuators that communicateactuators that communicate

through a faultthrough a fault tolerant realtolerant real timetime

network, and form together anetwork, and form together a

ddistributedistributed real-timereal-time computercomputer

systemsystem..

Page 6: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

IntroductionIntroduction FaultFault--ToleranceTolerance RequirementsRequirements: : Because of the Because of the

fact that drive-by-wire systems havefact that drive-by-wire systems have no mechanical no mechanical backup, they are assigned a high Safetybackup, they are assigned a high Safety Integrity Integrity LevelLevel. This means that . This means that their designtheir design must must incorporate all the necessary techniques forincorporate all the necessary techniques for achieving fault-tolerance.achieving fault-tolerance.

FaultFault--TolerantTolerant Design ApproachesDesign Approaches -- hardware hardware redundancy: 1) Static redundancy that is based on the voting of the outputs of a number of 1) Static redundancy that is based on the voting of the outputs of a number of

modules to mask the effects of a fault within these units. The simplest form of modules to mask the effects of a fault within these units. The simplest form of this arrangement consists of three modules and a voter and is termed a triple this arrangement consists of three modules and a voter and is termed a triple modular redundant system (TMR). modular redundant system (TMR).

2) Dynamic redundancy on the other hand is based on fault detection rather 2) Dynamic redundancy on the other hand is based on fault detection rather than fault masking. This is achieved by using two modules and some sort of than fault masking. This is achieved by using two modules and some sort of comparison on their outputs that can detect possible faults. This method has comparison on their outputs that can detect possible faults. This method has lower component count but is not suitable for real-time applications.lower component count but is not suitable for real-time applications.

3) Hybrid redundancy uses a combination of voting, fault-detection and module 3) Hybrid redundancy uses a combination of voting, fault-detection and module switching, thus combining static and dynamic redundancy.switching, thus combining static and dynamic redundancy.

Page 7: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerance in SOCFault-Tolerance in SOC

New trends in the automotive industry, like theNew trends in the automotive industry, like the

development of drive-by-wire systems, have development of drive-by-wire systems, have generatedgenerated thethe

need for computer systems with high levels ofneed for computer systems with high levels of faultfault

tolerance and also low cost. This can be achievedtolerance and also low cost. This can be achieved by by usingusing

system-on-chip (SoC) design methodssystem-on-chip (SoC) design methods.. Common mode failuresCommon mode failures - clock tree - power supply - silicon substrate

Page 8: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerance in SOCFault-Tolerance in SOC

Experienced FaultsExperienced Faults

-- hard fails: hard fails: permanentpermanent failures that are caused failures that are caused by an irreversible physicalby an irreversible physical change and derive change and derive from the term ‘hardware failure’.from the term ‘hardware failure’.

- soft fails (single event upsets SEU): soft fails (single event upsets SEU): Soft fails (or Soft fails (or soft errors) are defined as a spontaneoussoft errors) are defined as a spontaneous error or error or change in stored information which cannotchange in stored information which cannot be be reproduced.reproduced.

- external electronic noise - nuclear particles that come either from the decay of radioactive atoms

Page 9: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerance in SOCFault-Tolerance in SOC

While the occurrence of a permanent fault may impairWhile the occurrence of a permanent fault may impair oror

even stop the correct functionality of the system, softeven stop the correct functionality of the system, softerrors caused by transient faults often drastically reduceerrors caused by transient faults often drastically reduce tthehe system availability. As a matter of fact, it is often thesystem availability. As a matter of fact, it is often theccasease that soft error avoidance is strongly required tothat soft error avoidance is strongly required tommaintainaintain the system availability at an acceptable level.the system availability at an acceptable level. -- static temporal redundancystatic temporal redundancy -- t triple execution and majority votingriple execution and majority voting -- mask any single soft errormask any single soft error - dynamic technique - duplicationduplication and comparisonand comparison -- deploying error detectiondeploying error detection

Page 10: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerance in SOCFault-Tolerance in SOC

While the error detection drastically simplifiesWhile the error detection drastically simplifies the the systemsystem

roll-back and restart, error masking eliminateroll-back and restart, error masking eliminate (or at (or at leastleast

reduce) this need thus maintaining the providedreduce) this need thus maintaining the provided availabilityavailability

at an acceptable level.at an acceptable level.

Page 11: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

LockLock--StepStep Dual Processor Dual Processor ArchitectureArchitecture::

Page 12: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

LockLock--StepStep Dual Processor ArchitectureDual Processor Architecture: : - two processors (master & checker): two processors (master & checker): execute theexecute the

same code being strictly synchronized.same code being strictly synchronized.

-- master: master: hashas access to the system memory access to the system memory and drives all system outputsand drives all system outputs..

-- checker: checker: continuously executes the continuously executes the instructionsinstructions

moving on the bus (i.e. those fetched by the mastermoving on the bus (i.e. those fetched by the master

processor)processor)

Page 13: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

-- compare logic (monitor)compare logic (monitor):: consistingconsisting of a of a comparatorcomparator circuit at the master’s and checker’scircuit at the master’s and checker’s bus interfaces, that checks the consistency of their bus interfaces, that checks the consistency of their data-address- and control-lines.data-address- and control-lines. The detection of a The detection of a disagreementdisagreement on the value of any pair of duplicated on the value of any pair of duplicated bus lines reveals thebus lines reveals the presence of a fault on either presence of a fault on either CPU without giving the chanceCPU without giving the chance to identify the faulty to identify the faulty CPU.CPU.

-- source of common-modesource of common-mode failurefailure: : bus and memorybus and memory errorserrors

-- error detection (correction) techniqueserror detection (correction) techniques

-- parity bitsparity bits

Page 14: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

The lock-step architecture can be employed as a The lock-step architecture can be employed as a fail-silentfail-silent node providing the capability of node providing the capability of detecting any (100% coverage)detecting any (100% coverage) single error single error (permanent or transient) occurring indifferently(permanent or transient) occurring indifferently on the CPU, memory or communication sub-on the CPU, memory or communication sub-system.system. Error correcting codes are required when Error correcting codes are required when errors occurringerrors occurring on busses and memories turn out on busses and memories turn out to be relatively frequentto be relatively frequent due to the occurrence of due to the occurrence of transient faults.transient faults.

Page 15: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

LooselyLoosely--SynchronizedSynchronized Dual Dual Processor ArchitectureProcessor Architecture::

Page 16: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

LooselyLoosely--SynchronizedSynchronized Dual Processor ArchitectureDual Processor Architecture:: -- two CPU’s: two CPU’s: run independently having access to run independently having access to

distinct memory subsystems.distinct memory subsystems. -- A real-time operating system running on bothA real-time operating system running on both CPUsCPUs -- interprocessor communicationinterprocessor communication -- synchronizationsynchronization - - error detection (e.g. by meanserror detection (e.g. by means of cross-checks), of cross-checks),

correction and containment (e.g. memorycorrection and containment (e.g. memory protectionprotection)) -- A subset of the tasks executed by the processorsA subset of the tasks executed by the processors

are defined as critical. The image of critical tasks isare defined as critical. The image of critical tasks is duplicated on both memories. Critical tasks are duplicated on both memories. Critical tasks are executed inexecuted in parallel as software replicas and theirparallel as software replicas and their outputs are exchangedoutputs are exchanged after each run on a timeafter each run on a time triggered basis. Both processorstriggered basis. Both processors are responsible for are responsible for checking theirchecking their consistency.consistency.

Page 17: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

- A mismatchA mismatch: : indicates a fault on the CPU, indicates a fault on the CPU, memory or communicationmemory or communication sub-system and sub-system and prevents outputs from being committed.prevents outputs from being committed.

- cross-check mismatch - sanity-check - self-testing - commitment of agreed outputscommitment of agreed outputs

-- First technique: to prevent outputs from being First technique: to prevent outputs from being committed before being cross-checked, time committed before being cross-checked, time guardians can restrict CPU access to system guardians can restrict CPU access to system outputs to a predefined time-window.outputs to a predefined time-window.

Page 18: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

-- Second Technique: Second Technique: Each processor adds its own Each processor adds its own

signature to the outputs of criticalsignature to the outputs of critical tasks and the tasks and the receiver checks for both signatures beforereceiver checks for both signatures before accepting accepting the data.the data.

-- According to the subset of critical task, the architectureAccording to the subset of critical task, the architecturecan appear in several different configurations. At the onecan appear in several different configurations. At the oneend, fully critical applications must be entirely replicated,end, fully critical applications must be entirely replicated,thus requiring twice as much memory while providing thethus requiring twice as much memory while providing thesame performance as a single processor architecture.same performance as a single processor architecture.

Page 19: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

The execution of a function on both CPUs The execution of a function on both CPUs guarantees theguarantees the detection of any error (100% detection of any error (100% coverage) occurring indifferentlycoverage) occurring indifferently on one of the on one of the CPUs, busses or memories. Since bussesCPUs, busses or memories. Since busses and and memories (at least for critical tasks) are memories (at least for critical tasks) are replicated, noreplicated, no other form of redundancy (e.g. other form of redundancy (e.g. parity bits) is needed to detectparity bits) is needed to detect errors on these errors on these components. Nevertheless, ECCs maycomponents. Nevertheless, ECCs may be be employed in the case of high memory (or bus) employed in the case of high memory (or bus) failurefailure rate.rate.

Page 20: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

Triple Modular Redundant Triple Modular Redundant (TMR) Architecture(TMR) Architecture::

Page 21: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

Triple Modular Redundant (TMR) ArchitectureTriple Modular Redundant (TMR) Architecture::

- three identical CPUs: execute the same code in lock-step.

- majority voter: majority vote of the outputs masks any possible single CPU fault.

The memoryThe memory and communication sub-system and communication sub-system faults can be masked employingfaults can be masked employing ECC ECC (Error (Error Correcting Codes) Correcting Codes) techniques.techniques.

Page 22: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

Dual LockDual Lock--StepStep ArchitectureArchitecture::

Page 23: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

Dual LockDual Lock--StepStep ArchitectureArchitecture: : A configuration A configuration largely employed in multi-chip fault-tolerantlargely employed in multi-chip fault-tolerant systems consists of the combination of two fail-systems consists of the combination of two fail-silent channels, each one consisting of a lock-step silent channels, each one consisting of a lock-step architecture as thearchitecture as the one presented inone presented in LockLock--StepStep Dual Processor Architecture, building up a single Dual Processor Architecture, building up a single fail-operationalfail-operational unitunit. In this case, the architecture . In this case, the architecture provides fault-tolerance only for the replicated provides fault-tolerance only for the replicated tasks, whose outputs are checked before being tasks, whose outputs are checked before being committed.committed.

SoftwareSoftware design errors can be prevented as well.design errors can be prevented as well.

Page 24: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Fault-Tolerant Multi-Processor Fault-Tolerant Multi-Processor ArchitecturesArchitectures

In contrast to solutionIn contrast to solution presented in Looselypresented in Loosely--SynchronizedSynchronized Dual Processor Architecture, the Dual Processor Architecture, the execution of sanity-checksexecution of sanity-checks is no more required, is no more required, since self-checking capabilities are alreadysince self-checking capabilities are already provided in hardware by means of duplication provided in hardware by means of duplication andand yield a 100% fault coverage.yield a 100% fault coverage.

Page 25: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

SOC Fault-Tolerant SOC Fault-Tolerant Architecture ImplementationArchitecture Implementation

CostCost: : Due to the costs associated to the higher Due to the costs associated to the higher integration level, single-chip implementations integration level, single-chip implementations should have enough flexibilityshould have enough flexibility to support a wide to support a wide range of applications in order to sharerange of applications in order to share the silicon the silicon development cost across a set of different finaldevelopment cost across a set of different final electronic systems.electronic systems.

FlexibilityFlexibility: : the capability of a siliconthe capability of a silicon solution to solution to correctly adapt to performance, cost andcorrectly adapt to performance, cost and fault-fault-tolerance requirements of a set of applications, tolerance requirements of a set of applications, afterafter silicon production.silicon production.

Page 26: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

SOC Fault-Tolerant SOC Fault-Tolerant Architecture ImplementationArchitecture Implementation

In contrast to multi-chip solutions, in a single-chip In contrast to multi-chip solutions, in a single-chip dualdual--processorprocessor architecture the memory sub-architecture the memory sub-system can be sharedsystem can be shared between the processors at between the processors at much lower cost.much lower cost. Since the twoSince the two cores can run cores can run independently, the memory and communicationindependently, the memory and communication sub-systems are likely to become a major sub-systems are likely to become a major performanceperformance bottleneck. For this reasonbottleneck. For this reason the the memory sub-system is split into 4 banks (2 for memory sub-system is split into 4 banks (2 for code andcode and data respectively) and the traditional data respectively) and the traditional bus is replaced by abus is replaced by a more performant crossbar more performant crossbar switch, which guarantees sufficientswitch, which guarantees sufficient bandwidth bandwidth between the processor and memory subsystems.between the processor and memory subsystems.

Page 27: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

SOC Fault-Tolerant SOC Fault-Tolerant Architecture ImplementationArchitecture Implementation

TThe single-chip loosely-synchronized dualhe single-chip loosely-synchronized dual

pprocessorrocessor architecture, called Shared-Memory (SM)architecture, called Shared-Memory (SM)

LooselyLoosely Synchronized Dual-ProcessorSynchronized Dual-Processor: : Since theSince the

memorymemory sub-system is shared between the processors, thesub-system is shared between the processors, the

duplicationduplication of critical code becomes a trade-off betweenof critical code becomes a trade-off between

systemsystem integrity, memory size and performance: whileintegrity, memory size and performance: while

critical codecritical code takes up costly memory space, non-duplicatedtakes up costly memory space, non-duplicated

critical code,critical code, which must be executed on both cores, runswhich must be executed on both cores, runs

at half the speedat half the speed of a single processor.of a single processor.

Page 28: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

SOC Fault-Tolerant SOC Fault-Tolerant Architecture ImplementationArchitecture Implementation

Shared-Memory (SM)Shared-Memory (SM) LooselyLoosely

Synchronized DualSynchronized Dual ProcessorProcessor

AArchitecturerchitecture::

Page 29: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

SOC Fault-Tolerant SOC Fault-Tolerant Architecture ImplementationArchitecture Implementation

SM Dual Lock-Step architectureSM Dual Lock-Step architecture: : TThe two fail-silenthe two fail-silent

channels share the samechannels share the same memory sub-system. Thismemory sub-system. This

solution largely enhances flexibility,solution largely enhances flexibility, since it covers the TMRsince it covers the TMR

solution (same fault-tolerancesolution (same fault-tolerance properties), whileproperties), while

implementing the dual lock-step architecture.implementing the dual lock-step architecture. Lock-Step modeLock-Step mode: W: When fail-operational capability ishen fail-operational capability is

required,required, the two channels can be arranged in lock-stepthe two channels can be arranged in lock-step

mode, inmode, in which case the architecture provides maskingwhich case the architecture provides masking

capabilities ofcapabilities of CPU’s faults as in the TMR solutionCPU’s faults as in the TMR solution..

Page 30: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

SOC Fault-Tolerant SOC Fault-Tolerant Architecture ImplementationArchitecture Implementation

Parallel ModeParallel Mode: Two channels : Two channels can becan be used as twoused as two

completely parallel fail-silent channels providingcompletely parallel fail-silent channels providing doubledouble

performance.performance. Memories and buses are protectedMemories and buses are protected using ECCs in using ECCs in

orderorder

to retain error masking capabilities onto retain error masking capabilities on these these componentscomponents

when operating in lock-step mode.when operating in lock-step mode.

Page 31: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

SOC Fault-Tolerant SOC Fault-Tolerant Architecture ImplementationArchitecture Implementation

SM Dual Lock-Step SM Dual Lock-Step architecturearchitecture::

Page 32: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

TThe performance andhe performance and the fault-tolerance features of the fault-tolerance features of thethe

different solutionsdifferent solutions are compared and their costs are are compared and their costs are

evaluated evaluated on the basis of the area estimateon the basis of the area estimates.s. Table summarizes the area of memory componentsTable summarizes the area of memory components

(both(both RAM and FLASH) and buses, normalized to the CPURAM and FLASH) and buses, normalized to the CPU

footprint.footprint.

Area of embedded memory componentsArea of embedded memory components

normalized to CPU footprintnormalized to CPU footprint

Page 33: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

Cost of different architectures for low-/mid-range X-by-wire systemsCost of different architectures for low-/mid-range X-by-wire systems

Page 34: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

The single CPU architecture can be considered as aThe single CPU architecture can be considered as arreferenceeference design satisfying computational and memorydesign satisfying computational and memoryrrequirementsequirements but not providing any fault-tolerancebut not providing any fault-tolerancecapability.capability. The lockstepThe lockstep architecturearchitecture: : The lock-step architectureThe lock-step architectureccannotannot provide any performanceprovide any performance boost over the singleboost over the singlepprocessorrocessor solution, since the two coressolution, since the two cores are bound toare bound toexecute theexecute the same code cycle by cycle. Rather,same code cycle by cycle. Rather, due to thedue to theiintroductionntroduction of the compare logic and the ECCof the compare logic and the ECCcoders/decoders in thecoders/decoders in the critical path, the clock rate may becritical path, the clock rate may bedecreased.decreased.

Page 35: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

However, with a relatively low area overheadHowever, with a relatively low area overhead, , thisthis

solution provides a 100% fault coverage within an solution provides a 100% fault coverage within an errorerror

detection time in the order of the clock period.detection time in the order of the clock period. BBoth processors execute the same code, the oth processors execute the same code, the

locksteplockstep

configuration does not provide any protection configuration does not provide any protection againstagainst

software design errors.software design errors.

Page 36: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

SMSM looselyloosely--synchronizedsynchronized dualdual--processorprocessor ArchitectureArchitecture: : In the SM loosely-synchronized dualIn the SM loosely-synchronized dualprocessor architectureprocessor architecture the two CPUs can run the two CPUs can run

independentlyindependentlyhaving full accesshaving full access to the memory sub-system and systemto the memory sub-system and systemI/O. Since only criticalI/O. Since only critical tasks must be duplicated for safetytasks must be duplicated for safetyRequirementsRequirements.. As the lock-step configuration, the SM looselyAs the lock-step configuration, the SM looselysynchronizedsynchronized architecturearchitecture provides a 100% error detectionprovides a 100% error detectionwhen runningwhen running full-critical applications.full-critical applications. However, thisHowever, thisrequires roughly twicerequires roughly twice as much memory space toas much memory space toaaccommodateccommodate the duplicated code.the duplicated code. Memory footprint isMemory footprint ismostly responsible for the huge areamostly responsible for the huge area overhead as shown inoverhead as shown inTable. Table.

Page 37: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

Moreover, fault diagnosisMoreover, fault diagnosis is complicated by theis complicated by the longerlongereerrorrror detection time, proportionaldetection time, proportional to the check executionto the check executionperiod, and byperiod, and by the fact that errorthe fact that error detection only performeddetection only performedon selected outputs. Nonetheless,on selected outputs. Nonetheless, in contrast to the lockin contrast to the lockstep solution, the SM looselystep solution, the SM loosely--synchronizedsynchronized architecturearchitecture

hashasthe ability of supporting boththe ability of supporting both hardware and softwarehardware and softwaredesign diversity anddesign diversity and provides a degradedprovides a degraded mode ofmode ofoperation.operation. Both configurations presented above provide no faultmasking mechanism, except for the possibleimplementation of ECCs on buses and memories. This maybe a major draw-back especially in the case of a hightransient fault rate.

Page 38: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

Triple modular redundant architectureTriple modular redundant architecture:: The TMRThe TMRcconfigurationonfiguration represents a “low-cost” solution.represents a “low-cost” solution. In fact, theIn fact, thearea overheadarea overhead over the lock-step architecture isover the lock-step architecture is as low asas low as9% and 1.5%9% and 1.5% for low- and mid-range systems respectively.for low- and mid-range systems respectively.However, itHowever, it also inherits almost all of the featuresalso inherits almost all of the features andandflaws of theflaws of the lock-step architecture. Excepting its uniquelock-step architecture. Excepting its uniquecapability ofcapability of masking any single fault, at the cost of anmasking any single fault, at the cost of anadditionaladditional CPU,CPU, it offers a 100% error detection coverageit offers a 100% error detection coveragewithinwithin a singlea single clock period.clock period.

Page 39: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

SM dual lockstepSM dual lockstep architecturearchitecture: : The SM dual lockThe SM dual lockstep architecture combines the advantagesstep architecture combines the advantages of the SMof the SMloosely-synchronized solution in terms ofloosely-synchronized solution in terms of flexibility with theflexibility with thefault masking capabilities provided byfault masking capabilities provided by the TMRthe TMR architecture.architecture. When the two cores execute theWhen the two cores execute the same code in lock-step,same code in lock-step,they provide fault-tolerance capabilities.they provide fault-tolerance capabilities. On the other On the other

hand,hand,if the fail-silence property sufficesif the fail-silence property suffices for the application atfor the application athand, the two channels can operatehand, the two channels can operate completelycompletelyindependently and the architecture behaves likeindependently and the architecture behaves like aa““traditional” dual processor solution.traditional” dual processor solution.

Page 40: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

This great deal of flexibility comes at a relatively lowThis great deal of flexibility comes at a relatively lowprice.price. In fact, if compared with the fault-tolerant TMRIn fact, if compared with the fault-tolerant TMRarchitecture,architecture, while the introduction of the 4th CPU yields while the introduction of the 4th CPU yields

aa10%10% overhead for low-range applications, the overheadoverhead for low-range applications, the overheadfalls downfalls down to just 2-3% for more memory demandingto just 2-3% for more memory demandingapplications. Notice that to cover softwareapplications. Notice that to cover software design faults2design faults2viavia design diversity, we need to double thedesign diversity, we need to double the memorymemoryfootprint asfootprint as done for the SM loosely-synchronizeddone for the SM loosely-synchronizedarchitecture. Also inarchitecture. Also in this case, comparing the twothis case, comparing the twoalternatives, we come out withalternatives, we come out with a modest increase in area,a modest increase in area,in the order of about 8% and 2%in the order of about 8% and 2% for low- and mid-rangefor low- and mid-rangeapplications respectively.applications respectively.

Page 41: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Implementation Issues & Implementation Issues & ComparisonsComparisons

TradeoffTradeoff analysisanalysis: : - SM loosely-synchronized architecture - most area-demanding solutionmost area-demanding solution -- lock-step and the TMR architectureslock-step and the TMR architectures - cannot provide any performance improvement over the

single processor solution, while representing “low-cost” solutions

- SM dual lock-step architectureSM dual lock-step architecture -- 100% single fault-tolerance 100% single fault-tolerance -- wider range of applications wider range of applications -- reducing engineering costs reducing engineering costs

-- best alternative between the four architecturesbest alternative between the four architectures

Page 42: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

Concluding RemarksConcluding Remarks

AA single-chip solution single-chip solution is proposed is proposed,, devised for devised for faultfault

tolerant automotive applications, which istolerant automotive applications, which is based on based on the usethe use

of two lock-step channels (4 CPUs overall),of two lock-step channels (4 CPUs overall), a cross-a cross-barbar

communication architecture and embeddedcommunication architecture and embedded memories.memories.

Page 43: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

ReferencesReferences

[1] R. Baumann. The impact of technology scaling on soft error rate performance and [1] R. Baumann. The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction. In Digest of the Internation Electron Devices limits to the efficacy of error correction. In Digest of the Internation Electron Devices Meeting IEDM’02., pages 329–332, 2002.Meeting IEDM’02., pages 329–332, 2002.

[2] R.C. Baumann. Soft errors in advanced semiconductor devices - part I: The three [2] R.C. Baumann. Soft errors in advanced semiconductor devices - part I: The three radiation sources. IEEE Transaction on Device and Materials Reliability, 1(1):17–22, radiation sources. IEEE Transaction on Device and Materials Reliability, 1(1):17–22, Mar 2001.Mar 2001.

[3] E. B¨ohl, Th. Lindenkreuz, and R. Stephan. The fail-stop controller AE11. In [3] E. B¨ohl, Th. Lindenkreuz, and R. Stephan. The fail-stop controller AE11. In Proceedings of the International Test Conference, pages 567–577, Nov 1997.Proceedings of the International Test Conference, pages 567–577, Nov 1997.

[4] M. Baleani, A. Ferrari, L. Mangeruca, Maurizio Peri, Saverio Pezzini. Fault-Tolerant [4] M. Baleani, A. Ferrari, L. Mangeruca, Maurizio Peri, Saverio Pezzini. Fault-Tolerant Platforms for Automotive Safety Critical Applications In: Proceedings of the 2003 Platforms for Automotive Safety Critical Applications In: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded international conference on Compilers, architecture and synthesis for embedded systems, pages 170 – 177, 2003. systems, pages 170 – 177, 2003. 

[5] R. Iserman, R. Schwarz, and S. St¨olzl, “Fault-Tolerant Drive-by-Wire Systems,” IEEE Control

Systems Magazine, vol. 22, no. 5, pp. 64–81, October 2002. [6] K. Ahlstr¨om and J. Torin, “Future Architecture of Flight Control Systems,” IEEE

Aerospace and Electronic Systems Magazine, vol. 17, no. 12, pp. 21–27, December 2002.

[7] P. H. Jesty, K. M. Hobley, R. Evans, and I. Kendall, “Safety Analysis of Vehicle-Based Systems,” in Proceedings of the Eighth Safety-critical Systems Symposium, 2000, pp. 90–110.  

[8] C. Constantinescu, “Trends and Challenges in VLSI Circuit Reliability,” IEEE Micro, vol. 23, no. 4, pp. 14–19, July-August 2003.

Page 44: Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344.

THANKSTHANKS

Q&AQ&A