Fault Tolerance in Automotive Systems_report

7/24/2019 Fault Tolerance in Automotive Systems_report

1/10

1

Fault Tolerance in Automotive SystemsAdithya Hrudhayan Krishnamurthy, Ramkumar Ravikumar

Department of Electrical and Computer EngineeringUniversity of Wisconsin, Madison

{akrishnamur3, ravikumar}@wisc.edu

Abstract - Design of fault tolerant electronics hasbecome a standard requirement in the automotivesector these days. These systems increases the overall

automotive and passenger safety by liberating thedriver from handling routine tasks and also assistingthe driver during critical situations. In this paper wepresent exhaustive information about some of thecommonly used fault tolerant design techniques in theautomotive domain. We start off by analyzing X-by-wire systems, which are fault tolerant distributedsystems that are fail-operational and can maintain a

reliable state all the time. A case study on Steer-by-wireis included. Fault tolerance techniques used in thedesign of automotive software and how they help inimproving the overall reliability and dependability ofthe system is investigated. Common design techniquesused in the design of fail-safe Sensors and actuators ispresented. We conclude the paper with a section on the

design of automotive communication systems and

protocols and their ability to ensure reliablecommunication between various ECU's in the vehicle.

1. IntroductionAdvancements in the field of automotive electronics

have helped in realizing the potential of sophisticated

vehicular control systems. In addition to liberatingthe driver from routine tasks, such systems assist thedriver during critical situations, thereby enhancing

vehicular safety and performance. Among these

systems, X-by-Wire systems (where driving, steeringand braking are electronically controlled) have

provided feasible electronic and electromechanical

solutions resulting in enhanced fault tolerance and

reliability. Traditional mechanical and hydraulic

systems employed in automotive and aviation

systems are being replaced by electronic control

systems such as X-by-Wire systems. A current

premium car, for instance, implements about 270functions a user interacts with, deployed over about

70 embedded platforms. Altogether, the software

amounts to about 100 MB of binary code. Ensuring

fault tolerance in automotive software is an activearea of research. Safety-critical systems such as X-

by-wire systems and most ECUs typically use a lot of

sensors for performing their functions. Hence sensors

and actuators, which form the backbone of most

commonly used electronic systems, need to be fault

tolerant as well. Major automotive subsystems such

as chassis, air-bag, powertrain, body and comfort

electronics, diagnostics, x-by-wire, multimedia and

infotainment, and wireless rely on automotive

communication systems. Fault tolerantcommunication systems are built so they are tolerant

to defective circuits, line failures etc., and

constructed using redundant hard- and software

architectures to ensure reliable communication

between different sub-systems in a vehicle. In this

paper we present exhaustive information about fault

tolerance techniques used in the automotive industry.

2. X-by Wire in Automotive SystemsFollowing the aviation industry, the automotive

industry took to X-by-Wire systems, called Drive-

By-Wire. In the automotive environment, X

denotes the commanded action such as accelerating,braking or steering. Drive-by-Wire systems sense

driver requests and translate them into optimum steer,

brake, and acceleration manoeuvres. X-by-Wire

systems can be classified into Brake-by-Wire,

Throttle-by-Wire and Steer-by-Wire systems

depending upon the commanded action. The

following sections discuss Brake-by-Wire andThrottle-by-Wire in a succinct manner. Later, a

section has been devoted to the detailed study ofSteer-by-Wire systems.

2.1. Brake-by-WireBrake-by-Wire (BBW) systems are realized through

electro-mechanical actuators and communication

networks, instead of conventional hydraulic devices.

It offers enhanced safety, cuts off cost associated

with manufacture and maintenance of mechanical

brakes and brake fluids. It also eliminates

environmental concerns caused by hydraulic systems.There are two ways to realize a BBW system. On one

hand, the system is based on the traditional hydraulic

brake system. The by-wire function is realizedthrough hydraulic pumps and additional electric

controlled valves Electric Hydraulic Brake (EHB)[24]. In an EHB system, a hydraulic backup can be

realized with the help of valves. Once a fault is

detected, a direct hydraulic brake circuit will be

closed. On the other hand the brake-by-wire system

based on electric mechanical actuators is called as

Electric Mechanical Brake (EMB). In an EMBsystem the brake force and brake control is realized

by electric components [11]. Since a hydraulic brake

cannot be realized, the system must be extremely


2/10

2

reliable. BBW systems enable simple integration of

vehicle traction and stability control.

2.2. Throttle-by-Wire

Conventional throttle systems consist of a cablerunning from the gas pedal into the throttle body.This cable slides within a housing as it winds its wayaround various components. Such a system is

relatively bulky and prone to wear. Automotive

manufacturers have implemented a new means of

throttle control known as Throttle-by-Wire. Throttle-

by-Wire consists of a sensor providing pedal

position. The data acquired by the sensor is sent to

the Engine Control Module (ECM) that determines

the parameters to change. The ECM coordinates

components such as Anti-lock Braking System

(ABS), gear selection, fuel and air intake, andtraction control. This embedded intelligence results inincreased fuel efficiency, reduced emissions,improved performance, and reduced frictional losses.

Throttle-By-Wire allows the engine computer tointegrate torque management with traction control

and stability control.

3. Constraints for the Design of X-by-Wire

Systems

3.1. Power ConstraintThe challenge faced during the implementation of X-

by-Wire systems in vehicles has been to

accommodate such systems amidst the 14 V powernets in automobiles. The number of electricalcomponents in cars has steadily increased in recentyears. To guarantee the proper functioning of all

electrical components, a stable voltage supply isnecessary. The automotive industry considered

migrating to 42 Volt systems. With 42 Volts in the

system, vehicles could use thinner-gauge wires and

smaller motors because higher voltage would mean

reduced amperage (Which dictates the size of the

wire).

3.2. Real Time ConstraintX-by-Wire systems are intrinsically real-timedistributed systems. They implement complexmultivariable control laws and deliver real-time

information to intelligent devices that are physically

distant (for example, the four wheels). They have

stringent time constraints, such as a sampling period

of only a few milliseconds [26]. Occasional absence

of samples or out-of-bound delays at the controller or

actuator level, for instance, due to frame loss, does

not necessarily lead to vehicle instability, but

degrades steering performance (or Quality ofService). This is because most of the control lawsthat are used are designed with specific delay and/orabsence of sampling data compensation mechanisms.

End-to-end response times, such as the time between

a request from the driver and the response of the

physical system, must be bounded, typically lower

than a few tens of milliseconds. An excessive end-to-

end response time of a control loop may not only

induce performance degradation but also cause the

instability of the vehicle.

2.3. Dependability ConstraintsFor a critical X-by-Wire system, the following must

be ensured - A system failure does not lead to a state

in which human life, economics, or environment is

endangered; a single failure of one component must

not lead to the failure of the whole X-by-Wire system

[26]. Also, the system must be able to tolerate at least

one major critical fault without loss of functionality

for a long time to reach a safe parking area. The X-

by-Wire system should also offer the same

availability and maintainability as theirmechanical/hydraulic counterparts.

4. Case Study: Steer-by-Wire

Traditionally, vehicle wheels have been turned by adirect mechanical linkage between the steering

wheel, steering gears and actual wheel. In such a

system, the driver turns the steering wheels to request

the steering gears to turn the vehicle wheels. The

feedback of the torque encountered by the steering

system as the wheels are turned is provided to the

driver through mechanical linkages. This torquefeedback is critical as it provides the driver with asense of the road conditions, such as the tractionexperienced by the wheels with the road surface.

Steer-by-Wire systems eliminate mechanical

components between the steering wheel and turning

wheels. The system aims to control the wheeldirection according to the driver's request and provide

a mechanical-like force feedback to the hand wheel.

The following sections deal with the functionality of

the architecture, the real-time constraints imposed on

the system, and the protocol used by the nodes to

communicate between each other.

4.1. Functional Description and OperationalArchitecture of Steer-by-Wire SystemThe two critical services provided by the Steer-by-

Wire system involve the front axle actuation and the

hand wheel force feedback [26].

Front Axle Control - This function computes theorders that are given to the motor of the front axle,

based on the state of this front axle and the

commands given by the driver through the handwheel. The driver's requests are translated depending

on the hand wheel angle, torque and speed.Hand Wheel Force Feedback - This function

computes the orders provided to the hand wheel

motor based on the speed of the vehicle, front axleposition and the front tie rod force.

An operational architecture for the functions

mentioned above is realized - The operational


3/10

3

architecture includes four ECUs (micro-controllers)

named, respectively, HW ECU1 (Hand Wheel

ECU1), HW ECU2 (Hand Wheel ECU2), FAA

ECU1 (Front Axle Actuator ECU1), and FAA ECU2

(Front Axle Actuator ECU2). Each node is connectedto the two TDMA-based communication channels(BUS1 and BUS2). Finally, three sensors, named as1,as2 and as3, placed near the hand wheel measure the

requests of the driver in a similar way, the latter

being translated into a 3-tuple . Three other

sensors, named rps1, rps2 and rps3, are dedicated to

the measurement of the front axle position. Finally,

two motors (FAA Motor2 and FAA Motor2),

configured in active redundancy, act on the front axle

while two other motors (HW Motor1 and HW

Motor2) realize the force feedback control on thehand wheel. Sensors as1, as2 and as3 (respectively,sensors rps1, rps2 and rps3) are connected by point-to-point links, both to HW ECU1 and HW ECU2

(respectively, FAA ECU1 and FAA ECU2).

Implementation of the Front axle control function -

The requests of the driver are measured by the three

replicated sensors as1, as2 and as3 and sent to both

HW ECU1 and HW ECU2. Each ECU performs a

majority vote on the 3 received values and transmits

the data on both communication channels BUS1 andBUS2. The two ECUs, FAA ECU1 and FAA ECU2,

placed behind the front axle, consume this data, aswell as the last wheel position, in order to elaborate

the commands that are to be applied to FAA Motor 1

and 2.

Implementation of the Force feedback controlfunction- In a way similar to the previous function,

measurements taken by rps1, rps2 and rps3 are

transmitted both to FAA ECU1 and FAA ECU2.

Each of these ECUs elaborates information

transmitted on the network. The consumers of this

information are both HW ECU1 and HW ECU2which compute the command transmitted to HWMotor 1 and 2.

4.2. Fault Classification and RedundancyFault Classification A Byzantine fault is a fault

whose effect can be perceived differently by different

observers. A Coherent faults effect is seen the sameby all observers. A Transient fault is a fault whose

duration is such that the system does not reach a not-

safe-state. A Permanent fault is a non-transient one.

The property of fail silence, assumed for some

components, leads to another class of faults. Acomponent is said to be fail silent if, each time it is

not silent, we can conclude that it is functioningproperly.

ECU redundancy - Two functions need to be

implemented in ECU: Front Axle Control and Force

Feedback Control. To avoid costly and numerous

wires, ECUs have to be placed close to the sensors,

and communication between ECUs has to be

multiplexed.

Fig 1: Steer-by-Wire operational architecture

Dependability analyses are generally based on astrong hypothesis assuming that, in the whole system,

n simultaneous component failures can never occur

for any set of redundant components. Lamport statesthat 3n+1 redundant components are necessary to

tolerate n Byzantine faults [29]. In order to toleraten coherent faults, it is sufficient to have 2n+1

redundant components. According to the rule given

by Lamport, the minimum number of redundant

Hand Wheel ECUs (respectively, Front Axle ECUs)

should be 4. This solution is mainly used in the

aeronautic domain. Therefore a classical solution incost prohibitive situations is to use Fail-Silent ECUs.

In this case, only two Hand Wheel ECUs

(respectively, Front Axle ECUs) are necessary.

Hand Wheel Sensor and Actuator redundancy - A

Hand Wheel sensor produces information for two

Hand Wheel ECUs. Three Hand Wheel sensors arenecessary for ensuring that each Hand Wheel ECU,

assumed to provide a voting algorithm, is able to

tolerate one Byzantine fault (and subsequently one

coherent fault or one fail-silent sensor). A single

actuator can take charge of piloting the Front Axleand it is assumed that an actuator can never wrongly

apply an order received by a Front Axle ECU.

Considering the fail silence property of a Front AxleECU, only 2 couples (Front Axle ECU, actuator) are

necessary for the tolerance of, at most, one fault.


4/10

4

If the chosen fault-tolerance strategy is failure

recovery, redundant ECUs will work only in the case

of the primary ECU failing. Failure detection must be

quick and reliable. Otherwise, if the strategy is failure

compensation, redundant ECUs will work in parallel.Because of the stringent real-time constraint, ourarchitecture must provide failure compensation.Redundancy of identical ECUs does not prevent the

architecture from common mode failures: the

hardware of redundant ECUs should be furnished by

different suppliers and their software realized by

different teams.

4.3. Communication Protocol (TTP/C)TTP/C is a protocol based on the TTP (Time-

Triggered Protocol) and implemented so that it meets

the SAE requirements for a class C automotiveprotocol. Class C protocols are suitable for highspeed single failure operational safety criticalapplications. A principal reason that TTP/C is the

first protocol to qualify as Class C, is that theprevious protocols are all event triggered. Event-

triggered systems are susceptible to several serious

failure modes such as the babbling idiot failure.

TTP/C is a Time Division Multiple Access (TDMA)

protocol in which periodic time slots are assigned to

individual processing nodes in a system. The braking

unit has its own TTP/C nodes, each of which isreplicated as a Fault Tolerant Unit (FTU).

Node Architecture:System nodes in a TTP/C system

consist of a Host, a Controller Network Interface

(CNI), and a TTP/C communications controller. The

Host runs the application software for the relevantsystem function (for instance, the control software for

the braking system in an automobile). The CNI

stands between the Host and the TTP/C controller,

effectively de-coupling the applications-level

software from the network. Within the CNI is a

Message Descriptor List, which contains informationcontrolling bus access, and a data sharing interfacewhich is typically implemented with dual port RAM(allowing the Host and the TTP/C controller to access

the shared memory independently). The TTP/C

communications controller provides the actual

connection between the TTP/C node and the shared

network. The controller supports the protocol withseveral essential services such as guaranteed

transmission times with minimal latency, jitter, fault-

tolerant clock synchronization, and error detection.Scheduling and State Messages: The system

scheduling in TTP/C protocol is static. The points intime in which the various nodes in a system are

authorized to transmit form the lattice points of aTTP/C action lattice. The time difference between

two adjacent points in the action lattice represents the

basic cycle time of the system, and sets a lower

bound on the response time of the system.

Fig 2: Communication Subsystem with FTUs

Under TTP/C, nodes generate a state message in each

TDMA round, that are posted to the CNI by theTTP/C host for transmission over the system

network. Nodes receiving messages do not respond

with a receipt acknowledgement.Clock Synchronization: A commonly maintained

keeper of global time is thus critical to the proper

operation of the protocol. Within a cluster of TTP/C

nodes, all nodes are aware of which node has access

to the bus during a specified time slot, given the a

priori scheduling allocation. By noting the time when

messages are received from other nodes (TTP/C is abroadcast protocol, so all nodes hear all messages)

with the known schedule, a node can calculate thedifference between the clock of the sending node and

its own clock.

Composability: TTP/C supports a robust level of

composability the capability to carry a thoroughlytested subsystem into a larger system, and to be able

to depend on the subsystem retaining the same

characteristics that it demonstrated in isolation [32].

In the auto industry, this feature potentially allows

the rapid integration of components provided bymultiple suppliers into a larger framework, without

the need to perform extensive system integrationtuning and testing.

Reliability and Fault Tolerance: Several aspects of

the TTP/C protocol, and the way it is implemented in

real-world systems, serve to provide reliability and

fault-tolerant behavior.a. Membership:The role of the membership service

component of a real-time protocol is to inform all

system nodes of the failure of a node with minimal

delay [32]. Under TTP/C, a node membership field,

maintained as a status register in the CNI, contains aso-called node membership vector. This node

membership vector contains one bit for every node ina TTP/C cluster, with the bit set to True if the node in

question is operating correctly and to False if the

node is not operating or is flawed. The node

membership vector is updated by checking that the

expected messages from other nodes in the cluster are


5/10

5

received, and by analyzing the cyclic-redundancy

check (CRC) fields in the messages received.b. Fail-Silence:TTP/C nodes are designed to detect

faults in their own operation. The principle of

operation is that, each and every node must deliverresults which are correct in both the value and thetime domain, or no results at all. If a node detects anabnormality in its operation, it switches itself off. At

a software level, TTP/C supports, for example, a

variety of techniques, including double execution,

double execution with reference checks, validity

checks, assertion checking, and signature checks.c. Bus Guardian: The Bus Guardian (BG) is a

hardware element of the TTP/C Controller which

serves as a portal to the system bus. The key role of

the Bus Guardian is to enable the bus driver only

during the transmission slot for its node, andotherwise to guarantee that the bus driver is in adisabled state. This serves to prevent babbling idiotfailures [32].

d. Replication of System Components: The TTP/Cprotocol supports replication of the hardware

elements of a node, dual system buses and error

detection [32].

5. Fault tolerance in Automotive SoftwareImproving software fault tolerance is a common

interest for aeronautics, railway and automotivesoftware-based systems. However, the automotivecontext meets more stringent economical constraintsand resource limitations, due to higher volume of

vehicle production [1]. The amount of software has

evolved from zero to tens of millions of lines of code.

Today, more than 80% of the innovations in a carcome from computer systems; software has thus

become a major contributor to the value of

contemporary cars [4]. A study made by Mercer

Management Consulting and Hypovereinsbank in

2001 claims that the total value of software in cars

will rise from 4% to 13% by 2010 [3] and digitalhardware and software is expected to account for upto 30% of the overall cost of a car [5].

5.1. Automotive Software ClassificationAutomotive software is very diverse, ranging from

entertainment and office-related software to safety-

critical real-time control software. It can be clusteredaccording to application area and the associated

nonfunctional requirements. The following five

clusters are usually distinguished [4] as Multimedia,

telematics, and HMI software, body/comfort

software, software for safety electronics,power trainand chassis control software and infrastructure

software. Automotive software development posesgreat challenges to automotive manufacturers since

an automobile is inherently distributed and subject to

fault-tolerance and real-time requirements. Reliability

and robustness for automotive software is a critical

requirement of ECU software, so that fault tolerance

mechanisms can handle detected faults locally

without propagation to other SW-components [2].

5.2. Current ApproachesA fault-tolerant architecture based on computationalreflection is proposed [1]. The reflection paradigmfor fault-tolerance purpose relies on the ability of a

system to check and to correct itself in a separate

abstraction level. The software architecture is divided

into two parts (functional and defense software) that

interact together through an interface. It is assumed

that the defense software has enough knowledge of

the structure and expected behavior of functional

software, to control it. The defense software detects

errors by checking safety properties and performs

recovery using generic instrumentation andinfrastructure functionalities. As failures can impact

both data and control, the failure model is structuredinto two parts: data flow and control flow. Critical

control flow failures can disrupt control events,sequence of execution and execution time. Critical

data flow failures can affect both value and timing in

the system. The runtime behavior of a system is

described as a sequence of scheduled entitiesthat are

triggered by events and generate triggering events.Control flow of a scheduled entity relates to the

control events starting or stopping its execution.These events are produced by the environment orother entities. In parallel, data flow of a scheduledentity corresponds to the input data it consumes and

the output data it produces during its execution.

The Defense software is organized around loggingtables and three types of services that control

information logging, error detection and error

recovery. Logging or tracing are mechanisms often

needed for debugging and diagnosis issues. The

logging strategy has to select rigorously thenecessary and sufficient critical information to get atruntime, according to fault tolerance concerns. Thelogging architecture is organized into severalbracket-tables that are updated and used at runtime.

Each table is associated with a dedicated logging

routine that uses preferably existing infrastructure

services to get information. Once application-specific

safety properties are specified, the correspondingerror detection routine is developed as an executable

assertion. An assertion is verified at runtime within a

corresponding checking routine. When an error is

detected, the checking routine triggers a recovery

routine.

At the application level, once an error is detected, theapplication is turned into a safe state. Degraded data

can recover to their valid values and new

communication request may arise for missed data

acknowledgement. At the infrastructure level,


6/10

6

recovery actions on control flow include reset,

terminate and restart a task or a set of OS objects. In

the proposed fault-tolerant architecture, eachchecking routine is associated with one or morerecovery routines that call available executiveservices and update logging tables, if necessary. Therecovery action depends on the detected error. Twotypes of software instrumentation considered are

hooksand basic services. Hooks are the means to tie

up defense software to functional software, and to

insert code. Basic services play the role of software

sensors and actuators. Fault injection techniques are

used to measure fault-tolerance coverage, and to

detect remaining software errors of defense software.

The paper, however, does not mention about the

amount of data that can be handled.

An alternate method for ensuring fault tolerance inautomotive software is proposed in [6]. Since it might

be difficult for application programmers to

implement fault tolerance features by themselveswhile designing automotive software, it is desirable

that the fault tolerance is provided by the middleware

and hidden from application programmers.

Middleware is a software layer that connects and

manages application components running on

distributed hosts. It exists between network operating

systems and application components. Middlewarehides and abstracts many of the complex details ofdistributed programming from applicationdevelopers. Figure 3 shows the structure of the

proposed middleware for automotive systems. At the

top level, there exists the publish/subscribe interface

that is used by application components. Beneath theinterface, there are QoS Configuration module for

specifying real-time requirements, Resource

Allocation module for guaranteeing timeliness

operations, and Clock Synchronization module for

providing a global time base. All incoming and

outgoing messages pass through Fault-ToleranceLayer in order to guarantee reliability. Finally,messages are transmitted and received by TransportLayer.

Fig. 3: Structure of proposed middleware

Because software is very flexible and relatively

cheap, it is a very desirable medium forimplementing fault tolerance mechanisms [7].

Redundancy in software can be accomplished

through redundant computation and redundant

storage. The most common computational

redundancy technique involves the use of a software

controlled watchdog timer. As the program executes,

the timer is periodically reset by writing a new value

into the timer register. In the event of a failure, the

watchdog will restart the processor from a re-entrypoint in the code. This technique is particularly usefulfor avoiding deadlock conditions due tocommunication failures. Another simple technique

for data redundancy is complement data write. Rather

than simply duplicating data, it is complemented first.

This adds in some additional fault tolerance by

providing the means to detect stuck-at-faults. As

duplicating every piece of data might be expensive,

often only safety-critical date is duplicated [8].

Corruption of the instruction memory can be detected

and tolerated by duplicating the program in memory

and executing each copy in sequence. RedundantOrthogonal Coding (ROC), a technique similar to n-version programming, can be used to increasereliability [8]. The difference is that in ROC,

different algorithms are intentionally used to performthe same calculation. This addresses the problem of

n-version programming where different people tend

to design programs for a given tasks in very similar

ways and also to make similar mistakes in the design

by explicitly ensuring dissimilarity. This technique

may also catch programming errors. Most of the error

detection techniques focus on memory integrity. Acommon technique is to use checksums to verify thecontents of the memory [8]. Another technique todetermine that the system memory is still working

correctly is to test the RAM by writing various

patterns of ones and zeros into every bit in memory.

Another common error detection technique is to useassertion tests inside a program. Using assertion

checks can catch programming errors as well as

errors arising from unusual conditions.

Currently, the amount of error diagnosis and error

recovery in cars is rather lightweight [9]. In the CPUssome error logging takes place, but there is neitherconsideration nor logging of errors at the level of thenetwork and the functional distribution. There is no

comprehensive error diagnosis and no systematic

error recovery beyond individual CPUs. Failure

logging to the end of better diagnosis for

maintenance has emerged as a relevant researchproblem. There are some fail-safe and graceful

degradation techniques found in cars today, but a

systematic and comprehensive error treatment is

missing. With the upcoming multi-core controllers

for embedded applications, an interesting area forresearch is how this can be exploited also for

redundancy/recovery strategies. Several keys areas ofresearch have been identified in [4]. Reliability and

safety concerns are important for all functions

relevant to driving, from engine control and

passenger safety functions to X-by-wire functions


7/10

7

where mechanical transmission is replaced by

electrical signals.

6. Fault tolerant actuators and sensors

Safety-critical automotive applications, such as steer-by-wire systems, are in most cases control systemswith hard real time requirements. Such systemstypically have a number of sensors (inputs) [12]

connected to them, whose values are processed in

order to produce the control actions. Consequently

the sensors are the first in the flow of information and

control computations rely of these values. Therefore

it is important that the sensors can be trusted.

Actuators are essential for the reliable operation of

various components in the automobile including

brakes, valves and cylinders.

6.1. Fault tolerant sensorsA fault-tolerant sensor configuration should be atleast fail-operational for one sensor fault [11]. This

can be obtained by hardware redundancy with thesame type of sensors or by analytical redundancy

with different sensors and process models. Sensor

systems with static redundancy are realized with a

triplex system and a voter. A configuration with

dynamic redundancy needs at least two sensors and

fault detection for each sensor. The fault detection

can be performed by self-tests.

Fig. 4: Triplex system with static redundancy and duplex systemwith dynamic redundancy

Depending on the importance of the sensed quantity,additional sensors may or may not be needed to

obtain the required dependability at the system-level.

This type of sensors requires some form of sensor-

internal redundancy in the form of a built-in self-test

(BIST) and/or internal replication. If a particular

sensor is not fail-silent, it can be replicated in order to

obtain the fail-silent property at the system level. In

this case, the values of the sensors are collected and

analyzed by an intelligent unit that makes a decisionof which value to use in further calculations. Therequired degree of replication is dependent on thecriticality of the sensor. If knowledge of the sensedquantity is critical, sensor triplication is necessary.

However, if a lack of knowledge about the sensed

quantity is acceptable, dual redundancy is sufficient.

A prototype of a steering angle sensor is

demonstrated in [10]. By extension to a diverse four-

sensor system (there are six pairs of sensor elements),

the steering angle sensor is fault tolerant since it is

able to tolerate the loss of one or two sensor elements

and to diagnose the failed sensor elements. Hardware

and/or Software Instrument Fault Detection, Isolation

and Accommodation (IFDIA) schemes are finding

more and more applications in automotivemeasurement and control systems. A scheme fordetection, location and accommodation of faults is

presented in [13]. This scheme was designed to

identify and accommodate some kinds of faults that

may affect manifold pressure, crankshaft speed and

throttle valve angle position sensors. It is reported

that the realized scheme is able to identify and

accommodate also small faults in all considered

sensors.

6.2. Fault tolerant actuators

Actuators generally consist of different parts: inputtransformer, actuation converter, actuationtransformer, and actuation element. The actuationconverter converts one type of energy (e.g., electrical

or pneumatic) into another (e.g., mechanical orhydraulic). Fault-tolerant actuators can be designed

by using multiple complete actuators in parallel, with

either static redundancy or dynamic redundancy with

cold or hot standby [11]. Another possibility is to

limit the redundancy to parts of the actuator that have

the lowest reliability. To achieve fault tolerant

control either the actuator must not be a single pointof failure or it has to be fault tolerant. For functionswithout inherent redundancy, actuators must bereplicated [12]. When only two actuators are used,

the failure of one actuator must not affect operation

of the remaining unit. Sensors to continuously

monitor the actuator behavior (e.g. injector current,motor current, motion, force, torque, etc) are

typically used. When these sensors indicate a serious

actuator failure, the power to the actuator should be

switched off. As cost and weight generally are higher

for them than for sensors, actuators with fail-

operational duplex configuration are preferred [11].Static redundant structures, where both parts operatecontinuously or dynamic redundant structures withhot or cold standby can be chosen. For dynamic

redundancy, fault-detection methods for the actuator

parts are required.

A prototype of an actuator is demonstrated in [10].Typical electromagnetic faults such as Winding open

circuit, winding short circuit may occur within an

actuator. To develop one single drive that can

continue to operate with any one of these faults, it

became clear that the most successful designapproach was to use a multiple phase drive in which

each phase may be regarded as a single module. Theoperation of any one module must have minimal

impact upon the others, so that in the event of that

module failing the others can continue to operate

unaffected. When both sensor and actuator failures


8/10

8

occur at the same time, their mutual effects on

residuals make fault isolation difficult [14]. A

hexadecimal decision table to relate all possible

failure patterns to the residual code has been

proposed in [14]. Detection and isolation of multiplesensor and actuator failures in automotive engines isachieved. Simulation and experimental resultsindicate that the proposed diagnostic system not only

can be applied to cases where all failures occur in the

same sector, but is also appropriate for isolating

multiple failures occurring simultaneously in sensors

and actuators.

7. Fault tolerant Automotive Communication

Systems

The specific requirements of the different car

domains have led to the development of a largenumber of automotive networks such as LIN, J1850,CAN, TTP/C, FlexRay, media-oriented systemtransport, IDB1394, etc. One of the important

requirements of an automotive communicationsystem is fault-tolerance [16]. Fault tolerant

(typically safety-critical) communication systems are

built so they are tolerant to defective circuits, line

failures etc., and constructed using redundant hard-

and software architectures.

7.1. Event triggered vs. Time triggered SystemsThere are two main paradigms for communications inautomotive systems [15]: time triggered and eventtriggered. Event triggered means that messages are

transmitted to signal the occurrence of significant

events (e.g., a door has been closed). In this case, the

system possesses the ability to take into account, asquickly as possible, any asynchronous events such as

an alarm. Event-triggered communication is very

efficient in terms of bandwidth usage since only

necessary messages are transmitted. In time triggered

systems, frames are transmitted at predetermined

points in time, which is well suited for the periodictransmission of messages as required in distributedcontrol loops. Each frame is scheduled fortransmission at one predefined interval of time,

usually termed a slot, and the schedule repeats itself

indefinitely. As the frame scheduling is statically

defined, the temporal behavior is fully predictable;

thus, it is easy to check whether the timingconstraints expressed on data exchanges are met.

7.2. Controller Area Network (CAN)

CAN (Controller Area Network) is the most widely

used in-vehicle network. It was designed by Bosch inthe mid 80's for multiplexing communication

between ECUs in vehicles and thus for decreasing theoverall wire harness: length of wires and number of

dedicated wires. CAN on a twisted pair of copper

wires became an ISO standard in 1994 and is now a

de-facto standard in Europe for data transmission in

automotive applications, due to its low cost, its

robustness and the bounded communication delays.

CAN has several mechanisms for error detection

[17]. For instance, it is checked that the CRCtransmitted in the frame is identical to the CRCcomputed at the receiver end, that the structure of theframe is valid and that no bit-stuffing error occurred.

Each station which detects an error sends an "error

tag" which is a particular type of frame composed of

6 consecutive dominant bits that allows all the

stations on the bus to be aware of the transmission

error. CAN possesses some fault-confinement

mechanisms aimed at identifying permanent failures

due to hardware dysfunctioning at the level of the

micro-controller, communication controller or

physical layer. The scheme is based on error countersthat are increased and decreased according to

particular events. The main drawback is that a nodehas to diagnose itself, which can lead to the non-

detection of some critical errors. Without additionalfault-tolerance facilities, CAN is not suited for

safety-critical applications such as future X-by-Wire

systems [17]. For instance, a single node can perturb

the functioning of the whole network by sending

messages outside their specification (i.e. length and

period of the frames). A framework to provide

selective fault-tolerance for messages with variousfault-tolerance requirements scheduled on CAN is

proposed in [19]. The set of messages are analyzedoff-line and scheduling attributes are provided that

ensures feasible transmission of messages as well as

retransmissions upon error occurrences that satisfy

the fault-tolerance requirements.

7.3. Time-Triggered CAN (TTCAN)TTCAN uses the CAN standard but, in addition,

requires that the controllers have the possibility to

disable automatic retransmission of frames upon

transmission errors and to provide the upper layerswith the point in time at which the first bit of a framewas sent or received. The key idea is to propose aflexible time-triggered/event-triggered protocol.

TTCAN defines a basic cycle as the concatenation of

one or several time-triggered (exclusive) windows

and one event-triggered (arbitrary) window. Though

TTCAN is built on a well-mastered and low-costtechnology, CAN, does not provide important

dependability services such as the bus guardian,

membership service and reliable acknowledgment

[17]. It does not provide the same level of fault

tolerance as TTP and FlexRay, which are the othertwo candidates for x-by-wire [16]. Strong points of

TT-CAN are the support of coexisting event- andtime-triggered traffic together with the fact that it is

standardized by ISO. It is also on top of standard

CAN which allows for an easy transition from CAN

to TT-CAN.


9/10

9

7.4. FlexRayThe FlexRay network is very flexible with regard to

topology and transmission support redundancy. It can

be configured as a bus, a star or multistar. It is not

mandatory for each station to possess replicatedchannels or a bus guardian, even though this should

be the case for critical functions such as the Steer-by-Wire [15]. FlexRay also provides fault tolerance by

distributed time-triggered synchronization (clock

synchronization) and error containment on the

physical layer through an independent bus guardian.

FlexRay allows both time-triggered and event-

triggered communication by means of a

communication cycle, where a time-triggered (static)

window and event triggered (dynamic) window are

concatenated. The time triggered window uses

TDMA like TTP, but unlike TTP, a given node maybe able to access the bus multiple times before allremaining nodes access it. The event-triggeredwindow uses a technique called Flexible TDMA

(FTDMA) to provide event triggered behaviorwithout collisions. According to the FlexRay

specification [21] a frame contains a 24 bit CRC

checksum to ensure the integrity of the frame

transmission. The probability of undetected network

errors is less than (6*10^-8). Adequately addressing

fault-tolerance is one of the key aspects that needed

to be considered during the design of FlexRay. Toallow a single communications system to support thediverse needs of automotive applications acrossdifferent application domains the consortium decided

to introduce a concept of scalable fault-tolerance.

Scalable fault-tolerance aims at allowing FlexRay to

be used economically in distributed non fault-tolerantsystems as well as in distributed fault-tolerant

systems.

In addition FlexRay can be deployed using optional

local or remote channel guardians that protect the

communications channels from transmission faults

that violate the TDMA scheme. The clocksynchronization algorithm supports fault-tolerant aswell as non fault-tolerant synchronization. For fault-tolerant synchronization the synchronization

algorithm considers the transient / permanent fault

class as well as the symmetric / asymmetric fault

class [22]. In this protocol, the synchronization of the

global time happens at the macrotick level, with theuse of a cluster-wide clock synchronization

algorithm. This clock synchronization algorithm

continues to operate even in the event of an ECU

failure in the system, unlike a master-slave

synchronization algorithm. Table 1 summarizes thekey differences between the automotive protocols

discussed so far.

7.5 Recent WorkA simulation study for fault-tolerant sensor networks

for cars on-board control is presented in [18]. On-

board communication and control networks are built

using Gigabit Ethernet. Sensors are smart and they

are the sources of traffic. Actuators are smart and

they are the sinks of traffic.

Table 1: Summary of Automotive Protocols

The controller is a personal computer. The sources of

real time traffic (sensors) are tripled while thenumber of sink nodes (actuators) is not increased.

This increase in the number of sensors is made to test

the possibility to build triple-modular redundancy(TMR) on the sensors level for fault-tolerance. The

disadvantage of TMR is obviously cost since three

sensors are required to produce an output that could

be generated by just one sensor. On the other hand,

the advantage of TMR is an increase in reliability.

The outputs have to go through a voter. Voting can

also be done in software which has been used in this

study. The controller, after reading the outputs of thethree sensors, executes a routine that compares thesethree outputs. A major problem here is that the three

outputs may not completely agree because thesensors, while identical, may not produce the same

exact output. The first solution to this problem is the

mid-value select technique. The second technique

is to ignore the least significant bits of the data. The

number of significant bits that have to agree depends

on the application. With TMR, the data from the

three copies of the sensor is compared and voted

upon. As long as there is only one failed sensor, the

controller will know which of the three packets to

discard and the system will remain operational. Whenthe second sensor fails, the entire system will fail. A

methodology of interconnecting the automotive busnetworks in a fault tolerant way is proposed in [20].

8. ConclusionIn this paper, we have provided a survey of fault

tolerant design techniques and methodologies used in

the automotive industry. X-by-wire systems that are

discussed in length are expected to be integrated into

most automobiles in the future. Automobile

manufacturers such as Toyota, Nissan and BMW

USAGE CAN TTCAN FlexRay

Chassis YES YES NOAirbags YES NO NO

Powertrain YES YES SOME

X-by-wire SOME YES YES

Multimedia NO NO NO

Telematics NO NO NO

Diagnostics YES SOME SOME

REQUIRE-MENTS

CAN TTCAN FlexRay

Fault tolerance SOME SOME YES

Determinism YES YES YES

Bandwidth SOME SOME YES

Flexibility YES YES YES

Security NO NO NO


10/10

10

have already introduced Brake-by-wire technology in

some of their recent models. As the amount of

software that goes into a modern car increases

steadily with the introduction of navigation systems,

instrument clusters, software fault tolerant techniqueswill become a mandatory requirement in theautomotive industry. FlexRay is expected to be thenetwork of choice in future X-by-wire designs due to

its high bandwidth. We feel a key challenge in the

area of automotive software will be handling the

huge amount of data that is to be processed (such as

map data in a navigation system), and yet provide

fault tolerance without causing any hindrance to user

experience. As modern cars slowly transition to

functioning as an information hub by providing

connectivity to Laptops, PDAs and cell phones,

another challenge would be to introduce some faulttolerant services in protocols such as Bluetooth,ZigBee and MOST. This represents an opportunityfor research for engineers from all backgrounds.

References[1] C. Lu, Jean-Charles Fabre, Marc-Olivier Killijian, An

approach for improving Fault-Tolerance in Automotive

Modular Embedded Software, Proc. of the 17th

International Conference on Real-Time and Network

Systems, 2009.

[2] Xi Chen, Requirements and concepts for future

automotive electronic architectures from the view ofintegrated safety, PhD Thesis, Universittsverlag

Karlsruhe, 2008.[3] H. Gustavsson, J. Sterner, An industrial case study of

Design Methodology and Decision Making for

Automotive Electronics, Proc. of the ASME

International Design Engineering Technical

Conferences & Computers and Information in

Engineering Conference, 2008.[4] M. Broy, I.H. Kruger, A. Pretschner, C. Salzmann,

Engineering Automotive Software, Proc. of IEEE,

Vol. 95, Issue 2, pp. 356-373, 2007.[5] M. Broy, Automotive Software and Systems

Engineering, Proc. of the 2nd

ACM/IEEE Conference

on formal Methods and Models for Co-Design , 2005.

[6] J. Park, S. Kim, W. Yoo, S. Hong, Designing Real-Time and Fault-Tolerant Middleware for Automotive

Software, Proc. of International Joint ConferenceSICE-ICASE,pp. 4409-4413, 2006.

[7] D. Palsetia, S. Pieper, Fault Tolerance in Automotive

X-by-Wire, Project Report, Department of Electrical

and Computer Engineering, UW-Madison, 2005.

[8] E.G. Leaphart, B.J. Czerny, J.G. DAmbrosio, B.T.

Murray, C.L. Denlinger, D. Littlejohn, Survey of

Software Failsafe Techniques for Safety-CriticalAutomotive Applications, SAE 2005-01-0779, SAE

World Congress, 2005.[9] A. Pretschner, M. Broy, I.H. Kruger, T. Stauner,

Software Engineering for Automotive Systems: A

Roadmap, Proc. of Future of Software Engineering ,pp. 55-71, 2007.

[10] E. Digler, R. Karrelmeyer, B. Straube, Fault Tolerant

Mechatronics, Proc. of 10th

International On-Line

Testing Symposium, 2004.[11] R. Isermann, R. Schwarz, S. Stolzl, Fault-Tolerant

Drive-by-wire Systems, IEEE Control SystemsMagazine, Vol. 22, Issue 5, pp. 64-81, 2002.

[12] A. Manzone, A. Pincetti, and D. De Costantini, Fault

Tolerant Automotive Systems: An Overview, Proc. of

the 7th International On-Line Testing Workshop, pp.

117-121, 2001.

[13] D. Capriglione, C. Liguori, C. Pianese, A. Pietrosanto,On-Line Sensor Fault Detection, Isolation, andAccommodation in Automotive Engines, IEEE

Transactions on Instrumentation and Measurement,Vol. 52, Issue 4, pp. 182-189, 2003.

[14] P. Hsu, K. Lin, L. Shen, Diagnosis of Multiple Sensor

and Actuator Failures in Automotive Engines, IEEETransactions on Vehicular Technology,Vol. 44, Issue4, pp. 779-789, 1995.

[15] N. Navet, Y. Song, F. Simonot-Lion, C. Wilwert,Trends in automotive communication systems, Proc.

of IEEE, Vol. 93, Issue 6, pp. 1204-1223, 2005.

[16] T. Nolte, H. Hansson, L.L. Bello, Automotive

Communications - Past, Current and Future, Proc. of10

th IEEE Conference on Emerging Technologies and

Factory Automation, Vol. 1, pp. 985-992, 2005.[17] N. Navet, F. Simonot-Lion, A Review of Embedded

Automotive Protocols, Technical Report, Nancy

Universit, 2008.[18] R.M. Daoud, H.H. Amer, H.M. Elsayed, Y. Sallez,

Fault-Tolerant Ethernet-Based Vehicle On-Board

Networks, Proc. of 32nd

Annual Conference on

Industrial Electronics, pp. 4662-4665, 2006.

[19] H. Aysan, A. Thekkilakattil, R. Dobrin, S. Punnekkat,Fault Tolerant Scheduling on Controller Area Network(CAN), Proc. of Emerging Technologies and Factory

Automation Conference,pp. 1-8, 2010.

[20] H. Kimm, Ho-Sang Ham, Integrated Fault Tolerant

System for Automotive Bus Networks, Proc. of 2nd

International Conference on Computer Engineering

and Applications, pp. 486-490, 2010.[21] FlexRay Consortium. (2004, June) FlexRay

Communication System, Protocol Specification,

Version 2.0. [Online]. Available:http://www.flexray.com

[22] C. Temple, Networking the FlexRay Way - An

overview of the FlexRay Communications System,Technical Report, Freescale Semiconductor.

[23] R. Garbenfeldt, X-by-wire: Driving Your Car and the

Semiconductor Industry, Technical Report, 2005.[24] F. Seidel, X-by-wire, Technical Report, Chemnitz

University of Technology, 2009.

[25] B. Selic, Fault tolerance techniques for Distributed

Systems, [Online], Available:http://www.ibm.com/developerworks/rational/library/1

14.html#N101B5[26] C. Wilwert, N. Navet, Y. Song, F. Simonot-Lion,

Design of automotive X-by-wire systems, Technical

Report.[27] L. He, Z. Yu, C. Zong, H. Zhao, The Dual-core Fault-

tolerant control for Electronic Control Unit of Steer-By-wire System, Proc. of International Conference on

Computer, Mechatronics, Control and Electronic

Engineering , pp. 436-439, 2010.[28] E. Touloupis, J.A. Flint, V.A. Chouliaras, A Fault-

Tolerant Architecture For Automotive Applications.

Technical Report, Loughborough University.[29] L. Lamport, R. Shostak, M. Pease, The Byzantine

Generals Problem , ACM Transactions onProgramming Language and Systems, vol. 4, no. 3,

pp382-401, 1982.

[30] IEC61508-1, Functional Safety of Electrical ElectronicProgrammable Electronic Safety-related Systems - Part1 : General requirements, IEC/SC65A, 1998.

[31] D. Jhalani, S. Dhir, Survey of Fault TolerantTechniques in Automotives, University of Wisconsin

Madison.

[32] H. Curtis, R. France, Time Triggered Protocol(TTP/C): A Safety-Critical System Protocol, literatureSurvey, University of Texas-Austin,1999.

Fault Tolerance in Automotive Systems_report

Documents

Transcript of Fault Tolerance in Automotive Systems_report