Fault Tolerance in Automotive Systems_report

download Fault Tolerance in Automotive Systems_report

of 10

Transcript of Fault Tolerance in Automotive Systems_report

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    1/10

    1

    Fault Tolerance in Automotive SystemsAdithya Hrudhayan Krishnamurthy, Ramkumar Ravikumar

    Department of Electrical and Computer EngineeringUniversity of Wisconsin, Madison

    {akrishnamur3, ravikumar}@wisc.edu

    Abstract - Design of fault tolerant electronics hasbecome a standard requirement in the automotivesector these days. These systems increases the overall

    automotive and passenger safety by liberating thedriver from handling routine tasks and also assistingthe driver during critical situations. In this paper wepresent exhaustive information about some of thecommonly used fault tolerant design techniques in theautomotive domain. We start off by analyzing X-by-wire systems, which are fault tolerant distributedsystems that are fail-operational and can maintain a

    reliable state all the time. A case study on Steer-by-wireis included. Fault tolerance techniques used in thedesign of automotive software and how they help inimproving the overall reliability and dependability ofthe system is investigated. Common design techniquesused in the design of fail-safe Sensors and actuators ispresented. We conclude the paper with a section on the

    design of automotive communication systems and

    protocols and their ability to ensure reliablecommunication between various ECU's in the vehicle.

    1. IntroductionAdvancements in the field of automotive electronics

    have helped in realizing the potential of sophisticated

    vehicular control systems. In addition to liberatingthe driver from routine tasks, such systems assist thedriver during critical situations, thereby enhancing

    vehicular safety and performance. Among these

    systems, X-by-Wire systems (where driving, steeringand braking are electronically controlled) have

    provided feasible electronic and electromechanical

    solutions resulting in enhanced fault tolerance and

    reliability. Traditional mechanical and hydraulic

    systems employed in automotive and aviation

    systems are being replaced by electronic control

    systems such as X-by-Wire systems. A current

    premium car, for instance, implements about 270functions a user interacts with, deployed over about

    70 embedded platforms. Altogether, the software

    amounts to about 100 MB of binary code. Ensuring

    fault tolerance in automotive software is an activearea of research. Safety-critical systems such as X-

    by-wire systems and most ECUs typically use a lot of

    sensors for performing their functions. Hence sensors

    and actuators, which form the backbone of most

    commonly used electronic systems, need to be fault

    tolerant as well. Major automotive subsystems such

    as chassis, air-bag, powertrain, body and comfort

    electronics, diagnostics, x-by-wire, multimedia and

    infotainment, and wireless rely on automotive

    communication systems. Fault tolerantcommunication systems are built so they are tolerant

    to defective circuits, line failures etc., and

    constructed using redundant hard- and software

    architectures to ensure reliable communication

    between different sub-systems in a vehicle. In this

    paper we present exhaustive information about fault

    tolerance techniques used in the automotive industry.

    2. X-by Wire in Automotive SystemsFollowing the aviation industry, the automotive

    industry took to X-by-Wire systems, called Drive-

    By-Wire. In the automotive environment, X

    denotes the commanded action such as accelerating,braking or steering. Drive-by-Wire systems sense

    driver requests and translate them into optimum steer,

    brake, and acceleration manoeuvres. X-by-Wire

    systems can be classified into Brake-by-Wire,

    Throttle-by-Wire and Steer-by-Wire systems

    depending upon the commanded action. The

    following sections discuss Brake-by-Wire andThrottle-by-Wire in a succinct manner. Later, a

    section has been devoted to the detailed study ofSteer-by-Wire systems.

    2.1. Brake-by-WireBrake-by-Wire (BBW) systems are realized through

    electro-mechanical actuators and communication

    networks, instead of conventional hydraulic devices.

    It offers enhanced safety, cuts off cost associated

    with manufacture and maintenance of mechanical

    brakes and brake fluids. It also eliminates

    environmental concerns caused by hydraulic systems.There are two ways to realize a BBW system. On one

    hand, the system is based on the traditional hydraulic

    brake system. The by-wire function is realizedthrough hydraulic pumps and additional electric

    controlled valves Electric Hydraulic Brake (EHB)[24]. In an EHB system, a hydraulic backup can be

    realized with the help of valves. Once a fault is

    detected, a direct hydraulic brake circuit will be

    closed. On the other hand the brake-by-wire system

    based on electric mechanical actuators is called as

    Electric Mechanical Brake (EMB). In an EMBsystem the brake force and brake control is realized

    by electric components [11]. Since a hydraulic brake

    cannot be realized, the system must be extremely

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    2/10

    2

    reliable. BBW systems enable simple integration of

    vehicle traction and stability control.

    2.2. Throttle-by-Wire

    Conventional throttle systems consist of a cablerunning from the gas pedal into the throttle body.This cable slides within a housing as it winds its wayaround various components. Such a system is

    relatively bulky and prone to wear. Automotive

    manufacturers have implemented a new means of

    throttle control known as Throttle-by-Wire. Throttle-

    by-Wire consists of a sensor providing pedal

    position. The data acquired by the sensor is sent to

    the Engine Control Module (ECM) that determines

    the parameters to change. The ECM coordinates

    components such as Anti-lock Braking System

    (ABS), gear selection, fuel and air intake, andtraction control. This embedded intelligence results inincreased fuel efficiency, reduced emissions,improved performance, and reduced frictional losses.

    Throttle-By-Wire allows the engine computer tointegrate torque management with traction control

    and stability control.

    3. Constraints for the Design of X-by-Wire

    Systems

    3.1. Power ConstraintThe challenge faced during the implementation of X-

    by-Wire systems in vehicles has been to

    accommodate such systems amidst the 14 V powernets in automobiles. The number of electricalcomponents in cars has steadily increased in recentyears. To guarantee the proper functioning of all

    electrical components, a stable voltage supply isnecessary. The automotive industry considered

    migrating to 42 Volt systems. With 42 Volts in the

    system, vehicles could use thinner-gauge wires and

    smaller motors because higher voltage would mean

    reduced amperage (Which dictates the size of the

    wire).

    3.2. Real Time ConstraintX-by-Wire systems are intrinsically real-timedistributed systems. They implement complexmultivariable control laws and deliver real-time

    information to intelligent devices that are physically

    distant (for example, the four wheels). They have

    stringent time constraints, such as a sampling period

    of only a few milliseconds [26]. Occasional absence

    of samples or out-of-bound delays at the controller or

    actuator level, for instance, due to frame loss, does

    not necessarily lead to vehicle instability, but

    degrades steering performance (or Quality ofService). This is because most of the control lawsthat are used are designed with specific delay and/orabsence of sampling data compensation mechanisms.

    End-to-end response times, such as the time between

    a request from the driver and the response of the

    physical system, must be bounded, typically lower

    than a few tens of milliseconds. An excessive end-to-

    end response time of a control loop may not only

    induce performance degradation but also cause the

    instability of the vehicle.

    2.3. Dependability ConstraintsFor a critical X-by-Wire system, the following must

    be ensured - A system failure does not lead to a state

    in which human life, economics, or environment is

    endangered; a single failure of one component must

    not lead to the failure of the whole X-by-Wire system

    [26]. Also, the system must be able to tolerate at least

    one major critical fault without loss of functionality

    for a long time to reach a safe parking area. The X-

    by-Wire system should also offer the same

    availability and maintainability as theirmechanical/hydraulic counterparts.

    4. Case Study: Steer-by-Wire

    Traditionally, vehicle wheels have been turned by adirect mechanical linkage between the steering

    wheel, steering gears and actual wheel. In such a

    system, the driver turns the steering wheels to request

    the steering gears to turn the vehicle wheels. The

    feedback of the torque encountered by the steering

    system as the wheels are turned is provided to the

    driver through mechanical linkages. This torquefeedback is critical as it provides the driver with asense of the road conditions, such as the tractionexperienced by the wheels with the road surface.

    Steer-by-Wire systems eliminate mechanical

    components between the steering wheel and turning

    wheels. The system aims to control the wheeldirection according to the driver's request and provide

    a mechanical-like force feedback to the hand wheel.

    The following sections deal with the functionality of

    the architecture, the real-time constraints imposed on

    the system, and the protocol used by the nodes to

    communicate between each other.

    4.1. Functional Description and OperationalArchitecture of Steer-by-Wire SystemThe two critical services provided by the Steer-by-

    Wire system involve the front axle actuation and the

    hand wheel force feedback [26].

    Front Axle Control - This function computes theorders that are given to the motor of the front axle,

    based on the state of this front axle and the

    commands given by the driver through the handwheel. The driver's requests are translated depending

    on the hand wheel angle, torque and speed.Hand Wheel Force Feedback - This function

    computes the orders provided to the hand wheel

    motor based on the speed of the vehicle, front axleposition and the front tie rod force.

    An operational architecture for the functions

    mentioned above is realized - The operational

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    3/10

    3

    architecture includes four ECUs (micro-controllers)

    named, respectively, HW ECU1 (Hand Wheel

    ECU1), HW ECU2 (Hand Wheel ECU2), FAA

    ECU1 (Front Axle Actuator ECU1), and FAA ECU2

    (Front Axle Actuator ECU2). Each node is connectedto the two TDMA-based communication channels(BUS1 and BUS2). Finally, three sensors, named as1,as2 and as3, placed near the hand wheel measure the

    requests of the driver in a similar way, the latter

    being translated into a 3-tuple . Three other

    sensors, named rps1, rps2 and rps3, are dedicated to

    the measurement of the front axle position. Finally,

    two motors (FAA Motor2 and FAA Motor2),

    configured in active redundancy, act on the front axle

    while two other motors (HW Motor1 and HW

    Motor2) realize the force feedback control on thehand wheel. Sensors as1, as2 and as3 (respectively,sensors rps1, rps2 and rps3) are connected by point-to-point links, both to HW ECU1 and HW ECU2

    (respectively, FAA ECU1 and FAA ECU2).

    Implementation of the Front axle control function -

    The requests of the driver are measured by the three

    replicated sensors as1, as2 and as3 and sent to both

    HW ECU1 and HW ECU2. Each ECU performs a

    majority vote on the 3 received values and transmits

    the data on both communication channels BUS1 andBUS2. The two ECUs, FAA ECU1 and FAA ECU2,

    placed behind the front axle, consume this data, aswell as the last wheel position, in order to elaborate

    the commands that are to be applied to FAA Motor 1

    and 2.

    Implementation of the Force feedback controlfunction- In a way similar to the previous function,

    measurements taken by rps1, rps2 and rps3 are

    transmitted both to FAA ECU1 and FAA ECU2.

    Each of these ECUs elaborates information

    transmitted on the network. The consumers of this

    information are both HW ECU1 and HW ECU2which compute the command transmitted to HWMotor 1 and 2.

    4.2. Fault Classification and RedundancyFault Classification A Byzantine fault is a fault

    whose effect can be perceived differently by different

    observers. A Coherent faults effect is seen the sameby all observers. A Transient fault is a fault whose

    duration is such that the system does not reach a not-

    safe-state. A Permanent fault is a non-transient one.

    The property of fail silence, assumed for some

    components, leads to another class of faults. Acomponent is said to be fail silent if, each time it is

    not silent, we can conclude that it is functioningproperly.

    ECU redundancy - Two functions need to be

    implemented in ECU: Front Axle Control and Force

    Feedback Control. To avoid costly and numerous

    wires, ECUs have to be placed close to the sensors,

    and communication between ECUs has to be

    multiplexed.

    Fig 1: Steer-by-Wire operational architecture

    Dependability analyses are generally based on astrong hypothesis assuming that, in the whole system,

    n simultaneous component failures can never occur

    for any set of redundant components. Lamport statesthat 3n+1 redundant components are necessary to

    tolerate n Byzantine faults [29]. In order to toleraten coherent faults, it is sufficient to have 2n+1

    redundant components. According to the rule given

    by Lamport, the minimum number of redundant

    Hand Wheel ECUs (respectively, Front Axle ECUs)

    should be 4. This solution is mainly used in the

    aeronautic domain. Therefore a classical solution incost prohibitive situations is to use Fail-Silent ECUs.

    In this case, only two Hand Wheel ECUs

    (respectively, Front Axle ECUs) are necessary.

    Hand Wheel Sensor and Actuator redundancy - A

    Hand Wheel sensor produces information for two

    Hand Wheel ECUs. Three Hand Wheel sensors arenecessary for ensuring that each Hand Wheel ECU,

    assumed to provide a voting algorithm, is able to

    tolerate one Byzantine fault (and subsequently one

    coherent fault or one fail-silent sensor). A single

    actuator can take charge of piloting the Front Axleand it is assumed that an actuator can never wrongly

    apply an order received by a Front Axle ECU.

    Considering the fail silence property of a Front AxleECU, only 2 couples (Front Axle ECU, actuator) are

    necessary for the tolerance of, at most, one fault.

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    4/10

    4

    If the chosen fault-tolerance strategy is failure

    recovery, redundant ECUs will work only in the case

    of the primary ECU failing. Failure detection must be

    quick and reliable. Otherwise, if the strategy is failure

    compensation, redundant ECUs will work in parallel.Because of the stringent real-time constraint, ourarchitecture must provide failure compensation.Redundancy of identical ECUs does not prevent the

    architecture from common mode failures: the

    hardware of redundant ECUs should be furnished by

    different suppliers and their software realized by

    different teams.

    4.3. Communication Protocol (TTP/C)TTP/C is a protocol based on the TTP (Time-

    Triggered Protocol) and implemented so that it meets

    the SAE requirements for a class C automotiveprotocol. Class C protocols are suitable for highspeed single failure operational safety criticalapplications. A principal reason that TTP/C is the

    first protocol to qualify as Class C, is that theprevious protocols are all event triggered. Event-

    triggered systems are susceptible to several serious

    failure modes such as the babbling idiot failure.

    TTP/C is a Time Division Multiple Access (TDMA)

    protocol in which periodic time slots are assigned to

    individual processing nodes in a system. The braking

    unit has its own TTP/C nodes, each of which isreplicated as a Fault Tolerant Unit (FTU).

    Node Architecture:System nodes in a TTP/C system

    consist of a Host, a Controller Network Interface

    (CNI), and a TTP/C communications controller. The

    Host runs the application software for the relevantsystem function (for instance, the control software for

    the braking system in an automobile). The CNI

    stands between the Host and the TTP/C controller,

    effectively de-coupling the applications-level

    software from the network. Within the CNI is a

    Message Descriptor List, which contains informationcontrolling bus access, and a data sharing interfacewhich is typically implemented with dual port RAM(allowing the Host and the TTP/C controller to access

    the shared memory independently). The TTP/C

    communications controller provides the actual

    connection between the TTP/C node and the shared

    network. The controller supports the protocol withseveral essential services such as guaranteed

    transmission times with minimal latency, jitter, fault-

    tolerant clock synchronization, and error detection.Scheduling and State Messages: The system

    scheduling in TTP/C protocol is static. The points intime in which the various nodes in a system are

    authorized to transmit form the lattice points of aTTP/C action lattice. The time difference between

    two adjacent points in the action lattice represents the

    basic cycle time of the system, and sets a lower

    bound on the response time of the system.

    Fig 2: Communication Subsystem with FTUs

    Under TTP/C, nodes generate a state message in each

    TDMA round, that are posted to the CNI by theTTP/C host for transmission over the system

    network. Nodes receiving messages do not respond

    with a receipt acknowledgement.Clock Synchronization: A commonly maintained

    keeper of global time is thus critical to the proper

    operation of the protocol. Within a cluster of TTP/C

    nodes, all nodes are aware of which node has access

    to the bus during a specified time slot, given the a

    priori scheduling allocation. By noting the time when

    messages are received from other nodes (TTP/C is abroadcast protocol, so all nodes hear all messages)

    with the known schedule, a node can calculate thedifference between the clock of the sending node and

    its own clock.

    Composability: TTP/C supports a robust level of

    composability the capability to carry a thoroughlytested subsystem into a larger system, and to be able

    to depend on the subsystem retaining the same

    characteristics that it demonstrated in isolation [32].

    In the auto industry, this feature potentially allows

    the rapid integration of components provided bymultiple suppliers into a larger framework, without

    the need to perform extensive system integrationtuning and testing.

    Reliability and Fault Tolerance: Several aspects of

    the TTP/C protocol, and the way it is implemented in

    real-world systems, serve to provide reliability and

    fault-tolerant behavior.a. Membership:The role of the membership service

    component of a real-time protocol is to inform all

    system nodes of the failure of a node with minimal

    delay [32]. Under TTP/C, a node membership field,

    maintained as a status register in the CNI, contains aso-called node membership vector. This node

    membership vector contains one bit for every node ina TTP/C cluster, with the bit set to True if the node in

    question is operating correctly and to False if the

    node is not operating or is flawed. The node

    membership vector is updated by checking that the

    expected messages from other nodes in the cluster are

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    5/10

    5

    received, and by analyzing the cyclic-redundancy

    check (CRC) fields in the messages received.b. Fail-Silence:TTP/C nodes are designed to detect

    faults in their own operation. The principle of

    operation is that, each and every node must deliverresults which are correct in both the value and thetime domain, or no results at all. If a node detects anabnormality in its operation, it switches itself off. At

    a software level, TTP/C supports, for example, a

    variety of techniques, including double execution,

    double execution with reference checks, validity

    checks, assertion checking, and signature checks.c. Bus Guardian: The Bus Guardian (BG) is a

    hardware element of the TTP/C Controller which

    serves as a portal to the system bus. The key role of

    the Bus Guardian is to enable the bus driver only

    during the transmission slot for its node, andotherwise to guarantee that the bus driver is in adisabled state. This serves to prevent babbling idiotfailures [32].

    d. Replication of System Components: The TTP/Cprotocol supports replication of the hardware

    elements of a node, dual system buses and error

    detection [32].

    5. Fault tolerance in Automotive SoftwareImproving software fault tolerance is a common

    interest for aeronautics, railway and automotivesoftware-based systems. However, the automotivecontext meets more stringent economical constraintsand resource limitations, due to higher volume of

    vehicle production [1]. The amount of software has

    evolved from zero to tens of millions of lines of code.

    Today, more than 80% of the innovations in a carcome from computer systems; software has thus

    become a major contributor to the value of

    contemporary cars [4]. A study made by Mercer

    Management Consulting and Hypovereinsbank in

    2001 claims that the total value of software in cars

    will rise from 4% to 13% by 2010 [3] and digitalhardware and software is expected to account for upto 30% of the overall cost of a car [5].

    5.1. Automotive Software ClassificationAutomotive software is very diverse, ranging from

    entertainment and office-related software to safety-

    critical real-time control software. It can be clusteredaccording to application area and the associated

    nonfunctional requirements. The following five

    clusters are usually distinguished [4] as Multimedia,

    telematics, and HMI software, body/comfort

    software, software for safety electronics,power trainand chassis control software and infrastructure

    software. Automotive software development posesgreat challenges to automotive manufacturers since

    an automobile is inherently distributed and subject to

    fault-tolerance and real-time requirements. Reliability

    and robustness for automotive software is a critical

    requirement of ECU software, so that fault tolerance

    mechanisms can handle detected faults locally

    without propagation to other SW-components [2].

    5.2. Current ApproachesA fault-tolerant architecture based on computationalreflection is proposed [1]. The reflection paradigmfor fault-tolerance purpose relies on the ability of a

    system to check and to correct itself in a separate

    abstraction level. The software architecture is divided

    into two parts (functional and defense software) that

    interact together through an interface. It is assumed

    that the defense software has enough knowledge of

    the structure and expected behavior of functional

    software, to control it. The defense software detects

    errors by checking safety properties and performs

    recovery using generic instrumentation andinfrastructure functionalities. As failures can impact

    both data and control, the failure model is structuredinto two parts: data flow and control flow. Critical

    control flow failures can disrupt control events,sequence of execution and execution time. Critical

    data flow failures can affect both value and timing in

    the system. The runtime behavior of a system is

    described as a sequence of scheduled entitiesthat are

    triggered by events and generate triggering events.Control flow of a scheduled entity relates to the

    control events starting or stopping its execution.These events are produced by the environment orother entities. In parallel, data flow of a scheduledentity corresponds to the input data it consumes and

    the output data it produces during its execution.

    The Defense software is organized around loggingtables and three types of services that control

    information logging, error detection and error

    recovery. Logging or tracing are mechanisms often

    needed for debugging and diagnosis issues. The

    logging strategy has to select rigorously thenecessary and sufficient critical information to get atruntime, according to fault tolerance concerns. Thelogging architecture is organized into severalbracket-tables that are updated and used at runtime.

    Each table is associated with a dedicated logging

    routine that uses preferably existing infrastructure

    services to get information. Once application-specific

    safety properties are specified, the correspondingerror detection routine is developed as an executable

    assertion. An assertion is verified at runtime within a

    corresponding checking routine. When an error is

    detected, the checking routine triggers a recovery

    routine.

    At the application level, once an error is detected, theapplication is turned into a safe state. Degraded data

    can recover to their valid values and new

    communication request may arise for missed data

    acknowledgement. At the infrastructure level,

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    6/10

    6

    recovery actions on control flow include reset,

    terminate and restart a task or a set of OS objects. In

    the proposed fault-tolerant architecture, eachchecking routine is associated with one or morerecovery routines that call available executiveservices and update logging tables, if necessary. Therecovery action depends on the detected error. Twotypes of software instrumentation considered are

    hooksand basic services. Hooks are the means to tie

    up defense software to functional software, and to

    insert code. Basic services play the role of software

    sensors and actuators. Fault injection techniques are

    used to measure fault-tolerance coverage, and to

    detect remaining software errors of defense software.

    The paper, however, does not mention about the

    amount of data that can be handled.

    An alternate method for ensuring fault tolerance inautomotive software is proposed in [6]. Since it might

    be difficult for application programmers to

    implement fault tolerance features by themselveswhile designing automotive software, it is desirable

    that the fault tolerance is provided by the middleware

    and hidden from application programmers.

    Middleware is a software layer that connects and

    manages application components running on

    distributed hosts. It exists between network operating

    systems and application components. Middlewarehides and abstracts many of the complex details ofdistributed programming from applicationdevelopers. Figure 3 shows the structure of the

    proposed middleware for automotive systems. At the

    top level, there exists the publish/subscribe interface

    that is used by application components. Beneath theinterface, there are QoS Configuration module for

    specifying real-time requirements, Resource

    Allocation module for guaranteeing timeliness

    operations, and Clock Synchronization module for

    providing a global time base. All incoming and

    outgoing messages pass through Fault-ToleranceLayer in order to guarantee reliability. Finally,messages are transmitted and received by TransportLayer.

    Fig. 3: Structure of proposed middleware

    Because software is very flexible and relatively

    cheap, it is a very desirable medium forimplementing fault tolerance mechanisms [7].

    Redundancy in software can be accomplished

    through redundant computation and redundant

    storage. The most common computational

    redundancy technique involves the use of a software

    controlled watchdog timer. As the program executes,

    the timer is periodically reset by writing a new value

    into the timer register. In the event of a failure, the

    watchdog will restart the processor from a re-entrypoint in the code. This technique is particularly usefulfor avoiding deadlock conditions due tocommunication failures. Another simple technique

    for data redundancy is complement data write. Rather

    than simply duplicating data, it is complemented first.

    This adds in some additional fault tolerance by

    providing the means to detect stuck-at-faults. As

    duplicating every piece of data might be expensive,

    often only safety-critical date is duplicated [8].

    Corruption of the instruction memory can be detected

    and tolerated by duplicating the program in memory

    and executing each copy in sequence. RedundantOrthogonal Coding (ROC), a technique similar to n-version programming, can be used to increasereliability [8]. The difference is that in ROC,

    different algorithms are intentionally used to performthe same calculation. This addresses the problem of

    n-version programming where different people tend

    to design programs for a given tasks in very similar

    ways and also to make similar mistakes in the design

    by explicitly ensuring dissimilarity. This technique

    may also catch programming errors. Most of the error

    detection techniques focus on memory integrity. Acommon technique is to use checksums to verify thecontents of the memory [8]. Another technique todetermine that the system memory is still working

    correctly is to test the RAM by writing various

    patterns of ones and zeros into every bit in memory.

    Another common error detection technique is to useassertion tests inside a program. Using assertion

    checks can catch programming errors as well as

    errors arising from unusual conditions.

    Currently, the amount of error diagnosis and error

    recovery in cars is rather lightweight [9]. In the CPUssome error logging takes place, but there is neitherconsideration nor logging of errors at the level of thenetwork and the functional distribution. There is no

    comprehensive error diagnosis and no systematic

    error recovery beyond individual CPUs. Failure

    logging to the end of better diagnosis for

    maintenance has emerged as a relevant researchproblem. There are some fail-safe and graceful

    degradation techniques found in cars today, but a

    systematic and comprehensive error treatment is

    missing. With the upcoming multi-core controllers

    for embedded applications, an interesting area forresearch is how this can be exploited also for

    redundancy/recovery strategies. Several keys areas ofresearch have been identified in [4]. Reliability and

    safety concerns are important for all functions

    relevant to driving, from engine control and

    passenger safety functions to X-by-wire functions

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    7/10

    7

    where mechanical transmission is replaced by

    electrical signals.

    6. Fault tolerant actuators and sensors

    Safety-critical automotive applications, such as steer-by-wire systems, are in most cases control systemswith hard real time requirements. Such systemstypically have a number of sensors (inputs) [12]

    connected to them, whose values are processed in

    order to produce the control actions. Consequently

    the sensors are the first in the flow of information and

    control computations rely of these values. Therefore

    it is important that the sensors can be trusted.

    Actuators are essential for the reliable operation of

    various components in the automobile including

    brakes, valves and cylinders.

    6.1. Fault tolerant sensorsA fault-tolerant sensor configuration should be atleast fail-operational for one sensor fault [11]. This

    can be obtained by hardware redundancy with thesame type of sensors or by analytical redundancy

    with different sensors and process models. Sensor

    systems with static redundancy are realized with a

    triplex system and a voter. A configuration with

    dynamic redundancy needs at least two sensors and

    fault detection for each sensor. The fault detection

    can be performed by self-tests.

    Fig. 4: Triplex system with static redundancy and duplex systemwith dynamic redundancy

    Depending on the importance of the sensed quantity,additional sensors may or may not be needed to

    obtain the required dependability at the system-level.

    This type of sensors requires some form of sensor-

    internal redundancy in the form of a built-in self-test

    (BIST) and/or internal replication. If a particular

    sensor is not fail-silent, it can be replicated in order to

    obtain the fail-silent property at the system level. In

    this case, the values of the sensors are collected and

    analyzed by an intelligent unit that makes a decisionof which value to use in further calculations. Therequired degree of replication is dependent on thecriticality of the sensor. If knowledge of the sensedquantity is critical, sensor triplication is necessary.

    However, if a lack of knowledge about the sensed

    quantity is acceptable, dual redundancy is sufficient.

    A prototype of a steering angle sensor is

    demonstrated in [10]. By extension to a diverse four-

    sensor system (there are six pairs of sensor elements),

    the steering angle sensor is fault tolerant since it is

    able to tolerate the loss of one or two sensor elements

    and to diagnose the failed sensor elements. Hardware

    and/or Software Instrument Fault Detection, Isolation

    and Accommodation (IFDIA) schemes are finding

    more and more applications in automotivemeasurement and control systems. A scheme fordetection, location and accommodation of faults is

    presented in [13]. This scheme was designed to

    identify and accommodate some kinds of faults that

    may affect manifold pressure, crankshaft speed and

    throttle valve angle position sensors. It is reported

    that the realized scheme is able to identify and

    accommodate also small faults in all considered

    sensors.

    6.2. Fault tolerant actuators

    Actuators generally consist of different parts: inputtransformer, actuation converter, actuationtransformer, and actuation element. The actuationconverter converts one type of energy (e.g., electrical

    or pneumatic) into another (e.g., mechanical orhydraulic). Fault-tolerant actuators can be designed

    by using multiple complete actuators in parallel, with

    either static redundancy or dynamic redundancy with

    cold or hot standby [11]. Another possibility is to

    limit the redundancy to parts of the actuator that have

    the lowest reliability. To achieve fault tolerant

    control either the actuator must not be a single pointof failure or it has to be fault tolerant. For functionswithout inherent redundancy, actuators must bereplicated [12]. When only two actuators are used,

    the failure of one actuator must not affect operation

    of the remaining unit. Sensors to continuously

    monitor the actuator behavior (e.g. injector current,motor current, motion, force, torque, etc) are

    typically used. When these sensors indicate a serious

    actuator failure, the power to the actuator should be

    switched off. As cost and weight generally are higher

    for them than for sensors, actuators with fail-

    operational duplex configuration are preferred [11].Static redundant structures, where both parts operatecontinuously or dynamic redundant structures withhot or cold standby can be chosen. For dynamic

    redundancy, fault-detection methods for the actuator

    parts are required.

    A prototype of an actuator is demonstrated in [10].Typical electromagnetic faults such as Winding open

    circuit, winding short circuit may occur within an

    actuator. To develop one single drive that can

    continue to operate with any one of these faults, it

    became clear that the most successful designapproach was to use a multiple phase drive in which

    each phase may be regarded as a single module. Theoperation of any one module must have minimal

    impact upon the others, so that in the event of that

    module failing the others can continue to operate

    unaffected. When both sensor and actuator failures

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    8/10

    8

    occur at the same time, their mutual effects on

    residuals make fault isolation difficult [14]. A

    hexadecimal decision table to relate all possible

    failure patterns to the residual code has been

    proposed in [14]. Detection and isolation of multiplesensor and actuator failures in automotive engines isachieved. Simulation and experimental resultsindicate that the proposed diagnostic system not only

    can be applied to cases where all failures occur in the

    same sector, but is also appropriate for isolating

    multiple failures occurring simultaneously in sensors

    and actuators.

    7. Fault tolerant Automotive Communication

    Systems

    The specific requirements of the different car

    domains have led to the development of a largenumber of automotive networks such as LIN, J1850,CAN, TTP/C, FlexRay, media-oriented systemtransport, IDB1394, etc. One of the important

    requirements of an automotive communicationsystem is fault-tolerance [16]. Fault tolerant

    (typically safety-critical) communication systems are

    built so they are tolerant to defective circuits, line

    failures etc., and constructed using redundant hard-

    and software architectures.

    7.1. Event triggered vs. Time triggered SystemsThere are two main paradigms for communications inautomotive systems [15]: time triggered and eventtriggered. Event triggered means that messages are

    transmitted to signal the occurrence of significant

    events (e.g., a door has been closed). In this case, the

    system possesses the ability to take into account, asquickly as possible, any asynchronous events such as

    an alarm. Event-triggered communication is very

    efficient in terms of bandwidth usage since only

    necessary messages are transmitted. In time triggered

    systems, frames are transmitted at predetermined

    points in time, which is well suited for the periodictransmission of messages as required in distributedcontrol loops. Each frame is scheduled fortransmission at one predefined interval of time,

    usually termed a slot, and the schedule repeats itself

    indefinitely. As the frame scheduling is statically

    defined, the temporal behavior is fully predictable;

    thus, it is easy to check whether the timingconstraints expressed on data exchanges are met.

    7.2. Controller Area Network (CAN)

    CAN (Controller Area Network) is the most widely

    used in-vehicle network. It was designed by Bosch inthe mid 80's for multiplexing communication

    between ECUs in vehicles and thus for decreasing theoverall wire harness: length of wires and number of

    dedicated wires. CAN on a twisted pair of copper

    wires became an ISO standard in 1994 and is now a

    de-facto standard in Europe for data transmission in

    automotive applications, due to its low cost, its

    robustness and the bounded communication delays.

    CAN has several mechanisms for error detection

    [17]. For instance, it is checked that the CRCtransmitted in the frame is identical to the CRCcomputed at the receiver end, that the structure of theframe is valid and that no bit-stuffing error occurred.

    Each station which detects an error sends an "error

    tag" which is a particular type of frame composed of

    6 consecutive dominant bits that allows all the

    stations on the bus to be aware of the transmission

    error. CAN possesses some fault-confinement

    mechanisms aimed at identifying permanent failures

    due to hardware dysfunctioning at the level of the

    micro-controller, communication controller or

    physical layer. The scheme is based on error countersthat are increased and decreased according to

    particular events. The main drawback is that a nodehas to diagnose itself, which can lead to the non-

    detection of some critical errors. Without additionalfault-tolerance facilities, CAN is not suited for

    safety-critical applications such as future X-by-Wire

    systems [17]. For instance, a single node can perturb

    the functioning of the whole network by sending

    messages outside their specification (i.e. length and

    period of the frames). A framework to provide

    selective fault-tolerance for messages with variousfault-tolerance requirements scheduled on CAN is

    proposed in [19]. The set of messages are analyzedoff-line and scheduling attributes are provided that

    ensures feasible transmission of messages as well as

    retransmissions upon error occurrences that satisfy

    the fault-tolerance requirements.

    7.3. Time-Triggered CAN (TTCAN)TTCAN uses the CAN standard but, in addition,

    requires that the controllers have the possibility to

    disable automatic retransmission of frames upon

    transmission errors and to provide the upper layerswith the point in time at which the first bit of a framewas sent or received. The key idea is to propose aflexible time-triggered/event-triggered protocol.

    TTCAN defines a basic cycle as the concatenation of

    one or several time-triggered (exclusive) windows

    and one event-triggered (arbitrary) window. Though

    TTCAN is built on a well-mastered and low-costtechnology, CAN, does not provide important

    dependability services such as the bus guardian,

    membership service and reliable acknowledgment

    [17]. It does not provide the same level of fault

    tolerance as TTP and FlexRay, which are the othertwo candidates for x-by-wire [16]. Strong points of

    TT-CAN are the support of coexisting event- andtime-triggered traffic together with the fact that it is

    standardized by ISO. It is also on top of standard

    CAN which allows for an easy transition from CAN

    to TT-CAN.

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    9/10

    9

    7.4. FlexRayThe FlexRay network is very flexible with regard to

    topology and transmission support redundancy. It can

    be configured as a bus, a star or multistar. It is not

    mandatory for each station to possess replicatedchannels or a bus guardian, even though this should

    be the case for critical functions such as the Steer-by-Wire [15]. FlexRay also provides fault tolerance by

    distributed time-triggered synchronization (clock

    synchronization) and error containment on the

    physical layer through an independent bus guardian.

    FlexRay allows both time-triggered and event-

    triggered communication by means of a

    communication cycle, where a time-triggered (static)

    window and event triggered (dynamic) window are

    concatenated. The time triggered window uses

    TDMA like TTP, but unlike TTP, a given node maybe able to access the bus multiple times before allremaining nodes access it. The event-triggeredwindow uses a technique called Flexible TDMA

    (FTDMA) to provide event triggered behaviorwithout collisions. According to the FlexRay

    specification [21] a frame contains a 24 bit CRC

    checksum to ensure the integrity of the frame

    transmission. The probability of undetected network

    errors is less than (6*10^-8). Adequately addressing

    fault-tolerance is one of the key aspects that needed

    to be considered during the design of FlexRay. Toallow a single communications system to support thediverse needs of automotive applications acrossdifferent application domains the consortium decided

    to introduce a concept of scalable fault-tolerance.

    Scalable fault-tolerance aims at allowing FlexRay to

    be used economically in distributed non fault-tolerantsystems as well as in distributed fault-tolerant

    systems.

    In addition FlexRay can be deployed using optional

    local or remote channel guardians that protect the

    communications channels from transmission faults

    that violate the TDMA scheme. The clocksynchronization algorithm supports fault-tolerant aswell as non fault-tolerant synchronization. For fault-tolerant synchronization the synchronization

    algorithm considers the transient / permanent fault

    class as well as the symmetric / asymmetric fault

    class [22]. In this protocol, the synchronization of the

    global time happens at the macrotick level, with theuse of a cluster-wide clock synchronization

    algorithm. This clock synchronization algorithm

    continues to operate even in the event of an ECU

    failure in the system, unlike a master-slave

    synchronization algorithm. Table 1 summarizes thekey differences between the automotive protocols

    discussed so far.

    7.5 Recent WorkA simulation study for fault-tolerant sensor networks

    for cars on-board control is presented in [18]. On-

    board communication and control networks are built

    using Gigabit Ethernet. Sensors are smart and they

    are the sources of traffic. Actuators are smart and

    they are the sinks of traffic.

    Table 1: Summary of Automotive Protocols

    The controller is a personal computer. The sources of

    real time traffic (sensors) are tripled while thenumber of sink nodes (actuators) is not increased.

    This increase in the number of sensors is made to test

    the possibility to build triple-modular redundancy(TMR) on the sensors level for fault-tolerance. The

    disadvantage of TMR is obviously cost since three

    sensors are required to produce an output that could

    be generated by just one sensor. On the other hand,

    the advantage of TMR is an increase in reliability.

    The outputs have to go through a voter. Voting can

    also be done in software which has been used in this

    study. The controller, after reading the outputs of thethree sensors, executes a routine that compares thesethree outputs. A major problem here is that the three

    outputs may not completely agree because thesensors, while identical, may not produce the same

    exact output. The first solution to this problem is the

    mid-value select technique. The second technique

    is to ignore the least significant bits of the data. The

    number of significant bits that have to agree depends

    on the application. With TMR, the data from the

    three copies of the sensor is compared and voted

    upon. As long as there is only one failed sensor, the

    controller will know which of the three packets to

    discard and the system will remain operational. Whenthe second sensor fails, the entire system will fail. A

    methodology of interconnecting the automotive busnetworks in a fault tolerant way is proposed in [20].

    8. ConclusionIn this paper, we have provided a survey of fault

    tolerant design techniques and methodologies used in

    the automotive industry. X-by-wire systems that are

    discussed in length are expected to be integrated into

    most automobiles in the future. Automobile

    manufacturers such as Toyota, Nissan and BMW

    USAGE CAN TTCAN FlexRay

    Chassis YES YES NOAirbags YES NO NO

    Powertrain YES YES SOME

    X-by-wire SOME YES YES

    Multimedia NO NO NO

    Telematics NO NO NO

    Diagnostics YES SOME SOME

    REQUIRE-MENTS

    CAN TTCAN FlexRay

    Fault tolerance SOME SOME YES

    Determinism YES YES YES

    Bandwidth SOME SOME YES

    Flexibility YES YES YES

    Security NO NO NO

  • 7/24/2019 Fault Tolerance in Automotive Systems_report

    10/10

    10

    have already introduced Brake-by-wire technology in

    some of their recent models. As the amount of

    software that goes into a modern car increases

    steadily with the introduction of navigation systems,

    instrument clusters, software fault tolerant techniqueswill become a mandatory requirement in theautomotive industry. FlexRay is expected to be thenetwork of choice in future X-by-wire designs due to

    its high bandwidth. We feel a key challenge in the

    area of automotive software will be handling the

    huge amount of data that is to be processed (such as

    map data in a navigation system), and yet provide

    fault tolerance without causing any hindrance to user

    experience. As modern cars slowly transition to

    functioning as an information hub by providing

    connectivity to Laptops, PDAs and cell phones,

    another challenge would be to introduce some faulttolerant services in protocols such as Bluetooth,ZigBee and MOST. This represents an opportunityfor research for engineers from all backgrounds.

    References[1] C. Lu, Jean-Charles Fabre, Marc-Olivier Killijian, An

    approach for improving Fault-Tolerance in Automotive

    Modular Embedded Software, Proc. of the 17th

    International Conference on Real-Time and Network

    Systems, 2009.

    [2] Xi Chen, Requirements and concepts for future

    automotive electronic architectures from the view ofintegrated safety, PhD Thesis, Universittsverlag

    Karlsruhe, 2008.[3] H. Gustavsson, J. Sterner, An industrial case study of

    Design Methodology and Decision Making for

    Automotive Electronics, Proc. of the ASME

    International Design Engineering Technical

    Conferences & Computers and Information in

    Engineering Conference, 2008.[4] M. Broy, I.H. Kruger, A. Pretschner, C. Salzmann,

    Engineering Automotive Software, Proc. of IEEE,

    Vol. 95, Issue 2, pp. 356-373, 2007.[5] M. Broy, Automotive Software and Systems

    Engineering, Proc. of the 2nd

    ACM/IEEE Conference

    on formal Methods and Models for Co-Design , 2005.

    [6] J. Park, S. Kim, W. Yoo, S. Hong, Designing Real-Time and Fault-Tolerant Middleware for Automotive

    Software, Proc. of International Joint ConferenceSICE-ICASE,pp. 4409-4413, 2006.

    [7] D. Palsetia, S. Pieper, Fault Tolerance in Automotive

    X-by-Wire, Project Report, Department of Electrical

    and Computer Engineering, UW-Madison, 2005.

    [8] E.G. Leaphart, B.J. Czerny, J.G. DAmbrosio, B.T.

    Murray, C.L. Denlinger, D. Littlejohn, Survey of

    Software Failsafe Techniques for Safety-CriticalAutomotive Applications, SAE 2005-01-0779, SAE

    World Congress, 2005.[9] A. Pretschner, M. Broy, I.H. Kruger, T. Stauner,

    Software Engineering for Automotive Systems: A

    Roadmap, Proc. of Future of Software Engineering ,pp. 55-71, 2007.

    [10] E. Digler, R. Karrelmeyer, B. Straube, Fault Tolerant

    Mechatronics, Proc. of 10th

    International On-Line

    Testing Symposium, 2004.[11] R. Isermann, R. Schwarz, S. Stolzl, Fault-Tolerant

    Drive-by-wire Systems, IEEE Control SystemsMagazine, Vol. 22, Issue 5, pp. 64-81, 2002.

    [12] A. Manzone, A. Pincetti, and D. De Costantini, Fault

    Tolerant Automotive Systems: An Overview, Proc. of

    the 7th International On-Line Testing Workshop, pp.

    117-121, 2001.

    [13] D. Capriglione, C. Liguori, C. Pianese, A. Pietrosanto,On-Line Sensor Fault Detection, Isolation, andAccommodation in Automotive Engines, IEEE

    Transactions on Instrumentation and Measurement,Vol. 52, Issue 4, pp. 182-189, 2003.

    [14] P. Hsu, K. Lin, L. Shen, Diagnosis of Multiple Sensor

    and Actuator Failures in Automotive Engines, IEEETransactions on Vehicular Technology,Vol. 44, Issue4, pp. 779-789, 1995.

    [15] N. Navet, Y. Song, F. Simonot-Lion, C. Wilwert,Trends in automotive communication systems, Proc.

    of IEEE, Vol. 93, Issue 6, pp. 1204-1223, 2005.

    [16] T. Nolte, H. Hansson, L.L. Bello, Automotive

    Communications - Past, Current and Future, Proc. of10

    th IEEE Conference on Emerging Technologies and

    Factory Automation, Vol. 1, pp. 985-992, 2005.[17] N. Navet, F. Simonot-Lion, A Review of Embedded

    Automotive Protocols, Technical Report, Nancy

    Universit, 2008.[18] R.M. Daoud, H.H. Amer, H.M. Elsayed, Y. Sallez,

    Fault-Tolerant Ethernet-Based Vehicle On-Board

    Networks, Proc. of 32nd

    Annual Conference on

    Industrial Electronics, pp. 4662-4665, 2006.

    [19] H. Aysan, A. Thekkilakattil, R. Dobrin, S. Punnekkat,Fault Tolerant Scheduling on Controller Area Network(CAN), Proc. of Emerging Technologies and Factory

    Automation Conference,pp. 1-8, 2010.

    [20] H. Kimm, Ho-Sang Ham, Integrated Fault Tolerant

    System for Automotive Bus Networks, Proc. of 2nd

    International Conference on Computer Engineering

    and Applications, pp. 486-490, 2010.[21] FlexRay Consortium. (2004, June) FlexRay

    Communication System, Protocol Specification,

    Version 2.0. [Online]. Available:http://www.flexray.com

    [22] C. Temple, Networking the FlexRay Way - An

    overview of the FlexRay Communications System,Technical Report, Freescale Semiconductor.

    [23] R. Garbenfeldt, X-by-wire: Driving Your Car and the

    Semiconductor Industry, Technical Report, 2005.[24] F. Seidel, X-by-wire, Technical Report, Chemnitz

    University of Technology, 2009.

    [25] B. Selic, Fault tolerance techniques for Distributed

    Systems, [Online], Available:http://www.ibm.com/developerworks/rational/library/1

    14.html#N101B5[26] C. Wilwert, N. Navet, Y. Song, F. Simonot-Lion,

    Design of automotive X-by-wire systems, Technical

    Report.[27] L. He, Z. Yu, C. Zong, H. Zhao, The Dual-core Fault-

    tolerant control for Electronic Control Unit of Steer-By-wire System, Proc. of International Conference on

    Computer, Mechatronics, Control and Electronic

    Engineering , pp. 436-439, 2010.[28] E. Touloupis, J.A. Flint, V.A. Chouliaras, A Fault-

    Tolerant Architecture For Automotive Applications.

    Technical Report, Loughborough University.[29] L. Lamport, R. Shostak, M. Pease, The Byzantine

    Generals Problem , ACM Transactions onProgramming Language and Systems, vol. 4, no. 3,

    pp382-401, 1982.

    [30] IEC61508-1, Functional Safety of Electrical ElectronicProgrammable Electronic Safety-related Systems - Part1 : General requirements, IEC/SC65A, 1998.

    [31] D. Jhalani, S. Dhir, Survey of Fault TolerantTechniques in Automotives, University of Wisconsin

    Madison.

    [32] H. Curtis, R. France, Time Triggered Protocol(TTP/C): A Safety-Critical System Protocol, literatureSurvey, University of Texas-Austin,1999.