Intrusion Tolerant Distributed Systems – Algorithms and Architectures Angelo Corsaro & Venkita Subramonian DOC Group, Washington University Software Systems Research Seminar March 21, 2003

Angelo Corsaro & Venkita Subramonian

DOC Group, Washington University

Software Systems Research Seminar

March 21, 2003

Security: State of the Art

Most of secure systems are nowadays built by trying to prevent attacks

Several techniques and tools have been developed to make more secure systems, detect system weakness, and protect systems New Programming Languages Software Tools like code analyzer, system profiler, etc. New Hardware/Software components etc. etc.

Yet, systems’ security keeps being compromised!!!

Nowadays pervasive interconnectivity introduces more challenges for security

The lesson learned in securing systems is that this brute force approach does not work.

Experience has led to the key observation that it isn’t practical/feasible to build 100% secure systems

Classical Secure Distributed Systems

Classical Secure distributed systems are based on the assumption that there exist part of the system which is trusted

The basic and recurrent idea is that of connecting distributed components together so as to form a global secure infrastructure

This approach requires large trusted parts on all computers on the network


One of the most used and deployed distributed security systems is Kerberos

It was designed and implemented at the MIT as part of the Athena project

The core assumption at the base of Kerberos’s design are the following: Client workstations are totally under control of the user, i.e., can’t

be trusted Remote services can be accessed only via an authentication

service Servers are trusted, and are physically protected The servers are under the complete control and responsibility of

the administrator

The master server is replicated on passive slaves, which can replace the server when it fails




1. Request for a TGS ticket



1. Request for a TGS ticket

2. Ticket for TGS



1. Request for a TGS ticket

2. Ticket for TGS

3. Request for Server Ticket



2 3


1. Request for a TGS ticket

2. Ticket for TGS

3. Request for Server Ticket

4. Server Ticket



2 3



1. Request for a TGS ticket

2. Ticket for TGS

3. Request for Server Ticket

4. Server Ticket

5. Request for Service



2 3



Kerberos’s Security Problems

The security administrator can misuse his privileges to performs unauthorized actions

Replicas (Kerberos uses passive replication) can also provide information to intruders if not well protected

If Kerberos server fails, the last DB changes are lost

Nothing is done to prevent “covert” channels

There is a single point of failure!!!

Security: New Trends

Eliminating flaws that make systems un-secure is not feasible (especially for legacy systems)

Currently adopted solutions for distributed systems’ security have quite a few problems

How about building systems that can continue critical operations in face of attacks?

Can we build systems that instead of trying to prevent attacks can instead tolerate them?

Architectures for Intrusion Tolerance

Intrusion Tolerance: The Idea

Intrusion Tolerant Systems are designed in such a way that they can tolerate a bounded number of misuses

If one or more intruders by-pass the protection mechanism and if the number of misuses they do is less than a given threshold, the security properties of the system: Confidentiality Integrity Availability

Are always ensured!!!

The key observation at the basis of Intrusion Tolerant systems is that an intrusion can be though as a Byzantine Fault

Types of Intrusion Tolerance

Confidentiality: Read access to a subset of confidential data gives no information about the data

Integrity: The change of a subset of data does not change the data perceived by legitimate users

Availability: The change or deletion of a subset of data or of a server does not produce a denial of service to legitimate users

For each property P is defined a threshold Tp

The reading, modifying or destroying a part X of the data or server D such that |X| < T

|X|< T Intrusion


Data Intrusion Tolerance

Data intrusion-tolerance techniques have existed for a long time

Confidentiality can be ensured by cryptographic tools like the threshold scheme

The data is shared in shadows, each shadow being stored on one security site

To build the data it is sufficient a number of shadows called the threshold

This scheme ensures availability and integrity

To prevent denial of service the server are replicated

Different sites cannot take decision independently, they must agree by communicating data and local decisions

This last point requires replication and agreement

Intrusion Tolerant Security Service

Intrusion Tolerant Security Server

The goal of an Intrusion Tolerant Distributed Security server is that of providing a trusted service out of a set of potentially untrusted computers

This way, the intrusion of one of some of the computers won’t compromise the security of the global system

All the sites that are part of the security service, called security sites, have to provide a series of services: Registration Authentication Sensitive Data Management Audit and Recovery Service

Registration Service

The registration permits a user to be registered by the system for future access to secured services

This operation must be carried out independently on each security site to prevent a single site from using information to impersonate the user

The operation is done under control of the security administrator of each site

Authentication Service

The role of this service is to verify the claimed identity of a subject

In a distributed system with several authentication servers, each server must independently authenticate the subject

Notice that the security sites are untrusted and one site could fake the authentication information

An agreement protocol is used to make sure that the user is authenticated if a majority of server succeeded

Upon authentication the server sends the user some session information, such as session id, key etc.

Authorization Service The role of the authorization service is that of checking that the

access to a secured service by a subject is authorized according to its access-rights

Access rights could be implements in a UNIX-like manner

The authorization service is made intrusion tolerant by implementing it on security servers Authorization phases are: The client asks the security server

for permission to access a secured service

The access rights stored on the security sites allow to determine if the client has the proper rights

The security sites vote to decide if the access is authorized

If the sites agree to permit access they send a ticket to the client, and another to the server

Using the ticket the client can now open a session with the server

Sensitive Data Management Service

The role of this service is to store, manage and retrieve the sensitive information on the security servers

The data management service must enforce the three main security properties Confidentiality Integrity Availability

Integrity property is provided by a modification detection mechanism based such as cryptographic signatures

Replication can be used to ensure availability, while threshold techniques could be used for confidentiality and availability

Sensitive Data Management Service

If data is replicated on N sites, then With respect to availability, up

to N-1 replicas can be lost With respect to confidentiality,

one replica is sufficient to observe the data

If one data item is shared on N security sites using a threshold of T, then With respect to availability, N-

T shadows can be lost With respect to confidentiality,

T shadows are necessary and sufficient to observe the data

The Audit and Recovery Service

The role of this service is to audit the security information sent by the services

There exists two kind of information Authorized operations Attempted or successful intrusion or


Notice that it is not a role of the service that of determine what constitutes an intrusion or a misuse

Analysis of the audit is done offline by security administrators

The recovery service acts as an error recovery mechanism to correct certain modified data

Voting Algorithms for Intrusion Tolerance

Need for voting algorithms

Authentication Authorization

FT Node architecture

Distributed Voting

Two phases Local Computation

Compute results locally and broadcast results Majority reconciliation

Determine if majority exists Initiate fault diagnostics if necessary

Distributed algorithm for both phases

Coordinator commits the majority vote


Distributed algorithm that runs on every voter

Receive result from all voters

If my result same as all other results

we have a unanimous vote

commit vote

Else if we have more than 50% of the results the same

we have a majority

if I am the coordinator and my result NOT same as majority result

select a new coordinator from among the “majority processors”

commit vote

if I am the coordinator

initiate fault recovery in minority nodes




we do not have a majority

start local diagnostics

if my status = “okay”

select new coordinator from among “okay” processors

repeat voting process

Choosing a new coordinator

New coordinator chosen from a processor set

Candidate processor set could be all processors, when there is no majority or set of processors belonging to the majority

Check local node status

If status = “okay”

broadcast status to other processors

wait until broadcast from other processors arrive

if my node has the largest node id among “okay” processors

I declare myself new coordinator

Committing a Vote

Coordinator responsible for committing majority vote

If I am the coordinator

broadcast result to majority

wait for ack from all processors in majority


wait for result from coordinator

send ack to coordinator

Problems with 2 Phase protocol

What if coordinator fails right before committing majority vote? User (client) will receive bad result

Probability very less Within acceptable risk parameters

But transient faults could have adverse effect on security

An attacker could control what result a user sees Majority does not matter any more

Security and transient faults

Transient faults could hamper security Illuminating a single transistor in an IC using a laser

Serious threat to Smartcard technology Attack invented and perfected by Sergei Skorobogatov,

Cambridge University

“Sergei's work will trigger a generation change in smartcard technology. The immediate effect of his work is that many attacks on computer systems that were developed as theoretical possibilities by the research communities in the 1990s have suddenly become practical”

– EE Times, May 2002

A Solution

Algorithm by Castro and Liskov

Pros Commit done by all voters as opposed to just one coordinator, hence

more secure than the 2-Phase algorithm

Cons Does not scale well, since client has to wait for f+1 replies

Other algorithms

More algorithms in literature

Inexact voting

Drawbacks to the previous algorithms Assumes state machine replication in all voters Two different non-faulty voters will produce the same result

Some use-cases where this assumption does not hold E.g., sensor values

Inexact voting Values that fall within a range of tolerance are considered equal Equivalence classes

Algorithms can be modified to handle inexact voting

BUT, performance overhead large for multiple inexact comparisons to determine majority

Proposed Algorithm Assumptions

Network with Atomic broadcast capability Bounded message delay Fair-sharing of broadcast medium

No voter will commit answer until all voters ready Enforced using application dependent thresholds Any commits before this threshold are considered invalid

Majority of voters are fault-free for reliable working of the system

Each voter can vote only once Enforced by the User Interface module

Proposed Algorithm (1/2)


Interface Module Client

voter voter


1. Commit, if not committed already2. Compare with committed result

2 2

3. Timer expires, send result to client


Proposed Algorithm (2/2)


Interface Module Client

voter voter


2 2

1. Commit, if not committed already2. Compare with committed result3. Dissent, if no match


4. Commit new vote


5. Reset timer expiry


Uniqueness of this algorithm

Security increased No specific coordinator node – hence reduced vulnerability Even if the first commit to User Interface module is

compromised, it gets invalidated by dissenting voters “Denial of Service” (vote-rigging) eliminated since a vote

from an already committed voter is ignored

Fault-tolerance properties maintained as before Result still based on majority

Concerns about the User-Interface module Single point of failure BUT, this module is very simple with very little computation User-Interface module can be isolated from the voter complex

Less intensive computation on the client Does not have to reconcile all results from voters


Voters must be authenticated by User Interface module before accepting commits

This should not increase the complexity of the module

Strong authentication with minimal interaction between voters and the interface module preferred

Example mechanism Use SKEY authentication

SKEY authentication scheme

VoterInterface Module

vote f n(R)

vote f n-1(R)

vote f(R)



f is a one-way function

Distributed voting in WAN

Centralized voting not appropriate in a WAN setting

Multiple hops for vote to reach from voter to coordinator

Link failures could partition the network

Network congestion in the vicinity of the coordinator

Inexact voting could be computationally very intensive Sensor data from a vast coverage area

Single coordinator target for malicious attack


Reliable transport

Messages are digitally signed and subject to verification before delivery to upper layer

Unverifiable messages are discarded

Presence of Public-Key infrastructure

Every voter knows the public key of every other voter

Secure voting

voter voter voter

1. Send signed vote to other voters, hash the result and save it

2. Verify sign and compare with own result

3. Hash sender’s result, sign it and send endorsement back




2 24

4. Verify the endorsement and compare it with saved value in step 1


Time complexity Each voter signs its result and broadcasts it - O(1) Each voter waits to receive one signed vote from every other

voter – O(n) Each voter does vote comparison – O(1) Each voter receives an endorsement from every other voter –

O(n) Complexity is O(n)

Number of messages Voter sends vote to every other voter – n(n-1) Voter sends endorsement to every other voter – n(n-1) O(n2)

Concluding Remarks

The Intrusion Tolerance mechanism described provide a much robust way of enforcing security that traditional techniques

The intrusion tolerance mechanism based on fragmentation-scattering ensures confidentiality and integrity of data and availability of services

Efficient and secure voting algorithms are an essential part of intrusion tolerant systems

More research needed to make intrusion tolerance a “real” technology

Scope for further research overlapping security and fault-tolerance

Fault tolerance vs Security

Fault-tolerant Design Secure Design

Guard against faulty system components or random faults

Guard against malicious outside attacks

Optimistic Pessimistic

Probabilistic phenomena Directed Intelligent attack

Redundancy as a solution Redundancy as an adversity

Redundancy – a boon or a bane?

Degree of redundancy







cy Fault toleranceSecurity

Desired security behavior


