Ft nmdoc

IMPLEMENTATION OF A FAULT TOLERANT NETWORK MANAGEMENT

SYSTEM USING VOTING

High Performance Computing and Simulation Research Lab

Department of Electrical and Computer Engineering

University of Florida

Dr. Alan D. George and Edwin Hernandez

Abstract

Fault Tolerance (FT) can be achieved through Replication and N-Module Redundancy (NMR) systems are

widely applied in software and hardware architectures. NMR Software system could have dynamic or

static voting mechanisms, indeed static voters decrease the complexity of the control algorithms and add

new performance bottlenecks, meanwhile dynamic voters have increased control complexity and some

other performance bottlenecks. Every Network Management System (NMS) requires of replication to

increase reliability and availability. The software implementation of a Fault Tolerant Network

Management System (FT-NMS) shown here uses the NMR approach. The paper contributions are related

to the implementation issues of the system and measurements of the performance at the protocol and

application level. The resulting measurements lead to conclude that the system can improve the bottlenecks

and keep simplicity by taking into account synchronization and buffer management techniques.

Keywords: Network Management, Fault-Tolerant Distributed Applications, SNMP, NMR.

1. INTRODUCTION

Several local-area and wide-area networks rely on Network Management information for decision making

and monitoring. Therefore, a Network Management System (NMS) has to maintain high availability and

reliability in order to complete those tasks. Consequently, the implementation of an NMS has to be fault

tolerant and include software replication. Moreover, fault masking and voting are techniques added to the

system as a consequence of the replication.[Prad97]. In addition, the most prevalent solutions in

distributed systems to create FT server models use replicated servers, redundant servers and event-based

servers [Lan98]. If replicated servers are chosen, the system selects among static or dynamic coordination.

In static coordination, it does not require a leader election mechanism; indeed, fault masking using voting is

easier to achieve. In dynamic coordination, protocol overhead is required and leader election and voting

coordination algorithms have to be included in every transaction. There are several replica control

protocols involving leader election. They depend on the quorum size, availability and system distribution,

for example: Coterie, Quorum Consensus Protocol, Dynamic Voting, Maekawa’s, Grid, Tree and

Hierarchical. Those are found in [WU93] and [Beed95]. The algorithms mentioned above introduce

overhead and can degrade the performance of the system in terms of communication or processing

[SAHA93]. There are some other replica control protocols defined in [Paris94] and [Sing94] in which the

location of the replicas of a database or the location of the replicas in a network are minimized. However,

the protocol complexity is not avoided and remains present. Triple Module Redundancy (TMR) and

masking fault tolerance can be found in [Aro95] and [Brasi95] , a TMR voting protocol does not require a

leader, neither leader election protocols, but it requires message broadcasting from one to all nodes at the

group, therefore the protocol complexity relies on message ordering and coordination. Experimental

measurements of the TMR nodes yielded a total processing time between 50 to 200ms using 100 messages

of 64 bytes long. Static voting architectures will require high reliability of the network node where the

voter reside, because the voter becomes a single point of failure (SPF). However, protocol overhead is light

and fault recovery can be achieved almost instantaneously. Those advantages support the use of a voter in

high performance networks and the implementation of static TMR systems such as the one presented in this

paper.

In addition to the requirements of group communication, the detection of a failure is generally done by

using the combination of a heartbeat and a predefined [Maffe96], [Prad96] [Landi98] or a self-adapted

timeout period [Agui97]. The predefined timeout was used in the FT-NMS implemented. Nevertheless,

there are several other techniques for handling faults in network managers and agents as defined in

[Duar96] with the hierarchical adaptive distributed system-level diagnosis (HADSD).

Moreover, monitoring is one of the main tasks of any NMS. For this purpose, a traditional approach was

used, in which a set of replica managers were organized in a tree structure running at a set network’s node.

The managers are able to monitor a set of agents using Simple Network Management Protocol (SNMP)

request as defined in [Rose90] and [Rose94]. Decentralization and database monitoring were not

implemented for the application [Scho97] and [Wolf91].

This paper is organized in the following manner. First at Section 1, some assumptions are presented and

they are also used for the experiments and system’s architecture design (Section 2). In Section 3,

algorithms and system design are described. Finally, performance measurements were run in different

testbeds such as Myrinet and ATM-LAN as shown in Section 4.

2. Assumptions

The FT-NMS application relies on a simple heartbeat error-detection mechanism and sender-based message

logging to generate monitored information (replica are responsible to initiate the communication is not the

voter). Failures are detected when a failing system stops replying to heartbeats and is considered to have a

transient fault. If a transient fault takes longer than the timeout period the faulty unit is considered “down”,

consequently, the unit will not be able to provide useful information to the system. Basically, a machine

will fail by crashing or fail-stop behavior.

A TCP/IP environment with naming services has to be available for the replica units. In addition to that, it

is assumed that SNMP daemons should be already installed and working correctly in all the agents to be

monitored. The data types that can be retrieved from the agent’s Management Information Base (MIB) are

INTEGER, INTEGER32 and COUNTER as define in the ASN.1 standard in [Feit97] and [OSI87]. Each

node should be in the same sub-network and consequently avoid long time communication delays between

managers, otherwise timeouts have to be modified. And finally, The heartbeat interval used is one second.

3. System’s Model

The System’s model is conformed by two sub-systems: the managers (Section 3.1) and the voter-gateway

(Section 3.2). Managers depend on the voter, who also works as a coordinator. All the applications

mentioned here are multithreaded and client/server. Moreover, the manager makes use of the Carnegie

Melon University SNMP Application Programming Interface (CMU-SNMP API). The voter is being run

in a separate, highly reliable computation node, meanwhile replica managers should run at different

network nodes. The different manager modules are shown in figure 1.

Figure 1. SNMP Manager Application, using the CMU-API to handle SNMP packets for the agents

3.1. Manager

The manager application uses the CMU-SNMP API to handle snmpget packets to the agents. It has access

to a local MIB database (Figure 1.) which is handled by HCS_SNMP Object handler (Figure 2.), this object

is used to support all the different MIBs found in all network management agents. In addition to that, a

simple UDP_Echo server runs concurrently for handling of the heartbeat service provided by the voter.

class HCS_SNMP {private: struct snmp_session session, *ss; struct snmp_pdu *pdu, *response1; struct variable_list *vars; char* gateway, *community; int Port; oid name[MAX_NAME_LEN]; int name_length; int HCS_SNMPCommunication(snmp_pdu* response, char** varname, char** value);public: HCS_SNMP(char* gateway, char* community); ~HCS_SNMP(); int HCS_SNMPGet(char** namesobjid, int number, char** varname, char** value);

};

Figure 2. Class description of the SNMP Object

The main goal of the FT-NMS is distributed method for reliable monitoring of MIBs handled at different

agents. The system is designed to keep a heavy weight process running for each agent (Figure 1.). Each

heavy-weight process is able to handle 64 simultaneous Object Identifiers (OID) from the MIB at any

agent. The SNMP API provides all the libraries and services to convert from the Abstract Syntax Notation

One (ASN.1) to the different data types used in C++. Each manager is designed to read a table of OIDs

that has to monitor from the agents using polling. In addition to that the manager should define a polling

strategy and write all the responses to a file. Furthermore, each manager creates and sends a TCP packet

with the format shown in Figure 5. to the main voter application informing the results gathered from the

agents being monitored.

Also, the sampling time and the number of samples to monitor is defined for every OID. In order to reach

accuracy in the measurement, synchronization is required to achieve concurrency on each poll, otherwise

the differences between the value sampled or measured to agenti at To and any ∆T will have a misalignment

Thread for HeartbeatListening

UDP_echo serverSNMP OBject Handler

CMU - API

MIBDatabase

TCP SocketsUDP Sockets

ReliableCommunication with

GatewayEvents Local

Data

with the values sampled by agentI+1 from another replica. This behavior could not affect large sample

interval in which the sampling time is greater in several orders of magnitude to the misalignment, but

assuming that in high performance networks the sampling time must be really small, synchronization

should be achieved as a priority. Therefore a Two Phase Commit protocol is used to achieve

synchronization between samples and to coordinate groups of replica monitoring applications (Sec. 3.1.1)

3.1.1. Sample synchronization through a Two Phase Commit Protocol (2PC).

As previously stated, the main weakness of a distributed network management system is time

synchronization. Nevertheless, a Network Time Protocol (NTP) could supply some useful information, the

network traffic and lack of precision are not suitable for High Performance Networks (HPN). In fact, a

2PC protocol is easier to implement and it will provide the required synchronization. In addition, the

implementation of the heartbeat was merged with the 2PC avoiding any possibility of deadlock (Figure

3.b.).

Monitoring at the manager is done using the pseudo-code at Figure 4. The manager waits for executing the

sampling action (commit) to the agent as the voter delivers the commit packet to all non-faulty managers.

Consequently, the manager delivers the “SNMP Get” packet to the agent, polling the required information.

Finally the manager transmits the information sampled to the gateway-voter element. Observe that a

correction_factor is introduced to the waiting time between samples and the sampling time. This

modification keeps an accurate sampling time and reduces the error introduced on time delay spend on

synchronization and round-trip communication to the agent.

It also can be drawn from Figure 4., the minimum sampling time at the manager is shown in eq.1.

Tpc + Tsnmpget +TmsgResponse. (eq. 1)

(a) (b)

Figure 3. Two Phase Commit Protocol to achieve synchronization

While (n_samples>I){I++;Start=gethrtime();HCS_MAN->TwoPC(); // TCP socket connectionHCS_MAN->SNMPGet(OID’s, agent, &Response); // UDP socket connectionHCS_MAN->SendResponse(Response, Gateway); // TCP socket connectionCorrection_factor=gethrtime()-Start;Wait(sampling_time - Correction_factor);

}

Figure 4. Pseudo-code executed at each replica manager

3.2. The voter–gateway (GW).

Having different replicas monitoring the same agent, the voter collects all the non-faulty measurements. In

order to achieve congruent results and generate a voted output an instance of the voter-gateway has to be

running in a highly reliable network node. The gateway or voter is shown in Figure 5. As mentioned above,

the use of a voter avoids the implementation of complex leader election schemas in replica management.

Fault masking is easily achieve in transitions of N to N-1 replicas, and processing delays are almost non-

existent. In other words, whenever a failure is injected to the system, and any manager application can fail

with graceful degradation of the overall system.

However the Voter becomes a performance bottleneck in the sense that all the traffic of the Managers is

directed to the Voter and therefore several performance issues have to be found and ways to improve them

should be reached.

Manager

1

2

3

2PC

Msg PacketSNMP GET

VOTE

R

Tpc + Tmsg +Tsnmpget + Interval = TimmingThe Manager only waits (Interval seconds)

before making the next sample

Accepting CommitLWP

CommitBoundedThread

CommitBoundedThread

CommitBoundedThread

Request to Commit

Manager

Manager

Manager

Request to Commit

Request to Commit

//Thread Contentvoid* commit(void* sockdesc){ read (sock_desc, commit_request); P(mutex); n_commits++; V(mutex); wait_until (ncommits==n_available_managers); write(sock_desc, "GO"); P(mutex); n_commits--; V(mutex);}

Mutlthreaded 2PC commitprotocol for synchronization

The LWP monitors the TIMEOUT for the commit Threads,therfore if from the available managers one fails to COMMIT

the whole operation should fail by a TIMEOUT and send aCANCEL but without minding the simultaneously approach

GOAL: Use the 2PC to achieve synchronicitybetween the Manager and the Voting Application

Figure 5. Gateway Application Architecture.

As shown in the Figure 5., there are several objects working together to achieve total monitoring or N

Replica Managers monitoring M agents concurrently. The voter-gateway was tested using two approaches

asynchronous messages from the Managers and total synchronization using the 2PC protocol [Chow97].

The communication frame among the manager and the voter is presented in Figure 6.

Total Length: 560 Octets/bytes

Figure 6. SNMP information frame Manager-to-Gateway.

As show here voter has local objects which are initialized with the information of all the replicas, such as

OIDs monitored by the manager, the agent itself has some information concerning agent name, parameters

to be measured from the agent. The big scenario is drawn in the Figure 7. First, the voter creates

Manager Name(16 bytes)

Structure of Msg (Message from Manager-to-Voter)

TCP HeaderTime Stamp(16 bytes)

Agent Name(16 bytes)

OID IN OCTET FORMAT(256 bytes)

VALUE MEASURED(256 bytes)

Hearbeat Thread(timming = 1sec)

LocalObjects

UDP SocketsTCP Sockets

Thread forRecepction ofInformation

a thread for each TCPconnect()

a thread for each TCPconnect()

Buffer ofreceived data Voting Thread

(one per agentmonitored)

Voting Thread(one per agent

monitored)

Voting Thread(one per agent

monitored)

Server Side

client Side

MainApplica

tion

VotedDB Voted

DBVoted

DB

GeneralLog file

Manager Agent

Figure 7. Distributed Fault Tolerant Network Managers and the Gateway accessing a Network Agent

via RPC calls, NxM instances of Managers (hcs_snmp_man objects) in each network node, where M is the

number of agents to monitor by the voter, and N is the number of replicas. There are K servers to perform

as network nodes or peers for remote execution of the NxM instances.

Then, when all the instances are executed, each replica manager will poll its correspondent agent. Hence,

the replica gathers the responses of all the OIDs defined to monitor, as a consequence the replica also

generates the frame voter-manager with the information gathered (Figure 6).

The tasks done at the voter are resumed as follows:

a) Voter reads all configuration files about agents to monitor and references to the OIDS to be read by the

manager. It generates local objects of the environment driven.

b) The rshells are executed in each of the nodes of the system.

c) The voter activates the threads for heart-beat processing and failure detection, and the TCP port

listener for SNMP results and commits.

d) Concurrently replica managers communicate with the Agents collecting network management

information

e) Replica Managers send the queried information to the voter, using the frame at Figure 6.

f) The voter handles all the arrival information using two techniques: a linear buffer and a hash table.

Simultaneously, local threads are created per agent. The thread is used to dig into the data colected and

generate the voted result for every agent being monitored.

Each message that arrives to the voter is converted to the format of the class type MsgFormat at figure 8.

Voter/ NM Proxy

VOTEDDataBase

Replica ofManager

Replica ofManager

Replica ofManager

Data Base ofLocal Agents

Data Base ofLocal Agents

DataBase ofLocal Agents

Network Agent

MIB

SNMPGETResponse

SNMPGETResponse

SNMPGETResponse

VOTE for OID

Vote forOID

ATM Network

SNMPd running/Agent

Hearbeat/rpcshell and

Collection ofVotes

class Msgformat{public: int marked; // message marked to be deleted char* manager; // manager name char* agent; // agent owner of the information char* timestamp_client; // timestamps client and server (manager and GW) char* timestamp_server; char* OID; // Object Identifier according to the MIB char* SNMPResponse; // Response from the Agent hrtime_t start; // for performance measurements. Msgformat(); ~Msgformat();};

Figure 8. Msgformat class used at the voter to collect the information from the replica managers.

The Msgformat instance is stored whether in the buffer or the hash table. It is expected that a thread

(fillbuffer in figure 9.a) is created per message received from the replica managers, therefore if all the

replica managers sample concurrently, the number of threads created will be N_Agents*N_Mananagers.

The access policy to the shared structure (buffer or hash table) used by the threads is round-robin. In

addition to all of these threads from the Replicas, there are “voter” threads created for each agent. Each

voter-thread is in charge of pulling from the buffer and basically generate the voted output. This process

can be viewed as a join between two tables, the first one composed of agents and OIDs to monitor and the

“table” of messages, which is the buffer. The JOIN operation executed is msg-

>OID==agents[thread_id].OID and msg->agent=agents[thread_id].agentname. (Voter in figure 9.b.)

void* fillbuffer(void* sock_desc){ while (message[1]!='q'){ if (read(sock_desc, &message, SIZE_MSG)==SIZE_MSG) msg = new Msgformat; msg->start = gethrtime(); strncpy(msg->manager, message, 16); strncpy(msg->timestamp_client, message+16,16); strncpy(msg->agent, message+32, 16); strncpy(msg->OID, message+48,256); strncpy(msg->SNMPResponse, message+304,256); gettimeStamp(msg->timestamp_server); msg->marked = 0; P(db); buffer->append((void*) msg); V(db); } } close(sock_desc); thr_exit((void*) 0);}

void* Voter(int agent_id){ while (NOT(cancel)){ for each agent[agent_id].OID do { P(db); while (buffer->length()>=k){ k++; Msg=buffer->pop(); If (msg->agent==agents[agent_id].agentname] && (msg->OID==agents[agent_id].OID[j]){ T_buffer->append(msg); } } if (T_buffer_>length()==getAvailableManagers()){ agent[agent_id].file << Tstamp<<”Average_Values(T_buffer); delete_elements_in_buffer(); } delete T_buffer; V(db); } }}

(a) (b)

Figure 9. Threads for Filling to and removing elements from the linear buffer. (messages from replica

managers to the voter)

The pseudo-code shown in Figure 9.a. for the process of filling up the buffer as well in Figure 9.b. for the

voter’s threads. As seen here the voting function depends upon the number of available managers,

getAvailableManagers(), this is used to determine whether the number of messages in the buffer is valid

or not. In case of failure from the replica managers, the number of messages will be greater or less than the

number of available managers, therefore the sample will be simply be lost and not processed. It won’t be

until the next sequence of messages arriving into the queue that the process will continue normally. As

mentioned before the access method for all the voter’s threads is a round-robin sequence, they also share

the context with all the fillbuffer() threads generated upon the arrival of the SNMP managers-voter

packets. (Figure 6.)

IF instead of using the linear buffer (figure 9a and 9b), this structure is substituted by a double hashed

array, having as hashing functions the OIDS and the Agent Name. (see Figure 10.a and 10.b.)

void* VOTER_h(void* agent_id){while (NOT(cancel)){ j=0; for each agent[agent_id].OID do { P(db); j++; if (buffer[agent_id][j]->length()>=getAvailableManagers()){ agent[agent_id].file << Tstamp<<”Average_Values(T_buffer); delete_elements_in_buffer(agent_id,j); } V(db); }

}

void* fillbuffer(void* sock_desc){ Msgformat* msg; char message[600]; int k, m; while (message[1]!='q'){ if (read((int) sock_desc, &message, SIZE_MSG)==SIZE_MSG) { msg = new Msgformat; msg->start = gethrtime(); strncpy(msg->manager, message, 16); strncpy(msg->timestamp_client, message+16,16); strncpy(msg->agent, message+32, 16); strncpy(msg->OID, message+48,256); strncpy(msg->SNMPResponse, message+304,256); gettimeStamp(msg->timestamp_server); P(db); m=getAgentIndex(msg->agent); //Hashing Functions. k=getOIDIndex(msg->OID, m); if ((m>0) && (k>0)) { Hashed_buf[m][k]->append((void*) msg); } else { ERROR();

elete msg; }

V(db); } } close((int) sock_desc); thr_exit((void*) 0);}

(a) (b)

Figure 10. Threads for Filling to and Reading from the double Hashed Table.

As Shown here the number of iterations is reduced from O(n) for the Linear buffer to approximately O(log

n) of the Hashed array. The dynamic voting is achieved to in a similar way than in the linear buffer, the

difference here is that instead of loosing the sample, the hash structure will have more than the number of

allowed elements, manAvailableManagers(). Therefore in the subsequent iteration an error of a sample

will be introduced to the measurment after a failure but all subsequent measurement will proceed normally.

The Failure detection system and definition of manAvailableManagers() is presented in sect. 3.3

3. 3. Heartbeat and Status of the Nodes, Managers and Agents.

As stated in the assumptions of the system heartbeats are issued to the managers every second (or defined

interval during compiling time). A simple echo server is running per node and a timeout mechanism is

used to switch a manager of a set of managers from NORMAL into FAULTY state. Later on, if a second

timeout is reached the manager or node does not return to a NORMAL state is erased from the group and

declared DOWN. Thus, the number of available managers is decreased by one since a FAULTY state is

detected.

A recovery action to maintain the number of available managers above a threshold, can be easily achieved

by finding the next available network node and running a remote shells command into it with the

monitoring application. In addition, agents are as well considered NORMAL, FAULTY or DOWN. A

FAULTY behavior of an agent is defined after a timeout from which no SNMP or null responses are sent

from the manager. The Agent can switch from the FAULTY state into the DOWN state after a second

timeout. The manager kills itself and the agent is not monitored anymore.

4. Experiments, Fault Injection and Performance Measurements.

Performance experiments were run, these experiments were executed at the HCS, ATM LAN and Myrinet

SAN, using as nodes for managers the following workstation’s architectures:

• Ultra-Station 30/300, 128 MB of RAM (Managers)

• Ultra-Station 2/200, 256 MB of RAM (Gateway station, Managers and agents)

• Ultra-Station 1/170, 128 MB of RAM (Managers and agents)

• Sparc-Station 20/85 , 64 MB of RAM (Managers and Agents)

• An ATM Fore-HUB and a Fore-ATM-switch.

All the measurements were done in terms of the latency added by voting process and monitoring, at both

sides, the manager and the voter-gateway.

4.1. Performance of SNMP at the managers.

Testbed measurements where done to determine the latency of different OIDs using the CMU-SNMP

protocol using Myrinet and ATM-LAN’s

Figure 11. Latency Measurements using CNTR32 and OCTET STRING data types in a Myrinet and ATM

testbeds.

As shown in figure 11, the latency of different combinations of SNMPGET commands using the CMU-

SNMP API grows constantly as the number of OIDs is increased. Figures a) and c) where done using the

Round- t r ip La tency o f SNMP ud ing OCTET STRING da ta types

0

5

1 0

1 5

2 0

2 5

3 0

3 5

1 2 4 8 1 6 3 2

N u m b e r o f o b j e c t i d e n t i f i e r s ( O I D s ) O C T E T S T R I N G

ms Roun t r i p T imming (My r i ne t )

R o u n t r i p T i m m i n g ( A T M )

Round-trip latency of SNMP using CNTR32 data types

0

20

40

60

80

100

120

140

1 2 4 8 16 32 64

Number of object identifiers (OIDs) CNTR32

ms Rountrip Timming (Myrinet)

Rountrip Timming (ATM)

T i m e d i s t r i b u t i o n o f a S N M P G E T r e q u e s t a t t h e M a n a g e r u s i n g C N T R 3 2 d a t a t y p e s

0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0

A T M1 C N T R 3 2

A T M6 4 C N T R 3 2

D A T A T Y P E S

M Y R IN E T1 C N T R 3 2

M Y R IN E T6 4 C N T R 3 2

n u m b e r o f o b j e c t i d e n t i f i e r s ( O I D S )

ms

P r o t o c o l a n d A g e n ta p p l ic a t i o nd e c o d i n ge n c o d i n g

Time d is t r ibut ion of a SNMP Get request a t the manager us ing OCTECT STRING data types

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

1 O C T E TS T R I N G

A T M

3 2 O C T E TS T R I N G

A T M

1 O C T E TS T R I N GMyr ine t

3 2 O C T E TS T R I N G Myr ine t

N u m b o e r o f o b j e c t i d e n t i f i e r s ( O I D s )

ms

pro toco l and agen t

app l i ca t iondecod ing

encod ing

OCTECT STRING data type, and b) and d) using the CNTR32 Data Type. In average the number of

CNTR32 and INTEGER Requests using the SNMPGET frame cover more than 85% of all the requests. For

these reason the performance experiments run at the agents included only CNTR32 data types.

The percentage of time involved in the processing of each request to the agent is In the process of

encoding/decoding ASN.1 information, the application and the protocol is shown in Table 1. The

protocol/Agent overhead covers from 43.8% to 94.9% of the overhead. This situation turns out to be the

first performance bottleneck found at the whole process. It is important to remember that at the Agent level

the ASN.1 encoding/decoding overhead is also executed and in addition to that the agent should made

access to its Control Status Registers and create the Frame. The experiments where run at the Myrinet link

between two Ultra-2/200 with 256 Mbytes and for ATM, one Ultra-2/200 polling information out the

ATM-Fore switch.

Table 1. Processing overhead distribution at the Manager.

ATM Latency (ns) Distribution 1 CNTR32 64 CNTR32 1 CNTR32 64 CNTR32ATM ATM Myrinet Myrinet

encoding 1.06% 2.41% 1.58% 1.32%decoding 1.78% 4.74% 1.97% 2.33%application 40.98% 3.04% 52.65% 1.44%Protocol and Agent 56.18% 89.80% 43.80% 94.90%No. of samples: 500, Sampling time: 1 second

Figure 12. Average latency at all replica managers having different levels of replication and different

number of monitored agents.

At figure 12, both a) and b) corresponds to the measurements made at the manager, but in this case the

issue mentioned as SNMPGET corresponds to the whole process described at Figure 11. For these

measurement the number of OIDs used was fixed to eight. The eight OIDs selected are those presented in

the Table 2. The main concern for all the experiments is that the sampling time is fixed and exactly the

same in number of samples for all the agents being monitored. As a consequence the values of

TCPConnect and Latency to Voter are the transmission of the eight OIDs transmitted from the replica

manager.

Table 2. OIDs selected for the performance experimentsOID ASN.1 Data Type

.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifInOctets.3 CNTR32




.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.3 CNTR32




L a t e n c y a t t h e m a n a g e r ( 1 r e p l i c a )

0 . 0 0

5 . 0 0

1 0 . 0 0

1 5 . 0 0

2 0 . 0 0

2 5 . 0 0

3 0 . 0 0

3 5 . 0 0

0 5 1 0 1 5 2 0

N u m b e r o f A g e n t s

ms

2 P C

S N M P G E T

T C P c o n n e c t

L a n t e n c y t o t h e v o t e r

Latency at the manager (5 replicas)

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

0 2 4 6 8 10

Number of Agents

ms

2PC

SNMPGET

TCPconnect

Lantency to thevoter

Those entries correspond with the number of octets in/out at the different devices monitored by the snmpd at

each workstation being polled by the manager. According to the results from Figure 12, the Two Phase

Commit (2PC) protocol and the SNMGET required more than 95% of the overhead. It is shown also in

Table 3., that the SNMPGET covers more than 65% of the process and in comparison with the 2PC which

only occupies less than 30%. The algorithm is shown in Figure 4.

Table 3. Comparison of 2PC and SNMPGET at the manager using different agents and replication

Percentage of UtilizationNumber of

Agents2 agents 16 agents

Replicas 2PC SNMPGET 2PC SNMPGET1 23.80% 69.50% 32.45% 62.82%2 23.10% 68.10% 24.50% 47.10%3 21.80% 72.07% 25.90% 66.45%4 27.83% 76.87% 28.22% 63.45%5 27.14% 60.60% 29.80% 60.60%

Therefore, this overhead introduced by the manager has to be compensated and taken into account to define

the minimum sampling time stated in section 3.1, equation 1.

4.2. Testbed experiments using the 2PC.

By adding the 2PC protocol to each manager, the overhead introduced to the application represents the 25%

but the total Processing time at the manager. It is important to point out that the “Total Processing Time”

here is related to the voter system and includes the inter-arrival time between SNMP gateway-voter

messages, the searching time at the shared buffer and the correspondent I/O to disk. In addition, the total

processing time is per Thread. In other words N_Agents will be able to process the incoming OID

concurrently at the average time mentioned above.

(a)

(b)

Figure 13. Inter-arrival time and total processing time at the voter.

In both cases the dominant factor is the Inter-arrival time of messages to the queue. This time represents

the time difference between the arrival to the shared buffer or a message identifying an OID and the

moment in which the last message coming from a different manager arrives to the shared structure.

It is important to remember that the system shares the access for the threads of filling up the buffer,

fillbuffer() and the voters (each per agent). Thus, the Inter-arrival rate is being affected by the

processing time of the voters or all the “join” processes mentioned in section 3.2, figure 9.

The Amount of messages received per sample are defined by N_Managers*N_Agents*N_OIDs. In other

words a system with 16 agents, 5 managers and 8 OIDs will send out to the voter 640 messages, each

Total processing time (TPT) of eight concurrent OID at the voter

0

500

1000

1500

2000

2500

3000

3500

0 5 10 15 20

Number of agents

ms

1 TPT 1 Manager2 TPT 2 Managers3 TPT 3 Managers4 TPT 4 Managers5 TPT 5 Managers

Inter-arrival time of eight concurrent OID at the voter

0.00

500.00

1000.00

1500.00

2000.00

2500.00

3000.00

0 2 4 6 8 10 12 14 16 18

Number of agents

ms

1 Interarrival 1 Manager2 Interarrival 2 Managers3 Interarrival 3 Managers4 Interarrival 4 Managers5 Interarrival 5 Managers

message with a fixed size of 560 bytes (figure 6) with represents 358400 bytes received by the Manager, in

this particular iteration. For instance in an iteration with two replicas and 16 agents, the average processing

time is 500ms and having 8 OIDs per agent, the minimum sampling time is 4 seconds. Any other shorter

sampling time measurement will lead to an erroneous monitoring and the results won’t be accurate to the

sampling time. To avoid these problems the sampling time for the 5 managers was defined at 30 seconds,

and the number of samples to 20.

4.2.2. Testbed experiments with the hash table.

In order to reduce the searching time to the shared buffer structure and make that reflect it in the

performance of the whole application, the linear-search was substituted by the hash table (figure 10). The

results of using a Hash Table are shown in Figure 14.

Preliminary the reduction of the search time will allow more time for the other threads to process and

reduce the sections of the buffer in which semaphores are required for access. The Hash functions are very

simple and the relationship is one-to-one since it is the voter which defines the managers and agents of

monitoring.

In comparison with the linear search done at the previous structure an improvement of 50% was achieved

when having three or more replicas, and it remains without change for one or two replicas. An interesting

behavior is shown with 4 or 5 replicas in which the variation between processing times is not more than

hundreds of milliseconds, which is expected given the nature of the search time at the hash table.

(a)

(b)

Figure 14. Inter-arrival and Total Processing Time at the voter using a Hashed Table.

4.2.3 Testbed experiments with an asynchronous system

If the 2PC protocol is not included, the manager will be able to be fully independent and reduce the

overhead involved in more than 20%. However, the overhead of the replica control protocol which is non-

existent will cause the system to degrade drastically in performance. The values to be measured at the voter

have a total processing time of a minimum of 200 ms to a maximum of 2.2 seconds (having 4 replicas an

16 agents). Therefore, in this particular system, the lost of synchronization drastically degrades the System.

Observe that from Table 4 and Figure 15, this degradation is at the order of 60% respect to the Hash Table

method using the 2PC. Preliminary measurements showed that the degradation is even worst if the Has

Table is substituted by the linear search using the shared buffer structure.

Total processing time (TPT) of eight concurrent OID at the voter

0200400600800

10001200140016001800

1 2 4 8 16

Number of Agents

ms

1 TP 1 Manager

2 TP 2 Managers3 TP 3 Managers

4 TP 4 Managers

5 TP 5 Managers

Inter-arrival time of eight concurrent OID at the Voter

0.00200.00400.00600.00800.00

1000.001200.001400.001600.00

1 2 4 8 16

Number of Agents

ms

1 Interarrival 1 Manager

2 Interarrival 2 Managers




Table 4. Total processing time without synchronization using the hash table

Total Processing Time (ms)Number ofReplicas

1 2 3 4

No of Agents No of OIDs1 8 220 436 483 4612 8 297 556 580 9264 8 345 778 890 12078 8 322 800 1137 1659

16 8 344 975 1325 2210

As showed in the previous two sections where the total processing time grows with the number of agents,

in figure 15 the same behavior is obtained but in greater proportions.

Figure 15. Average total processing time (TPT) introduced at the voter in the asynchronous system.

4.3 Comparison between “non-voted” and “voted” outputs.

One of the main goals with a Fault Tolerant System is the achievement of transparency for every

measurement by reducing the addition of noise product of the replication. In order to define whether is the

T o t a l p r o c e s s i n g t i m e o f e i g h t c o n c u r r e n t O I D s a t t h e v o t e r

0

5 0 0

1 0 0 0

1 5 0 0

2 0 0 0

2 5 0 0

0 2 4 6 8 1 0 1 2 1 4 1 6 1 8

N u m b e r o f A g e n t s

Tim

e (m

s) O n e R e p l i c a sT w o R e p l i c a sT h r e e R e p l i c a sF o u r R e p l i c a s

FT system reduces or gracefully degrades the accuracy of the measurement, in figure 16 it’s shown a

sample of voted and non-voted measurements. The experimental conditions for Figure 16 are: three

replicated managers monitoring a FORE-ATM router/hub (hcs-gw).

Figure 15. Fault-free of voted and non-voted measurements at the Gb-ethernet port at hcs-gw (router)

using .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.3

As shown in Figure 15, dagger is taken as point of reference. The collected data showed that the error

between the voted and non-voted measurements is not greater than 0.03%. This behavior is reflected in a

qualitative view to the figure.

4.4. Performance Degradation after Fault Injections

One of the main assumptions for the FT-system is the fail-stop model of the system. However after the

failure of one of the managers the monitoring should continue by fault masking. The experiments run at the

test bed consisted in a set of five managers and one agent, and using 8 OID’s per request. The injection of

faults is done by killing the remote shell, after that the system depends in the heartbeat service for failure

detection.

In figure 16, it is shown how the values of the voter are modified after a failure in one of the managers. The

measurement starts with five managers, every minute the number of managers decreased by one. As seen

V o t e d a n d N O N - v o t e d D a t a c o l l e c t e d b y o n e o f t h e R e p l i c a M a n a g e m e n t A p p l ic a t i o n ( D a g g e r - U l t r a 2 )

7 2 6 9 5 0 0 0 0

7 2 7 0 0 0 0 0 0

7 2 7 0 5 0 0 0 0

7 2 7 1 0 0 0 0 0

7 2 7 1 5 0 0 0 0

7 2 7 2 0 0 0 0 0

7 2 7 2 5 0 0 0 0

7 2 7 3 0 0 0 0 0

T i m e ( s e c o n d s )

IfO

utO

ctet

s -

GB

po

rt a

t H

CS

-GW

V o t e d - I f O u t O c t e t s

D a g g e r - I f O u t O c t e t s

here the voted and the real measurement varies slightly, as a matter of fact after a failure the value at the

time of the measurement is lost and an interpolation is required to determine the sample between the gap

when the system had N and N-1 replicas.

Figure 16. Comparison of voted and non-voted results of ifInOctets after fail-stop faults at the managers

C o m p a r a s i o n V o t e d V r s a R e p l i c a M a n a g e r L o c a l I n f o r m a t i o n g iven Fau l t In jec t ion

3 4 9 7 2 6 0 0 0 0

3 4 9 7 2 8 0 0 0 0

3 4 9 7 3 0 0 0 0 0

3 4 9 7 3 2 0 0 0 0

3 4 9 7 3 4 0 0 0 0

3 4 9 7 3 6 0 0 0 0

3 4 9 7 3 8 0 0 0 0

3 4 9 7 4 0 0 0 0 0

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1

N o n f a u l t y m a n a g e r s

ifIn

Oct

ets

(GB

Po

rt)

at H

CS

-GW

(ro

ute

r)

Voted - i f InOctetsDagger - i f InOcte ts

The Behavior of the Throughput is presented in Figure 17a and 17b, in both cases the reference is the

manager at the node dagger. In both cases the graph generated by “dagger” is followed by graph generated

(a) (b)

Figure 17. Input and Output Throughput measured from one of the router ports with Fault Injection

by the voter. Observe that the graceful degradation is achieved by avoiding gaps between two or more

samples. Moreover, when there is only one manager, “dagger”, left it is obvious to expect that both graphs

will be exactly the same as shown on the figure.

5. Future Work

Group communication is very important to keep concurrency at the application level. There are several

improvements for the system. The use of lightweight agents by replacing UDP sockets by XTP [ ] or

SCALE Messages [George98a] will decrease the overhead at the protocol layer in every manager. In

addition to this, the message interchange between voter-managers can be done using multicast –

lightweight communication. Reducing the thread context switching and speeding up the shared buffer

access at the voter can also be achieved.

Output Throughput Measured at the Voter and Local Information at Replica (dagger)

0

200

400

600

800

1000

1200

1400

1600

5 5 5 5 5 5 5 5 4 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1

voters (slots of 10 sec)

Oct

ets/

sec

Voted-IfOutOctets/sDagger-IfOutOctets/s

Input Throughput Measured from the Voted and Local information in Replica (dagger)

0

100

200

300

400

500

600

700

800

900

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1

voters (slots of 10 seconds)O

ctet

s/se

c

Voted-IfInOctets/s

Dagger - IfInOctets/s

As a NM application, the voter should be able to re-run or re-schedule one of the replicas after detecting a

dead node. A combination of the Leader Election Replica Management and Voting can also reduce the load

of every Management Node.

Adding a Lightweight CORBA framework [George98b] to communicate with agents instead of using the

SNMP only. Finally, one of the major improvements to achieve for a FT application is the use of faster data

structures and a SQL [Wolf91] engine to relate parameters from replications and original and work as a

failure detector, which will already include built-in check-pointing and error recovery.

6. Conclusions

• FT Distributed Applications required a well defined-efficient replica communication protocol. The

combination of fault-detection and 2PC protocol provides an easy methodology to achieve

synchronization

• Timing and Latency overhead at the Voter and Manager have to be taken into consideration. Specially

when defining small intervals of sampling in High Performance Networks.

• The use of a voting system provides an efficient way to monitor and gracefully degrade measurements

as a Network Management application

7. Acknowledgements

HCS Lab for their comments and reviews.

8. References

[Agui97] M. Aguilera, W. Chen, S. Toueg.” Heartbeat: A timeout-Free Failure Detector forQuiescent Reliable Communication”, Cornell University, July 1997.

[Alvi96] L. Alvisi., K. Marzullo. “ Message Logging: Pessimistic, Optimistic and Casual”, IEEEInt. Symp of Distributed Computing, pp 229-235 1995.

[Aro95] A. Aurora, S. Kulkarni “Designing Masking Fault via Non-Masking Fault Tolerance”,IEEE Symposium on Reliable Distributed Systems, 1995, pp 174-185.

[Beed95] G. Beedubahil, A. Karmarkar, U. Pooch.” Fault Tolerant Object Replication Algorithm”,TR-95-042, Dept. of Computer Science, Texas A&M, October, 1995

[Begu97] A. Beguelin, E. Seligman, P. Stephan.” Application Level Fault Tolerance inHeterogeneous Networks of Workstations”, Journal of Parallel and distributedComputing, Vol 43, pg 147-155, 1997.

[Bras95] F. Brasileiro, P. Ezhichelvan. “TMR Processing without explicit clock synchronisation”,IEEE Symposium on Reliable Distributed Systems 1995, pp 186-195.

[Doer90] W. Doeringer, D. Dykeman, et.al. “A Survey of Light-Weight Transport Protocols forHigh-Speed Networks”, IEEE Trans. On Communications, Vol. 38, No.11, pp 2025-2035.

[Dol97] S. Dolev, A. Israeli, S. Moran. “ Uniform Dynamic Self-Stabilizing Leader Election”,IEEE Transactions on Parallel and Distributed Systems, vol. 8., No. 4, April, pp 424-440,1997.

[Duar96] E. Duarte, T. Nanya. “Hierarchical adaptive distributed system-level diagnosis appliedfor SNMP-based network fault management”, IEEE Symposium on reliable distributedsystems, 1996, pp 98-107.

[Feit97] S. Feit. “ SNMP”, McGraw Hill, NY, 1997.[Georg98a ] Paper being fixed by Dave and Tim.[Georg98b] Luises Thesis…..[John97] D. Johnson. “Sender-Based Message Logging”, IEEE Fault Tolerant Computing, pp 14-

19, 1987.[Landi98] S, Landis, R. Stento “CORBA with Fault Tolerance”, Object Magazine, March 1998,[Maffe96] S. Maffeis. “Fault Tolerant Name Server”, IEEE Symposium on reliable distributed

systems, 1996, pp 188-197.[OSI87] OSI, Information Processing Systems – Specification of the Abstract Syntax Notation

One ASN.1, ISO 8824, December 1987.[Paris94] J. Franciois, Paris.” A highly Available Replication Control Protocol Using Volatile

Witnesses”, IEEE Intl Conference of Distrib. Comp. Systems, 1994, pp 536-543,[Prad96] D. Pradhan. “Fault Tolerant Computer System Design”, Prentice Hall: NJ, 1995.[Rose90] M. Rose and K. McCloghrie. “Structure and Identification of Management Information

for TCP/IP based Internets”, RFC 1155, 1990[Rose94] M. Rose. “The Simple book – An Introduction to Internet Management”, 2nd edition,

Prentice Hall, Englewooed Cliffs, NJ, 1994.[Saha93] D. Saja, S. Rangarajan, S. Tripathi.” Average Message Overhead of Replica Control

Protocols”, 23th IEEE Intl Conf on Distrib Computing Systems, pp 474-481, 1993[Scho97] J. Schowalder. “Network Management by delegation” , Computer Networks and ISDN

Systems, No. 29, 1997. Pp 1843-1852[Sing94] G. Singh, M. Bommareddy. “Replica Placement in a Dynamic Network”, IEEE Intl

Confernce of Distributed Computing Systems, pp 528-535, 1994.[Wolf91] O. Wolfson, S. Sengrupta, Y. Yemini. “Managing Communication Networks by

Monitoring Databases” IEEE Transactions on Software Engineering” Vol. 17 No. 9. Sep1991, pp 944-953.

[WU97] C. Wu.” Replica Control Protocols that guarantee high availability and low access cost”,Thesis – Ph.D., University of Illionis – Urbana Campaign, 1993.

[XU96] J. Xu, B. Randell, et.al.” Fault Tolerant in Concurrent Object Oriented Software throughCoordinated Error Recovery.”, FTCS-25 Submission, University New-Castle, 1996.

Ft nmdoc

Documents

Transcript of Ft nmdoc