RunTimeAwithRef1.doc

32
Project Description 1.0 Introduction A common vision of the future is one where our everyday environments are replete with smart cyber physical objects networked to form complicated systems of systems. People will interact with these embedded systems both explicitly and implicitly. The systems will be heterogeneous, need to exist for many years, and operate in the context of real world communication, sensing and failure realities. Many of the systems will be unattended (at least for large periods of time) and often performing very important tasks. The systems will be open in the sense that they will permit access to their functions from humans and other cyber physical systems. The current rapid development and deployment of wireless sensor networks and ubiquitous computing systems and their interactions are exacerbating the need for high confidence embedded systems. Achieving high confidence embedded systems will require new assurance technologies both off-line and on-line. For off-line solutions we expect to utilize formal methods and new analysis techniques to produce high quality software. However, even when these off-line solutions are effective there will still be a great need for run time assurance technologies because these systems operate in the noisy, error-prone physical world. Run time assurance is given by explicitly added software that demonstrates (periodically or on demand) that the system is capable of providing its important services. Most current solutions deal with faults and reliability and not with application level semantics and associated assurances. The few works that do address run time assurance at the application semantics level are preliminary or are primarily developed for other purposes such as debugging or activity recognition and hypothesized to also be useful for run time assurances. This proposal addresses developing comprehensive solutions for run time assurances in high confidence embedded systems. Consider the following motivating application example. New, low cost wireless sensor networks (WSN) can be embedded into large city skyscrapers to support fire detection and reaction. Such systems must reliably detect a fire on any floor, activate alarms, notify fire stations and announce and illuminate egress

description

 

Transcript of RunTimeAwithRef1.doc

Page 1: RunTimeAwithRef1.doc

Project Description

1.0 Introduction

A common vision of the future is one where our everyday environments are replete with smart cyber physical objects networked to form complicated systems of systems. People will interact with these embedded systems both explicitly and implicitly. The systems will be heterogeneous, need to exist for many years, and operate in the context of real world communication, sensing and failure realities. Many of the systems will be unattended (at least for large periods of time) and often performing very important tasks. The systems will be open in the sense that they will permit access to their functions from humans and other cyber physical systems. The current rapid development and deployment of wireless sensor networks and ubiquitous computing systems and their interactions are exacerbating the need for high confidence embedded systems.

Achieving high confidence embedded systems will require new assurance technologies both off-line and on-line. For off-line solutions we expect to utilize formal methods and new analysis techniques to produce high quality software. However, even when these off-line solutions are effective there will still be a great need for run time assurance technologies because these systems operate in the noisy, error-prone physical world. Run time assurance is given by explicitly added software that demonstrates (periodically or on demand) that the system is capable of providing its important services. Most current solutions deal with faults and reliability and not with application level semantics and associated assurances. The few works that do address run time assurance at the application semantics level are preliminary or are primarily developed for other purposes such as debugging or activity recognition and hypothesized to also be useful for run time assurances. This proposal addresses developing comprehensive solutions for run time assurances in high confidence embedded systems.

Consider the following motivating application example. New, low cost wireless sensor networks (WSN) can be embedded into large city skyscrapers to support fire detection and reaction. Such systems must reliably detect a fire on any floor, activate alarms, notify fire stations and announce and illuminate egress routes. These systems are passively monitored for hazards and are unattended. However, such systems require high confidence in their operation and must also be able to demonstrate that they are operational on a periodic inspection basis (at a minimum). This is a very difficult and important problem, e.g., there may be 100 floors, with over 100 nodes per floor all operating with complex semantics and policies. It is critical to understand the operation of the system. Many surveillance and tracking systems, pollution monitoring, and medical applications have similar high confidence requirements.

Many embedded systems today employ fault tolerance, self-healing, heath monitoring and various other reliability mechanisms. These are considered as part of the core system requirements. However, we are adding a layer of requirements that specifically addresses what is needed to demonstrate the critical functionality of the system. It is necessary to understand how these sets of requirements interact and how the necessary run time support for these sets of requirements can leverage each other. We propose to develop an understanding and solution for this problem by creating a framework and embedding it in multiple different applications.

Our proposed research is novel in several ways. First, we develop a requirements language that permits designers to specify, via a combination of declarative statements, invariants, and rules, the run time assurances required for high confidence. The language addresses application semantics, the statistical nature of WSN, costs, future predictions on system performance, and monitoring needs for various mechanisms including data mining. Second, we develop a run time

Page 2: RunTimeAwithRef1.doc

assurance framework that supports specific demonstrations of a system’s key capabilities on demand and offers a well defined set of diagnosis and repair capabilities when the system fails to meet its assurances. Third, we develop and use various run time mechanisms in novel ways including virtual event generation, real event replay and data mining. Fourth, an implementation and evaluation for multiple applications domains are undertaken. Note that off-line high confidence techniques and issues regarding security attacks are outside the scope of this proposal.

The main intellectual contributions are determining how to specify and support at run time a collection of solutions that enable embedded systems to improve confidence and demonstrate application operability. The broad impact of this work can be extensive since there is a proliferation of embedded systems being deployed or contemplated for critical applications such as fire fighting, pollution control, disaster response, tracking, military surveillance, and medical assistance. Providing systems with the capability to certify that they are operational is one of the next steps to seeing such technology widely used. Without effective run time assurances, systems will be unsafe or just not deployed in many situations.

For educational purposes we will create and incorporate a run time assurance class module into two current course offerings at UVA: CS-451 Wireless Sensor Networks and CS-651 Cyber Physical Systems. We will also make the corresponding teaching materials (slides and labs) available for use at other Universities via the Web. We will also offer graduate seminars dedicated to high confidence embedded systems. The PI will utilize the School of Engineering Office of Minority Affairs to match students with this research.

We believe that we are well positioned to succeed in this research because of our extensive experience in WSN and, in particular, our experiences with building robust WSN. In WSN, our experience and success in the DARPA NEST project forms a basis for this research. In the DARPA project we designed, built and evaluated a wireless sensor network of 203 nodes which performed surveillance, tracking and classification. The system, called VigilNet, was successfully demonstrated many times at Fort MacDill, Avon Park, Berkeley, the Rayburn House, and at the University of Virginia. A version of the system is now classified and being developed for deployment by Northrop-Grumman. Many novel ideas were developed in this system regarding self-healing, power management, sensor fusion, localization, routing, and data aggregation. Over 50 papers were published including three best paper awards [24][25][80] and a nomination for best paper [31]. Our work on this system also clearly identified the key issues that need to be addressed for high confidence solutions under severe constraints and in noisy real-world cyber-physical environments. We have also constructed two other wireless sensor network testbed systems: AlarmNet [79][80] (an assisted living system) and Luster [65] (an environmental science application). These systems also demonstrate the need for run time assurances. We have also developed Envirolog (a run time event replay service) [54] and an on-line monitoring and maintenance service [13] both of which provide key background for one part of this proposed work. However, this proposed work goes well beyond either of these latter systems both of which will be briefly discussed later in the proposal.

2.0 Research Approach

To develop high confidence embedded systems it is necessary for the designers to explicitly address run time assurance at requirements time and throughout the lifetime of the system. Since the systems run in real world settings and may evolve over time it is also necessary to have a run time framework that can provide assurances when necessary (continuously, on demand or periodically). If the system does not meet the run time assurance goals then aids in diagnosis and

Page 3: RunTimeAwithRef1.doc

repair are also crucial. In this work we propose a methodology and framework to support run time assurance for open embedded systems. The goal is creating high confidence open embedded systems.

2.1 Requirements Specifications

It is good software engineering practice to carefully specify the requirements for a system. For high confidence embedded systems, we propose that it is also necessary for designers to precisely specify what is required for run time assurance. Once this is done, then these assurances are supported by our proposed run time framework. In this section we describe our approach and research questions that must be answered for the requirements specifications stage.

We propose to develop a specification language for run time assurances of high confidence embedded systems. The specifications must indicate what application level functionality and what system level functionality need to be assured. For example, in the fire fighting application, it is necessary to show that a fire on any floor is detected, alarms are activated, fire stations are notified and egress routes are identified and illuminated. For each such function the designers specify when the assurance capability is to be invoked. Generally, the invocation of run time assurance modules may occur continuously, periodically, on a non-failure event, or due to a fault. For example, the above capabilities may have to be demonstrated to fire inspectors once per month for the lifetime of the system or whenever certain (sets of) faults occur. System level assurances may be required to demonstrate various features such as reliable routing paths are operational, nodes have sufficient energy, and egress lighting can be activated. Consequently, it is possible to execute application and system level assurances independently.

We also suggest that sometimes it is possible to specify that if an application level function can not be demonstrated, then the system should perform a set of system level assurances to help find the cause of the problem. For example, if the run time assurance periodic test (using virtual event replays – discussed later) indicates that a fire on floor 33 can not currently be detected, then there may be a specific chain (set) of other system level assurance tests specified to execute such as (i) run a spanning tree route detection test, and if broken notify the systems administrator, else (ii) ping nodes on that floor, and if any are not responding then identify them for replacement. Of course, if the cause of the problem is something other than what is being investigated, then these tests cannot find the cause. In this case, standard debugging techniques have to be employed.

We believe that it is critical to address run time assurances as a first design principle and specifically identify the key application semantics, the key system capabilities, and their interactions. The hypothesis is that this strategy helps achieve high confidence embedded systems.

New research on the requirements problem for high confidence embedded systems is needed in three areas. What specification language can best articulate the application and system level run time assurances and their interactions? How can the specification aid in not only identifying system capabilities, but also better support diagnosis and repair when the assurance test fails? How do the run time assurance requirements relate to the overall system requirements?

Specification language: Our approach is to combine declarative specifications with explicit statements of invariants in a manner that (1) has formal semantics, (2) addresses application level assurances, (3) addresses system level assurances, (4) addresses the statistical nature of WSN, (5) accounts for costs, (6) deals with future system projections and (7) identifies monitoring conditions that are usable for the various run time mechanisms including data mining. No current

Page 4: RunTimeAwithRef1.doc

specification schemes or languages support this collection of issues and are usually very general. We are focusing only on run time assurance requirements and embedded systems issues. See also the state of the art section for further comparison to current requirements languages.

While it may sometimes be difficult to distinguish between application and system level assurances, it is more important to simply identify what the system should certify at run time and when. In reality there is actually no need to distinguish the two types. However, we separate them to emphasize that application semantic support is the overall goal and system level assurances are specified to help achieve that goal.

Consider the following examples. At the application level, declarative statements enable specifying requirements such as any fire must be detected; we might also specify an invariant that on each floor there must be at least one active alarm node; we might need to state operational capabilities under probabilities of dense smoke or excessive temperatures; we might need the system to predict future capabilities such as projecting the remaining lifetime of nodes based on current power (battery) levels; and specify what data to collect so that on false alarms the data mining techniques can better assess causes of false alarms. We believe that dealing with this collection of issues is novel. Therefore, we propose a requirements language that can specify predicates, invariants, rules, conditionals (including probabilistic conditionals) and costs. The exact syntax and power of the language is one of the main research questions to be answered. However, we expect that the invariants will be specified with declarative statements that contain a “scope” (for a node, for all nodes, for the system, there exists) a “where” clause that indicates the conditionals and a “requires” clause that specifies facts, actions and rules. Of course, the same types of needs appear for assisted living and many other applications.

The same language should also be used for system level requirements that relate to run time assurance, diagnosis and recovery. At the system level, we might specify the need to check spanning tree coverage, link quality for key links, and sufficient memory availability.

Support diagnosis and repair: To build high confidence systems we require that designers also consider what services are needed to help diagnosis and repair. The specifications must permit rules that describe the diagnosis and what functions to invoke to attempt repair. If designers are successful in prescribing good diagnosis tests then when the run time assurances fail, these tests can be invoked via our framework and the causes identified. If the causes were not anticipated then standard debugging techniques or diagnosis tools like Sympathy [58] must be used. Similarly, for some causes of failures recovery mechanisms will exist and can be activated. For example, by running an application assurance test it may be determined that floor 23 is not detecting a fire, then the system run time assurance might find that the spanning tree is broken (diagnosis), and the recovery is to re-run the spanning tree creation protocol.

Run time assurance and overall system requirements: Many embedded systems today employ fault tolerance, self-healing, health monitoring and various other reliability mechanisms. These are considered as part of the core system requirements. However, we are adding a layer of requirements that specifically address what is needed to demonstrate the critical functionality of the system. It is necessary to understand how these sets of requirements interact and how the necessary run time support for these sets of requirements can leverage each other. We propose to develop an understanding and solution for this problem by creating our framework and embedding it in multiple different applications (see evaluation section).

In summary, the methodology we are proposing is that designers explicitly identify the applications semantics and the system level capabilities that must be assured at run time. For

Page 5: RunTimeAwithRef1.doc

each, they must specify when they are to be invoked. We expect the linkage between the requirements and the run time framework to include a combination of automatic code generation, activation rules, explicit dependencies, and library routines. This set of information and code are then incorporated into the run time framework. While there is potentially an extremely large number of requirements that one might be tempted to specify, it is important to note that run time assurances are not debugging schemes nor a complete set of tests to demonstrate all the functions of the system. Rather the run time assurances need to focus on only the key functions. We believe that it is possible to identify the key application semantics and key system capabilities and that this will be a relatively small set compared to all the detailed requirements. We will validate this hypothesis by working with application experts in multiple domains (see evaluation section).

2.2 Run Time Framework

A second major component of our proposed research is the design, implementation and evaluation of a run time framework that permits run time assurances. The architecture of the framework must be general enough to easily port and execute in many different applications. It needs to support the loading and executing of run time assurance modules controlled by the activation semantics specified. It also needs to interact with underlying mechanisms as provided by the specific application system implementation. These support mechanisms are of two types: the first is a set of new mechanisms specifically added to support run time assurances. The second is the integration and utilization of the run time framework with the application system’s native capabilities such as its communications, monitoring and data collection features. The essence of the framework is to enable a largely unattended embedded system to (on demand) execute code that demonstrates key functionality of the system is operational. When such assurances cannot be given, then some explicit support is provided for diagnosis and repair.

Underlying Mechanisms

A key problem to resolve in developing run time assurances is the creation of an effective set of underlying mechanisms. This set of mechanisms must provide cost-effective support for monitoring, diagnosis and repair. Even deciding what mechanisms are necessary is an open question. In addition, for each mechanism there exists a set of research questions. Our approach is to separate the required mechanisms into two categories: (i) adding specific new mechanisms that are needed to support run time assurances, and (ii) re-using existing functions already available (required) in the functional aspects of the system. Below, we treat each category separately.

Additional Required Mechanisms

Our initial idea is that at least 3 new mechanisms are required, each providing different types of support for run time assurances. Our hypothesis is that these 3 mechanisms are sufficiently powerful that when combined with those mechanisms already in the system they can provide most of what is required in monitoring and diagnosis to support run time assurances. Explicit development of automatic repair capabilities will only be partly addressed. We believe that some repair capabilities fall out naturally from the monitoring and diagnosis and these types of repair capabilities will be addressed. However, complete solutions for repair are extremely difficult and often require human interaction and this is outside the scope of this proposal.

The three additional mechanisms are: Virtual event generators Real event replay capabilities Data mining

Page 6: RunTimeAwithRef1.doc

Virtual Event Generators

Virtual event generators create events to activate specific event-trigger routines and their subsequent processing. To support run time assurance each node runs a virtual event handler and maintains an event table. For each specified event handled at this node, the table records the entry of the corresponding reaction routine. Virtual events are activated in two ways. First, the assurance routines can generate a specific event by sending a command message. Upon receiving the message, the handler produces the event immediately. This method can be used for the immediate node level validation, and by broadcasting can set up a system-wide validation. Second, the event handler can generate events periodically according to a predefined schedule or at a future time. Note that events can be positive events such as emulating the appearance of a target or a fire, or negative events such as a node failing or losing a message.

Our research in this area is to solve several complicating issues. One is that some modules may need event parameters for more sophisticated event processing and associated assurance tests. In this case, such parameters are predefined and stored in the event table or downloaded at run time. Another is that for complex validation, which may require a trace of event logs and parameters, we need to develop more implementation support. In this case we can combine virtual event generation with some of the capabilities being developed for event logging as discussed below. Because we are dealing with embedded systems in physical worlds we must also provide the capability to create events probabilistically. We must develop a scheme for mapping from specified run time assurance requirements into the set of virtual events and their invocation times. Another main concern in supporting virtual event generation is run time cost. However, the memory cost is low because the handler only involves a mapping table, a timer to schedule events, and a message receiver to accept the commands. Communication and energy costs are necessary to provide the assurances, but efficient techniques will be investigated. Another problem is how to decide the level of event to generate and where and how to monitor the results of a generated event. Finally, an especially difficult problem to solve is addressing the simultaneous firing of a virtual event with an actual event from the system.

Real Event Replays

Many high confidence embedded systems provide services for safety critical tasks which do not occur often. Examples of these tasks are detecting fires, pollution or an elderly person falling down. It is also often difficult to produce the actual event for a run time assurance test – we don’t want to start fires or make people fall down. One option is to use the virtual event discussed above. However, such virtual events do not always include enough realities, e.g., they occur after the sensing modules themselves. Another option is to run real world system events and log their details in the system. Then when an assurance test is run, the event as recorded from real world sensors can be replayed. This can even be possible for fires where a real fire can be created in a controlled setting and the low level system reaction to it recorded for replay many times.

To develop real event replays we propose to extend Envirolog [54] a real event logger and replay system that we developed several years ago. Envirolog has the key features required although it was originally used only for debugging. We now propose to explore and extend it for run time assurance.

Page 7: RunTimeAwithRef1.doc

Fig. 1. Record and Replay: Two Main Stages of EnviroLog

As shown in Fig.1, during a recording stage, at each node EnviroLog logs all function calls and their parameters issued by the log modules into a flash as caused by real world events (note that the same logging mechanism can be used for virtual events). In people tracking experiments we performed, it was shown that with a 512KB flash you can collect up to 90 minutes of raw sensor data. Since, for purposes of run time assurance, it is not often that you want to record events lasting this long, the recording capabilities of Envirolog are adequate (and flash memory sizes are increasing). During the replay stage, Envirolog disables the log modules and issues the previously recorded function calls at the right time and in the right sequence at each node of the system. Very little extra overhead is necessary when replaying since the system is acting in the same manner as when the real environmental event occurred. It was also shown that the replays are very accurate [54] assuming that the system is still operational.

Events recorded and replayed by EnviroLog are not limited to direct reflection of environment events such as raw sensor reading, but they can be any system-level events or statistical information of interest of the runtime assurance framework. However, currently EnviroLog logs all events from different specifications and replays all the events in the sequence of their recording. For the purpose of providing different granularities of runtime assurance, we propose to extend EnviroLog to record and replay events according to semantic definitions. For example, we might define magnetic sensor events and routing events as part of a weapons detection event (semantic tag). We might also define temperature and chemical sensor events as pollution events (semantic tag). For runtime assurance even though all the events are recorded, events are tagged with semantic definitions, so during replay stage, events can be replayed according to the semantic tags.

Since system events caused by real world activities can be recorded at different levels of granularity from raw signal readings to function calls, we must also develop schemes to permit a tradeoff between the granularity of the events being recorded and their costs in flash memory.

We believe that it is necessary to support both virtual and real event systems. Virtual event capabilities are more flexible and efficient than real replays. Also, virtual events can create tests for conditions which have not been a priori determined or are too complicated to create in the real world. On the other hand, real event replay provides an added degree of reality for a key subset of required assurances.

Page 8: RunTimeAwithRef1.doc

Data Mining

In spite of best efforts to produce correct systems, high confidence embedded systems operating in an open fashion will still likely experience difficult scenarios and anomalous behaviors from time to time. For example, because of real world realities a system may produce false positives and false negatives. Or the system may be experiencing highly unusual communication delays. The causes of these, hopefully rare, events are often difficult to assess. Problems due to concurrency, race conditions, faults, and real-time non-deterministic events can also cause unexpected results. By requiring the inclusion of specific monitoring and data collection and then making that information available to off-line data mining tools we hope to enable a system to identify unexpected behaviors and causes of unexpected events. Over time such a capability can enable a system to improve, thereby enhancing the overall confidence in its operation.

Our plan is to collect the monitored data into a data warehouse. When executing assurances we can then identify when the assurances show that the system is operational and when not. Based on this information we can construct models of the system and its operation. Then, via standard data mining algorithms we can search for conditions that are common for successful tests and those for unsuccessful tests. Certain trends may be detected such as the system does not operate when large numbers of people are on the same floor (due to a department party) because this causes a communication break in the routing spanning tree. Finding such previously unknown patterns can then result in improved design and implementation. Equally intriguing is being able to use data mining for prediction of potential future problems. For example, it may be learned that when the overall system load reaches 85% then nodes and links begin to fail and within 1 day the system begins to lose its operational capability. Hence, upon seeing such a system load condition actions can be undertaken to reduce load and avoiding the loss of (partial) operation.

Research questions include identifying which data to monitor, how to collect it efficiently, what are the most effective data mining techniques to use, how to find problems caused by race conditions, and determining how effective is prediction.

Integration with the Application System

High confidence embedded applications will normally require the following capabilities: Distributed state monitoring State information collection

Note that systems like LiveNet [17], Momento [62] and our Self-healing VigilNet [13] provide such capabilities. It is our research to develop means for integrating with such capabilities in order to extend these types of health monitoring systems to run time assurances.

Distributed State Monitoring

We expect that most distributed embedded application will need a distributed state monitoring capability for its core operation. Our run time assurance framework also requires distributed state monitoring. We propose one integrated mechanism to cover these two situations. We propose to create a monitor object. A monitor object is a piece of code, which resides on each device and executes the monitoring tasks. The monitor object maintains a list of what states to monitor, the frequency at which to monitor, and what to do with the data (e.g., transmit to the base station for run time assurance). These parameters need to be dynamically re-configurable. A monitor object also includes information processing operations, such as a comparison operation. This

Page 9: RunTimeAwithRef1.doc

overarching monitor does not preclude individual protocols and services from using their own monitoring, e.g., a MAC protocol might monitor channel delay. This individual protocol monitoring would not be part of the monitor object. By providing this monitor object solution, both the core functions of the system and run time assurance functionality can use the same monitoring framework.

Monitoring is not a simple issue and it intimately interacts with state information collection. Our initial design considers a two tier monitoring architecture as shown in Fig. 2. The global monitor object consists of a collector and controller and enables the base station to collect and process the performance and/or state information from each individual node in the network via the collector. The processed information then feeds into the controller. The controller generates a list of virtual or real events, tasks activations, and protocols to execute and transmits the list to the network. Each node runs a local monitor object that consists of a collector, reporter and controller. The local monitor acquires the information on what to monitor from the base station, collects the requested information through the local collector, and reports the requested information back to the base station if required.

Fig. 2: Two Tier Monitor Architecture

Key ideas are to make the monitors capable of collecting specified state information in 5 categories and enable the 5 categories to be flexibly be integrated with many existing systems. The categories are: (1) States that are available via directly interfacing with the hardware layer, for instance, energy remaining, RSSI levels, and the clock. (2) States that are obtainable through the interfaces provided in the original system without the need to interact with the nodes in the neighborhood, such as the maximum number of neighbors in a node's neighbor table or the maximum number of parents of a node. (3) States that require the cooperation among the nodes in a neighborhood with explicit message exchanges. Link quality is one example of those states. (4) States that are specified as the states to be monitored, but the original system does not provide interfaces to expose those states. (5) States that are not maintained by any components in the original system. With these capabilities it will be possible to flexibly integrate this monitoring structure with the core application functionality and enable the same monitoring framework to be used in many applications.

Page 10: RunTimeAwithRef1.doc

Energy consumption of the monitors is one overhead concern. The main energy consumer of the monitors is the radio, i.e., both exchanging beacon messages in a neighborhood to collect state information and reporting the states to the base station. We have developed a preliminary implementation of this monitoring structure and measured its overhead. Fig.3. shows that the overhead of a node for both local and global activities for different beacon periods. We can see that the beacon period has an approximately linear impact on the overhead. But the absolute overheads for both local and global services are minimal (less than 1.2mAh) as compared with a battery's capacity (2,848mAh).

Fig. 3. Overhead of Monitoring per Node

State Information Collection

State information collection is a second system capability that can be used by our run time assurance framework. Many systems require data to be collected at one or more locations. Our run time assurance framework requires state information to be collected at the validation site. Consequently, we can (sometimes) use the same solution for both purposes. One complication is that taking this approach places an extra load on the data collection process. Sometimes this is not acceptable and would invalidate the assurance test itself. Our research is to provide a two part solution that allows use of the normal system state collection protocols when acceptable, but provides overhearing or parallel path solutions when it is not. Solutions will investigate parallel paths, extra overhearing nodes, piggybacking, operating only when system is idle or projected to be idle, etc. to minimize or eliminate the impact state collection has on the normal operation of the system. We will also consider techniques to minimize the memory costs such as keeping state information no longer than necessary and using compression techniques.

In summary, some of the key research questions for the run time assurances framework are:

How to efficiently execute assurance code in parallel with the current operation? How to keep the cost and overheads of the assurance system to a minimum?

Page 11: RunTimeAwithRef1.doc

How to determine fundamental limits for run time assurance? How to effectively use the framework to support diagnosis and repair? Is it possible to

leverage on existing monitoring and diagnosis tools like Sympathy? How does the framework interact with fault tolerance and self-healing mechanisms? Can our solutions reveal improved fault tolerance and self-healing methods to make the

system more conducive to run time assurance?

3.0 Evaluation

The first step in our evaluation is to determine the breadth and completeness of the requirements language. To do this we will specify the necessary run time assurances for three diverse systems that we built in the past and for one new system. The three past systems are VigilNet [32][33] – a military surveillance, tracking and classification system, AlarmNet [79][80] – a smart environment for an assisted living, and Luster [65] – an environmental science application. These systems differ in the frequencies and types of activities and the levels of run time assurance required. For example, in assisted living there is continuous human activity with frequent need for assurances. In military surveillance there is less frequent activity, but with potentially dire consequences. For our environmental science application there are large periods of inactivity, no humans involved and less critical activities. The new system will be an emulation of a firefighting system for skyscrapers. This application combines a number of the requirements from the previous systems: safety critical, human interactions, and long periods of inactivity. It will also serve as a test for applying our solutions to a system not previously built. To ensure that we specify the key run time assurances we will work with domain experts. For the assisted living application we are working with the Medical Automation Research Lab, the Geriatrics department, and the Nursing department all at the UVA medical school. For the environmental science application we are partnering with the UVA Environmental Science department and heavily interacted with them in the original construction of Luster. For fire fighting we are interacting with Windowman Inc. a company that focuses on fire fighting in skyscrapers in New York. If our requirements language is capable of addressing the concerns of this diverse set of applications, then it should be widely applicable.

We only briefly describe the AlarmNet testbed since this is the one we will tackle first. AlarmNet emulates an assisted living facility. It is a joint project between the UVA-PI and the University of Virginia Medical School. AlarmNet integrates heterogeneous devices, some wearable on the patient and some placed inside the living space. Together they inform the healthcare provider about the health status and activities of the resident. Data is collected, aggregated, pre-processed, stored, and acted upon using a variety of replaceable sensors and devices in the architecture (activity sensors, physiological sensors, environmental sensors, pressure sensors, RFID tags, pollution sensors, floor sensor, etc.). There are many users of the system including doctors, nurses, technicians, patients, patients’ family members, and administrators. Privacy of data is paramount and should be revealed on a need to know basis or as permitted by the patient. Traditional healthcare provider networks may connect to the system by a residential gateway, or directly to their distributed databases. Some elements of the network are mobile such as the body networks as well as some of the infrastructure network nodes, while others are stationary. Some nodes can use line power, but others depend on batteries. The system is designed to exist across a large number of living units. The components of the architecture are shown in Figure 4, dividing devices into strata based on their roles and physical interconnect.

Page 12: RunTimeAwithRef1.doc

Figure 4: Architecture of AlarmNet Testbed

In our evaluation we will implement the run time assurance framework and its associated mechanisms and incorporate them into each of the AlarmNet, VigilNet, and Luster testbeds. The first implementation on AlarmNet will be the most difficult, but subsequent porting will be easier. We will measure the amount of code that needs to change when porting and assess overall portability capabilities. We will also measure overhead costs of running the framework under various conditions as well as memory, communication and energy costs. Note that there is no need to measure how confident the system is. Confidence is defined precisely in terms of the stated requirements.

After gaining experience with defining run time assurances and our run time framework on these three previously constructed applications, we will undertake a fourth application from the beginning: fire fighting in sky scrapers. A testbed that emulates fire systems in building will be built. Benefits and costs will be measured similar to the first three applications.

It is important to note that this work does not compute an overall reliability or confidence metric. The overall quality of the system depends on many factors such as the rigorous off-line process, on-line techniques for fault tolerance and self-healing, and the environment. Instead, this work concentrates on, periodically or on-demand, demonstrating that the system does or does not meet its carefully specified run time assurance requirements. How frequently the run time assurance framework answers in the affirmative depends on the environment and the quality of the system implementation. When the answer is no, our framework provides some help with recovery.

Each key mechanism will also be evaluated individually. For example, delays in identifying actual events due to run time assurances being executed in parallel will be measured. True event logging and replay will be tested for a wide range of event types, for its costs, for its effectiveness and with comparisons to other techniques such as virtual events. The effectiveness of data mining in supporting the predictive aspects of run time assurance will be determined.

4.0 Outline of Year by Year Plan and Summary

Page 13: RunTimeAwithRef1.doc

Year 1: During the first year we will solve the research problems in developing the requirements language and create additional fundamental concepts for the major mechanisms: virtual events, real event replay, data mining, distributed state monitoring and state information collection.

Year 2: A full implementation and evaluation of the framework and its mechanisms will be completed on the AlarmNet testbed (medical applications) by the middle of the second year. We will also port the framework to the VigilNet application towards the end of year 2. Data mining solutions will be developed. Solutions for the simultaneous operation of the actual system with validation tests will be completed.

Year 3: We will finalize our framework and our methodology will be generalized. We will consider how our solutions might feed back ideas into improving fault tolerance and self-healing in a manner to facilitate run time assurance. Refinement of the requirements language will be made and re-evaluations conducted. We will port the framework to Luster and create the fire fighting testbed. We will perform extensive experiments to assess the capabilities of our overall solution.

In summary, when this research is complete embedded system designers will have a requirements language focused on run time assurance, a portable and reusable software framework to support run time assurance, and a methodology to understand and create high confidence embedded systems. Fundamental research questions on how to create high confidence embedded systems will be answered. This includes questions such as what are the central principles of run time assurance, how to capture the intricacies and uncertainties of the physical world, understanding of the costs and value of various assurance mechanisms, as well as the other research questions brought up throughout the proposal. Users of systems employing this technology will benefit because the system can show that it is operational as expected, thereby increasing confidence in the system. When the system is not operating according to the requirements, explicit support is given to bring it into compliance.

5.0 State of Art

The work proposed here is highly related to, but also distinct from a number of sensor network research areas including fault tolerance, self-healing, debugging, health monitoring, and system management.

There is an incredible array of fault tolerance, testing and reliability techniques developed over 50 years. Many of these run time techniques can be applied to WSN. Recent work in this area includes [58] [18] [57][64][83]. We expect that any WSN that must operate in high confidence mode will utilize many of these schemes. However, the low level fault tolerance techniques themselves are not the subject of this work and our work operates in conjunction with such techniques. In addition, most existing approaches aim to improve the robustness of an individual component. For example, eScan [87] can actively monitor remaining energy levels, and detect potential node-failure faults. While CODA [75] monitors channel loading conditions to detect congestion for packet loss faults. However, it is difficult to use such schemes to validate the high-level functionality of systems, which involve interactions among multiple components.

Most wireless sensor networks utilize decentralized algorithms to achieve some degree of reliability. Most of these greatly enhance reliability of individual services, but do not deal with correctness guarantees of the whole system. Self-healing, to various degrees, is also a property of autonomic systems and many WSN [8][32][86]. Such techniques attempt to avoid violations of

Page 14: RunTimeAwithRef1.doc

run time assurances and thus should be used. Again, our framework can be used in conjunction with reliability and self-healing techniques.

Debugging is a complicated process for WSN and many different approaches exist. Some allow setting distributed breakpoints [76] [82], others use overhearing [60], some are based on invariants [37], and some attempt to use data mining to discover especially difficult to find bugs [46]. Debugging relies on extensive testing. We expect that debugging tools and testing are used prior to deployment or when major problems occur and a cause must be determined. Typically, these solutions are used to fix coding errors, but not exclusively. In our context we expect that if a run time assurance fails and the pre-specified state assurances fail to find the cause, then we have to resort to use of debugging and diagnosis techniques. Tools like Sympathy [58] could provide help in this regard. However, many of these tools are tightly interwoven with specific applications. For example, Sympathy can associate communications with related faults, but it works for tree-like networks with periodic data collection traffic. It may be difficult to adopt the scheme for other systems.

Health monitoring systems such as MANNA [63], LiveNet [17]and Memento [62] and others [61] employ sniffers or specific embedded code to monitoring the status of a system. One work uses correctness monitoring using invariants [37]. Such tools can be used as the monitoring component of our framework if they are available. Information from systems like these would then be passed to our framework for run time assurance checks or to activate further system state checks.

There are very few overall system management systems for WSN [6] [73]. Some of them manage only a few things such as energy [41], topology or bandwidth. However, we have not found any that address run time assurances for high confidence systems in the manner or depth proposed here.

The field of requirements engineering has a long history [22][27][42][44]. The common trend in this field is to develop languages based on a formal semantics [23]. Note that these languages are distinct from software specification languages [5] [71]. While these general requirements languages provide guidance for our research, our work has two key differences, (i) our requirements are simpler because we focus only on run time assurances, and (ii) we must address the complications of open, high confidence embedded systems.

6.0 Education and Outreach

We will create and incorporate a run time assurance class module into two current course offerings: CS-451 Wireless Sensor Networks and CS-651 Cyber Physical Systems. Both classes have already been offered and are taught at the graduate and undergraduate levels. We will also make the corresponding teaching materials (slides and labs) available for use at other Universities via the Web. We will enhance the labs associated with these courses to include experiments that use virtual event generators and real event replays. See the following URL http://www.cs.virginia.edu/cs651.wsn/labs.htm for a description of current labs. We will also offer graduate seminars dedicated to high confidence embedded systems. These will include an application component where guest speakers from fire fighting and medical systems will discuss real assurance requirements. The software framework and assurance mechanisms will be available via SourceForge.

The PI has a strong commitment to include underrepresented students in this proposed work. The PI will utilize the School of Engineering Office of minority affairs to match students with this

Page 15: RunTimeAwithRef1.doc

research. Recently, the University of Virginia was ranked number 1 in terms of percentage of African America students enrolled among the top research Universities.

Results from Previous Grants – Example

NSF CCR-0325197 Amount: $500,000 8/15/03 to 7/31/07TITLE: Spatiotemporal Protocols and Analysis for Sensor NetsPI C. Lu (Washington University), PI Stankovic, (UVA), Co-PI Abdelzaher (UIUC)

This work develops real-time communication protocols and associated analysis for WSNs, explicitly addressing both the space and time dimension [30][31][35]. Currently, we are combining the work in [30] and [53] to produce a highly efficient real-time communication protocol. Importantly, our work has been applied to a real-time surveillance application for target detection and tracking [2][3][32]. The results included one of the first complete systems of Berkeley motes and its evaluation. A key service of wireless sensor networks is data aggregation. Here we have combined our expertise in feedback control with our expertise in wireless sensor networks to produce the beginnings of an analysis for such networks [3]. We have also developed a state free routing protocol [34] which improves end-to-end delivery by more than 10 times over the best solutions in the literature for mobile environments, and a hardware solution for device wakeup by taking energy from a communication signal [24]. This latter solution saves more than 70% of the energy when compared to other energy saving schemes.

REFERENCES

Page 16: RunTimeAwithRef1.doc

[1] T. Abdelzaher, J. Stankovic, S. Son, B. Blum, T. He, A. Wood, and C. Lu, “A Communication Architecture and Programming Abstractions for Real-Time Embedded Sensor Networks.” Invited paper, In Workshop on Data Distribution for Real-Time Systems, May 2003.

[2] T. Abdelzaher, B. Blum, D. Evans, J. George, S. George, L. Gu, T. He, C. Huang, P. Nagaraddi, S. Son, P. Sorokin, J. Stankovic, and A. Wood, “EnviroTrack: Towards an Environmental Computing Paradigm for Distributed Sensor Networks.” In Proceedings of IEEE ICDCS, April 2004.

[3] T. Abdelzaher, T. He, and J. Stankovic, “Feedback Control of Data Aggregation in Sensor Networks.” Invited paper, Conference on Decision and Control, February 2004.

[4] T. Abdelzaher, B. Blum, Q. Cao, L. Gu, T. He, S. Krishnamurthy, L. Luo, S. Son, J. Stankovic, R. Stoleru, and A. Wood, “Programming and Execution Support for Surveillance in Sensor Networks.” In submission.

[5] Alloy, http://alloy.mit.edu/alloy4/[6] Y. J. Al-Raisi, D. J. Parish, Approximate wireless sensor network health monitoring.

In Proceedings of the International Conference on Wireless Communications and Mobile Computing, IWCMC 2007

[7] J. Alves-Foss, W.S. Harrison, P. Oman and C. Taylor, “The MILS Architecture for High-Assurance Embedded Systems.” In the International Journal of Embedded Systems, Feb. 2005.

[8] Autonomic computing. http://www.research.ibm.com/autonomic/[9] H. Balakrishnan, V. Padamanabhan, and R.H. Katz, “The Effects of Asymmetry on

TCP Performance.” In Mobile Networks and Applications, pages 219-241, 1999.[10] V. Bharghavan, A. Demers, S. Shenker and L. Zhang, “MACAW: A Media Access

Protocol for Wireless LANs.” In Proceedings of ACM SIGCOMM, pages. 212-225, 1994.

[11] B. Blum, P. Nagaraddi, A. Wood, T. Abdelzaher, S. Son, J. Stankovic, “An Entity Maintenance and Connection Service for Sensor Networks.” In Proceedings of the International Conference on Mobile Systems, Applications, and Services (Mobisys), San Francisco, CA, May 2003

[12] N. Bulusu, J. Heidemann, D. Estrin, “GPS-less Low Cost Outdoor Localization for Very Small Devices.” In IEEE Personal Communications Magazine, Special Issue on Smart Spaces and Environments, 2000.

[13] Q. Cao and J. Stankovic, “An In-Field Maintenance Framework for Wireless Sensor Networks.” DCOSS, June 2008.

[14] A. Cerpa and D. Estrin, “ASCENT: Adaptive Self-Configuring Sensor Networks Topologies.” In Proceedings of the IEEE Infocom, 2002.

[15] A. Cerpa, N. Busek and D. Estrin, “SCALE: A Tool for Simple Connectivity Assessment in Lossy Environments.” In CENS Technical Report 0021, September 2003.

[16] B. Chen, K. Jamieson, H. Balakrishnan and R. Morris, “Span: An Energy-Efficient Coordination Algorithm for Topology Maintenance in Ad-hoc Wireless Networks.” In ACM MobiCom, July 2001.

[17] B. Chen, G. Peterson, G. Mainland and M. Walsh, “LiveNet: Using Passive Monitoring to Reconstruct Sensor Network Dynamics.” Accepted to the International Conference on Distributed Computing in Sensor Systems (DCOSS), 2008.

[18] T. Clouqueur, K. K. Saluja, and P. Ramanathan, Fault Tolerance in Collaborative Sensor Networks for Target Detection. IEEE Transactions on Computers, Vol. 53, No. 3, pp. 320–333, March 2004.

Page 17: RunTimeAwithRef1.doc

[19] J. Deng, R. Han, and S. Mishra. “Insens: Intrusion-Tolerant Routing in Wireless Sensor Networks.” In Proceedings of the 23rd IEEE International Conference on Distributed Computing Systems (ICDCS 2003), Providence, RI, MAY 2003.

[20] T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, D. Boneh, “Terra: A Virtual Machine-Based Platform for Trusted Computing.” Proceedings of ACM Symposium on Operating Systems Principles (SOSP), 2003.

[21] D. Garlan, V. Poladian, B. Schmerl, and J. P. Sousa, “Task-based Self-Adaptation.” Proceedings of the ACM SIGSOFT 2004 Workshop on Self-Managing Systems (WOSS'04), Newport Beach, CA, Oct/Nov 2004.

[22] S. Greenspan, A. Borgida, J. Mylopoulos, “A Requirements Modeling Language and Its Logic.” Information Systems, 11(1), pp. 9-23, 1986. Also appears in Knowledge Base Management Systems, M. Brodie and J. Mylopoulos, Eds. Springer-Verlag, 1986.

[23] S. Greenspan, J. Mylopoulos, A. Borgida, “On Formal Requirements Modeling Languages: RML Revisted.” In the Proceedings of the 16th International Conference on Software Engineering, 1994.

[24] L. Gu and J. A. Stankovic, “t-kernel: Providing Reliable OS Support for Wireless Sensor Networks.” In Proc. of ACM Conf. on Embedded Networked Sensor Systems (SenSys'06), 2006.

[25] L. Gu and J. Stankovic, “Radio-Triggered Wake-Up Capability for Sensor Networks.” In RTAS, May 2004.

[26] H. Gupta, S. R. Das, and Q. Gu, “Connected Sensor Cover: Self-Organization of Sensor Networks for Efficient Query Execution.” In Proceeding of MobiHoc ’03, Annapolis, Maryland, June 2003.

[27] J. Hagelstein, D. Roelents, P. Wodon, “Formal Requirements Made Practical.” In ESEC93, pp. 127-144, 1993.

[28] A. Harter, A. Hopper, P. Steggles, A.Ward, and P.Webster, “The Anatomy of a Context-Aware Application.” In Proceedings of the MOBICOM ’99, 1999.

[29] T. He, C. Huang, B. M. Blum, J. A. Stankovic and T. F. Abdelzaher, “Range-Free Localization Schemes in Large Scale Sensor Networks.” In Proc. MOBICOM, 2003.

[30] T. He, J. Stankovic, C. Lu and T. Abdelzaher, “A Spatiotemporal Communication Protocol for Wireless Sensor Networks.” IEEE Transactions on Parallel and Distributed Systems, October 2003.

[31] T. He, J. Stankovic, C. Lu, and T. Abdelzaher, “SPEED: A Stateless Protocol for Real-Time Communication in Ad Hoc Sensor Networks.” In International Conference on Distributed Computing Systems (ICDCS), 2003.

[32] T. He, S. Krishnamurthy, J. Stankovic, T. Abdelzaher, L. Luo, T. Yan, J. Hui and B. Krogh, “Energy Efficient Surveillance Systems Using Wireless Sensor Networks.” In Mobisys, 2004.

[33] T. He, B. Blum, Q. Cao, J. Stankovic, S. Son and T. Abdelzaher, “Robust and Timely Communication over Highly Dynamic Sensor Networks.” Special issue of Real-Time Systems Journal, acceptance rate 13%, Vol. 37, No. 3, December 2007, pp. 261-289.

[34] T. He, B. Blum, J. Stankovic, S, Son and T. Abdelzaher, “A Lazy-Binding Communication Protocol for Highly Dynamic Wireless Sensor Networks.” In ACM Transactions on Embedded Computer Systems, Vol. 4, Issue 4, 2005.

[35] T. He, B. Blum, J. Stankovic, and T. Abdelzaher, “AIDA: Application Independent Data Aggregation in Wireless Sensor Networks.” Special issue of ACM TECS, 2006.

[36] T. He, C. Huang, B. Blum, J. Stankovic, T. Abdelzaher, “Range-Free Localization and Its Impact on Large Scale Sensor Networks.” ACM Transactions on Embedded Computer Systems, Vol. 4, Issue 4, 2005.

Page 18: RunTimeAwithRef1.doc

[37] D. Herbert, V. Sundaram, Y. Lu, S. Bagchi, Z. Li, Adaptive Correctness Monitoring for Wireless Sensor Networks Using Hierarchical Distributed Run-Time Invariant Checking. In ACM Transactions on Autonomous and Adaptive Systems (TAAS) 2007.

[38] Y.C. Hu, D. B. Johnson, and A. Perrig, “SEAD: Secure Efficient Distance Vector Routing for Mobile Wireless Ad Hoc Networks.” In Proceedings of the 4th IEEE Workshop on Mobile Computing Systems and Applications (WMCSA 2002), June 2002, pp. 3–13.

[39] C. Intanagonwiwat, R. Govindan and D. Estrin, “Directed Diffusion: A Scalable and Robust Communication Paradigm for Sensor Networks.” In Proc. MOBICOM, pp. 56-67, March 2000.

[40] D.N. Jayasimha, “Fault Tolerance in Multi-Sensor Networks.” IEEE Transactions on Reliability, vol.45, no.2, pp.308-15, June 1996.

[41] X. Jiang, J. Taneja, J. Ortiz, A. Tavakoli, P. Dutta, J. Jeong, D. Culler, P. Levis, and S. Shenker, “An Architecture for Energy Management in Wireless Sensor Networks.” In SIGBED, 2007.

[42] W.L. Johnson, M. Feather, and D. Harris, “Representing and Presenting Requirements Knowledge.” In IEEE Transactions on SE, 1992.

[43] K. D. Kang, S. Son, J. Stankovic, and T. Abdelzaher, “A QoS-Sensitive Approach for Timeliness and Freshness Guarantees in Real-Time Databases.” In EuroMicro Real-Time Systems Conference, June 2002.

[44] S. Kent, T. Maibaum, and W. Quirk, “Formally Specifying Temporal Constraints and Error Recovery.” In RE93, pp. 208-215.

[45] I. Khalil, S. Bagchi, C. Nina-Rotaru, “Dicas: Detection, Diagnosis and Isolation of Control Attacks in Sensor Networks.” Proceedings of the First International Conference on Security and Privacy for Emerging Areas in Communications Networks (SECURECOMM), 2005.

[46] M. Khan, T. Abdelzaher, K. Gupta, “Towards Diagnostic Simulation in Sensor Networks.” Accepted to the International Conference on Distributed Computing in Sensor Systems (DCOSS), 2008.

[47] S. Kim, S. Son, J. Stankovic, S. Li, and Y. Choi, “SAFE: A Data Dissemination Protocol for Periodic Updates in Sensor Networks.” First Workshop on Data Distribution in Real-Time Systems, May 2003.

[48] S. Kim, T. Kwon, Y. Choi, S. Son and J. Stankovic, “Multi-Rate Multicast for Data Dissemination in Sensor Networks.” submitted to IEEE Computer.

[49] S. Kim, S. Son, J. Stankovic, and Y. Choi, “Data Dissemination over Wireless Sensor Networks.” IEEE Communications Letters, Sept. 2004.

[50] L. Lazos, R. Poovendran, and S. Capkun, “ROPE: Robust Position Estimation in Wireless Sensor Networks.” In Proceedings of International Symposium on Information Processing in Sensor Networks (IPSN ‘05), 2005.

[51] S. Li, S. Son, and J. Stankovic, “Event Detection Services Using Data Service Middleware in Distributed Sensor Networks.” 2nd International Workshop on Information Processing in Sensor Networks (IPSN'03), 2003.

[52] S. Li, Y. Lin, S. Son, J. Stankovic and Y. Wei, “Event Detection Using Data Service Middleware in Distributed Sensor Networks.” Special Issue on Wireless Sensor Networks of Telecommunications Systems, Kluwer, Aug. 2004.

[53] C. Lu, B. Blum, T. Abdelzaher, J. Stankovic, and T. He, “RAP: A Real-Time Communication Architecture for Large-Scale Wireless Sensor Networks.” RTAS, June 2002.

Page 19: RunTimeAwithRef1.doc

[54] L. Luo, T. He, T. Abdelzaher, J. Stankovic, G. Zhou and L. Gu, “Achieving Repeatability of Asynchronous Events in Wireless Sensor Networks with EnviroLog.” Infocom, 2006.

[55] K. Marzullo, “Tolerating Failures of Continuous Valued Sensors.” In ACM Transactions on Computer Systems, vol.8, no.4, pp.284-304, November 1990.

[56] J.C. Navas and T. Imielinski, “Geographic Addressing and Routing.” In Proceedings of MOBICOM ’97, Budapest, Hungary, September 26, 1997.

[57] L. Paradis and Q. Han, “A Survey of Fault Management in Wireless Sensor Networks.” In Journal of Network and Systems Management, Vol 15, No. 2, June 2007.

[58] N. Ramanathan, K. Chang, L. Girod, R. Kapur, E. Kohler, and D. Estrin, Sympathy for the sensor network debugger. In Proceedings of SenSys, 2005.

[59] R. Rajagopal, X. Nguyen, S. Ergen and P. Varaiya, Distributed online simultaneous fault detection for multiple sensors. In Proceedings of the 7th International Conference on Information Processing in Sensor Networks (IPSN) 2008.

[60] M. Ringwald, K. Romer, A. Vitaletti, SNTS: Sensor Network Troubleshooting Suite, In Proceedings of the 3rd IEEE International Conference on Distributed Computing in Sensor Systems DCOSS 2007.

[61] M. Ringwald and K. Romer, Passive Inspection of Sensor Networks, DCOSS 2007.[62] S. Rost and H. Balakrishnan, “Memento: A Health Monitoring System for Wireless

Sensor Networks.” In Proceedings of IEEE SECON, 2006. [63] L.B. Ruiz, J. Nogueira and A. Loureiro, “MANNA: A Management Architecture for

Wireless Sensor Networks.” In IEEE Communications Magazine, Feb. 2003. [64] L.B. Ruiz, I. Siqueira, L. Oliveira, H. Wong, J. Nogueira, A. Loureiro, “Fault

Management in Event-Driven Wireless Sensor Networks.” In MSWiM’04, 2004.[65] L. Selavo, A. Wood, Q. Cao, A. Srinivasan, H. Liu, T. Sookoor, J. Stankovic,

“Luster: Wireless Sensor Network for Environmental Research.” ACM SenSys, Nov. 2007.

[66] J. Stankovic, T. Abdelzaher, C. Lu, L. Sha and J. Hou, “Real-Time Communication and Coordination in Embedded Sensor Networks.” Invited paper, IEEE Proceedings, Vol. 91, No. 7, July 2003, pp. 1002-1022.

[67] M. Steinder and A.S. Sethi, “Probabilistic Fault Diagnosis in Communication Systems Through Incremental Hypothesis Updating.” Computer Networks Vol. 45, 4 (July 2004), pp. 537-562.

[68] R. Stoleru, J.A. Stankovic, S.H. Son, “Robust Node Localization for Wireless Sensor Networks.” International Workshop on Embedded Networked Sensor Systems (EmNetS), 2007.

[69] R. Stoleru, T. He, J. Stankovic, “A High Accuracy, Low-Cost Localization System for Wireless Sensor Networks.” Sensys 2005.

[70] M. Strasser, H. Vogt, “Autonomous and Distributed Node Recovery in Wireless Sensor Networks.” Proceeding of the fourth ACM workshop on Security of Ad Hoc and Sensor Networks (SASN), 2006.

[71] R. Thayer, M. Dorfman, System and Software Requirements Engineering, (two volumes), IEEE Computer Society Press, 1990.

[72] D. Tian and N. D. Georganas, “A Node Scheduling Scheme for Energy Conservation in Large Wireless Sensor Networks.” In Wireless Communications and Mobile Computing Journal, May 2003.

[73] G. Tolle and D. Culler, Design of an application-cooperative management system for wireless sensor networks. In Proceedings of EWSN, 2005.

Page 20: RunTimeAwithRef1.doc

[74] H. Vogt, M. Ringwald, M. Strasser, “Intrusion Detection and Failure Recovery in Sensor Nodes.” Proceedings of the Fourth ACM Workshop on Security of Ad Hoc and Sensor Networks (SASN), 2006.

[75] C. Wan, S. Eisenman, A. Campbell, “CODA: Congestion Detection and Avoidance in Sensor Networks,” SenSys 2003.

[76] K. Whitehouse et. al., Marionette: Using RPC for Interactive Development and Debugging of Wireless Embedded Networks, IPSN, 2006.

[77] E. Wohlstadter, B. Toone, and P. Devanbu, “A Framework for Flexible Evolution in Distributed Heterogeneous Systems.” In Proceedings of the Workshop on Principles of Software Evolution, 2002.

[78] A. Woo, T. Tong and D. Culler, “Taming the Underlying Challenges of Reliable Multihop Routing in Sensor Networks.” In SenSys, Los Angeles, CA 2003.

[79] A. Wood, G. Virone, Y. Wu, L. Selavo, Q. Cao, R. Stoleru, S. Lin, Z. He, J. Stankovic, T. Doan, and L. Fang, AlarmNet: Wireless Sensor Networks for Assisted-Living and Residential Monitoring, Univ. of Virginia, TR CS2006-11, 2006.

[80] A. Wood, L. Selavo and J. Stankovic, SenQ: An Extensible Query System for Streaming Data Heterogeneous Interactive Wireless Sensor Networks, to appear DCOSS, June 2008.

[81] A. D. Wood, L. Fang, J. A. Stankovic, T. He, “SIGF: A Family of Configurable, Secure Routing Protocols for Wireless Sensor Networks.” In ACM Workshop on Security of Ad Hoc and Sensor Networks (SASN 2006), 2006.

[82] J. Yang, M.L. Soffa, L. Selavo, and K. Whitehouse, Clairvoyant: A Comprehensive Source-Level Debugger for Wireless Sensor Networks. In Proceedings of SenSys, 2007.

[83] M. Yu, H. Mokhtar, M. Merabti, “Fault Management in Wireless Sensor Networks.” In IEEE Wireless Communications, Dec. 2007.

[84] Y. Xu, J. Heidemann and D. Estrin, “Geography-informed Energy Conservation for Ad Hoc Routing.” In ACM MOBICOM 2001, 2001.

[85] Y. Yu, R. Govindan, and D. Estrin, “Geographical and Energy Aware Routing: A Recursive Data Dissemination Protocol for Wireless Sensor Networks.” Technical Report UCLA/CSD-TR-01-0023, UCLA, Department of Computer Science, May 2001.

[86] H. Zhang, A. Arora, “GS3: Scalable Self-Configuration and Self-Healing in Wireless Sensor Networks,” Computer Networks (Elsevier), 43(4):459-480, November 15, 2003.

[87] Y. Zhao, R. Govindan, D. Estrin, “Residual Energy Scan for Monitoring Sensor Networks,” WCNC 2002.