1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded...
-
Upload
anne-barton -
Category
Documents
-
view
218 -
download
4
Transcript of 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded...
1
Software Fault Tolerance (SWFT)
SWFT for Wireless Sensor Networks (Lec 2)
Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de
Prof. Neeraj Suri
Abdelmajid Khelil
Dept. of Computer ScienceTU Darmstadt, Germany
2
Last lecture:
.. FT environmental monitoring , e.g. target detection.... Data fusion ..
But the WSN ages!The system is evolvable!
“So who watches the watchmen?”
Motivation
3
• IDEALLY a self-healing network could• Identify Problem: That a bird landed on a node• Identify a Fix: Need to remove the bird • Fix the Problem: Actuation to remove the bird
Self-healing
Goal: Enable WSN to self-monitoring system health and autonomous debug
Begin by enabling human debugging in order to learn what metrics and techniques are useful, in order to enable autonomous system debugging
4
Debugging Is Hard in WSN
Wide range of failures node crashes, sensor fails, code bugs, transient
environmental changes to the network Bugs are multi-causal, non-repeatable, timing-sensitive
and have ephemeral triggers
Transient problems are common Not necessarily indicative of failures Interactions between sensor hardware, protocols, and
environmental characteristics are impossible to predict
Limited visibility Hard accessibility of the system Minimal resources, RAM, communication
WSN application design is an iterative process between debugging and deployment.
5
Some Debugging Challenges Minimal resource
Cannot remotely log on to nodes Bugs are hard to track down
Evolvable system/conditions Application behavior changes after deployment Operating conditions (energy ..)
Extracting debugging information Existing fault-tolerance techniques are limited Ensuring system health
6
Scenario
After Deploying a Sensor Network…
Very little data arrives at the sink, could be…. anything!
The sink is receiving fluctuating averages from a region – could be caused by Environmental fluctuations Bad sensors Channel drops the data Calculation / algorithmic errors; and Bad nodes
7
Existing Works
Simulators / Emulators / Visualizers E.g. EmTOS, EmView, moteview,
Tossim ..• Provide real-time information• Do not capture historical context or aid
in root-causing a failure
SNMS Interactive health monitoring focuses on infrastructure to deliver
metrics, and high code size Log files contain excessive data
which can obfuscate important events
[1] focus on metric collection and not on metric content
Momento, Sympathy
EmView
8
Sympathy: A Debugging System for Sensor Network
Nithya Ramanathan, Kevin Chang, Rahul Kapur, Lewis Girod, Eddie Kohler, and Deborah Estrin.
SenSys '05
9
Overview
System Model Approach Architecture Evaluation
10
System Model
Network feature some regular traffic: SN are expected to generate traffic of some kind (Monitored traffic): routing updates, time synchronization beacons, periodical data ..
Sympathy suspects a failure when a node generates less monitored traffic than expected.
Sympathy generates additional metrics traffic. No malicious behavior
11
Model For Correct Data Flow
Sink may not receive sufficient traffic from a node for multiple reasons
To determine where and why traffic is lost, Sympathy outlines high-level requirements for data to flow through the network
12
Node Not Connected
If destination node is not connected, then it may not receive the packet, and it will not respond
13
Node Does Not Receive Packet
If destination node does not receive certain packets (e.g. a query) from source, it may not transmit expected traffic
14
Node Does Not Transmit Traffic
Destination node may receive traffic, but due to software or hardware failure it may not transmit expected traffic
15
Sink Does Not Receive Traffic
Sink may not receive traffic due to collisions or other problems along the route
16
Tracking Traffic Through The Data Flow Model
Should NodeTransmitTraffic?
Did NodeTransmitTraffic?
Did SinkReceiveTraffic?
Node Connected?
Node NOTConnected!
(Node Crash)
(AsymmetricCommunication)
Node NOTConnected!
17
Design Requirements
Tool for detecting and debugging failures in pre- and post-deployment phases.
Debugging information should provide Most precise and meaningful failure detection
• Accuracy• Latency
Lowest overhead
Transmitted debugging information must be minimized
18
3 Challenges in WSN Debugging
Has a failure happened? Which failure happened? Is the failure important?
Sympathy aids users in - finding (detecting and localizing) and - fixing failures
by attempting to answer these questions
19
Sympathy Approach
Sink collects stats passively & activelyactively
Sink
Monitors data flow fromnodes / components
Identifies and localizes failures
X
Highlights failure dependencies and event correlations
2
1
34
Idea: “There is a direct relationship between amount of data collected at the sink and the existence of failures in the system”
20
Network’s purpose is to communicate If nodes are communicating sufficiently, network is
working Simplest solution is the best
“Insufficient” Traffic => Failure Application defines “sufficient”
Sympathy detects many different failure types by tracking application end-to-end data flow
Channel Contention
Node Crash
AsymmetricLinks
Sensor Failure
No Sensor Data
Has a Failure Happened?
21
Valid Neighbor Table No Neighbors
Yes
Valid RouteTable
No Route
Yes
No
Sufficient #PktsReceived
Bad Path to Node
No
Yes
Bad Node Transmit
No
No Bad Path to Sink
Anybody heard from node Sink Timestamp/Neighbor Tables Node Crash
Yes
No
Time Awake increases
No
Yes
Node Reboot
No
Should NodeTransmit?
Did NodeTransmit?
Did SinkReceive Traffic?
Node Connected?
Sufficient #PktsReceived at Sink
Yes
Sufficient #PktsTransmitted
Which Failure Happened?
22
Is the Failure Important? Analyze failure dependences to highlight primary failures
Based on reported node topology “Can failure X be caused by failure Y”
Deemphasize secondary failures to focus user’s attention Does NOT identify all failures or debug failures to line of
code
Primary FailureSecondary
Failures
23
Failure Localization
Determining why data is missing Physically narrow down cause
E.g. Where is the data lost
X
Was the data even sent by the component?
Where in the transmission path was the data lost?
OR
24
Bad Node Transmit (ADC failure)
Node crashed
No Neighbors
PRIMARY FailuresIn Red Boxes
Bad Path To Node (due to contention at sink)
Bad Path To Sink (due to contentionat sink)
Detect Failures Determine what the failure is (Root Cause) Determine if the failure is important (Primary
Failures)
Contention at Sink
Final Goal
25
Sympathy
Routing
…
AppsSympathy
Routing
…
Apps
SINK
UserProcesses
CollectMetrics Sympathy
Sympathy System: Architecture
26
Architecture Definitions Network: a sink and
distributed nodes Component
Node components Sink components
Sympathy-sink Communicates with sink
components Understands all packet formats
sent to the sink Non resource constrained node
Sympathy-node Statistics period Epoch
Sympathy sink
SinkComponent
Sympathy node
Node Component
Sink
Sensor node
27
CollectStats
PerformDiagnostic
If Insufficient data Run
Tests
Run FaultLocalizationAlgorithm
SYMPATHY
Sympathy
Routing
…
Comp 1
SINK
CollectStats
PerformDiagnostic
If No/Insufficient data Run
Tests
Run FaultLocalizationAlgorithm
SYMPATHY
USER
Sink Components
Nodes
Architecture: Overview
1
28
Routing Layer
MAC Layer
Retrieve CompStatistics R
ing
Bu
ffer
Stats Recorder&
Event Processor
Sympathy - Node
Data Return
…Comp 1
Sympathy Code on Sensor Node Each component is monitored independently Return generic or app-specific statistics
29
Metrics Metrics are collected in 3 ways:
Sympathy code on each SN actively reports to the sink (periodically or on-demand)
Sink passively snoops its own transmission area Sympathy code on sink extracts sink metrics from sink
application 3 metric categories:
Connectivity: ROUTING TABLE, NEIGHBOUR LIST from each node, either passively or actively.
Flow: PACKETS SENT, PACKETS RECEIVED, #SINK PACKETS TRANMITTED from each SN, and #SINK PACKETS RECEIVED and SINK LAST TIMESTAMP from sink.
Node: SN actively report UPTIME, BAD PACKETS RECEIVED, GOOD PACKETS RECEIVED. Sink also maintains BAD and GOOD PACKETS RECEIVED.
All Metrics timeout after EPOCH
30
Node Statistics
Passive (in sink’s broadcast domain) and actively transmitted by nodes
Statistic Name Description
ROUTING TABLE (Sink, next hop, quality) tuples.
NEIGHBOUR LIST Neighbors and associated ingress/ egress
UP TIME Time node is awake
#Statistics tx #Statistics packets transmitted to sink
#Pkts routed #Packets routed by the node
31
Component Statistics
Actively transmitted by a node to the sink, for each instrumented component
Statistic Name Description
#Reqs rx Number of packets component received
#Pkts tx Number of packets component transmitted
SINK LAST TIMESTAMP Timestamp of last data stored by
component
32
CollectStats
SYMPATHY
Sympathy
Routing
…
Comp 1
2
SINK
CollectStats
SYMPATHY
Sink Components
Comp 1
Comp 1
Sympathy System
33
Sink Interface
Sympathy passes comp-specific statistics using a packet queue
Components return ascii translations for Sympathy to print to the log file
Sympathy
Comp 1
Comp 2
Comp 3
Comp-specificstatistics
Ascii translationof statistics /Data received
34
CollectStats
PerformDiagnostic
If Insufficient data Run
Tests
Run FaultLocalizationAlgorithm
SYMPATHY
Sympathy
Routing
…
Comp 1
SINK
CollectStats
PerformDiagnostic
RunTests
Run FailureLocalizationAlgorithm
SYMPATHY
Sink Components
3
Sympathy System
If No/Insufficient data
35
Node Rebooted
Node Rebooted
Yes NoRx a Pkt
from nodeYes No
Some node hasheard this node
No
Node Crashed
Yes
Some nodehas route to sink
NoYes
Some node hassink as neighbor
No
No node has sink on their
neighbor list
No node has aRoute to sink
Yes
No Data
Rx Statistics
No stats
Yes No
Rx all Comp’sData
NO FAILURE (Comp has no Data to
Tx)
NoYes
Comp Rx Reqs
NoYes
Node not Rx ReqsComp Tx Resps
NoYes
Node not Tx RespsSink Rx Resps Comp Tx
NoYes
Sink not Rx Resps
DIAGNOSTIC
No DataInsufficient DataInsufficient
Data
Failure Localization: Decision Tree
Tx: transmitRx: receive
36
Functional “No Data” Failure Localization
Failure Description
Node Crash Node has crashed and not come back
No Route to Sink
No valid route exists to the sink from a node
No Data No data received from a node, and Sympathy cannot localize the failure
37
Performance “Insufficient Data” Failure Localization
Failure DescriptionNode Reboot Node has rebooted
Congestion Correlated failures on packet reception
No requests rx Component is not receiving requests from sink
No response tx Component is not transmitting data in response to requests
No response rx Sink is not receiving data transmitted by a component
No statistics rx Sink has not received Sympathy statistics on the component
38
Source Localization
Root Causes with Associated Metrics and Source:
Three localized sources for Failures:
• Node self (crash, reboot, local bug, connectivity issue..)• Path between node and sink (relay failure, collisions ..)• Sink
39
CollectStats
PerformDiagnostic
If Insufficient data Run
Tests
Run FaultLocalizationAlgorithm
SYMPATHY
Sympathy
Routing
…
Comp 1
4 SINK
CollectStats
PerformDiagnostic
If Insufficient data Run
Tests
Run FaultLocalizationAlgorithm
SYMPATHY
USER
Sink Components
Sympathy System
40
Informational Log File
Node 25Time: Node awake: 78 (mins) Sink awake: 78(mins)Route: 25 -> 18 -> 15 -> 12 -> 10 -> 8 -> 6 -> 2 Num neighbors heard this node: 6
Pkt-type #Rx Mins-since-last #Rx-errors Mins-since-last
1:Beacon 15(2) 0 mins 1(0) 52 mins3:Route 3(0) 37 mins 0(0) INFSymp-stats 12(2) 1 mins
Reported Stats from Components------------------------------------**Sympathy: #metrics tx/#stats tx/#metrics expected/#pkts routed: 13(2)/12(2)/13(1)/0(0)
Node-ID Egress Ingress-----------------------------------------------8 128 7113 128 12124 249 254
41
Failure Log FileNode 18 Node awake: 0 (mins)Sink awake: 3 (mins)Node Failure Category: Node Failed!
TESTS Received stats from module [FAILED] Received data this period [FAILED] Node thinks it is transmitting data [FAILED] Node has been claimed by other nodes as a neighbor [FAILED] Sink has heard some packets from node [FAILED] Received data this period: Num pkts rx: 0(0) Received stats from module: Num pkts rx: 0(0)
Node’s next-hop has no failures
42
Spurious Failures An artifact of another failure Sympathy highlights failure dependencies in order
to distinguish spurious failures
SympathySink
NodeCrashed
CongestionAppears tobe sending
very little data
Appears tonot be sending
data
43
Testing Methodology Application
Run in Sympathy with the ESS (Extensible Sensing System) application
In simulation, emulation and deployment Traffic conditions: no traffic, application traffic,
congestion Node failures
Node reboot – only requires information from the node Node crash – requires spatial information from neighboring
nodes to diagnose Failure injected in one node per run, for each node 18 node network, with maximum 7 hops to the sink
44
Evaluation Metrics
Accuracy of Failure Detection: Number of primary failure notifications
Latency of Failure Detection/notification Time from when the failure is injected to when Sympathy
notifies the user about the failure
There is a tradeoff between accuracy and latency
45
Notification Latency Does Sympathy always detects an injected failure?Detection =
Assign a root cause of node crash Highlight the failure as primary
EPOCH
46
Notification Accuracy
47
Memory Footprint TinyOS, mica2
Binary RAM ROM
ESS w/o Sympathy 3089 B 96094 B
ESS w/ Sympathy 3160 B 104802 B
Difference 71 B 8708 B
Sympathy 47 B 1558 B
48
Extensibility
Adding new metrics requires ~5 lines of code on the nodes and ~10 lines of code on the sink
Extensible to application classes with predictable data flow within bounds of an epoch User specifies expected amount of data
Extensible to different routing layers due to modular design Multihop routing plug-in was 140 lines Mintroute routing plug-in was 100 lines
49
Conclusion
A deployed system that aids in debugging by detecting and localizing failures
Small list of statistics that are effective in identifying and localizing failures
Behavioral model for a certain application class that provides a simple diagnostic to measure system health
50
Literature
[1] Zhao, J. Govindan, R. Estrin, D. “Computing aggregates for monitoring wireless sensor networks” SNPA 2003.
[2] Nithya Ramanathan, Kevin Chang, Rahul Kapur, Lewis Girod, Eddie Kohler, and Deborah Estrin “Sympathy for the Sensor Network Debugger”, Sensys 2005.
[3] Rost, S.; Balakrishnan, H. “Memento: A Health Monitoring System for Wireless Sensor Networks“, SECON 2006.