Query optimization for_sensor_networks

Query Optimization for Sensor

Networks

Harshavardhan AchrekarUniversity of Massachusetts-Lowell

Basic architecture for Querying in TinyDB

Query submitted at a PC (base station), parsed, optimized

Query sent into the sensor network, disseminated, processed

Result flows back up the routing tree that was formed as the query propagated

Disadvantages of this architecture

Data is extracted from sensor network in a predefined way and is stored in a database located on front-end.

Query processing takes place on centralized database & O/P results of predefined queries over historical data.

Nodes near access point become traffic hot spots, central points of failure , may be depleted of energy prematurely

Does not take advantage of in-network aggregation of data to reduce communication load, when only aggregate data needs to be reported

Goal of this Research Proposal

Design a scheme to support multiple data acquisition and aggregation queries in a wireless sensor network, in order to minimize the amount of radio activity and energy consumption.

Co-relation among similar queries to share the limited communication and computational resources.

Devise a final optimal query plan by applying successive transformations rules to initial query plan.

Example: Flood Warning System

A user from an emergency management agency sends a query to the flood sensor DB: “For the next 3 hours, retrieve every 10 minutes the maximum rainfall level in each county in Southern California, if it is greater than 3.0 inches”

Select max( rainfall_level), county from Sensors where state = 'Southern California‘ group by

county having max( rainfall_level ) > 3.0 in duration [now, now + 180 min] sampling period 10 min

Classification of Queries Long-running, continuous queries: report results over

an extended time window. ex: “for the next 3 hours, retrieve every 10 minutes the rainfall level in California”

Snapshot queries: data in the network at a given point in time. ex: “retrieve the current rainfall level for all sensors in California”

Historical queries: aggregate information over historical data. ex: “retrieve the average rainfall level at all sensors for the last 3 months of the previous year”

Optimization of a Long Continuous Query

(SI1,SI2 ) join operator that relates tuples having the same timestamp TS. For every new tuple read on one of the input streams the join operator checks if the last tuple read from the other stream has the same timestamp.

(SI1,SI2 ), sync-join, where SI2 is an on-demand stream. The sync-join requests the activation of SI2 only when a tuple arrives on SI1 .

Transformation rules

Use Sync-join & on-demand streams when possible.

Given that a sync-join requires a sensor stream on the right side, trees representing query plans should be unbalanced to the left (Left Deep Join Trees)

Unary operators such as selections, projections, and temporal aggregates (which reduce the amount of data being forwarded) should be moved as close as possible to the node where data is acquired.

Query optimization example:

SELECT * FROM 1.Magnetism, 2.Acceleration, 3.Temperature WHERE p1(1.Magnetism) and p2(2.Acceleration) and p3(3.Temperature) EVERY 1000

where p1, p2, and p3 are some predicates on magnetism, acceleration and temperature readings, respectively, with probability Pr(p1) = 0.01, Pr(p2) = 0.05, Pr(p3) = 0.1

Analysis of Cost of execution

QP1 is obtained by applying the left deep join trees rule.

QP2 is obtained from QP1 by using the selections push-down rule and their allocation on the node where data are generated

QP3 is obtained from QP2 by using rules fortransforming joins into sync-joins.

Two-Tier Multiple Query Optimization This Scheme proposes to supports both aggregation

and data acquisition queries while minimize the average transmission time in sensor network.

Tier One:- Base Station Optimization Algorithm (a cost-based approach to heuristically rewrite user queries into “synthetic” queries before injecting them into the sensor network)

Tier Two:- In network Optimization Algorithm (Sensor nodes make local decisions themselves and adaptively handle the query workload with time)

Base Station Optimization Algorithm User query Structure <qid, attribute list | agg_list, predicates, epoch duration, qid'>

(a) qid - unique identifier of the query.

(b) attribute list-list of attributes that data acquisition query qid retrieves from the sensor network

(c) Agg_list is a list of <operator, attribute> that an aggregation query qid acquires.

(d) predicate list - is a list of <attribute, min, max>

(e) qid' field -to denote which synthetic query this query qid has been rewritten into.

Synthetic Query Structure <qid, attribute list | agg_list , predicates, epoch duration, qid‘,

count(epoch), from_list , flag, benefit>(a) count field is associated with the epoch duration field as well as each entry in the various

lists (attribute list, agg list and predicate list), which denotes the number of user queries that require that piece of data. This is to facilitate the maintenance of the synthetic query when user queries terminate.

(b) A from list field contains the user queries which the synthetic query is responsible for.

(c) A flag field denotes the current status of this synthetic query.

(d) A benefit field indicates the benefit that can be gained by the synthetic query (in comparisonto processing the individual user queries).

Benefit Estimation-Cost Model

Transmission cost of a result message from one node to another can be estimated as Cstart+ Ctrans·len(qi).

To measure the average transmission cost incurred by qi for each unit of time, we have to estimate the number of per-unit time transmissions incurred by qi, which is related to the number of result messages generated by the sensors as well as the number of hops required to forward the messages back to the base station.

First, we look at the per-unit time number of result messages generated by a set of sensor nodes Nk, which is denoted as result(qi,,Nk). At the end of each epoch of qi,, one result message would be generated by a sensor node whose readings satisfy the predicates of qi. Therefore, we have

result(qi,Nk) = (sel(qi,Nk) · |Nk| )/epochi (1) where sel(qi,Nk) is the selectivity of the query

predicates over Nk, which is equal to the percentage of sensor nodes in Nk whose readings can satisfy the query predicates, epochi is the epoch length of qi.


Second, the forwarding hops of the result messages are determined by the message source nodes’ location at the data routing tree. Based on Eq. (1), the number of message transmission incurred by qi is

trans(qi) =∑ k=1 to max_depth {result(qi,Nk) · k } (2) where Nk is the set of sensor nodes at the kth level of

the routing tree and max depth is the maximum depth of the routing tree.

the computational cost of a query cost(qi)

cost(qi) = trans(qi) · (Cstart + Ctrans · len(qi)) (3)

Benefit(q1, q2) = cost(q1) + cost(q2) − cost(q12).


Base Station Optimization Algorithm

In-Network Optimization Algorithm

Sharing over time - more progressive sharing over time by scheduling data acquisition and transmission of all queries in a whole.

At the end of a query’s propagation phase, setSampleRate is triggered, which may start (or restart) the node’s clock to fire at the GCD of the “epoch duration” of all the queries. We set the epoch start time on sensor nodes to be divisible by the epoch duration instead of the arrival time of a new query (here we assume that every epoch duration is divisible by 2048ms).

Sharing over space - After the sample rate has been set at each node, data will be retrieved periodically and transmitted out of the network to the base station. During the query result collection, we use the optimization heuristics to aggressively share data over space.

Each sensor node dynamically selects a route (parent) that is aware of the query space (except tinydb network with uses link quality); in the meanwhile, it tries to take advantage of the broadcast nature of the radio channel to satisfy multiple queries in one message.

In-Network Optimization Algorithm

Query Propagation Phase

Queries are flooded throughout the network from the base station. Accurate set of sensors that have data for the query are not known a prior to the base station & the set of sensor nodes can vary with time.

Let every sensor decide where to propagate to based on its local information about neighbors.

When query is propagated from node x at level i to level i + 1, node x checks if it has the data the query retrieves, and piggybacks this information down.

In the meanwhile, the DAG is formed by having an edge from every node to each of its upper level neighbors (If the network is dense and not all neighbors be maintained, but neighbors that also have query result to transmit).

If the data at node x does not satisfy any query, x switches into sleep mode and will wake up after a predefined time.

When it wakes up, if it finds that its current data satisfies a query, it sends a one-hop broadcast message so that its lower level neighbors would consider the node as an option to relay its data.

Query Propagation Phase

Result Collection Phase Epoch-based mechanism: each epoch (sampling

period) is divided into time intervals. Nb. of intervals reflects the depth of the routing tree.

Aggregation results reported at the end of each sampling period

When a node broadcasts a query, it specifies the time interval within which it expects to hear the result from its children.

During its scheduled interval, each node:

listens for the packets from the children, receives them (gray)

computes a new partial state record by combining its own data and the partial state records from its children (black)

sends the result up the tree to its parent (white)

Result Collection Phase

Example to explain DAG In Network Algo

8 nodes involved and 12 (messages for qj )+8 (messages for qj ) =20 radio messages are transmitted.

Using DAG, G will choose D instead of C to relay for both qi and qj , and hence node C and A can be instructed to sleep.

6 nodes involved and 4 (messages for qj )+8 (messages for qi ) =12 radio messages are transmitted.

D,E,F,G,H are queried by query qi

D,G,H by qj

My Proposal for Multi Query Optimization

Suppose for a system with n queries we choose n distinct root stations (Assuming no. of sensor nodes>no of concurrent queries)

Queries are flooded throughout the network from each of the root node connected directly to base station.

We divide the epoch duration in n equal intervals and every processing node is continuously transmitting sensed data relevant to each query in every interval of the epoch.

My Proposal for Multi Query Optimization

The twist in the Algorithm lies in processing the next query in a scheduled round-robbin fashion when a node is in a sleeping mode as per previous discussion.

At the end of the epoch we output result of all concurrent queries simultaneously.

Apply co-relation Algorithm- reduce amount of transmitted data.

Problem - “Overloaded node”- a single node acts as a parent in the same epoch time for 2 different queries. Under normal circumstances collision occurs.

Solution :- apply exponential back-off algorithm on contention queries.

Data Flow ViewQuery1: Select light sample period 2 ms Query2: Select temp sample period 2ms Query3: Select Humidity sample period 2msQuery4: Select pressure sample period 2ms

Child for Query 1IsRoot for Query4

Parent of all leaf childs

Experimental Evaluations

No of Concurrent Co-related queries v/s avg. transmission time

For a fixed no. of multiple queries I would

study relation of Average transmission Time to no. of

nodes. Benefit ratio v/s Avg. no of synthetic

queries Communication cost v/s computation cost

Conclusion

After studying the current technology in optimizing sensor network query I proposed an architecture which can be the future of sensor networks.

Thank you Questions ?????

Query optimization for_sensor_networks

Education

Transcript of Query optimization for_sensor_networks