Post on 19-Dec-2015
Query Processing for the Semantic Sensor Web
Antonios DeligiannakisTechnical University of Crete
Vision for Semantic Sensor Networks
Universal, web-based access to sensor data– Simpler to consider collections of sensor networks
• Each network with some kind of authority and administration
2
Of Course, Nothing Comes Easy…
Requires additional info for collected data– Location/orientation of sensor, time, authority,
measured quantities, units, errors etc• Some of them are static, some may change (time, location…)
– Additional info may significantly impact volume of transmitted data
Requires proper languages for querying data– Query execution still needs to be optimized within
each network
3
Large Scale Querying
“Semantic Reality - Connecting the Real and the Virtual World”– Manfred Hauswirth, Stefan Decker
“Query processing, reasoning, and planning based on real world sensor information bases will be core functionalities to exploit the full potential of Semantic Reality. However, the size and the physical distribution of data will require new approaches which will have to trade logical correctness with statistical”
More complex, large-scale processing:– Issued queries are transformed– Relevant networks are identified– Queries issued over data of individual networks
• Query may involve log of historical data extracted from each network; or• Online query: Its execution still needs to be optimized within each network
– Results annotated, combined
4
Topic of our talk
Data Collection – Potential Queries
Collecting all data (SELECT * queries)– Entire network or a subregion, periodically or frequently
Collect aggregates of data– Report AVG, SUM, MAX quantities in an area/network
Data Reduction based on user-specified data quality– In all types of queries: (historical, aggregate…)– Minimize bandwidth based on quality (or the dual problem)
Detecting Outliers– “Strange” readings: Interesting phenomenon or malfunction?
Joins– Report information based on combined readings of sensors (i.e.,
report when a lion is close to a deer)– Harder to optimize. Naïve solution of sending potential joining
tuples (or projected attributes of them) to base station is often not far from best case
5
One-shot vs Continuous Queries
One-shot Queries– Ask a query, get results, DONE
Continuous Queries– Perform a query
• Specify how OFTEN it should be executed (query epoch)• Specify until WHEN it should run (optional)
– “Report avg temp per room every 30 sec, for the next hour”– “Report all measurements of sensor nodes in Room X every 1 min”
– More typical for monitoring applications– More data, more chances of doing something clever…
6
Data Collection – Types of Queries
Pull vs Push Based Queries– Pull-Based Queries: sensors transmit data only in
response to queries that request them– Push-based Queries: transmit data to cache node
proactively• If I know that someone will likely request it soon, transmit to
avoid query propagation, organization etc…– Tradeoff based on how often data is requested by
different queries• Hybrid Push-Pull Query Processing for Sensor Networks
– Niki Trigoni, Yong Yao, Alan Demers, Johannes Gehrke, Rajmohan Rajaraman
7
Data Collection – Types of Queries
Pull vs Push Based Queries– Data are usually pulled based on user queries– Exceptions for queries to historical data
• Node may need to push its bunch of latest measurements if memory becomes full
– Metadata changes likely need to be pushed to some external directory• Allows queries to be performed based on “correct”
knowledge• Is this an overkill for metadata that change constantly?
– I.e., time of acquired measurements is common – Careful organization helps. I.e., time can sometimes be inferred
– Possible to combine both approaches
8
Outline
Sensor Nodes– Brief Intro: Parts of sensors, capabilities, constraints
Techniques for Query Processing– Topologies for Data Collection– Data Reduction based on user-specified data quality– Detecting Outliers
Conclusions
9
Parts of a Sensor Sensing equipment (sensor and data acquisition
boards)– Internal (“built-in”) vs external sensing capabilities
CPU Memory Battery
– Some sensors may collect gather energy from the sun, vibrations etc
Radio to transmit/receive data from other sensors
10
Sensor Parts - Example
Berkeley Mica2Stargate
(Intel PXA255 cpu)Constraint
Battery 2 ΑΑ Li-IonConserve to increase
network lifetime
CPU 7.38 MHz 400 MHzComputationally cheap
algorithms
Memory4KB SRAM,
512 KB EEPROMup to 256 MB
FLASHAlgorithms with low
memory requirements
Radio 300 μέτραDepends on radio
modelTransmission range, bandwidth (bits/sec)
Main Constraint
Energy Constraints
3-5% battery yearly increase– CPU speed increases much faster
• However, energy per cpu instruction decreases
Some applications: unattended deployment– Eg: Disaster scenarios, military environments…– Often hard or impossible to replace batteries
Maximizing network lifetime is the main target– Cost-effective only if sensor networks last long– Applications with sensors without power constraints
are much easier to handle
12
Sources of Energy Drain Cpu computations Measurements from sensing equipments (cost depends
on what you sense) Very small energy consumption in sleep mode Radio is main source
– cost(Transmitting) > cost(Receiving) ≥ cost(idle listening) – Popular goal: reduce #transmitted bits– Synchronization + communication protocols equally important
• I.e., cost of transmitting K bits depends on duty cycle (percentage of time sensor is awake to listen for data)
• Idle listening for too long is extremely costly
13
Assumptions and Goals in Subsequent Algorithms
Research Emphasis on more constrained environments– Wireless communication, short transmission ranges
• One or more base stations with increased capabilities may exist• Candidates for gateways to the semantic sensor web
– Energy limitations (battery powered sensors) Goal of algorithms in all applications that follow:
– Preserve Energy– Organize sensors and their schedules
• Good schedules allow sensors to power down their radios/cpus and go into a sleep mode
– Reduce size of transmitted data Processing (esp, aggregation) focuses on numeric measurements Implication of having a strict schedule on when to collect data: base
stations knows when quantities are collected– Such metadata may not even need to be transmitted
14
Outline
Sensor Nodes– Brief Intro: Parts of sensors, capabilities, constraints
Techniques for Query Processing– Topologies for Data Collection– Data Reduction (SELECT * and aggregate queries)– Detecting Outliers
Conclusions
June 1, 2009 Antonios Deligiannakis 15
1a. Data Collection: TAG
TAG: a Tiny Aggregation Service for Ad-Hoc Sensor Networks– Samuel Madden, Michael Franklin, Joseph Hellerstein
Goals– Specify SQL type query– Organize sensors and schedules them to reduce energy
consumption– Emphasizes/uses IN-NETWORK query processing– Targets aggregate queries, but similar functionality can be used
for SELECT * queries Results of paper incorporated into TinyDB
– Data processing system built on top of TinyOS
16
TAG Operation
Users pose queries at a base station Messages flooded towards the sensors
– Reverse aggregation tree is formed– Each sensor belongs to a level based,
on hops to root Each epoch is (equally) partitioned
amongst the levels– Nodes listen ONLY when children nodes
send data– When nodes transmit, parent node has
radio open– Synchronization allows each node to
transmit ONCE per epoch• Transmission includes aggregate for
subtree
Base Station
Area A
The picture is from the paper
TAG Query Language
The picture is from the paper
18
TAG Contributions
Support of multiple aggregate functions AND group-by queries– Classification and behavior based on type of function
In network processing dramatically reduces transmitted data
Synchronization allows sensors to sleep most of the time
Further optimizations for monotonic aggregates– I.e., MAX aggregate: Don’t transmit aggregate if you
overhear a sibling that reports a larger aggregateConsiderations for message loss etc…
19
1b. Data Collection: WaveScheduling
WaveScheduling: Energy-Efficient Data Dissemination for Sensor Networks– Niki Trigoni, Yong Yao, Alan J. Demers, Johannes Gehrke,
Rajmohan Rajaraman– Proposed in the Cougar system
Observation: In TAG, nodes at the same level transmit at the same time– Many collisions, message losses– Many retransmissions, energy drain
Goal: Organize nodes in order to minimize message collisions
20
Main Idea: WaveScheduling
Organize nodes in a grid– Each grid area has a leader– Nodes within each area transmit data to their leader– Leaders communicate at specific intervals and directions
East Wave example
21
WaveScheduling Paths
The large right-turn path has a lower
latency for N,E,S,WSource
Destination
They follow a North, East, South, West direction
22
WaveScheduling Contributions
Transmissions without collisions– Much lower energy drain– But also, significantly larger delays
• Few nodes transmit at each time, long paths to follow
23
Outline
Sensor Nodes– Brief Intro: Parts of sensors, capabilities, constraints
Techniques for Query Processing– Topologies for Data Collection– Data Reduction (SELECT * and aggregate queries)– Detecting Outliers
Conclusions
June 1, 2009 Antonios Deligiannakis 24
2a. Compressing Historical Measurements
Compressing Historical Information in Sensor Networks– A. Deligiannakis, Y. Kotidis, N. Roussopoulos
Application: If past measurements are important, they should be periodically transmitted to base station, before the memory is exhausted
Transmitting all measurements is costly
25
2a. Compressing Historical Measurements
Observation: Sensors may measure several quantities Potential correlations
– At the same quantity• Periodicity, similar trends
– Between different quantities• Temperature and Voltage [Deshpande04], pressure and humidity
– Between different sensors in an area• I.e., Similar temperature and noise levels
Can we take advantage of such correlations to compress the data?
26
Regression at XY level = scaling and shifting
Examples of Correlated Signals
XY
XY Graph
27
Main IdeaCreate small dictionary of trends that appear
frequentlyPartition data into intervalsEncode each interval through some part of the
dictionary– Use linear regression for encoding: bXaY
W W W
Dictionary
Part of Data
Regression Parameters (a,b)
28
What is Transmitted
Size = B
Sensor Base Station
Dictionary
Measurements
Dictionary Updates Dictionary Log
Log with receivedcompressed data
Compressed Data
ΜBase
M
1
2
N
For each data interval transmit 4 values: 1) Start position in data array, 2) Location of best approx. in dictionary; 3-4) Regression Parameters
29
Cooperative Compression Exploring spatial correlations…
Group leader partitions part of bandwidth
to each sensor– To save energy, group leader may transmit its
dictionary updates to the group
Sensors compress their data, report
resulting error (NOT data)
More space is assigned to nodes with
larger errors
Compressed data transmitted to group
leader
Combination with its own data and
transmission
1
32
S-1
Group Leader
Base Station
B1
B2 B3
BS-1
E1
E2 E3
ES-1
Query, Bandwidth
Β1’
Β2’ Β3’
ΒS-1’
30
Algorithm Benefits
To minimize different error metrics, simply change regression algorithm– I.e., SSE, SSRE and Max errors are simple to handle
More space to difficult signals/sensors Group organization saves team the need to compute
dictionary (expensive part)– Need to rotate group leader selection, to avoid draining
energy– Can apply HEED protocol for group leader selection
• Prob of becoming group leader is analogous to Ecurr / Einit
31
2b. Model-Driven Data Acquisition
Model-driven Data Acquisition in Sensor Networks– Amol Deshpande, Carlos Guestrin, Samuel R. Madden, Joseph
M. Hellerstein, Wei Hong Idea: Learn a probabilistic model of past
measurements/patterns– Multi-gaussian pdfs– Also learn transitional probabilities P(Xt+1 | Xt)
At each epoch, base station decides for which sensors (and which quantities) it is confident for their current values– This confidence decays over time if no samples are taken
Generate query plan to retrieve only the remaining quantities
32
10 20 300
0.1
0.2
0.3
0.4
Dt
SQL query, with desired confidence
bounds
Method Sketch (slide from the authors’ presentation)
10 20 300
0.1
0.2
0.3
0.4
Query
Data Collection
Plan
Feed Model
10 20 300
0.1
0.2
0.3
0.4
Model Estimate
ProbabilisticModelNew Query
Algorithm Characteristics
Takes advantage of correlations– May decide to sample voltage instead of temperature, to
decrease energy
Model can be used for missing data (inaccessible sensors)
Can handle point and range queries However, hard to handle previously unseen patterns
– “Thus, for models to perform accurate, predictions they must be trained in the kind of environment where they will be used”
Centralized model, difficult to scale– Subsequent work by same authors proposed a more
distributed system (KEN)34
2c. Snapshot Queries
Snapshot Queries: Towards Data-Centric Sensor Networks– Yannis Kotidis
Idea: Nodes in proximity likely observe similar things Find if you are needed to answer queries
– Does there exist a representative that can approximate your data accurately?
Nodes that are needed (representatives) constitute network snapshot– Can estimate and answer queries for remaining nodes as well
Completely decentralized approach– Representatives tested, change with few messages
35
Snapshot Queries
Q’: select loc,temperature from Sensors where loc in SOUTH_EAST_ QUADRANT
use snapshotQ’
A
D
C
B
36
Example of Network Snapshot
Dark nodes show representatives
Network inside the network
Immediate visualization of common data patterns
A
B
C
Q
37
Testing for Representatives Ni maintains model for each neighbor Nj
– Constructed using cache of measurements from Nj
– Updated at random epochs by measurements transmitted by Nj
Can use model to predict measurement of Nj if:
error(x’j(t)-xj(t)) T
Supports multiple error functions, user provided threshold T Snapshot adapts over time to evolving data characteristics No training is necessary
Time
Tem
pera
ture
xi(t)
xj(t)
xi(t)
xj(t)
38
In TAG, each node transmits aggregates at EACH epoch– Can we do better than that?– Aggregates (i.e., avg temperature) may change slowly
Processing approximate aggregate queries in wireless sensor networks– A. Deligiannakis, Y. Kotidis, N. Roussopoulos
Application: If application tolerates E_Global = |V-V|, organize aggregation to minimize #messages
– V, V : Real/estimated aggregate result
Idea:– Don’t transmit small changes in aggregates of subtrees
Dual problem a little harder– Bandwidth Constrained Queries in Sensor Networks (same authors)
Approximate Aggregate Queries
39
Error Filters
Algorithm apply error filters to sensor nodes– Interval [L..H]
Each node computes aggregate for subtree– Transmit only if new aggregate is outside the filter– At each transmission, re-center error filter
Of course, the trick is how to decide how large each error filter should be– Always respect accuracy constraints E_Global
40
Example of Error Filters
54 6 8
1
7
3
Base Station
2
E=1 E=4
E=3
62 68
67
17
16 18 46 54
5017
19 21 41 49
50
17 50
62 68
65
19 21
20
41 49
45
62 68
65
19.5
19 21 41 49
47
62 68
65
19.5
19 21 41 49
47
62 68
65
41 49
45
19 21
20
62 68
65
62 68
65
17
16 18 46 54
50
i
iEE_Global =
41
Why not Uniform Allocation of Error?
Reasons not to select uniform filters– Different range of aggregates in nodes/subtrees– Changes in subtrees may cancel-out
• Eg, Vehicle movement from observation area of node 4 to the one of node 5
Goal: Use larger filters where you expect larger decrease in transmitted messages
Filters periodically adjusted– Adapt to changes in data characteristics
54
2
42
Sketch Of Technique
shrink
expand
F x W
(default)
W W + dW
(expanded)
Filter Width
#messages
Be
xp
an
d
Bs
hr
in
k
UpdG
ain
Filters periodically shrink to W x F – Creates error budget (1-F) x E_Global to redistribute to
nodes Algorithm computes simple statistics to estimate gain
of increasing filter– Statistics computed per node
C
C-W C+W
C
C-W*F C+W*F
Statistics aggregated bottom-up before reorganization– Important note: Compare this cost (1 aggregate/sensor) with
cost of transmitting individual statistics to base station (SELECT *)
Error budget redistributed top-down– Partition error budget to your subtrees (and yourself),
proportionally to the gain of each subtree
Often 5-time reduction in transmitted data compared to uniform allocation, order of magnitude when compared
Sketch of Technique (2)
44
Outline
Sensor Nodes– Brief Intro: Parts of sensors, capabilities, constraints
Techniques for Query Processing– Topologies for Data Collection– Data Reduction (SELECT * and aggregate queries)– Detecting Outliers
Conclusions
June 1, 2009 Antonios Deligiannakis 45
The Need for Outlier Detection
Outliers Detection is Useful– Outliers may denote malfunctioning sensors
• Sensor measurements are often unreliable• Sensors may fail-dirty
– Outliers may also represent interesting events detected by few sensors
• Fire detected by a sensor
Results to Aggregate queries are often Meaningless– Consider a MAX/MIN calculation
in the presence of outlier measurements
– Other aggregates are also influenced
46
What do we Need?
Goals for aggregate queries:– A “clean” aggregate– Reporting of outlier values– Both in a SINGLE, in-network framework
47
What to Consider as an Outlier?
Need to support several similarity metrics
Also consider characteristics of monitored quantity– Measurements may depend on distance from source (e.g.,
noise, heat)– Simply relying on values for testing similarity between sensors is
not enough – comparing recent trends may be more appropriate
Provide provision for user-specific “minimum support”– How many other sensors need to be similar to you, so that you
are not considered as an outlier?
48
Motivational Example
• S6, S7 and S8 observe a fire
• Their measurements fluctuate more
A voting process at S3 will reject the reading of S6
Smoothing at S3 also obscures the reading
S10 and S9 fail-dirty need to be excluded
Example is partitioned in 2 areas: Our framework
supports group-by queries
49
Sketch of Framework
1. Assume a required minimum support of 2
2. S6, S7 and S8 can only be tested for similarity at their closest common ancestor (S2)
3. At node S2, their values can be merged into aggregate
4. S12, S4 and S5 are in the same group. In S3 their values can be tested for similarity
5. Similarly, test S3, S2 and S11 for similarity in S2 etc
6. Nodes S10 and S9 have failed-dirty. Readings without the minimum support are not included in the aggregateEven if the readings of S6, S7 and S8 are incorporated
into the aggregate, one of these readings (revealing the fire) will be received by the root node 50
Framework Features
Performs similarity tests over the latest K readings of sensors– Can plug similarity functions with minimal cost
Allows for minimum support GROUP-BY support
– Grouping based on latest measurement, OR static predicates (area, id etc)
Can limit tests within each group using a CONSTRAIN TEST clause– Semantic information (i.e., location) could be useful here– I.e., only perform tests between sensors in the same floor
Collection Tree periodically reorganized– Move towards places you will find witnesses, outliers
51
Just to See what Happens…
52
Some More Notable Approaches
Cluster-Based Communication– LEACH, Pegasis, HEED– Goal: Organize sensors into clusters– Intra-cluster and inter-cluster communication– Overhead for clusterheads, so probabilistic election and rotation
Optimizing Scheduling– Nodes have different loads to transmit– Can determine minimal times to transmit/listen based on worst
case estimates of time to transmit/link– See: “Workload-aware Optimization of Query Routing Trees in
Wireless Sensor Networks (MicroPulse) by P. Andreou, D. Zeinalipour-Yazti, P. Chrysanthis, G. Samaras
53
Some More Notable Approaches (2)
Sensor Localization– Few sensor nodes are actually GPS-enabled
• Catalog for Crossbow gives only 1 such data acquisition board (MTS420)
– Several algorithms for sensor localization• Option 1: Sensors are not localized, but there exist landmarks with
GPS knowledge of themselves• Option 2: Mobile sensors can localize even without any
infrastructure, if (1) the sensors are free to move, (b) they have a common reference point (i.e., direction of North)
– GPSFree Node Localization in Mobile Wireless Sensor Networks, by Huseyin Akcan, Vassil Kriakov, Herve Bronnimann, Alex Delis
Duplicate-Insensitive sketches for aggregate queries– Minimizes impact of data loss
54
Conclusions
Presented several query processing techniques– All aim to process data in-network
• Crucial for network lifetime– Techniques for extraction of historical measurements, or online
querying (point, range, aggregate, group-by)– All data reduction techniques presented are easily tunable
• Given desired error/accuracy, minimize bandwidth consumption; or• Given bandwidth consumption, minimize error of produced results• Both useful to satisfy different user requirements
– Important to properly schedule nodes• Increases available time to sleep, decreases the energy drain• If possible, avoid creating different schedule per each query
– Too many conflicts, sensor continuously working…
55