Query Processing for the Semantic Sensor Web Antonios Deligiannakis Technical University of Crete.

Query Processing for the Semantic Sensor Web

Antonios DeligiannakisTechnical University of Crete

Vision for Semantic Sensor Networks

Universal, web-based access to sensor data– Simpler to consider collections of sensor networks

• Each network with some kind of authority and administration

Of Course, Nothing Comes Easy…

Requires additional info for collected data– Location/orientation of sensor, time, authority,

measured quantities, units, errors etc• Some of them are static, some may change (time, location…)

– Additional info may significantly impact volume of transmitted data

Requires proper languages for querying data– Query execution still needs to be optimized within

each network

Large Scale Querying

“Semantic Reality - Connecting the Real and the Virtual World”– Manfred Hauswirth, Stefan Decker

“Query processing, reasoning, and planning based on real world sensor information bases will be core functionalities to exploit the full potential of Semantic Reality. However, the size and the physical distribution of data will require new approaches which will have to trade logical correctness with statistical”

More complex, large-scale processing:– Issued queries are transformed– Relevant networks are identified– Queries issued over data of individual networks

• Query may involve log of historical data extracted from each network; or• Online query: Its execution still needs to be optimized within each network

– Results annotated, combined

Topic of our talk

Data Collection – Potential Queries

Collecting all data (SELECT * queries)– Entire network or a subregion, periodically or frequently

Collect aggregates of data– Report AVG, SUM, MAX quantities in an area/network

Data Reduction based on user-specified data quality– In all types of queries: (historical, aggregate…)– Minimize bandwidth based on quality (or the dual problem)

Detecting Outliers– “Strange” readings: Interesting phenomenon or malfunction?

Joins– Report information based on combined readings of sensors (i.e.,

report when a lion is close to a deer)– Harder to optimize. Naïve solution of sending potential joining

tuples (or projected attributes of them) to base station is often not far from best case

One-shot vs Continuous Queries

One-shot Queries– Ask a query, get results, DONE

Continuous Queries– Perform a query

• Specify how OFTEN it should be executed (query epoch)• Specify until WHEN it should run (optional)

– “Report avg temp per room every 30 sec, for the next hour”– “Report all measurements of sensor nodes in Room X every 1 min”

– More typical for monitoring applications– More data, more chances of doing something clever…

Data Collection – Types of Queries

Pull vs Push Based Queries– Pull-Based Queries: sensors transmit data only in

response to queries that request them– Push-based Queries: transmit data to cache node

proactively• If I know that someone will likely request it soon, transmit to

avoid query propagation, organization etc…– Tradeoff based on how often data is requested by

different queries• Hybrid Push-Pull Query Processing for Sensor Networks

– Niki Trigoni, Yong Yao, Alan Demers, Johannes Gehrke, Rajmohan Rajaraman

Data Collection – Types of Queries

Pull vs Push Based Queries– Data are usually pulled based on user queries– Exceptions for queries to historical data

• Node may need to push its bunch of latest measurements if memory becomes full

– Metadata changes likely need to be pushed to some external directory• Allows queries to be performed based on “correct”

knowledge• Is this an overkill for metadata that change constantly?

– I.e., time of acquired measurements is common – Careful organization helps. I.e., time can sometimes be inferred

– Possible to combine both approaches

Outline

Sensor Nodes– Brief Intro: Parts of sensors, capabilities, constraints

Techniques for Query Processing– Topologies for Data Collection– Data Reduction based on user-specified data quality– Detecting Outliers

Conclusions

Parts of a Sensor Sensing equipment (sensor and data acquisition

boards)– Internal (“built-in”) vs external sensing capabilities

CPU Memory Battery

– Some sensors may collect gather energy from the sun, vibrations etc

Radio to transmit/receive data from other sensors

Sensor Parts - Example

Berkeley Mica2Stargate

(Intel PXA255 cpu)Constraint

Battery 2 ΑΑ Li-IonConserve to increase

network lifetime

CPU 7.38 MHz 400 MHzComputationally cheap

algorithms

Memory4KB SRAM,

512 KB EEPROMup to 256 MB

FLASHAlgorithms with low

memory requirements

Radio 300 μέτραDepends on radio

modelTransmission range, bandwidth (bits/sec)

Main Constraint

Energy Constraints

3-5% battery yearly increase– CPU speed increases much faster

• However, energy per cpu instruction decreases

Some applications: unattended deployment– Eg: Disaster scenarios, military environments…– Often hard or impossible to replace batteries

Maximizing network lifetime is the main target– Cost-effective only if sensor networks last long– Applications with sensors without power constraints

are much easier to handle

Sources of Energy Drain Cpu computations Measurements from sensing equipments (cost depends

on what you sense) Very small energy consumption in sleep mode Radio is main source

– cost(Transmitting) > cost(Receiving) ≥ cost(idle listening) – Popular goal: reduce #transmitted bits– Synchronization + communication protocols equally important

• I.e., cost of transmitting K bits depends on duty cycle (percentage of time sensor is awake to listen for data)

• Idle listening for too long is extremely costly

Assumptions and Goals in Subsequent Algorithms

Research Emphasis on more constrained environments– Wireless communication, short transmission ranges

• One or more base stations with increased capabilities may exist• Candidates for gateways to the semantic sensor web

– Energy limitations (battery powered sensors) Goal of algorithms in all applications that follow:

– Preserve Energy– Organize sensors and their schedules

• Good schedules allow sensors to power down their radios/cpus and go into a sleep mode

– Reduce size of transmitted data Processing (esp, aggregation) focuses on numeric measurements Implication of having a strict schedule on when to collect data: base

stations knows when quantities are collected– Such metadata may not even need to be transmitted

Outline

Techniques for Query Processing– Topologies for Data Collection– Data Reduction (SELECT * and aggregate queries)– Detecting Outliers

Conclusions

June 1, 2009 Antonios Deligiannakis 15

1a. Data Collection: TAG

TAG: a Tiny Aggregation Service for Ad-Hoc Sensor Networks– Samuel Madden, Michael Franklin, Joseph Hellerstein

Goals– Specify SQL type query– Organize sensors and schedules them to reduce energy

consumption– Emphasizes/uses IN-NETWORK query processing– Targets aggregate queries, but similar functionality can be used

for SELECT * queries Results of paper incorporated into TinyDB

– Data processing system built on top of TinyOS

TAG Operation

Users pose queries at a base station Messages flooded towards the sensors

– Reverse aggregation tree is formed– Each sensor belongs to a level based,

on hops to root Each epoch is (equally) partitioned

amongst the levels– Nodes listen ONLY when children nodes

send data– When nodes transmit, parent node has

radio open– Synchronization allows each node to

transmit ONCE per epoch• Transmission includes aggregate for

subtree

Base Station

Area A

The picture is from the paper

TAG Query Language

The picture is from the paper

TAG Contributions

Support of multiple aggregate functions AND group-by queries– Classification and behavior based on type of function

In network processing dramatically reduces transmitted data

Synchronization allows sensors to sleep most of the time

Further optimizations for monotonic aggregates– I.e., MAX aggregate: Don’t transmit aggregate if you

overhear a sibling that reports a larger aggregateConsiderations for message loss etc…

1b. Data Collection: WaveScheduling

WaveScheduling: Energy-Efficient Data Dissemination for Sensor Networks– Niki Trigoni, Yong Yao, Alan J. Demers, Johannes Gehrke,

Rajmohan Rajaraman– Proposed in the Cougar system

Observation: In TAG, nodes at the same level transmit at the same time– Many collisions, message losses– Many retransmissions, energy drain

Goal: Organize nodes in order to minimize message collisions

Main Idea: WaveScheduling

Organize nodes in a grid– Each grid area has a leader– Nodes within each area transmit data to their leader– Leaders communicate at specific intervals and directions

East Wave example

WaveScheduling Paths

The large right-turn path has a lower

latency for N,E,S,WSource

Destination

They follow a North, East, South, West direction

WaveScheduling Contributions

Transmissions without collisions– Much lower energy drain– But also, significantly larger delays

• Few nodes transmit at each time, long paths to follow

Outline

Conclusions

2a. Compressing Historical Measurements

Compressing Historical Information in Sensor Networks– A. Deligiannakis, Y. Kotidis, N. Roussopoulos

Application: If past measurements are important, they should be periodically transmitted to base station, before the memory is exhausted

Transmitting all measurements is costly

2a. Compressing Historical Measurements

Observation: Sensors may measure several quantities Potential correlations

– At the same quantity• Periodicity, similar trends

– Between different quantities• Temperature and Voltage [Deshpande04], pressure and humidity

– Between different sensors in an area• I.e., Similar temperature and noise levels

Can we take advantage of such correlations to compress the data?

Regression at XY level = scaling and shifting

Examples of Correlated Signals

XY Graph

Main IdeaCreate small dictionary of trends that appear

frequentlyPartition data into intervalsEncode each interval through some part of the

dictionary– Use linear regression for encoding: bXaY

Dictionary

Part of Data

Regression Parameters (a,b)

What is Transmitted

Size = B

Sensor Base Station

Dictionary

Measurements

Dictionary Updates Dictionary Log

Log with receivedcompressed data

Compressed Data

ΜBase

For each data interval transmit 4 values: 1) Start position in data array, 2) Location of best approx. in dictionary; 3-4) Regression Parameters

Cooperative Compression Exploring spatial correlations…

Group leader partitions part of bandwidth

to each sensor– To save energy, group leader may transmit its

dictionary updates to the group

Sensors compress their data, report

resulting error (NOT data)

More space is assigned to nodes with

larger errors

Compressed data transmitted to group

leader

Combination with its own data and

transmission

Group Leader

Base Station

Query, Bandwidth

Β1’

Β2’ Β3’

ΒS-1’

Algorithm Benefits

To minimize different error metrics, simply change regression algorithm– I.e., SSE, SSRE and Max errors are simple to handle

More space to difficult signals/sensors Group organization saves team the need to compute

dictionary (expensive part)– Need to rotate group leader selection, to avoid draining

energy– Can apply HEED protocol for group leader selection

• Prob of becoming group leader is analogous to Ecurr / Einit

2b. Model-Driven Data Acquisition

Model-driven Data Acquisition in Sensor Networks– Amol Deshpande, Carlos Guestrin, Samuel R. Madden, Joseph

M. Hellerstein, Wei Hong Idea: Learn a probabilistic model of past

measurements/patterns– Multi-gaussian pdfs– Also learn transitional probabilities P(Xt+1 | Xt)

At each epoch, base station decides for which sensors (and which quantities) it is confident for their current values– This confidence decays over time if no samples are taken

Generate query plan to retrieve only the remaining quantities

10 20 300

SQL query, with desired confidence

bounds

Method Sketch (slide from the authors’ presentation)

10 20 300

Data Collection

Feed Model

10 20 300

Model Estimate

ProbabilisticModelNew Query

Algorithm Characteristics

Takes advantage of correlations– May decide to sample voltage instead of temperature, to

decrease energy

Model can be used for missing data (inaccessible sensors)

Can handle point and range queries However, hard to handle previously unseen patterns

– “Thus, for models to perform accurate, predictions they must be trained in the kind of environment where they will be used”

Centralized model, difficult to scale– Subsequent work by same authors proposed a more

distributed system (KEN)34

2c. Snapshot Queries

Snapshot Queries: Towards Data-Centric Sensor Networks– Yannis Kotidis

Idea: Nodes in proximity likely observe similar things Find if you are needed to answer queries

– Does there exist a representative that can approximate your data accurately?

Nodes that are needed (representatives) constitute network snapshot– Can estimate and answer queries for remaining nodes as well

Completely decentralized approach– Representatives tested, change with few messages

Snapshot Queries

Q’: select loc,temperature from Sensors where loc in SOUTH_EAST_ QUADRANT

use snapshotQ’

Example of Network Snapshot

Dark nodes show representatives

Network inside the network

Immediate visualization of common data patterns

Testing for Representatives Ni maintains model for each neighbor Nj

– Constructed using cache of measurements from Nj

– Updated at random epochs by measurements transmitted by Nj

Can use model to predict measurement of Nj if:

error(x’j(t)-xj(t)) T

Supports multiple error functions, user provided threshold T Snapshot adapts over time to evolving data characteristics No training is necessary

In TAG, each node transmits aggregates at EACH epoch– Can we do better than that?– Aggregates (i.e., avg temperature) may change slowly

Processing approximate aggregate queries in wireless sensor networks– A. Deligiannakis, Y. Kotidis, N. Roussopoulos

Application: If application tolerates E_Global = |V-V|, organize aggregation to minimize #messages

– V, V : Real/estimated aggregate result

Idea:– Don’t transmit small changes in aggregates of subtrees

Dual problem a little harder– Bandwidth Constrained Queries in Sensor Networks (same authors)

Approximate Aggregate Queries

Error Filters

Algorithm apply error filters to sensor nodes– Interval [L..H]

Each node computes aggregate for subtree– Transmit only if new aggregate is outside the filter– At each transmission, re-center error filter

Of course, the trick is how to decide how large each error filter should be– Always respect accuracy constraints E_Global

Example of Error Filters

54 6 8

Base Station

E=1 E=4

16 18 46 54

19 21 41 49

16 18 46 54

iEE_Global =

Why not Uniform Allocation of Error?

Reasons not to select uniform filters– Different range of aggregates in nodes/subtrees– Changes in subtrees may cancel-out

• Eg, Vehicle movement from observation area of node 4 to the one of node 5

Goal: Use larger filters where you expect larger decrease in transmitted messages

Filters periodically adjusted– Adapt to changes in data characteristics

Sketch Of Technique

shrink

expand

(default)

W W + dW

(expanded)

Filter Width

#messages

Filters periodically shrink to W x F – Creates error budget (1-F) x E_Global to redistribute to

nodes Algorithm computes simple statistics to estimate gain

of increasing filter– Statistics computed per node

C-W C+W

C-W*F C+W*F

Statistics aggregated bottom-up before reorganization– Important note: Compare this cost (1 aggregate/sensor) with

cost of transmitting individual statistics to base station (SELECT *)

Error budget redistributed top-down– Partition error budget to your subtrees (and yourself),

proportionally to the gain of each subtree

Often 5-time reduction in transmitted data compared to uniform allocation, order of magnitude when compared

Sketch of Technique (2)

Outline

Conclusions

The Need for Outlier Detection

Outliers Detection is Useful– Outliers may denote malfunctioning sensors

• Sensor measurements are often unreliable• Sensors may fail-dirty

– Outliers may also represent interesting events detected by few sensors

• Fire detected by a sensor

Results to Aggregate queries are often Meaningless– Consider a MAX/MIN calculation

in the presence of outlier measurements

– Other aggregates are also influenced

What do we Need?

Goals for aggregate queries:– A “clean” aggregate– Reporting of outlier values– Both in a SINGLE, in-network framework

What to Consider as an Outlier?

Need to support several similarity metrics

Also consider characteristics of monitored quantity– Measurements may depend on distance from source (e.g.,

noise, heat)– Simply relying on values for testing similarity between sensors is

not enough – comparing recent trends may be more appropriate

Provide provision for user-specific “minimum support”– How many other sensors need to be similar to you, so that you

are not considered as an outlier?

Motivational Example

• S6, S7 and S8 observe a fire

• Their measurements fluctuate more

A voting process at S3 will reject the reading of S6

Smoothing at S3 also obscures the reading

S10 and S9 fail-dirty need to be excluded

Example is partitioned in 2 areas: Our framework

supports group-by queries

Sketch of Framework

1. Assume a required minimum support of 2

2. S6, S7 and S8 can only be tested for similarity at their closest common ancestor (S2)

3. At node S2, their values can be merged into aggregate

4. S12, S4 and S5 are in the same group. In S3 their values can be tested for similarity

5. Similarly, test S3, S2 and S11 for similarity in S2 etc

6. Nodes S10 and S9 have failed-dirty. Readings without the minimum support are not included in the aggregateEven if the readings of S6, S7 and S8 are incorporated

into the aggregate, one of these readings (revealing the fire) will be received by the root node 50

Framework Features

Performs similarity tests over the latest K readings of sensors– Can plug similarity functions with minimal cost

Allows for minimum support GROUP-BY support

– Grouping based on latest measurement, OR static predicates (area, id etc)

Can limit tests within each group using a CONSTRAIN TEST clause– Semantic information (i.e., location) could be useful here– I.e., only perform tests between sensors in the same floor

Collection Tree periodically reorganized– Move towards places you will find witnesses, outliers

Just to See what Happens…

Some More Notable Approaches

Cluster-Based Communication– LEACH, Pegasis, HEED– Goal: Organize sensors into clusters– Intra-cluster and inter-cluster communication– Overhead for clusterheads, so probabilistic election and rotation

Optimizing Scheduling– Nodes have different loads to transmit– Can determine minimal times to transmit/listen based on worst

case estimates of time to transmit/link– See: “Workload-aware Optimization of Query Routing Trees in

Wireless Sensor Networks (MicroPulse) by P. Andreou, D. Zeinalipour-Yazti, P. Chrysanthis, G. Samaras

Some More Notable Approaches (2)

Sensor Localization– Few sensor nodes are actually GPS-enabled

• Catalog for Crossbow gives only 1 such data acquisition board (MTS420)

– Several algorithms for sensor localization• Option 1: Sensors are not localized, but there exist landmarks with

GPS knowledge of themselves• Option 2: Mobile sensors can localize even without any

infrastructure, if (1) the sensors are free to move, (b) they have a common reference point (i.e., direction of North)

– GPSFree Node Localization in Mobile Wireless Sensor Networks, by Huseyin Akcan, Vassil Kriakov, Herve Bronnimann, Alex Delis

Duplicate-Insensitive sketches for aggregate queries– Minimizes impact of data loss

Conclusions

Presented several query processing techniques– All aim to process data in-network

• Crucial for network lifetime– Techniques for extraction of historical measurements, or online

querying (point, range, aggregate, group-by)– All data reduction techniques presented are easily tunable

• Given desired error/accuracy, minimize bandwidth consumption; or• Given bandwidth consumption, minimize error of produced results• Both useful to satisfy different user requirements

– Important to properly schedule nodes• Increases available time to sleep, decreases the energy drain• If possible, avoid creating different schedule per each query

– Too many conflicts, sensor continuously working…

Query Processing for the Semantic Sensor Web Antonios Deligiannakis Technical University of Crete.

Documents

Transcript of Query Processing for the Semantic Sensor Web Antonios Deligiannakis Technical University of Crete.

Crete 2009

GEOcaching - Crete

JOHN CREATIVE MARKETING PROFESSIONAL ANTONIOS …johnantonios.com/wp-content/uploads/2015/01/John-Antonios-Infographic... · I’m an enterprising marketing professional with over

Andalusi Crete

Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables

Crete powerpoint

father Antonios Papanikolaou (Organisation "The Ark of the World")

Antonios Schinas, Maria Chatzipavlou, Loukia ...

Crete 2014

Bond Crete

Minoan Crete

Antonios Power Point on Adhd

CRETE - Greece

CSci485 C# Lecture 1 - Antonios Daskos An Introduction to C# and Database Programming with ADO.NET Antonios Daskos.

Cracow Grid Workshop ‘06 17 October 2006 Execution Management and SLA Enforcement in Akogrimo Antonios Litke Antonios Litke, Kleopatra Konstanteli, Vassiliki.

Visco Crete

Crete, August 2004. Started October 2002 Crete, August 2004.

Crete - Chania

CURRICULUM VITAE Antonios Kyparos, Ph.D. · 2 Curriculum Vitae: Dr. Antonios Kyparos (updated May, 2020) 2009-2014 Lecturer of Exercise Physiology, Department of Physical Education

BATTLE OF CRETE REMEMBEREDBATTLE OF …...BATTLE OF CRETE REMEMBEREDBATTLE OF CRETE REMEMBERED On 20 May 1941 German airborne divisions attacked Crete after the Nazi juggernaut had