Intelligent Data Processing for the Internet of Things

80
1 Intelligent Data Processing for the Internet of Things Payam Barnaghi Institute for Communication Systems (ICS) University of Surrey Guildford, United Kingdom International “IoT 360″ Summer School October 2015 – Rome, Italy

Transcript of Intelligent Data Processing for the Internet of Things

Page 1: Intelligent Data Processing for the Internet of Things

1

Intelligent Data Processing for the Internet of Things

Payam BarnaghiInstitute for Communication Systems (ICS)University of SurreyGuildford, United Kingdom

International “IoT 360″ Summer SchoolOctober 2015 – Rome, Italy

Page 2: Intelligent Data Processing for the Internet of Things

Wireless Sensor (and Actuator) Networks

Sinknode Gateway

Core networke.g. InternetGateway

End-user

Computer services

- The networks typically run Low Power Devices- Consist of one or more sensors, could be different type of sensors (or actuators)

Operating Systems?

Services?

Protocols?Protocols?

In-node Data

Processing

Data Aggregation/

Fusion

Inference/Processing of IoT data

Page 3: Intelligent Data Processing for the Internet of Things

3

Key characteristics of IoT devices

−Often inexpensive sensors (actuators) equipped with a radio transceiver for various applications, typically low data rate ~ 10-250 kbps.

−Deployed in large numbers−The sensors should coordinate to perform the desired task.−The acquired information (periodic or event-based) is

reported back to the information processing centre (or some cases in-network processing is required)

−Solutions are often application-dependent.

3

Page 4: Intelligent Data Processing for the Internet of Things

4

Beyond conventional sensors

− Human as a sensor (citizen sensors)− e.g. tweeting real world data and/or events

− Software sensors− e.g. Software agents/services generating/representing

data

Road block, A3

Road block, A3

Suggest a different route

Page 5: Intelligent Data Processing for the Internet of Things

5P. Barnaghi et al., "Digital Technology Adoption in the Smart Built Environment", IET Sector Technical Briefing, The Institution of Engineering and Technology (IET), I. Borthwick (editor), March 2015.

Page 6: Intelligent Data Processing for the Internet of Things

Internet of Things: The story so far

P. Barnaghi, A. Sheth, “The Internet of Things: The story so far”, IEEE IoT Newsletter, September 2014.

Page 7: Intelligent Data Processing for the Internet of Things

IoT data- challenges

− Multi-modal, distributed and heterogeneous− Noisy and incomplete− Time and location dependent − Dynamic and varies in quality − Crowdsourced data can be unreliable − Requires (near-) real-time analysis− Privacy and security are important issues− Data can be biased- we need to know our data!

7P. Barnaghi, A. Sheth, C. Henson, "From data to actionable knowledge: Big Data Challenges in the Web of Things," IEEE Intelligent Systems, vol.28 , issue.6, Dec 2013.

Page 8: Intelligent Data Processing for the Internet of Things

Data is not what we want or is it?

Page 9: Intelligent Data Processing for the Internet of Things

What we need are insights and actionable-information.

Page 10: Intelligent Data Processing for the Internet of Things

Diffusion of innovation

image source: Wikipedia

IoT

Page 11: Intelligent Data Processing for the Internet of Things

Problem #1

Data: We seem to have lots of it…

Real World Data: it is always difficult to get (silos, format, privacy, business interests or lack of interest!...)

Page 12: Intelligent Data Processing for the Internet of Things

Problem #2

Data: interoperability and metadata frameworks…

Real World Data: there are solutions for service based (RESTful) access, meta-data/semantic representation frameworks (e.g. W3C SSN, HyperCat,…) but none of them are still widely adapted.

Page 13: Intelligent Data Processing for the Internet of Things

Problem #3

Data: quality, reliability…

Real World Data: data can be noisy, crowed source data can be inaccurate, contradictory, delay in accessing/processing the data…

Page 14: Intelligent Data Processing for the Internet of Things

Problem #4

Data: having too much data and using analytics tools alone won’t solve the problem…

Real World Data: in addition to the HPC issues, we need new methods/solutions that can provide real-time analysis of dynamic, variable quality and multi-modal streams…

Page 15: Intelligent Data Processing for the Internet of Things

Problem #5

Data: abstraction, discovering the associations…

Real World Data: co-occurrence vs. causation; we need hypothesis, background knowledge,…After all data is not what we are really after…

Page 16: Intelligent Data Processing for the Internet of Things

We need more linked open data…

Page 17: Intelligent Data Processing for the Internet of Things

(near) real-time linked open data Streams

Sometimes it’s even better if we have:

Page 18: Intelligent Data Processing for the Internet of Things

(near) real-time linked open data Streams+meta-data (semantic annotations)+Adaptable and scalable analytics tools+Sufficient background knowledge

or even better than that if we have:

Page 19: Intelligent Data Processing for the Internet of Things

IoT Data

19

Page 20: Intelligent Data Processing for the Internet of Things

20

“Each single data item could be important.”

“Relying merely on data from sources that are unevenly distributed, without considering background information or social context, can lead to imbalanced interpretations and decisions.”?

Page 21: Intelligent Data Processing for the Internet of Things

Current focus on Big Data

− Emphasis on power of data and data mining solutions

− Technology solutions to handle large volumes of data; e.g. Hadoop, NoSQL, Graph Databases, …

− Trying to find patterns and trends from large volumes of data…

Page 22: Intelligent Data Processing for the Internet of Things

Myths About Big Data

− Big Data is only about massive data volume− Big Data means Hadoop− Big Data means unstructured data− If we have enough data we can draw conclusions

(enough here often means massive amounts)− NoSQL means No SQL− It is about increasing computational power and

taking more data and running data mining algorithms.

22Some of the items are adapted from: Brain Gentile, http://mashable.com/2012/06/19/big-data-myths/

Page 23: Intelligent Data Processing for the Internet of Things

What happens if we only focus on data

− Number of burgers consumed per day.− Number of cats outside.− Number of people checking their facebook

account.

− What insight would you draw?

23

Page 24: Intelligent Data Processing for the Internet of Things

It is also important to note what type of problems we expect to solve.

Page 25: Intelligent Data Processing for the Internet of Things

Back to the future

25

Page 26: Intelligent Data Processing for the Internet of Things

26Source LAT Times, http://documents.latimes.com/la-2013/

A smart City exampleFuture cities: A view from 1998

Page 27: Intelligent Data Processing for the Internet of Things

27Source: http://robertluisrabello.com/denial/traffic-in-la/#gallery[default]/0/

Source: wikipedia

Back to the Future: 2013

Page 28: Intelligent Data Processing for the Internet of Things

Common problems

28

Guildford, Surrey

Page 29: Intelligent Data Processing for the Internet of Things

29

Data alone is not enough

− Domain knowledge− Machine interpretable meta-data− Delivery, sharing and representation services− Query, discovery, aggregation services− Publish, subscribe, notification, and access

interfaces/services− More open solutions for innovation and citizen

participation− Efficient feedback and control mechanisms − Social network and social system analysis− In cities, interactions with people and social

systems is the key.

Page 30: Intelligent Data Processing for the Internet of Things

30

Some of the technical challenges

− Discovery: finding appropriate device and data sources− Access: Availability and (open) access to data

resources and data− Search: querying for data− Integration: dealing with heterogeneous devices,

networks and data− Large-scale data mining, adaptable learning and

efficient computing and processing − Interpretation: translating data to knowledge that can

be used by people and applications− Scalability: dealing with large numbers of devices and

a myriad of data and the computational complexity of interpreting the data.

Page 31: Intelligent Data Processing for the Internet of Things

31

Accessing IoT data

− Publish/Subscribe (long-term/short-term)− Ad-hoc query− The typical types of data request for sensory data:

− Query based on− ID (resource/service) – for known resources− Location− Type− Time – requests for freshness data or historical data; − One of the above + a range [+ Unit of Measurement]− Type/Location/Time + A combination of Quality of Information

attributes − An entity of interest (a feature of an entity on interest)− Complex Data Types (e.g. pollution data could be a combination of

different types)

Page 32: Intelligent Data Processing for the Internet of Things

32

IoT Data in the Cloud

Image courtesy: http://images.mathrubhumi.comhttp://www.anacostiaws.org/userfiles/image/Blog-Photos/river2.jpg

Page 33: Intelligent Data Processing for the Internet of Things

IoT Data Processing

WSN

WSN

WSN

WSNWSN

Network-enabled Devices

Network-enabled Devices

Network services/storag

e and processing

unitsData/service

access at application

level

Data collections and

processing within the networks

Data Discovery

Service/Resource Discovery

Page 34: Intelligent Data Processing for the Internet of Things

34

In-network processing

− Mobile Ad-hoc Networks can be seen as a set of nodes that deliver bits from one end to the other;

− WSNs, on the other end, are expected to provide information, not necessarily original bits− Gives additional options− e.g., manipulate or process the data in the network

− Main example: aggregation − Applying aggregation functions to a obtain an average value of

measurement data− Typical functions: minimum, maximum, average, sum, … − Not amenable functions: median

Source: Protocols and Architectures for Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor NetworksHolger Karl, Andreas Willig, chapter 3, Wiley, 2005 .

Page 35: Intelligent Data Processing for the Internet of Things

35

In-network processing

− Depending on application, more sophisticated processing of data can take place within the network− Example edge detection: locally exchange raw data with

neighboring nodes, compute edges, only communicate edge description to far away data sinks.

− Example tracking/angle detection of signal source: Conceive of sensor nodes as a distributed microphone array, use it to compute the angle of a single source, only communicate this angle, not all the raw data.

− Exploit temporal and spatial correlation− Observed signals might vary only slowly in time; so no need

to transmit all data at full rate all the time.− Signals of neighboring nodes are often quite similar; only try

to transmit differences (details a bit complicated, see later).

Source: Protocols and Architectures for Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor NetworksHolger Karl, Andreas Willig, chapter 3, Wiley, 2005 .

Page 36: Intelligent Data Processing for the Internet of Things

36

Data-centric networking

− In typical networks (including ad-hoc networks), network transactions are addressed to the identities of specific nodes− A “node-centric” or “address-centric” networking paradigm

− In a redundantly deployed sensor networks, specific source of an event, alarm, etc. might not be important− Redundancy: e.g., several nodes can observe the same area

− Thus: focus networking transactions on the data directly instead of their senders and transmitters ! data-centric networking− Specially this idea is reinforced by the fact that we might

have multiple sources to provide information and observations form the same or similar areas.

− Principal design change

Source: Protocols and Architectures for Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor NetworksHolger Karl, Andreas Willig, chapter 3, Wiley, 2005 .

Page 37: Intelligent Data Processing for the Internet of Things

Data Aggregation

− Computing a smaller representation of a number of data items (or messages) that is extracted from all the individual data items.

− For example computing min/max or mean of sensor data.

− More advance aggregation solutions could use approximation techniques to transform high-dimensionality data to lower-dimensionality abstractions/representations.

− The aggregated data can be smaller in size, represent patterns/abstractions; so in multi-hop networks, nodes can receive data form other node and aggregate them before forwarding them to a sink or gateway.

− Or the aggregation can happen on a sink/gateway node.

Page 38: Intelligent Data Processing for the Internet of Things

Aggregation example

− Reduce number of transmitted bits/packets by applying an aggregation function in the network

1

1

31

1

6

1

1

11

1

1

Source: Holger Karl, Andreas Willig, Protocols and Architectures for Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor Networks, chapter 3, Wiley, 2005 .

Page 39: Intelligent Data Processing for the Internet of Things

Efficacy of an aggregation mechanism

− Accuracy: difference between the resulting value or representation and the original data− Some solutions can be lossless or lossly depending on the

applied techniques. − Completeness: the percentage of all the data items that are

included in the computation of the aggregated data.− Latency: delay time to compute and report the aggregated

data− Computation foot-print; complexity;

− Overhead: the main advantage of the aggregation is reducing the size of the data representation;− Aggregation functions can trade-off between accuracy, latency

and overhead;

− Aggregation should happen close to the source.

Page 40: Intelligent Data Processing for the Internet of Things

Publish/Subscribe

− Achieved by publish/subscribe paradigm− Idea: Entities can publish data under certain names− Entities can subscribe to updates of such named data

− Conceptually: Implemented by a software bus− Software bus stores subscriptions, published data; names used

as filters; subscribers notified when values of named data changes

Software bus

Publisher 1 Publisher 2

Subscriber 1 Subscriber 2 Subscriber 3

− Variations− Topic-based P/S –

inflexible − Content-based

P/S – use general predicates over named data

Source: Holger Karl, Andreas Willig, Protocols and Architectures for Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor Networks, chapter 12, Wiley, 2005 .

Page 41: Intelligent Data Processing for the Internet of Things

MQTT Pub/Sub Protocol

− MQ Telemetry Transport (MQTT) is a lightweight broker-based publish/subscribe messaging protocol.

− MQTT is designed to be open, simple, lightweight and easy to implement. − These characteristics make MQTT ideal for use in

constrained environments, for example in IoT. −Where the network is expensive, has low bandwidth or

is unreliable −When run on an embedded device with limited

processor or memory resources;− A small transport overhead (the fixed-length header is just 2

bytes), and protocol exchanges minimised to reduce network traffic

− MQTT was developed by Andy Stanford-Clark of IBM, and Arlen Nipper of Cirrus Link Solutions.

Source: MQTT V3.1 Protocol Specification, IBM, http://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.html

Page 42: Intelligent Data Processing for the Internet of Things

MQTT

− It supports publish/subscribe message pattern to provide one-to-many message distribution and decoupling of applications

− A messaging transport that is agnostic to the content of the payload − The use of TCP/IP to provide basic network connectivity − Three qualities of service for message delivery:

− "At most once", where messages are delivered according to the best efforts of the underlying TCP/IP network. Message loss or duplication can occur.

− This level could be used, for example, with ambient sensor data where it does not matter if an individual reading is lost as the next one will be published soon after.

− "At least once", where messages are assured to arrive but duplicates may occur.

− "Exactly once", where message are assured to arrive exactly once. This level could be used, for example, with billing systems where duplicate or lost messages could lead to incorrect charges being applied.

Source: MQTT V3.1 Protocol Specification, IBM, http://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.html

Page 43: Intelligent Data Processing for the Internet of Things

MQTT Message Format

− The message header for each MQTT command message contains a fixed header.

− Some messages also require a variable header and a payload. − The format for each part of the message header:

Source: MQTT V3.1 Protocol Specification, IBM, http://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.html

— DUP: Duplicate delivery— QoS: Quality of Service— RETAIN: RETAIN flag

—This flag is only used on PUBLISH messages. When a client sends a PUBLISH to a server, if the Retain flag is set (1), the server should hold on to the message after it has been delivered to the current subscribers.—This allows new subscribers to instantly receive data with the retained, or Last Known Good, value.

Page 44: Intelligent Data Processing for the Internet of Things

Sensor Data as time-series data

− The sensor data (or IoT data in general) can be seen as time-series data.

− A sensor stream refers to a source that provide sensor data over time.

− The data can be sampled/collected at a rate (can be also variable) and is sent as a series of values.

− Over time, there will be a large number of data items collected.

− Using time-series processing techniques can help to reduce the size of the data that is communicated; −Let’s remember, communication can consume

more energy than communication;

Page 45: Intelligent Data Processing for the Internet of Things

Sensor Data as time-series data

− Different representation method that introduced for time-series data can be applied.

− The goal is to reduce the dimensionality (and size) of the data, to find patterns, detect anomalies, to query similar data;

− Dimensionality reduction techniques transform a data series with n items to a representation with w items where w < n.− This functions are often lossy in comparison with solutions

like normal compression that preserve all the data. − One of these techniques is called Symbolic Aggregation

Approximation (SAX).− SAX was originally proposed for symbolic representation of

time-series data; it can be also used for symbolic representation of time-series sensor measurements.

− The computational foot-print of SAX is low; so it can be also used as a an in-network processing technique.

Page 46: Intelligent Data Processing for the Internet of Things

46

In-network processingUsing Symbolic Aggregate Approximation (SAX)

SAX Pattern (blue) with word length of 20 and a vocabulary of 10 symbolsover the original sensor time-series data (green)

Source: P. Barnaghi, F. Ganz, C. Henson, A. Sheth, "Computing Perception from Sensor Data", in Proc. of the IEEE Sensors 2012, Oct. 2012.

fggfffhfffffgjhghfff

jfhiggfffhfffffgjhgi

fggfffhfffffgjhghfff

Page 47: Intelligent Data Processing for the Internet of Things

Symbolic Aggregate Approximation (SAX)

− SAX transforms time-series data into symbolic string representations. − Symbolic Aggregate approXimation was proposed by Jessica Lin et al

at the University of California –Riverside;− http://www.cs.ucr.edu/~eamonn/SAX.htm .

− It extends Piecewise Aggregate Approximation (PAA) symbolic representation approach.

− SAX algorithm is interesting for in-network processing in WSN because of its simplicity and low computational complexity.

− SAX provides reasonable sensitivity and selectivity in representing the data.

− The use of a symbolic representation makes it possible to use several other algorithms and techniques to process/utilise SAX representations such as hashing, pattern matching, suffix trees etc.

Page 48: Intelligent Data Processing for the Internet of Things

Processing Steps in SAX

− SAX transforms a time-series X of length n into the string of arbitrary length, where typically, using an alphabet A of size a > 2.

− The SAX algorithm has two main steps: −Transforming the original time-series into a

PAA representation−Converting the PAA intermediate

representation into a string during. − The string representations can be used for

pattern matching, distance measurements, outlier detection, etc.

Page 49: Intelligent Data Processing for the Internet of Things

Piecewise Aggregate Approximation

− In PAA, to reduce the time series from n dimensions to w dimensions, the data is divided into w equal sized “frames.”

− The mean value of the data falling within a frame is calculated and a vector of these values becomes the data-reduced representation.

− Before applying PAA, each time series to have a needs to be normalised to achieve a mean of zero and a standard deviation of one.−The reason is to avoid comparing time series

with different offsets and amplitudes;

Source: Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (DMKD '03). ACM, New York, NY, USA, 2-11.

Page 50: Intelligent Data Processing for the Internet of Things

SAX- normalisation before PAA

Timeseries (c): 2, 3, 4.5, 7.6, 4, 2, 2, 2, 3, 1

Mean (μ): μ= (2+3+4.5+7.6+4+2+2+2+3+1)/10= 3.11

Standard deviation (σ): (2-3.11)2 = 1.2321(3-3.11)2 = 0.0121(4.5-3.11)2 = 1.9321(7.6-3.11)2 = 20.1601(4-3.11)2 = 0.7921(2-3.11)2 = 1.2321(2-3.11)2 = 1.2321(2-3.11)2 = 1.2321(3-3.11)2 = 0.0121(1-3.11)2 = 4.4521

1.2321+0.0121+ 1.9321+ 20.1601+ 0.7921+ 1.2321+ 1.2321+ 1.2321+ 1.2321+ 0.0121+4.4521 = 33.5211

σ = √ (33.5211/10) = 1.83087683911

Page 51: Intelligent Data Processing for the Internet of Things

NormalisationTimeseries (c): 2, 3, 4.5, 7.6, 4, 2, 2, 2, 3, 1

Normalised: zi = (ci – μ)/ σ

σ = 1.83087683911μ = 3.11z1 = (2- 3.11)/1.83087683911 = -0.606z2 = (3-3.11)/ 1.83087683911= -0.600z3 = (4.5-3.11)/ 1.83087683911= 2.452z4 = (7.6-3.11)/ 1.83087683911= -0.600z5 = (4-3.11)/ 1.83087683911= 0.486z6 = (2-3.11)/ 1.83087683911= -0.606z7 = (2-3.11)/ 1.83087683911= -0.606z8 = (2-3.11)/ 1.83087683911= -0.606z9 = (3-3.11)/ 1.83087683911= -0.600z10 = (1-3.11)/ 1.83087683911= -1.152

Normalised Timeseries (z): -0.606, -0.600, 2.452, -0.600, 0.486, -0.606, -0.606, -0.606 , -0.600, -1.152

Page 52: Intelligent Data Processing for the Internet of Things

PAA calculation

Timeseries (c): 2, 3, 4.5, 7.6, 4, 2, 2, 2, 3, 1Normalised Timeseries (z): -0.606, -0.600, 2.452, -0.600,

0.486, -0.606, -0.606, -0.606 , -0.600, -1.152

PAA (w=5): -0.603, 0.926, -0.06, -0.606, 0.273

Page 53: Intelligent Data Processing for the Internet of Things

PAA to SAX Conversion

− Conversion of the PAA representation of a time-series into SAX is based on producing symbols that correspond to the time-series features with equal probability.

− The SAX developers have shown that time-series which are normalised (zero mean and standard deviation of 1) follow a Normal distribution (Gaussian distribution).

− The SAX method introduces breakpoints that divides the PAA representation to equal sections and assigns an alphabet for each section.− For defining breakpoints, Normal inverse cumulative

distribution function

Page 54: Intelligent Data Processing for the Internet of Things

Breakpoints in SAX

− “Breakpoints: breakpoints are a sorted list of numbers B = β 1,…, β a-1 such that the area under a N(0,1) Gaussian curve from βi to βi+1 = 1/a”.

Source: Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (DMKD '03). ACM, New York, NY, USA, 2-11.

Page 55: Intelligent Data Processing for the Internet of Things

Alphabet representation in SAX

− Let’s assume that we will have 4 symbols alphabet: a,b,c,d

− As shown in the table in the previous slide, the cut lines for this alphabet (also shown as the thin red lines on the plot below) will be { -0.67, 0, 0.67 }

Source: JMOTIF Time series mining, http://code.google.com/p/jmotif/wiki/SAX

Page 56: Intelligent Data Processing for the Internet of Things

SAX Represetantion

Timeseries (c): 2, 3, 4.5, 7.6, 4, 2, 2, 2, 3, 1Normalised Timeseries (z): -0.606, -0.600, 2.452, -

0.600, 0.486, -0.606, -0.606, -0.606 , -0.600, -1.152

PAA (w=5): -0.603, 0.926, -0.06, -0.606, 0.273

Cut off ranges: {-0.67, 0, 0.67}Alphabet: a ,b ,c, d

SAX representation: bdbbc

Page 57: Intelligent Data Processing for the Internet of Things

Features of the SAX technique

−SAX divides a time series data into equal segments and then creates a string representation for each segment.

−The SAX patterns create the lower-level abstractions that are used to create the higher-level interpretation of the underlying data.

−The string representation of the SAX mechanism enables to compare the patterns using a specific type of string similarity function.

Page 58: Intelligent Data Processing for the Internet of Things

Adaptive learning

Slides: Daniel Puschmann, Frieder Ganz (now with Adobe)

Page 59: Intelligent Data Processing for the Internet of Things

Creating Patterns- Adaptive sensor SAX

59F. Ganz, P. Barnaghi, F. Carrez, "Information Abstraction for Heterogeneous Real World Internet Data”, IEEE Sensors Journal, 2013.

Page 60: Intelligent Data Processing for the Internet of Things

Data abstraction

60F. Ganz, P. Barnaghi, F. Carrez, "Information Abstraction for Heterogeneous Real World Internet Data", IEEE Sensors Journal, 2013.

Page 61: Intelligent Data Processing for the Internet of Things

From patterns/events to a situation ontology

61F. Ganz, P. Barnaghi, F. Carrez, "Automated Semantic Knowledge Acquisition from Sensor Data", IEEE Systems Journal, 2014.

Page 62: Intelligent Data Processing for the Internet of Things

Learning ontology from sensory data

62

Page 63: Intelligent Data Processing for the Internet of Things

KAT- Knowledge Acquisition Toolkit

http://kat.ee.surrey.ac.uk/F. Ganz, D. Puschmann, P. Barnaghi, F. Carrez, "A Practical Evaluation of Information Processing and Abstraction Techniques for the Internet of Things", IEEE Internet of Things Journal, 2015. 63

Page 64: Intelligent Data Processing for the Internet of Things

64

https://github.com/UniSurreyIoT/KAT

Website: http://kat.ee.surrey.ac.uk

Page 65: Intelligent Data Processing for the Internet of Things

Social data analysis: A case study

Slides: Pramod Anantharam, Kno.e.sis Centre, Wright State University

P. Anantharam, P. Barnaghi, K. Thirunarayan, A.P. Sheth, "Extracting City Traffic Events from Social Streams", ACM Trans. on Intelligent Systems and Technology, 2015.

Page 66: Intelligent Data Processing for the Internet of Things

Social data analysis: A case study

− Are people talking about city infrastructure on twitter?

− Can we extract city infrastructure related events from twitter?

− How can we leverage event and location knowledge bases for event extraction?

− How well can we extract city events?

Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.

Page 67: Intelligent Data Processing for the Internet of Things

Do people talk about city infrastructure on Twitter?

67Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.

Page 68: Intelligent Data Processing for the Internet of Things

Some challenges in extracting events from Tweets− No well accepted definition of “events related to a

city”.− Tweets are short (140 characters) and its informal

nature make it hard to analyze− Entity, location, time, and type of the event

− Multiple reports of the same event and sparse report of some events (biased sample)− Numbers don’t necessarily indicate intensity

− Validation of the solution is hard due to the open domain nature of the problem

68Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.

Page 69: Intelligent Data Processing for the Internet of Things

Event extraction techniques

− N-grams + Regression − Text analysis to extract uni- and bi-grams (event markers)− Feature selection to select best possible event markers− Apply regression to predict P(Y|X) where Y is the target (rainfall)

and X is the input (event marker).− Clustering

− Create event clusters incrementally over time− Identify clusters of interest based on its relevance (manual

inspection)− Granularity remains at the tweet/cluster level (tweets are

assigned to clusters of interest)− Sequence Labeling (CRFs)

− Text analysis to extract features such as named entities, POS1 tagging

− Each event indicator is modeled as a mixture of event types that are latent variables

− Each type corresponds to a distribution over named entities n (labels assigned to event types by manual inspection) and other features

Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.

Page 70: Intelligent Data Processing for the Internet of Things

City Event Extraction

− Event extraction should be open domain (no a priori event types) with event metadata.

− Incorporate background knowledge related to city related events e.g., 511.org hierarchy, SCRIBE ontology, city location names.

− Assess the intensity of an event using content and network cues.

− Robust to noise, informal nature, and variability of data.

Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.

Page 71: Intelligent Data Processing for the Internet of Things

City Infrastructure

Tweets from a city POS Tagging

Hybrid NER+ Event term extraction

Geohashing

Temporal Estimation

Impact Assessment

Event AggregationOSM Locations SCRIBE ontology

511.org hierarchy

City Event Extraction

City Event Extraction Solution Architecture

City Event Annotation

P. Anantharam, P. Barnaghi, K. Thirunarayan, A.P. Sheth, "Extracting City Traffic Events from Social Streams", ACM Trans. on Intelligent Systems and Technology, 2015.

Page 72: Intelligent Data Processing for the Internet of Things

City Event Extraction - Key Insights

a) Space: events reported within a grid (gi ∈G where G is a set of all grids in a city)at a certain time are most likely reporting the same event

b) Time: events reported within a time ∆t in a grid gi are most likely to be reporting the same event

c) Theme: events with similar entities within a grid gi and time ∆t are most likely reporting the same event

Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.

Page 73: Intelligent Data Processing for the Internet of Things

CRF formalisation – for annotation

73

A General CRF Model

Page 74: Intelligent Data Processing for the Internet of Things

0.6 miles

Max-lat

Min-lat

Min-long

Max-long

0.38 miles

37.7545166015625, -122.40966796875

37.7490234375, -122.40966796875

37.7545166015625, -122.420654296875

37.7490234375, -122.420654296875

437.74933, -122.4106711

Hierarchical spatial structure of geohash for representing locations with variable precision.Here the location string is 5H34

0 1 2 3 4 5 67 8 9 B C D EF G H I J K L

0 172 3 4

5 6 8 9

0 1 2 3 4

5 6 7

0 1 23 4 5

6 7 8

City Event Extraction – Geohashing

Page 75: Intelligent Data Processing for the Internet of Things

Implmentation

− City Event Annotation − Automated creation of training data − Annotation task (our CRF model vs. baseline CRF model)

− City Event Extraction− Use aggregation algorithm for event extraction− Extracted events AND ground truth

− Dataset (Aug – Nov 2013) ~ 8 GB of data on disk− Over 8 million tweets− Over 162 million sensor data points− 311 active events and 170 scheduled events

Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.

Page 76: Intelligent Data Processing for the Internet of Things

Extracted Events and Ground Truth

Page 77: Intelligent Data Processing for the Internet of Things

Automated creation of training data

77

Evaluation over 500 randomly chosen tweets from around 8,000 annotated tweets

Page 78: Intelligent Data Processing for the Internet of Things

Conclusions

− The ultimate goal is to transform raw data into insights and actionable information and/or create effective representations for machines and also human users.− This requires data from multiple sources, (near-) real

time analytics and visualisation and/or semantic representations.

− We need solutions such as streaming processing, pattern recognition, multi-view and deep learning.

− Applications in areas such as intelligent urban development and combining physical-cyber-social data for developing smart city services, and healthcare services (e.g. elderly-care, remote patient monitoring).

78

Page 79: Intelligent Data Processing for the Internet of Things

Acknowledgements

− Some parts of the content are adapted from:− Holger Karl, Andreas Willig, Protocols and Architectures for

Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor Networks, chapters 3 and 12, Wiley, 2005 .

− Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (DMKD '03). ACM, New York, NY, USA, 2-11.

− JMOTIF Time series mining, http://code.google.com/p/jmotif/wiki/SAX

Page 80: Intelligent Data Processing for the Internet of Things

Q&A

− Thank you.

http://personal.ee.surrey.ac.uk/Personal/P.Barnaghi/

@pbarnaghi

[email protected]