Data Stream Processing and Analytics
description
Transcript of Data Stream Processing and Analytics
![Page 1: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/1.jpg)
Data Stream Processingand Analytics
Summer schools – CEA – EDF – INRIA06-20-2014
Alexis Bondu, EDF R&[email protected]
Thank to Vincent Lemaire, OrangeLab
![Page 2: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/2.jpg)
Outline
• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !
2
![Page 3: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/3.jpg)
Outline
• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !
3
![Page 4: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/4.jpg)
VLDB vs. CEP
• Very Large Database :– Static data– Storage : distributed on several computers – Query & Analysis : distributed and parallel processing
• Complex Event Processing : – Data in motion – Storage : none (only buffer in memory)– Query & Analysis : processing on the fly (and parallel)
~ More than 10 To
~ More than 1000 operations / secTwo complementary approaches
4
![Page 5: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/5.jpg)
Application Areas• Finance: High frequency trading
– Find correlations between the prices of stocks within the historical data; – Evaluate the stationarity of these correlations over the time;– Give more weight to recent data.
• Banking : Detection of frauds with credit cards
– Automatically monitor a large amount of transactions;– Detects patterns of events that indicate a likelihood of fraud;– Stop the processing and send an alert for a human adjudication.
• Medicine: Health monitoring
– Perform automatic medical analysis to reduce workload on nurses;– Analyze measurements of devices to detect early signs of disease.;– Help doctors to make a diagnosis in real time.
• Smart Cities & Smart gird :
– Optimization of public transports;– Management of the local production of electricity; – Flattening of the peaks of consumption.
5
![Page 6: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/6.jpg)
An example of data streamInput data stream
A tuple :
(1,1);(1,2);(2,2);(1,3)
All tuples can be coded by 4 couples of integers
Online processing :Rotate and combine tuples in a compact way
6
![Page 7: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/7.jpg)
Specific constrains of stream-processing
• A data stream continuously emits tuples• The order of tuples is not controlled• The emission rate of tuples is not controlled• Stream processing is an on-line process
In the end, the quality of the pocessing is the ajusting variable
What is a data stream ?
What is a tuple ?
• A small piece of information in motion• Composed by several variables• All tuples share the same structure (i.e. the variables)
7
![Page 8: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/8.jpg)
How to manage the time?
• A timestamp is associated with each tuple :
– Explicit timestamp : defined as a variable within the structure of the data stream– Implicit timestamp : assigned by the system when tuples are processed
• Two ways of representing the time :
– Logical time : only the order of processed tuples is considered– Physical time : characterizes the time when the tuple was emitted
• Buffer issues :
– The tuples are not necessarily received in the order– How long a missing tuple can be waited ?
8
![Page 9: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/9.jpg)
Complex Events Processing (CEP)
OperatorOperator
• An operator implements a query or a more complex analysis• An operator processes data in motion with a low latency• Several operators run at the same time, parallelized on several CPUs and/or
Computers• The graph of operators is defined before the processing of data-streams• Connectors allows to interact with: external data streams, static data in SGBD,
visualization tools.
OperatorOperator
OperatorOperator
OperatorOperator OperatorOperator
RSS
Stocks
XML
Visualization
Database
Input data stream Output data stream
9
![Page 10: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/10.jpg)
Complex Events Processing (CEP)
Main features:• High frequency processing• Parallel computing • Fault-tolerant• Robust to imperfect and asynchronous data• Extensible (implementation of new operators)
Notable products:• StreamBase (Tibco)• InfoSphere Streams (IBM) • STORM (Open source – Twitter)• KINESIS (API - Amazon)• SQLstream• Apama
10
![Page 11: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/11.jpg)
Outline
• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !
11
![Page 12: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/12.jpg)
Time-window• A query is performed on a finite part of the past tuples
FuturCurrent timePast
• Fixed window : “June 2000”• Sliding window : “last week”• Landmark window : “since 1 January 2000”
Define a time-window:
t1
t2
t3
Sliding windowt1
t2
t3
Jumping window
t1
t2
t3
Tumbling window
Update interval:• A result is produced at each update• The type of window depends on the update interval
12
![Page 13: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/13.jpg)
• Most of the CEP provide a SQL-like language• Few CEP provide a user-friendly interface • Each software publisher propose its own language (not standardized)• Main features :
– Define the structure of the connection of the data streams– Define time-windows on data streams– Extend the SQL language (able to run SQL queries on relational data bases) – Run queries on data streams within time-windows
• Additional functions :– Statistics (min, max, mean, standard deviation … etc)– Math (trigonometry, logarithm, exponential … etc)– String (regular expression, trim, substring … etc)– Date (getDayType, getSecond, now … etc)
SQL-like language
13
![Page 14: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/14.jpg)
SQL-like language
CREATE INPUT STREAM InputStream(Compteur string(12),Type string(12),Souscription int,C_index int,Date timestamp
);
CREATE WINDOW OneMinuteWindow(SIZE 60 ADVANCE 60 ON Date);
Geek zoneA simple example with StreamBase :
14
![Page 15: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/15.jpg)
SQL-like language
CREATE OUTPUT STREAM OuputStream;
SELECT firstval(Compteur) AS Compteur,lastval(C_index) – firstval(C_index) AS Consoopenval(Date) AS StartTimecloseval(Date) AS EndTime
FROM InputStream[OneMinuteWindow]
INTO OutputStream;
A simple example with StreamBase : Geek zone
15
![Page 16: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/16.jpg)
SQL-like language
User-friendly interfaceA simple example with StreamBase :
16
![Page 17: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/17.jpg)
Outline
• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !
17
![Page 18: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/18.jpg)
Online vs. Batch mode
• An entire dataset is available • The examples can be processed several times • Weak constrain on the computing time• The distribution of data does not change over time
What is unsupervised learning ?• Mining data in order to find new knowledge • No idea about the expected result
• Tuples are emitted one by one• Tuples are processed on the fly due to their high rate
(one pass algorithms)• Real-time computing (low latency)• The distribution of tuples changes over time (drift)
Batch mode :
Online processing :
18
![Page 19: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/19.jpg)
Summarizing data streams
Why we need to summarize data streams ?
• The number of tuples in infinite … • Their emission rate is potentially very high … • The hardware resources are limited (CPU, RAM & I/O)
What is a summary ?
• A compact representation of the past tuples• With a controlled memory space, accuracy and latency • Which allows to query (or analyze) the history of the
stream, in an approximated way
The objective is to maximize the accuracy of the queries, given technical constrains (stream rate, CPU, RAM & I/O)
19
![Page 20: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/20.jpg)
Two types of summary
Specific summaries : dedicated to a single query (or few)
• Flajolet-Martin Sketch : approximates the number of unique objects in a stream;
• Bloom Filter : efficiently tests if an element is a member of a set;
• Count-Sketch : efficiently finds the k most frequent elements of a set;
• Count-Min Sketch : enumerates the number of elements with a particular value, or within an interval of values.
Generic summaries : allow a large range of queries on any past period
• StreamSamp : based on successive windowing and sampling;
• CluStream : based on micro-clustering;
• DenStream : based on evolving micro-clustering;
Detailed in this talk 20
![Page 21: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/21.jpg)
Flajolet-Martin Sketch [1]
• S is a collection of N elements : S = {s1, s2 …. sN}
• Two elements of S may be identical
• S includes only F distinct elements
• The objective is to efficiently estimate F in terms of:– Time complexity (One pass algorithm)
– Space complexity
– Probabilistic guarantee
Problem statement:
How many 21
![Page 22: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/22.jpg)
Hash function : h(.)
• Associates an element si with a random binary value
• h(.) is a deterministic function
• w is the length of binary values (number of bits)
• w is an integer such that
• Random values are uniformly drawn within
Intuition :
Given a large set of random binary values,
of them begin with “1”
of them begin with “11”
of them begin with “111”
of them begin with k “1”
11
1111 1
11
111
10010110
00100011
10001010
01011011
00011001
01010011
Flajolet-Martin Sketch [1]
22
![Page 23: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/23.jpg)
Example : h(si) = 01001111011010 t(h(si)) = 01000000000000
t(.) is the function which keeps only the first “1” (counting from left), other bits are set to “0”
Location of the first “1” within h(.) 10000000
00100000
10000000
01000000
00010000
01000000
B is the fusion of all the binary words t(h(si)) by using the OR operator denoted by
Fusion of binary words :
B = 11110000
R is the rank of the first “0” (counting from left) within B.That is a random variable related with F !
R = 5
Flajolet-Martin Sketch [1]
23
![Page 24: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/24.jpg)
FM SketchFM SketchA single-pass algorithm :
h(a) = 01001111011010 t(h(a)) = 01000000000000
B = 00000000000000
Input data stream
R = 0
Flajolet-Martin Sketch [1]
24
![Page 25: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/25.jpg)
FM SketchFM SketchA single-pass algorithm :
h(a) = 01001111011010 t(h(a)) = 01000000000000
B = 01000000000000
Input data stream
h(b) = 10001010011011 t(h(b)) = 10000000000000
R = 2
Flajolet-Martin Sketch [1]
25
![Page 26: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/26.jpg)
FM SketchFM SketchA single-pass algorithm :
h(a) = 01001111011010 t(h(a)) = 01000000000000
B = 11000000000000
Input data stream
h(b) = 10001010011011 t(h(b)) = 10000000000000
R = 2
h(a) = 01001111011010 t(h(a)) = 01000000000000
Flajolet-Martin Sketch [1]
26
![Page 27: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/27.jpg)
FM SketchFM SketchA single-pass algorithm :
h(a) = 01001111011010 t(h(a)) = 01000000000000
B = 11000000000000
Input data stream
h(b) = 10001010011011 t(h(b)) = 10000000000000
R = 2
h(a) = 01001111011010 t(h(a)) = 01000000000000
h(c) = 00010110010110 t(h(c)) = 00010000000000
Flajolet-Martin Sketch [1]
27
![Page 28: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/28.jpg)
FM SketchFM SketchA single-pass algorithm :
h(a) = 01001111011010 t(h(a)) = 01000000000000
B = 11010000000000
Input data stream
h(b) = 10001010011011 t(h(b)) = 10000000000000
R = 4
h(a) = 01001111011010 t(h(a)) = 01000000000000
h(c) = 00010110010110 t(h(c)) = 00010000000000
h(b) = 10001010011011 t(h(b)) = 10000000000000
Flajolet-Martin Sketch [1]
28
![Page 29: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/29.jpg)
FM SketchFM SketchA single-pass algorithm :
h(a) = 01001111011010 t(h(a)) = 01000000000000
B = 11010000000000
Input data stream
h(b) = 10001010011011 t(h(b)) = 10000000000000
R = 4
h(a) = 01001111011010 t(h(a)) = 01000000000000
h(c) = 00010110010110 t(h(c)) = 00010000000000
h(b) = 10001010011011 t(h(b)) = 10000000000000
• This single-pass algorithm is adapted to data streams
• Few pieces of information need to be stored in the RAM
• R is a random variable such that :
Flajolet-Martin Sketch [1]
29
![Page 30: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/30.jpg)
How to estimate E(R) ?
Input data stream
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM Sketch
Bm = 11010000000000
Rm = 4
Collection of m Sketch
Deterministic rooting of h(si)
Deterministic rooting of h(si)
• The first bits of h(si) are used to affect each element to a sketchif m = 16, the 4 first bits of h(si) represent the ID of the corresponding Sketch
h(s1) = 0011001000101 -> 1011 = 3 -> s1 is affected to the 3th sketch
• Each Sketch counts approximately F/m distinct elements
Flajolet-Martin Sketch [1]
30
![Page 31: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/31.jpg)
How to estimate E(R) ?
Input data stream
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM SketchFM SketchFM Sketch
FM SketchFM Sketch
Bm = 11010000000000
Rm = 4
Collection of m Sketch
Deterministic rooting of h(si)
Deterministic rooting of h(si)
Stochastic average
Flajolet-Martin Sketch [1]
31
![Page 32: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/32.jpg)
Specific summaries : dedicated to a single query (or few)
• Flajolet-Martin Sketch : approximates the number of unique objects in a stream;
• Bloom Filter : efficiently tests if an element is a member of a set;
• Count-Sketch : efficiently finds the k most frequent elements of a set;
• Count-Min Sketch : enumerates the number of elements with a particular value, or within an interval of values.
Generic summaries : allow a large range of queries on any past period
• StreamSamp : based on successive windowing and sampling;
• CluStream : based on micro-clustering;
• DenStream : based on evolving micro-clustering;
Outline
32
![Page 33: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/33.jpg)
Sampling based summaries
Objectives of a generic summary :
• Summarizes the entire history of the data stream
• Requires a bounded memory space
• Allows a large range of queries, including supervised and supervised analysis
Summarize by sampling the tuples :
• The sampling technics are adapted to incremental processing
• A limited number of tuples are stored
• The stored tuples constitute a representative sample
• The recent past can be favored in terms of accuracy (i.e. sampling rate)
33
![Page 34: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/34.jpg)
Sampling based summariesReservoir sampling [2]
• The reservoir is a uniform sampling;
• The sampling rate decreases over time;
• The probability that tuples are included in the reservoir is : k / Nb_Emitted_Tuples
Time
Tuple 1Tuple 2
Tuple k
……
……
..
t
Reservoir
Selection of the tuple to delete
(uniform sampling)
Insert
t +1
i-th tuple
Uniform sampling(sampling rate = k/i)
34
![Page 35: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/35.jpg)
Sampling based summaries
Tuple 1Tuple 2Tuple 3Tuple 4
Sample 1
Order 0 Order 1 Order 2
Input data stream
Uniform sampling
StreamSamp [3]
35
![Page 36: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/36.jpg)
Sampling based summaries
Tuple 5Tuple 6Tuple 7Tuple 8
Sample 2
Order 0 Order 1 Order 2
Input data stream
Tuple 1
Tuple 3Tuple 4
Sample 1
Tuple 2
Uniform sampling
StreamSamp [3]
36
![Page 37: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/37.jpg)
Sampling based summaries
Tuple 9Tuple 10Tuple 11Tuple 12
Sample 3
Order 0 Order 1 Order 2
Input data stream
Tuple 5
Tuple 7Tuple 8
Sample 2
Tuple 6
Tuple 1Tuple 2Tuple 3Tuple 4
Sample 1 Fusion
Uniform sampling
Tuple 2Tuple 3Tuple 5Tuple 8
Sample 1-2
StreamSamp [3]
37
![Page 38: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/38.jpg)
Sampling based summaries
Tuple 13Tuple 14Tuple 15Tuple 16
Sample 4
Order 0 Order 1 Order 2
Input data stream
Tuple 9
Tuple 11Tuple 12
Sample 3
Tuple 10
Uniform sampling
Tuple 2Tuple 3Tuple 5Tuple 8
Sample 1-2
StreamSamp [3]
38
![Page 39: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/39.jpg)
Sampling based summaries
Tuple 17Tuple 18Tuple 19Tuple 20
Sample 5
Order 0 Order 1 Order 2
Input data stream
Uniform sampling
Tuple 2Tuple 3Tuple 5Tuple 8
Sample 1-2
Tuple 13
Tuple 15Tuple 16
Sample 4
Tuple 14
Tuple 9Tuple 10Tuple 11Tuple 12
Sample 3
StreamSamp [3]
39
![Page 40: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/40.jpg)
Sampling based summaries
Tuple 17Tuple 18Tuple 19Tuple 20
Sample 5
Order 0 Order 1 Order 2
Input data stream
Uniform sampling
Tuple 2Tuple 3Tuple 5Tuple 8
Sample 1-2
Tuple 13
Tuple 15Tuple 16
Sample 4
Tuple 14
Tuple 9Tuple 10Tuple 11Tuple 12
Sample 3 Fusion
Tuple 9Tuple 11Tuple 14Tuple 16
Sample 3-4
StreamSamp [3]
40
![Page 41: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/41.jpg)
Sampling based summaries
Order 0 Order 1 Order 2
Input data stream
Uniform sampling
Fusion Fusion
StreamSamp [3]
41
![Page 42: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/42.jpg)
Sampling based summaries
• A sample gathers k uniformly drawn tuples
• A collection of samples gathers h samples
• Each collection has an order o
• The sampling rate of samples is equal to
Time
(Present)(Past)
Order 0Order 1Order 2
StreamSamp [3]
42
![Page 43: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/43.jpg)
Sampling based summaries
TimeOrder 0Order 1Order 2
How to exploit this summary offline ?
Fusion of all samples
Weighting of tuples to keep their representativeness
Use of any Datamining approach able to process a weighted training set
StreamSamp [3]
43
![Page 44: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/44.jpg)
Specific summaries : dedicated to a single query (or few)
• Flajolet-Martin Sketch : approximates the number of unique objects in a stream;
• Bloom Filter : efficiently tests if an element is a member of a set;
• Count-Sketch : efficiently finds the k most frequent elements of a set;
• Count-Min Sketch : enumerates the number of elements with a particular value, or within an interval of values.
Generic summaries : allow a large range of queries on any past period
• StreamSamp : based on successive windowing and sampling;
• CluStream : based on micro-clustering;
• DenStream : based on evolving micro-clustering;
Outline
44
![Page 45: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/45.jpg)
Micro-clustering based summaries
What is a micro-clustering ?
• Micro-clusters (MC) are small groups of tuples,
• MC are represented by features which locally describe the density of tuples,
• Micro-clustering approaches handle evolving data,
• MC are maintained in RAM memory within a bounded memory space
• MC summarize the density of the input data stream, while giving more importance
to the recent past.
45
![Page 46: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/46.jpg)
DenStream [4]
• Density based micro-clustering • Weighting of the tuples over time
Micro-clustering based summaries
46
![Page 47: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/47.jpg)
Var 1
Var 2
Initialization of the micro-clusters with DBscan
MC(cj,rj,wj)MC(cj,rj,wj)
Parameters :- Minimum weight of Mc- Maximum variance of Mc- Fading factor
Parameters :- Minimum weight of Mc- Maximum variance of Mc- Fading factor
Micro-clustering based summariesDenStream [4]
47
![Page 48: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/48.jpg)
Var 1
Var 2
Received tuple
Micro-clustering based summariesDenStream [4]
48
![Page 49: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/49.jpg)
Var 1
Var 2
Fading of the micro-clusters
Micro-clustering based summariesDenStream [4]
49
![Page 50: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/50.jpg)
Var 1
Var 2
Closer Micro-cluster
Micro-clustering based summariesDenStream [4]
50
![Page 51: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/51.jpg)
Var 1
Var 2
Test on the new variance
Micro-clustering based summariesDenStream [4]
51
![Page 52: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/52.jpg)
Var 1
Var 2
Here, the new variance is greater than the maximum variance
Micro-clustering based summariesDenStream [4]
52
![Page 53: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/53.jpg)
Var 1
Var 2
An “outlier” micro-cluster is created
Micro-clustering based summariesDenStream [4]
53
![Page 54: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/54.jpg)
Var 1
Var 2
Received tuple
Micro-clustering based summariesDenStream [4]
54
![Page 55: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/55.jpg)
Var 1
Var 2
Fading of the micro-clusters
Micro-clustering based summariesDenStream [4]
55
![Page 56: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/56.jpg)
Var 1
Var 2
Closer Micro-cluster
Micro-clustering based summariesDenStream [4]
56
![Page 57: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/57.jpg)
Var 1
Var 2
Test on the new variance
Micro-clustering based summariesDenStream [4]
57
![Page 58: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/58.jpg)
Var 1
Var 2
Here, the new variance is less than the maximum variance
Micro-clustering based summariesDenStream [4]
58
![Page 59: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/59.jpg)
Var 1
Var 2
1) The tuple is assigned to the micro-cluster
2) The weight, the variance and the mean are updated
Micro-clustering based summariesDenStream [4]
59
![Page 60: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/60.jpg)
Var 1
Var 2
Received tuple
Micro-clustering based summariesDenStream [4]
60
![Page 61: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/61.jpg)
Var 1
Var 2
Fading of the micro-clusters
Micro-clustering based summariesDenStream [4]
61
![Page 62: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/62.jpg)
Var 1
Var 2
Closer Micro-cluster
Micro-clustering based summariesDenStream [4]
62
![Page 63: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/63.jpg)
Var 1
Var 2
Test on the new variance, which is to important
Micro-clustering based summariesDenStream [4]
63
![Page 64: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/64.jpg)
Var 1
Var 2
Closer “outlier” Micro-cluster
Micro-clustering based summariesDenStream [4]
64
![Page 65: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/65.jpg)
Var 1
Var 2
Test on the new variance
Micro-clustering based summariesDenStream [4]
65
![Page 66: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/66.jpg)
Var 1
Var 2
In this case, the tuple is assigned to the “outlier” micro-cluster
Micro-clustering based summariesDenStream [4]
66
![Page 67: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/67.jpg)
Var 1
Var 2
The weight of the “outlier” micro-cluster is greater than the minimum weight
Micro-clustering based summariesDenStream [4]
67
![Page 68: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/68.jpg)
Var 1
Var 2
The “outlier” micro-cluster becomes a regular micro-cluster
Micro-clustering based summariesDenStream [4]
68
![Page 69: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/69.jpg)
Micro-clustering based summaries
How to exploit this summary to estimate the density of the data stream ?
… the example of the Parzen widows estimator [5] …
DenStream [4]
69
![Page 70: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/70.jpg)
Micro-clustering based summaries
Adapted Parzen window [5] :
• W : total weight of the data stream
• C : number of micro-clusters
• wj : weight of the j-th micro-cluster
• rj : standard deviation of the j-th micro-cluster
• : smoothing parameter
Law of total variance
Hypothesis : each tuple represents a set of none-
observed tuples, with a fixed effective and a standard
deviation equal to
DenStream [4]
70
![Page 71: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/71.jpg)
Conclusion
Main ideas to retain :
•Summaries allow to process data streams with very high emission rate,
•By using limited hardware resources (CPU, RAM).
•In most cases, a trade off must be reached between the accuracy and the available memory.
•There are two types of summary (specific and generic)
•Limitation : most of generic summaries involves user parameters.
71
![Page 72: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/72.jpg)
References
[1] Flajolet, P. et G. Martin (1985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences 31(2), 182–209
[2] J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1) :37– 57, 1985.
[3] B. Csernel, F. Clerot, and G. Hebrail. Streamsamp : Datastream clustering over tilted windows through sampling. In ECML PKDD 2006 Workshop on Knowledge Discovery from Data Streams, 2006.
[4] Feng Cao, F., M. Ester, W. Qian, and A. Zhou (2006). Density-based clustering over an evolving data stream with noise. In SIAM Conference on Data Mining, pp. 328–339.
[5] A. Bondu, B. Grossin, M.L. Picard. Density estimation on data streams : an application to Change Detection.In EGC (Extraction et Gestion de la connaissance) 2010.
Related documents :
Thesis of Gasbi, N. Extension et interrogation des resumes de flux de donnees, 2011
72
![Page 73: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/73.jpg)
Outline
• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !
73
![Page 74: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/74.jpg)
From Batch mode to Online Learning
• Categorical target variable -> Classifier
• Numeric target variable -> Regression
• Time series -> Forecasting
What is supervised learning ?
• Output : prediction of a target variable for new observations
• Data : a supervised model is learned from labeled examples
• Objective : learn regularities from the training set and generalize it (with parsimony)
Several types of supervised models :
In this talk …
74
![Page 75: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/75.jpg)
From Batch mode to Online Learning
Var 1 Var 2 … Class
O 12 … A
Y 98 … B
Y 4 … A
Training set
ClassifierClassifier
Var 1
Var 2
Var k…
…
ClassA / B
A learning algorithm exploits the training set to automatically adjust the classifier
75
![Page 76: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/76.jpg)
From Batch mode to Online Learning
• An entire dataset is available • The examples can be processed several times • Weak constrain on the computing time• The distribution of data does not change
Batch mode learning :
• Can be interrupted before its end• Returns a valid classifier at any time• Is expected to find better and better classifier • Relevant for time-critical application
Any time learning algorithm :
76
![Page 77: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/77.jpg)
From Batch mode to Online Learning
• Only a single pass on the training examples is required. • The classifier is updated at each example. • Avoid the exhaustive storage of the examples in the RAM. • Relevant for time-critical applications and for progressively
recorded data.
Incremental learning algorithm :
• The training set is substituted by an input data stream• The classifier is continually updated over time,• By exploiting the current tuple,• With a very low latency.• The distribution of data can change over time (concept drift)
Online learning algorithm :
77
![Page 78: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/78.jpg)
Implementation of on-line classifiers
OnlineClassifier
OnlineClassifier
Input stream : explicative variables
Output stream : predicted labels
78
![Page 79: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/79.jpg)
Implementation of on-line classifiers
OnlineClassifier
OnlineClassifier
Update
Second input stream : Real labels
Comparison of real and predicted labels
79
![Page 80: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/80.jpg)
Implementation of on-line classifiers
OnlineClassifier
OnlineClassifier
Update
Evaluation
Perf
Time
80
![Page 81: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/81.jpg)
Implementation of on-line classifiers
OnlineClassifier
OnlineClassifier
Update
Evaluation
Perf
Time
In practice, this input stream may be delayed
A on-line classifier predicts the class label of tuples before receiving the true label … 81
![Page 82: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/82.jpg)
Implementation of on-line classifiers
OnlineClassifier
OnlineClassifier
Example : online advertising targeting
User
Ad
• Input tuples : couples “User – Ad”
P( | )
• Out tuples : estimated probability that a User clicks on an Ad
82
![Page 83: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/83.jpg)
Implementation of on-line classifiers
OnlineClassifier
OnlineClassifier
Example : online advertising targeting
User
Ad
P( | )
Browser Sending the Ad
AgrMax(Ads)
Waiting for a click
83
![Page 84: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/84.jpg)
Implementation of online classifiers
OnlineClassifier
OnlineClassifier
Example : online advertising targeting
User
Ad
P( | )
Browser Sending the Ad
Waiting for a click After a fixed delay
Update
Real labels
$If clicked
84
![Page 85: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/85.jpg)
Evaluation of online classifiers
OnlineClassifier
OnlineClassifier
Update
The stream of labeled tuples is split
Evaluation on the recent past
tt - k
Sliding window
Use of standard evaluation criteria(Accuracy, BER, Lift curve, AUC … etc.)
Unbiased evaluation
A – Holdout Evaluation
85
![Page 86: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/86.jpg)
Evaluation of online classifiers
OnlineClassifier
OnlineClassifier
2 - Update
B – Prequential Evaluation
1 - Update
On-lineEvaluation
On-lineEvaluation
n
1iii )y,y(LS
From the beginning of the stream
On the recent past(buffer on a sliding window)
Each labeled tuples is used twice
86
![Page 87: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/87.jpg)
Evaluation of online classifiersC – Kappa Statistic
• p0: prequential accuracy of the classifier
• pc: probability that a random classifier makes a correct prediction.
Κ = (p0 − pc)/(1 − pc)
• K = 1 if the classifier is always correct
• K = 0 if the predictions coincide with the correct ones as often as those of the random classifier
87
![Page 88: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/88.jpg)
Concept drift
• The input stream is not stationary • The distribution of data changes over time• Two strategies : adaptive learning or drift detection• Several types of concept drift :
What does it means ?
Virtual drift [B](or covariate shift)
Concept shift [A]
P(x,y) = P(x) . P(y|x)
Original data
88
![Page 89: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/89.jpg)
Concept driftWhat kinds of drift can be expected [C]?
Abrupt
Gradual
Incremental
Reoccuring
On-line adaptive learning
Drift detection & models management
89
![Page 90: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/90.jpg)
Concept driftSome specific constrains to manage :
• Adapt to concept drift asap
• Distinguish noise from changes (Robust to noise, Adaptive to changes)
• Recognizing and reacting to reoccurring contexts
• Adapting with limited hardware resources (CPU, RAM, I/O)
90
![Page 91: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/91.jpg)
Incremental classifierThe example of the Hoeffding Trees [D]
Recall on decision tree :
• Each node contains a test on an attribute
• Each branch corresponds to a possible outcome of the test
• Each leaf contains a class prediction
A decision tree is progressively learned by replacing leaves by test nodes, starting at the root.
Age<30?
A
Yes No
Car Type=Sports Car?
B
Yes No
91
![Page 92: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/92.jpg)
Incremental classifier
How an incremental decision trees is learned ?
• Single pass algorithm,• With a low latency, • Which avoids the exhaustive storage of training examples in the RAM.• The drift is not managed
Var 1 Var 2 … Class
O 12 … A
Y 98 … B
Y 4 … A
Training examples are processed one by one Input stream : labeled examples
The example of the Hoeffding Trees [D]
92
![Page 93: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/93.jpg)
Incremental classifier
Key ideas :
The best attribute at a node is found by exploiting a small subset of the labeled examples that pass through that node :•The first examples are exploited to choose the root attribute•Then, the other examples are passed down to the corresponding leaves•The attributes to be split are recursively chosen …
The Hoeffding bound answers the question : How many examples are required to split an attribute ?
Age<30?
Input stream
Sub-stream (Yes) Sub-stream (No)
Car Type=Sports Car?
Status =Married?
The example of the Hoeffding Trees [D]
93
![Page 94: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/94.jpg)
• Consider a random variable a whose range is R• Suppose we have n observations of a• The mean value is denoted by : • Hoeffding bound states:
With probability , the true mean of a is at least where
Incremental classifier
The Hoeffding bound :
The example of the Hoeffding Trees [D]
94
![Page 95: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/95.jpg)
• Let G(Xi) be the criterion used to choose the test attributes (e.g. Information Gain, Gini)
• Xa : the attribute with the highest value after having processed n examples.• Xb : the attribute with the second highest value after having processing n examples.
• If after having processed n examples at a node,
Hoeffding bound guarantees the true , with probability This node can be split using Xa, the succeeding examples will be passed to the new leaves
Incremental classifier
How many examples ?
The example of the Hoeffding Trees [D]
95
![Page 96: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/96.jpg)
Incremental classifier
The algorithm
Input stream• Find the two best attributes• Check the condition
If not satisfied
Age<30?
• Create a new test at the current node• Split the stream of examples• Create 2 new leaves
If satisfied
Car Type=Sports Car?
Status =Married?
• Recursively run the algorithm on new leaves
The example of the Hoeffding Trees [D]
This algorithm has been adapted in order to manage concept drift [E]By maintaining an incremental tree on a sliding windowsWhich allows to forget the old tuplesA collection of alternative sub-trees is maintained in memory and used in case of drift
96
![Page 97: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/97.jpg)
Incremental classifier
To learn more about incremental decision trees …
The example of the Hoeffding Trees [D]
The 4 essential elements :
• Bounds - How many examples before cutting an attribute ?• Split criterion - Which attribute and which cut point ?• Summaries in the leaves – How to manage high speed data streams ?• Local models – How to improve the classifier ?
97
![Page 98: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/98.jpg)
Incremental classifier
To learn more about incremental decision trees …
The example of the Hoeffding Trees [D]
Examples of summaries in the leaves :
• Numerical attributes– Exhaustive counts [Gama2003]– Incremental Discretization [Gama2006]– Gaussian approximation [Pfahringer2008]– Quantiles based summary [GK2001]
• Categorical attributes– for each categorical variable and for each modality the number
of occurrences is stored
98
![Page 99: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/99.jpg)
Drift detection
Fixed Classifier(applied online)Fixed Classifier(applied online)
DriftDetection
DriftDetection
If detected• Train a new classifier on
the recent past• Adapt the size of the
history
• Train a new classifier on the recent past
• Adapt the size of the history
Replace the classifier
General schema :
99
![Page 100: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/100.jpg)
Drift DetectionDrift Detection
Drift detection
How to detect the drift ?
Based on the online evaluation :
• Main idea : if the performance of the classifier changes, that means a drift is occurring ...
• For instance : if the error rate increases, the size of the sliding windows decreases and the classifier is retrained [F].
• Limitation : the user has to define a threshold
classifierclassifier
Update
Evaluation
Perf
Time
Learning Algorith
m
Learning Algorith
m
If detected 100
![Page 101: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/101.jpg)
Drift detection
How to detect the drift ?
• Main idea : if the distributions of the “current window” and the “reference window” are significantly different, that means a drift is occurring ….
Reference window Current window
time
Based on the distribution of tuples :
101
![Page 102: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/102.jpg)
Drift detection
How to detect the drift ?
Based on the distribution of tuples :
• In [G] the author uses statistical tests in order to compare the both distributions• Welch test – Mean values are the same ?• Kolmogorov Smirnov test – Both samples of tuples come from the same distribution ?
• A classifier can be exploited to discriminate tuples belonging to both windows [H]• If the quality of the classifier is good, that means a drift is occurring … • Explicative variables : X• Target variable : W (the window)
Detection of covariate shift : P(X)
• In [I] a classifier is exploited, the class value is considered as an additional input variable • Explicative variables : X and Y
• Target variable : W (the window)
Detection of concept shift : P(Y|X)
?=
102
![Page 103: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/103.jpg)
Conclusion
• Decision Tree– ID4 (Schlimmer - ML’86)– ID5/ITI (Utgoff – ML’97)– SPRINT (Shaffer - VLDB’96)
• Naive Bayes– Incremental (for the standard NB)– Learn fastly with a low variance (Domingos – ML’97)– Can be combined with decision tree: NBTree (Kohavi – KDD’96)
• Neural Networks– IOLIN (Cohen - TDM’04)– learn++ (Polikar - IJCNN’02),…
• Support Vector Machine– TSVM (Transductive SVM – Klinkenberg IJCAI’01), – PSVM (Proximal SVM – Mangasarian KDD’01),…– LASVM (Bordes 2005)
• Rules based systems– AQ15 (Michalski - AAAI’86), AQ-PM (Maloof/Michalski - ML’00)– STAGGER (Schlimmer - ML’86)– FLORA (Widmer - ML’96)
Examples of incremental learning algorithms :
103
![Page 104: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/104.jpg)
ConclusionExamples of online learning algorithms :
• Rules– FACIL (Ferrer-Troyano – SAC’04,05,06)
• Ensemble – SEA (Street - KDD’01) based on C4.5
• K-nn– ANNCAD (Law – LNCS‘05).– IBLS-Stream (Shaker et al – Evolving Systems” journal 2012)
• SVM– CVM (Tsang – JMLR’06)
• Decision Tree – Domingos : VFDT (KDD’00), CVFDT (KDD’01)– Gama : VFDTc (KDD’03), UFFT (SAC’04)– Kirkby : Ensemble of Hoeffding Trees (KDD’09)– del Campo-Avila : IADEM (LNCS’06)
An extensive literature
104
![Page 105: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/105.jpg)
Conclusion
Main ideas to retain :
•Online learning algorithm are designed in accordance with specific constrains– One pass
– Low latency
– Adaptive … etc
•In practice the true labels are delayed : an online classifier predicts the labels before observe it
•The evaluation of the classifiers is specific to data streams processing
•The distribution of the tuples may change over time :– Some approaches detect the drifts, and then update the classifier (abrupt drift)
– Other approaches progressively adapt the classifier (incremental drift)
•In practice, the type of expected drift must be known in order to choose an appropriate approach
•The distinction between noise and drifts can be viewed as a plasticity / stability dilemma105
![Page 106: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/106.jpg)
References[A] SALGANICOFF, M. 1993. Density-adaptive learning and forgetting. In Proc. of the Int. Conf. on Mach. Learn. ICML. 276–283.
[B] WIDMER, G. AND KUBAT, M. 1993. Effective learning in dynamic environments by explicit context tracking. In Proc. of the Eur. Conf. on Mach. Learn. ECML. 227–243.
[C] Zliobaite, I. Learning under Concept Drift : an Overview.
[D] Domingos, P. Hulten, G. Mining High-Speed Data Streams, KDD 2000.
[E] Bifet, A. and Gavalda, R. Adaptive Parameter-free Learning from Evolving Data Streams. In Advances in Intelligent Data AnalysisVIII, 2009 Springer
[F] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1) :69-101, 1996.
[G] Gopal K Kanji. 100 Statistical Tests, volume 129. Sage, 2006.
[H] A. Bondu, B. Grossin, M.L. Picard. Density estimation on data streams : an application to Change Detection.In EGC (Extraction et Gestion de la connaissance) 2010
[I] C. Salperwyck, M. Boulle, V. Lemaire. Grille bivariee pour la detection de changement dans un flux etiquete”, EGC 2013
Related documents :
Thesis of Salperwyck, C. Apprentissage incremental en-ligne sur flux de donees, 2012
106
![Page 107: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/107.jpg)
Outline
• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !
107
![Page 108: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/108.jpg)
Thanks to Benoît Grossin
TutorialAn introduction to StreamBase
Files of your projectFiles of your project
Drag and drop operators
Drag and drop operators
Running graph of operatorsRunning graph of operators
Run your project
Run your project
108
![Page 109: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/109.jpg)
TutorialAn introduction to StreamBase
Used data :
• Open Data• Provided By EnerNOC (A provider of energy intelligence software) • Two data files :
• Information on 100 industrial sites• 5-minute electricity consumptions of this sites over 2 years
Objective of this tutorial :
• Detect the peaks of consumption • Implement an “energy efficiency score” for each type of industry
109
![Page 110: Data Stream Processing and Analytics](https://reader034.fdocuments.in/reader034/viewer/2022051516/56814290550346895daebc6f/html5/thumbnails/110.jpg)
TutorialAn introduction to StreamBase
Overview :
110