Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
-
Upload
esteban-donato -
Category
Documents
-
view
436 -
download
0
description
Transcript of Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
![Page 1: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/1.jpg)
Evaluating classification algorithms applied to
data streamsAuthor: Ing. Esteban D. DonatoAdvisor: Dr. Fazel FamiliCo-Advisor: Dra. Ana S. Haedo
Dec-2009
Maestría en Explotación de Datos y Descubrimiento del Conocimiento
1
![Page 2: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/2.jpg)
Introduction
Majority of companies and organizations collect and maintain gigantic databases that grow to millions of registers per day.
Current algorithms for mining complex models from data cannot mine even a fraction of these data in useful time.
Concept drift: occurs when the underlying data distribution changes over time.
2
![Page 3: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/3.jpg)
Objective
To perform a benchmarking analysis between a number of known algorithms applied to data streams.
The algorithms chosen for this study are: UFFT, CVFDT and VFDTc.
The analysis will be focused on some aspects that all the algorithms applied to data streams have to deal with.
3
![Page 4: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/4.jpg)
Related work A data stream is a sequence of data items x1,…,xi,…,xn. Those
items are read one at a time in increasing order of the indices.
Off-line learning: Assumes that the dataset resides in a static database and that has been generated from a static distribution. Also, they assume that all the data is available before the training and that all the examples can fit into the memory.
Incremental learning: The items are time-ordered and the distribution that generates them varies over time. Systems evolve and change a concept definition as new observations are processed.
4
![Page 5: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/5.jpg)
Related work (Cont.): Data Streams Mining
A sub area of incremental learning. Accumulates faster than it can be mined. It must require small constant time per record. It must use only a fixed amount of main memory. It must be able to build a model using at most one scan of the
data. It must make a usable model available at any point in time. Ideally, it should produce a model that is equivalent to the one that
would be obtained by the corresponding ordinary database mining algorithm.
The model should be up-to-date at any time. Types of Algorithms: Set of rules, Induction trees and Ensembles
methods. 5
![Page 6: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/6.jpg)
Related work (Cont.): Very Fast Decision Tree (VFDT)
Requires each example to be read only once. Requires a small constant time to process it.
Building process: given a stream of examples, the first ones will be used to choose the root and the following examples will be passed down to the corresponding leaves.
To detect how many examples are needed at each node, The Hoeffding bound is used.
The Hoeffding bound: with probability 1 - φ, the true mean of the variable is at least r - e, where:
Let ∆G = G(Xa) - G(Xb) >= 0, if ∆G> e then ∆G >= ∆G- e > 0 with probability 1 – φ
Other features: Pre-pruning, different evaluation measure, Ties, Memory, Poor attributes, Initialization, Rescans.
Drawbacks: It does not detect Concept Drift.
ne R
2
)/1ln(2
6
![Page 7: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/7.jpg)
Related work (Cont.): Concept Drift
Change in the target concept
Depends on some hidden attributes, not given explicitly in the form of predictive features,
Examples: Weather prediction, customers’ buying, etc.
Concept drift handling system should be able to: Quickly adapt to concept drift Be robust to noise and distinguish it from concept drift. Recognize and treat recurring contexts.
Types: sudden, gradual, frequent and virtual concept drift.
7
![Page 8: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/8.jpg)
Conclusion of literature review Data stream is a sequence of time-ordered items, arriving faster than the time
needed to be mined.
Some changes in the underlying data distribution may occur requiring the algorithms to detect and adapt to these changes.
The main challenge in incremental learning is how to detect and adapt to a concept drift.
To deal with the problem of data arriving fast, the algorithms must require a small constant processing time per record.
One of the first algorithms developed was VFDT, using the Hoeffding bound
In concept drift, a difficult problem is to distinguish between a true concept drift and noise.
8
![Page 9: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/9.jpg)
Algorithm: VFDTcVery Fast Decision Tree for Continuous attributes
Extension of VFDT in three directions: continuous data, functional leaves, and concept drift.
For a continuous attribute the split-test is a condition of the form attri <= cut_point.
Use of Information gain to detect the cut_point.
Functional tree leaves: An innovative aspect of this algorithm is its ability to use the naive Bayes classifiers at tree leaves
A leaf must see nmin examples before computing the evaluation function
Concept Drift is based on the assumption that whatever is the cause of the drift, the decision surface moves. It supports two methods:
Drift Detection based on Error Estimates (EE/EBP) Drift Detection based on Affinity Coefficient (AC)
Reacting to Drift: method pushes up all the information of the descending leaves to node This is a forgetting mechanism.
9
![Page 10: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/10.jpg)
Algorithm: UFFT
Ultra Fast Forest Tree Generates a forest of binary trees Processes each example in constant time It uses analytical techniques to choose the splitting criteria, and the
information gain to estimate the merit of each possible splitting-test It maintains a short term memory for initializing the leaves To expand a leaf node: information gain positive and statistical
support Functional leaves Concept drift detection: error rate is calculated at each node (Naive-
Bayes ). Error follows a binomial distribution. Two confident interval levels: warning drift
10
![Page 11: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/11.jpg)
Algorithm: CVFDTConcept-adapting Very Fast Decision Tree
Extension of VFDT with support to concept drifts
It works by keeping its model consistent with a sliding window of examples. Updates just statistics.
It uses information gain for selecting the best attribute.
Grows an alternative subtree with the new best attribute at its root.
Periodically scans HT and all alternate trees looking for internal nodes whose performing better than the actual nodes.
11
![Page 12: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/12.jpg)
Performance measures Capacity to detect and respond to concept drift Capacity to detect and respond to virtual concept drift Capacity to detect and respond to recurring concept drift Capacity to adapt to sudden concept drift Capacity to adapt to gradual concept drift Capacity to adapt to frequent concept drift Accuracy of the classification task Capacity to deal with outliers Capacity to deal with noisy data Speed (Time to take to process an item in the stream)
12
![Page 13: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/13.jpg)
Data sets generated Data sets based on a moving hyperplane d-dimensional space [0; 1]d, is denoted by MOA (Massive Online Analysis) tool
http://sourceforge.net/projects/moa-datastream/ Released under GNU. Free and open source
Current configurabe attributes: instanceRandomSeed numClasses numAtts numDriftAtts magChange noisePercentage sigmaPercentage
New configurable attributes: driftFreq driftTran outlierPercentage distributionPercentage
d
1i 0ii w xw
13
![Page 14: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/14.jpg)
Data sets generated
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
positive
negative
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
positive
negative
0
0.5
1
1.5
2
2.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
positive
negative
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
positive
negative
14
Dataset with no concept drift, outlier of noise Dataset with 10% of noisy data
Dataset with 1% of outliers Dataset with 3 concept drift
![Page 15: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/15.jpg)
ResultsCapacity to detect and respond to concept drift
15
![Page 16: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/16.jpg)
ResultsCapacity to detect and respond to virtual concept drift
16
![Page 17: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/17.jpg)
ResultsCapacity to detect and respond to recurring concept drift
17
![Page 18: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/18.jpg)
ResultsCapacity to adapt to sudden concept drift
18
![Page 19: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/19.jpg)
ResultsCapacity to adapt to gradual concept drift
19
![Page 20: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/20.jpg)
ResultsCapacity to adapt to frequent concept drift
20
![Page 21: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/21.jpg)
ResultsAccuracy of the classification task
Predicted Predicted Class 1 Class 2Actual Class 1 44.5% (887) 5.5% (109)Actual Class 2 5% (101) 45% (903)
Predicted Predicted Class 1 Class 2Actual Class 1 39% (777) 11% (219)Actual Class 2 9% (173) 41% (831)
Predicted Predicted Class 1 Class 2Actual Class 1 46% (928) 3.5% (68)Actual Class 2 2.5% (48) 48% (956)
Predicted Predicted Class 1 Class 2Actual Class 1 34.5% (685) 15.5% (311)Actual Class 2 15.5% (312) 34.5% (692)
Accuracy (AC)
True positive (TP)
False Positive (FP)
True Negative (TN)
False Negative (FN)
Precision (P)
VFDTc (CA) 0.89 0.89 0.10 0.90 0.11 0.90VFDTc (EBP) 0.80 0.78 0.17 0.83 0.22 0.82UFFT 0.94 0.93 0.05 0.95 0.07 0.95CVFDT 0.69 0.69 0.31 0.69 0.31 0.69
VFDTc (CA) VFDTc (EBP)
UFFT CVFDT
measures derived from the confusion matrix
21
![Page 22: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/22.jpg)
ResultsDealing with outliers
22
![Page 23: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/23.jpg)
ResultsDealing with noisy data
23
![Page 24: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/24.jpg)
ResultsSpeed (Time to take to process an item in the stream)
24
![Page 25: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/25.jpg)
Conclusions & future work Given that the data can be generated very fast, that give us a new and
challenging way of developing Data Mining algorithms. We have to develop them having in mind that the training phase can never
end The changes in the data distribution are another challenging scenario that
data stream mining has to deal with. VFDT was one of the first data stream mining algorithms developed. It
implemented the Hoeffding bound We generated different datasets using the moving hyperplane algorithm UFFT for short term predictions CVFDT for long term solutions No impact on virtual concept drift or recurring concept drift
25
![Page 26: Evaluating Classification Algorithms Applied To Data Streams Esteban Donato](https://reader036.fdocuments.in/reader036/viewer/2022062615/548da1d3b47959640d8b480d/html5/thumbnails/26.jpg)
Conclusions & future work VFDTc (CA) is not suitable for gradual or sudden concept drift VFDTc (CA) or UFFT are not suitable for frequent concept drift VFDTc (EBP) and CVFDT for data streams with outliers CVFDT for data streams with noisy points CVFDT and UFFT fastest algorithms
Future Work
Clustering algorithms applied to data streams Classification algorithms applied to data streams of unstructured datasets
(text, images, etc)
26