Post on 04-Jan-2016
Powerpoint Templates
1
Powerpoint Templates
Mining High-Speed Data StreamsPedro Domingos
Geoff Hulten
Sixth ACM SIGKDD International Confrence - 2000
Presented by:Afsoon Yousefi
Powerpoint Templates
2 Outlines Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
3
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
4Introduction
In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution.
Many organizations have more than very large data bases that grow at a rate of several million records per day.
OpportunitiesChallenges
Main limited resources in knowledge discovery systems:
TimeMemorySample size
Powerpoint Templates
5Introduction—cont.
Traditional systems:Small amount of data is availableUsing a fraction of available
computational power
Current systems:The bottleneck is time and memoryUsing a fraction of available samples of
dataTry to mine databases that don’t fit in
main memory
Available algorithms:Efficient, but not guarantee a similar
learned model to the batch mode.• Never recover from an unfavorable set of
early examples.• Sensitive to example ordering.
Produce the same model as batch version, but not efficiently.
• Slower than the batch algorithm.
Powerpoint Templates
6Introduction—cont.
Requirements of algorithms to overcome these problems:
Operate continuously and indefinitely
Incorporate examples as they arrive
Never loosing potentially valuable information
Build a model using at most one scan of the data.
Use only a fixed amount of main memory.
Require small constant time per record.
Make a usable model available at any point in time.
Produce a model equivalent to the one obtained by ordinary database mining algorithm.
By changing the data-generating over time, the model at any time should be up-to-date.
Powerpoint Templates
7Introduction—cont.
Such requirements are fulfilled by:
Incremental learning methods
Online methods
Successive methods
Sequential methods
Powerpoint Templates
8
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
9Hoeffding Trees
Classic decision tree learners:
CART, ID3, C4.5All examples simultaneously in main
memory.
Disk based decision tree learners:
SLIQ, SPRINTExamples are stored on disk.Expensive to learn complex trees or very
large datasets.
Consider a subset of training examples to find the best attribute:
For extremely large datasets.Read each examples at most once.Directly mine online data sources.Build complex trees with acceptable
computational cost.
Powerpoint Templates
10Hoeffding Trees—cont.
Given a set of examples of the form
: number of examples : discrete class label : a vector of attributes (symbolic or
numeric)
Goal : produce
A model that will predict the classes of future examples with high accuracy.
Powerpoint Templates
11Hoeffding Trees—cont.
Given a stream of examples:
Use first ones to choose the root test.Pass succeeding ones to corresponding
leaves.Pick best attributes there.… And so on recursively
How many examples are necessary at each node?
Hoeffding BoundAdditive Chernof BoundA statistical result
Powerpoint Templates
12Hoeffding Trees—cont.
Hoeffding bound:
: heuristic measure used to choose test attributes C4.5 information gain CART Gini index Assume is to be maximized
: heuristic measure after seeing examples : attribute with highest observed : second-best attribute : difference between and
: probability of choosing the wrong attribute
Hoeffding bound guarantees that is the correct choice with probability if:
examples have been seen at this node
Powerpoint Templates
13Hoeffding Trees—cont.
Hoeffding bound:
If
is the best attribute with probability
Node needs to accumulate examples from the stream until becomes
smaller than
It is independent of the probability distribution generating the observations.
More conservative than distribution dependent ones.
Powerpoint Templates
14Hoeffding Tree algorithm
Inputs:
: is a sequence of examples. : is a set of discrete attributes. : is a split evaluation function. : desired probability of choosing the wrong
attribute at any given node.
Output:
: is a decision tree.
Powerpoint Templates
15Hoeffding Tree algorithm—cont.
Procedure HoeffdingTree ()
Let be a tree with a single leaf (the root).Let Let predict most frequent class in .For each class
For each value of each attribute Let
Powerpoint Templates
16Hoeffding Tree algorithm—cont.
For each example in • Sort into a leaf using • For each and each such that
o Increment .• Label with majority class among examples
seen at .• Compute for each attribute .• Let be the attribute with highest .• Let be the attribute with second-highest .• Compute .• If , then
o Replace by an internal node that split on .o For each branch of the split
- Add a new leaf , Let .- Let predict most frequent class.- For each class and each that
. Let Return .
Powerpoint Templates
17Hoeffding Trees—cont.
: leaf probability (assume this is constant).
: tree produced by Hoeffding tree algorithm with desired given an infinite sequence of examples .
: decision tree induced by choosing at each node the attribute with true greatest .
: intentional disagreement between two decision trees:
: probability that the attribute vector will be observed.
: indicator function (1:true argument, 0:otherwise)
THEOREM :
Powerpoint Templates
18Hoeffding Trees—cont.
Suppose that the best and second-best attribute differ by 10%
According to
requires 380 examples
requires 345 more examples
An exponential improvement in can be obtained with a linear increase in the number of examples
Powerpoint Templates
19
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
20The VFDT System
Very Fast Decision Tree learner (VFDT).
A decision tree learning system.
based on the Hoeffding tree algorithm.
Either uses information gain or Gini index as attribute evaluation measure.
Includes a number of refinements to Hoeffding tree algorithm:
Ties. computation.Memory.Poor attributes.Initialization.Rescans.
Powerpoint Templates
21The VFDT System—cont.
Ties
Two or more attributes have very similar ’s
Potentially many examples will be required to decide between them with high confidence.
It makes little difference which attribute is chosen.
If : split on the current best attribute.
Powerpoint Templates
22The VFDT System—cont.
computation
The most significant part of the time cost per example is recomputing .
Computing for every new example is inefficient.
new examples must be accumulated at a leaf before recomputing .
Powerpoint Templates
23The VFDT System—cont.
Memory
VFDT’s memory use is dominated by the memory required to keep counts for all growing leaves.
If the maximum available memory reached, VFDT deactivates the least promising leaves.
The least promising leaves are considered to be the ones with the lowest values of .
Powerpoint Templates
24The VFDT System—cont.
Poor attributes
VFDS’s memory usage is also minimized by dropping early on attributes that do not look promising.
As soon as the difference between an attribute’s and the best one’s becomes greater than , the attribute can be dropped.
The memory used to store the corresponding counts can be freed.
Powerpoint Templates
25The VFDT System—cont.
Initialization
VFDT can be initialized with the tree produced by a conventional RAM-based learner on a small subset of the data.
The tree can either be input as it is, or over-pruned.
Gives VFDT a “head start”.
Powerpoint Templates
26The VFDT System—cont.
rescans
VFDT can rescan previously-seen examples.
Can be activate if:
The data arrives slowly enough that there is time for it.
The dataset is finite and small enough that it is feasible.
Powerpoint Templates
27
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
28 Synthetic Data Study
Comparing VFDT with C4.5 release 8.
Restricted two systems to using the same amount of RAM.
VFDT used information gain as the function.
14 concepts were used, all with 2 classes and 100 attributes.
For each level after the first 3A fraction of the nodes was replaced by leavesThe rest become splits on a random attributeAt depth of 18, all the nodes were replaced with
leaves.Each leaf was randomly assigned a class
Stream of training examples was then generated
Sampling uniformly from the instance space.Assigning classes according to the target tree.Various levels of class and attribute noise was
added.
Powerpoint Templates
29 Synthetic Data Study
—cont.Accuracy as a function of the number of training examples.
Powerpoint Templates
30 Synthetic Data Study
—cont. Tree size as a function of the number of training examples.
Powerpoint Templates
31 Synthetic Data Study
—cont.Accuracy as a function of the noise level. 4 runs on same concept
(C4.5:100k,VFDT:20million examples)
Powerpoint Templates
32Lesion Study
Effect of initializing VFDT with C4.5 with and without over-pruning.
Powerpoint Templates
33Web Data
Applying VFDT to mining the steam of Web page requests.
From the whole University of Washington mail campus.
To mine 1.6 million examples:
VFDT took 1540 seconds to do one pass over the training data.
983 seconds was spent reading data from disk. C4.5 took 24 hours to mine 1.6 million examples.
Powerpoint Templates
35
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
36Conclusion
Hoeffding trees:
A method for learning online. Learns the high-volume data streams. Allows learning in very small constant time per
example. Guarantees high similarity to the corresponding
batch trees.
VFDT system:
A high performance data mining system. Based on Hoeffding trees. Effective in taking advantage of massive number
of examples.
Powerpoint Templates
37
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
38Qs & As
Name 4 requirements of algorithms to overcome current disk-based available algorithms?
Operate continuously and indefinitelyIncorporate examples as they arriveNever loosing potentially valuable
informationBuild a model using at most one scan of the
data.Use only a fixed amount of main memory.Require small constant time per record.Make a usable model available at any point
in time.Produce a model equivalent to the one
obtained by ordinary database mining algorithm.
By changing the data-generating over time, the model at any time should be up-to-date
Powerpoint Templates
39Qs & As
What are the benefits of considering a subset of training examples to find the best attribute:
For extremely large datasets.
Read each examples at most once.
Directly mine online data sources.
Build complex trees with acceptable computational cost.
Powerpoint Templates
40Qs & As
How does VFDT’s tie refinement to Hoeffding tree algorithm works?
Two or more attributes have very similar ’s
Potentially many examples will be required to decide between them with high confidence.
It makes little difference which attribute is chosen.
If : split on the current best attribute.