A General Framework for Mining Massive Data Streams
-
Upload
waggoner-buckingham -
Category
Documents
-
view
51 -
download
0
description
Transcript of A General Framework for Mining Massive Data Streams
![Page 1: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/1.jpg)
A General Framework for Mining Massive Data Streams
Geoff HultenAdvised by Pedro Domingos
![Page 2: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/2.jpg)
Mining Massive Data Streams
• High-speed data streams abundant– Large retailers– Long distance & cellular phone call records– Scientific projects– Large Web sites
• Build model of the process creating data
• Use model to interact more efficiently
![Page 3: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/3.jpg)
Growing Mismatch BetweenAlgorithms and Data
• State of the art data mining algorithms– One shot learning– Work with static databases– Maximum of 1 million – 10 million records
• Properties of Data Streams– Data stream exists over months or years– 10s – 100s of millions of new records per day– Process generating data changing over time
![Page 4: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/4.jpg)
The Cost of This Mismatch
• Fraction of data we can effectively mine shrinking towards zero
• Models learned from heuristically selected samples of data
• Models out of date before being deployed
![Page 5: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/5.jpg)
Need New Algorithms
• Monitor a data stream and have a model available at all times
• Improve the model as data arrives
• Adapt the model as process generating data changes
• Have quality guarantees
• Work within strict resource constraints
![Page 6: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/6.jpg)
Solution: General Framework
• Applicable to algorithms based on discrete search
• Semi-automatically converts algorithm to meet our design needs
• Uses sampling to select data size for each search step
• Extensions to continuous searches and relational data
![Page 7: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/7.jpg)
Outline
• Introduction
• Scaling up Decision Trees
• Our Framework for Scaling
• Other Applications and Results
• Conclusion
![Page 8: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/8.jpg)
Decision Trees
• Examples:
• Encode:
• Nodes contain tests
• Leaves contain predictions
Gender?
False Age?
Male Female
< 25 >= 25
yxx D ,,,1 DxxFy ,,1
False True
![Page 9: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/9.jpg)
Decision Tree Induction
DecisionTree(Data D, Tree T, Attributes A) If D is pure Let T be a leaf predicting class in D Return Let X be best of A according to D and G() Let T be a node that splits on X For each value V of X Let D^ be the portion of D with V for X Let T^ be the child of T for V DecisionTree(D^, T^, A – X)
![Page 10: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/10.jpg)
VFDT (Very Fast Decision Tree)
• In order to pick split attribute for a node looking at a few example may be sufficient
• Given a stream of examples:– Use the first to pick the split at the root– Sort succeeding ones to the leaves– Pick best attribute there– Continue…
• Leaves predict most common class• Very fast, incremental, any time decision tree
induction algorithm
![Page 11: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/11.jpg)
How Much Data?
• Make sure best attribute is better than second– That is:
• Using a sample so need Hoeffding bound– Collect data till: 21 XGXG
n
R
2
1ln2
021 XGXG
![Page 12: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/12.jpg)
Core VFDT Algorithm
Proceedure VFDT(Stream, δ)Let T = Tree with single leaf (root)Initialize sufficient statistics at rootFor each example (X, y) in Stream
Sort (X, y) to leaf using TUpdate sufficient statistics at leafCompute G for each attributeIf G(best) – G(2nd best) > ε, then
Split leaf on best attributeFor each branch
Start new leaf, init sufficient statisticsReturn T
x1?
y=0 x2?
y=0 y=1
male female
> 65 <= 65
![Page 13: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/13.jpg)
Quality of Trees from VFDT
• Model may contain incorrect splits, useful?
• Bound the difference with infinite data tree– Chance an arbitrary example takes different
path
• Intuition: example on level i of tree has i chances to go through a mistaken node
p
DTDT HT
,
![Page 14: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/14.jpg)
Complete VFDT System
• Memory management– Memory dominated by sufficient statistics– Deactivate less promising leaves when needed
• Ties:– Wasteful to decide between identical attributes
• Check for splits periodically• Pre-pruning
– Only make splits that improve the value of G(.)
• Early stop on bad attributes
G
![Page 15: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/15.jpg)
VFDT (Continued)
• Bootstrap with traditional learner
• Rescan dataset when time available
• Time changing data streams
• Post pruning
• Continuous attributes
• Batch mode
![Page 16: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/16.jpg)
Experiments
• Compared VFDT and C4.5 (Quinlan, 1993)
• Same memory limit for both (40 MB)– 100k examples for C4.5
• VFDT settings: δ = 10^-7, τ = 5%
• Domains: 2 classes, 100 binary attributes
• Fifteen synthetic trees 2.2k – 500k leaves
• Noise from 0% to 30%
![Page 17: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/17.jpg)
![Page 18: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/18.jpg)
![Page 19: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/19.jpg)
Running Times
• Pentium III at 500 MHz running Linux
• C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds
• VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process
• VFDT processes 32k examples per second (excluding I/O)
![Page 20: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/20.jpg)
![Page 21: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/21.jpg)
Real World Data Sets:Trace of UW Web requests
• Stream of Web page request from UW• One week 23k clients, 170 orgs. 244k hosts,
82.8M requests (peak: 17k/min), 20GB• Goal: improve cache by predicting requests• 1.6M examples, 61% default class• C4.5 on 75k exs, 2975 secs.
– 73.3% accuracy
• VFDT ~3000 secs., 74.3% accurate
![Page 22: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/22.jpg)
Outline
• Introduction
• Scaling up Decision Trees
• Our Framework for Scaling
• Overview of Applications and Results
• Conclusion
![Page 23: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/23.jpg)
Data Mining as Discrete Search
...
• Initial state– Empty – prior – random
• Search operators– Refine structure
• Evaluation function– Likelihood – many other
• Goal state– Local optimum, etc.
![Page 24: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/24.jpg)
Data Mining As Search
...
...
Training Data Training Data Training Data
1.7
1.5
1.8
1.9
1.9
2.0
![Page 25: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/25.jpg)
Example: Decision Tree
...
Training Data
1.7
1.5
X1?
Xd?
??
...
X1?
...
X1?
Training Data • Initial state– Root node
• Search operators– Turn any leaf into
a test on attribute• Evaluation
– Entropy Reduction
• Goal state– No further gain– Post prune
)(
lgyval
ii pp
![Page 26: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/26.jpg)
Overview of Framework
• Cast the learning algorithm as a search
• Begin monitoring data stream– Use each example to update sufficient
statistics where appropriate (then discard it)– Periodically pause and use statistical tests
• Take steps that can be made with high confidence
– Monitor old search decisions• Change them when data stream changes
![Page 27: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/27.jpg)
How Much Data is Enough?
...
Training Data
1.65
1.38 Xd?
X1?
![Page 28: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/28.jpg)
How Much Data is Enough?
...
Sample of Data
1.6 +/- ε• Use statistical bounds
– Normal distribution– Hoeffding bound
• Applies to scores that are average over examples
• Can select a winner if– Score1 > Score2 + ε
1.4 +/- ε
nR 21ln2
Xd?
X1?
![Page 29: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/29.jpg)
Global Quality Guarantee
• δ – probability of error in single decision
• b – branching factor of search
• d – depth of search
• c – number of checks for winner
δ* = δbdc
![Page 30: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/30.jpg)
Identical States And Ties
• Fails if states are identical (or nearly so)
• τ – user supplied tie parameter
• Select winner early if alternatives differ by less than τ– Score1 > Score2 + ε or – ε <= τ
![Page 31: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/31.jpg)
Dealing with Time Changing Concepts
• Maintain a window of the most recent examples• Keep model up to date with this window• Effective when window size similar to concept
drift rate• Traditional approach
– Periodically reapply learner– Very inefficient!
• Our approach– Monitor quality of old decisions as window shifts– Correct decisions in fine-grained manner
![Page 32: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/32.jpg)
Alternate Searches• When new test looks better grow alternate sub-tree• Replace the old when new is more accurate• This smoothly adjusts to changing concepts
Gender?
Pets? College?
Hair?
false
false true
falsetrue true
![Page 33: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/33.jpg)
RAM Limitations• Each search
requires sufficient statistics structure
• Decision Tree– O(avc) RAM
• Bayesian Network– O(c^p) RAM
![Page 34: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/34.jpg)
RAM Limitations
Active
Temporarily inactive
![Page 35: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/35.jpg)
Outline
• Introduction
• Data Mining as Discrete Search
• Our Framework for Scaling
• Application to Decision Trees
• Other Applications and Results
• Conclusion
![Page 36: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/36.jpg)
Applications
• VFDT (KDD ’00) – Decision Trees• CVFDT (KDD ’01) – VFDT + concept drift• VFBN & VFBN2 (KDD ’02) – Bayesian Networks• Continuous Searches
– VFKM (ICML ’01) – K-Means clustering– VFEM (NIPS ’01) – EM for mixtures of Gaussians
• Relational Data Sets– VFREL (Submitted) – Feature selection in relational
data
![Page 37: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/37.jpg)
CFVDT Experiments
![Page 38: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/38.jpg)
Activity Profile for VFBN
![Page 39: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/39.jpg)
Other Real World Data Sets
• Trace of all web requests from UW campus– Use clustering to find good locations for proxy caches
• KDD Cup 2000 Data set– 700k page requests from an e-commerce site– Categorize pages into 65 categories, predict which a session will
visit
• UW CSE Data set– 8 Million sessions over two years– Predict which of 80 level 2 directories each visits
• Web Crawl of .edu sites– Two data sets each with two million web pages– Use relational structure to predict which will increase in
popularity over time
![Page 40: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/40.jpg)
Related Work
• DB Mine: A Performance Perspective (Agrawal, Imielinski, Swami ‘93)– Framework for scaling rule learning
• RainForest (Gehrke, Ramakrishnan, Ganti ‘98)
– Framework for scaling decision trees• ADtrees (Moore, Lee ‘97)
– Accelerate computing sufficient stats• PALO (Greiner ‘92)
– Accelerate hill climbing search via sampling• DEMON (Ganti, Gehrke, Ramakrishnan ‘00)
– Framework for converting incremental algs. for time changing data streams
![Page 41: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/41.jpg)
Future Work
• Combine framework for discrete search with frameworks for continuous search and relational learning
• Further study time changing processes• Develop a language for specifying data stream
learning algorithms• Use framework to develop novel algorithms for
massive data streams• Apply algorithms to more real-world problems
![Page 42: A General Framework for Mining Massive Data Streams](https://reader030.fdocuments.in/reader030/viewer/2022033022/56812bc5550346895d900ad1/html5/thumbnails/42.jpg)
Conclusion
• Framework helps scale up learning algorithms based on discrete search
• Resulting algorithms:– Work on databases and data streams– Work with limited resources – Adapt to time changing concepts– Learn in time proportional to concept complexity
• Independent of amount of training data!
• Benefits have been demonstrated in a series of applications