A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei...
-
Upload
danielle-wood -
Category
Documents
-
view
214 -
download
1
Transcript of A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei...
![Page 1: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/1.jpg)
A General Framework for Mining Concept-Drifting Data Streams with
Skewed Distributions
Jing Gao† Wei Fan‡ Jiawei Han† Philip S. Yu‡
†University of Illinois at Urbana-Champaign‡IBM T. J. Watson Research Center
![Page 2: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/2.jpg)
Introduction (1)
• Data Stream– Continuously arriving
data flow– Applications: network
traffic, credit card transaction flow, phone calling records, etc.
10
11
10
1
00
11
![Page 3: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/3.jpg)
Introduction (2)• Stream Classification
– Construct a classification model based on past records
– Use the model to predict labels for new data– Help decision making
Fraud?
Fraud
Classification model
Labeling
![Page 4: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/4.jpg)
Framework
……… ?………
Classification Model Predict
![Page 5: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/5.jpg)
Concept Drifts
• Changes in P(x,y)– P(x,y)=P(y|x)P(x) x-feature vector, y-class label– No Change, Feature Change, Conditional Change, Dual C
hange– Expected error is not a good indicator of concept drifts– Training on the most recent data could help reduce expect
ed error
Time Stamp 1
Time Stamp 11
Time Stamp 21
![Page 6: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/6.jpg)
Issues in Stream Classification(1)
• Generative Model– P(y|x) follows some
distribution
• Descriptive Model– Let data decides
• Stream Data– Distribution unknow
n and evolving
![Page 7: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/7.jpg)
Issues in Stream Classification(2)
• Label Prediction– Classify x into one
class
• Probability Estimation– x is assigned to all
classes with different probabilities
• Stream Applications– Stochastic, prediction
confidence information is needed
![Page 8: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/8.jpg)
Mining Skewed Data Stream• Skewed Distribution
– Credit card frauds, network intrusions
• Existing Stream Classification Algorithms– Evaluated on balanced
data
• Problems– Ignore minority examples– The cost of misclassifying
minority examples is usually huge
+
-
Classify every leaf node as negative
![Page 9: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/9.jpg)
Stream Ensemble Approach (1)
……… ?………
Training set? Insufficient positive examples!
Step 1
Sampling
![Page 10: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/10.jpg)
Stream Ensemble Approach (2)
Step 2
Ensemble
C1 C2 Ck……
k
i
iE xfk
xf1
)(1
)(
1 2 k……
![Page 11: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/11.jpg)
Why this approach works?• Incorporation of old positive examples
– increase the training size, reduce variance– negative examples reflect current concepts, so
the increase in boundary bias is small• Ensemble
– reduce variance caused by single model– disjoint sets of negative examples—the
classifiers will make uncorrelated errors• Bagging & Boosting
– running cost is much higher– cannot generate reliable probability estimates for
skewed distributions
![Page 12: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/12.jpg)
Analysis
)()|()( xxcPxf ccc 2222 /)( sb
)()|()( xxcPxf ccEC
• Error Reduction– Sampling
– Ensemble
• Efficiency Analysis– Single model– Ensemble– Ensemble is more efficient
k
ibb iE
k 1
22
2 1
))log()(( qpqp knnknndO
))log()(( qpqp nnnndkO
![Page 13: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/13.jpg)
Experiments
• Measures– Mean Squared Error
– ROC Curve – Recall-Precision Curve
• Baseline Methods– NS: No sampling +Single Model– SS: Sampling + Single Model– SE: Sampling + Ensemble
n
iii xPxf
nL
1
2))|()((1
![Page 14: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/14.jpg)
Experimental Results (1)
Mean Squared Error on Synthetic Data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Feature Condi ti onal Dual
SENSSS
Feature Change only P(x) changes
Conditional Change only P(y|x) changes
Dual Change both P(x) and P(y|x)
changes
![Page 15: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/15.jpg)
Experimental Results (2)
Mean Squared Error on Real Data
0
0.05
0.1
0.15
0.2
0.25
Thyroi d1 Thyroi d2 Opt Letter Covtype
SENSSS
![Page 16: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/16.jpg)
Experimental Results (3)
ROC Curve Recall-Precision Plot
Plots on Synthetic Data
![Page 17: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/17.jpg)
Experimental Results (4)
ROC Curve Recall-Precision Plot
Plots on Real Data
![Page 18: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/18.jpg)
Experimental Results (5)
Training Time
![Page 19: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/19.jpg)
Conclusions
• General issues in stream classification– concept drifts– descriptive model– probability estimation
• Mining skewed data streams– sampling and ensemble techniques– accurate and efficient
• Wide applications– graph data– airforce data
![Page 20: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.](https://reader036.fdocuments.in/reader036/viewer/2022081602/551515ac550346a80c8b5d7a/html5/thumbnails/20.jpg)
Thanks!
• Any questions?