Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree
-
Upload
maisie-joseph -
Category
Documents
-
view
22 -
download
0
description
Transcript of Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree
![Page 1: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/1.jpg)
Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree
Wei Fan, Kun Zhang, Hong Cheng,
Jing Gao, Xifeng Yan, Jiawei Han,
Philip S. Yu, Olivier Verscheure
How to find good features from semi-structured raw data for classification
![Page 2: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/2.jpg)
Feature Construction
Most data mining and machine learning model assume the following structured data: (x1, x2, ..., xk) -> y where xi’s are independent variable y is dependent variable.
y drawn from discrete set: classification y drawn from continuous variable: regression
When feature vectors are good, differences in accuracy among learners are not much.
Questions: where do good features come from?
![Page 3: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/3.jpg)
Frequent Pattern-Based Feature Extraction
Data not in the pre-defined feature vectors Transactions
Biological sequence
Graph database
Frequent pattern is a good candidate for discriminative features So, how to mine them?
![Page 4: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/4.jpg)
FP: Sub-graphO
A discovered pattern
HO
O
NSC 4960
NSC 191370
O O
NH
O
HN
O
O
SH
NSC 40773
O
O
O
HO
O
HO
O
O
NSC 164863 NS
H2N O
OOO
O
O O
O
OO
OH
O
NSC 699181
(example borrowed from George Karypis presentation)
![Page 5: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/5.jpg)
Frequent Pattern Feature Vector Representation
P1 P2 P3
Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1
………
|
Petal.Width< 1.75setosa
versicolor virginica
Petal.Length< 2.45
Any classifiers you can name
NN
DT
SVM
LRMining these predictivefeatures is an NP-hardproblem.
100 examples can get up to1010 patterns
Most are useless
![Page 6: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/6.jpg)
Example 192 examples
12% support (at least 12% examples contain the pattern), 8600 patterns returned by itemsets 192 vs 8600 ?
4% support, 92,000 patterns 192 vs 92,000 ??
Most patterns have no predictive power and cannot be used to construct features.
Our algorithm Find only 20 highly predictive patterns can construct a decision tree with about 90% accuracy
![Page 7: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/7.jpg)
Data in “bad” feature space Discriminative patterns
A non-linear combination of single feature(s) Increase the expressive and discriminative power of the
feature space
An example
X Y C
0 0 0
1 1 1
-1 1 1
1 -1 1
-1 -1 1Data is non-linearly separable in (x, y)
0
1
1
x
y
1
1
![Page 8: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/8.jpg)
New Feature Space
Data is linearly separable in (x, y, F)
Mine &
Transform
• Solving Problem
Map
Dat
a to
a D
iffer
ent S
pace
X Y C
0 0 0
1 1 1
-1 1 1
1 -1 1
-1 -1 1X Y F:x=0,
y=0 C
0 0 1 0
1 1 0 1
-1 1 0 1
1 -1 0 1
-1 -1 0 1
01 x
y
1
1
1
1
F
0
11
1ItemSet:F: x=0,y=0Association ruleF: x=0 y=0
![Page 9: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/9.jpg)
Computational Issues Measured by its “frequency” or support.
E.g. frequent subgraphs with sup ≥ 10% or ≥ 10% examples contain these patterns
“Ordered” enumeration: cannot enumerate “sup = 10%” without first enumerating all patterns > 10%.
NP hard problem, easily up to 1010 patterns for a realistic problem. Most Patterns are Non-discriminative. Low support patterns can have high “discriminative power”. Bad! Random sampling not work since it is not exhaustive.
Most patterns are useless. Random sample patterns (or blindly enumerate without considering frequency) is useless.
Small number of examples. If subset of vocabulary, incomplete search. If complete vocabulary, won’t help much but introduce sample selection bias
problem, particularly to miss low support but high info gain patterns
![Page 10: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/10.jpg)
1. Mine frequent patterns (>sup)
Frequent Patterns1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
DataSet mine
Mined Discriminative
Patterns
1 2 4
select
2. Select most discriminative patterns;
3. Represent data in the feature space using such patterns;
4. Build classification models.
F1 F2 F4
Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1
………represent
|
Petal.Width< 1.75setosa
versicolor virginica
Petal.Length< 2.45
Any classifiers you can name
NN
DT
SVM
LR
Conventional Procedure
Feature Construction and Selection
Two-Step Batch Method
![Page 11: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/11.jpg)
Two Problems
Mine step combinatorial explosion
Frequent Patterns
1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
DataSetmine
1. exponential explosion 2. patterns not considered if minsupport isn’t small
enough
![Page 12: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/12.jpg)
Two Problems Select step
Issue of discriminative power
Frequent Patterns
1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
Mined Discriminative
Patterns
1 2 4
select
3. InfoGain against the complete dataset, NOT on subset of
examples
4. Correlation notdirectly evaluated on their
joint predictability
![Page 13: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/13.jpg)
Direct Mining & Selection via Model-based Search Tree Basic Flow
Mined Discriminative Patterns
Compact set of highly
discriminative patterns
1234567...
Divide-and-Conquer Based Frequent Pattern Mining
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
6
Y
+
Y Y4
N
Few Data
N N
+
N
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%
…
… Y
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Feature Miner
Classifier
Global Support:
10*20%/10000=0.02%
![Page 14: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/14.jpg)
Analyses (I)
1. Scalability (Theorem 1)
Upper bound
“Scale down” ratio to obtain extremely low support pat:
2. Bound on number of returned features (Theorem 2)
![Page 15: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/15.jpg)
4. Non-overfitting5. Optimality under exhaustive search
Analyses (II)
3. Subspace is important for discriminative pattern
Original set: no-information gain if C1 and C0: number of examples belonging to class 1 and 0 P1: number of examples in C1 that contains “a pattern α” P0: number of examples in C0 that contains the same pattern α
Subsets could have info gain:
![Page 16: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/16.jpg)
Experimental Studies: Itemset Mining (I)
Scalability Comparison
01
23
4
Adult Chess Hypo Sick Sonar
Log(DT #Pat) Log(MbT #Pat)
0
1
2
3
4
Adult Chess Hypo Sick Sonar
Log(DTAbsSupport) Log(MbTAbsSupport)
Datasets #Pat using MbT supRatio (MbT #Pat / #Pat using MbT
sup)
Adult 252809 0.41%
Chess +∞ ~0%
Hypo 423439 0.0035%
Sick 4818391 0.00032%
Sonar 95507 0.00775%
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
![Page 17: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/17.jpg)
Experimental Studies: Itemset Mining (II)
Accuracy of Mined Itemsets
70%
80%
90%
100%
Adult Chess Hypo Sick Sonar
DT Accuracy MbT Accuracy
4 Wins 1 loss
01
23
4
Adult Chess Hypo Sick Sonar
Log(DT #Pat) Log(MbT #Pat)
much smallernumber ofpatterns
![Page 18: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/18.jpg)
Experimental Studies: Itemset Mining (III)
Convergence
![Page 19: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/19.jpg)
Experimental Studies: Graph Mining (I)
9 NCI anti-cancer screen datasets The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3%
2 AIDS anti-viral screen datasets URL: http://dtp.nci.nih.gov. H1: CM+CA – 3.5% H2: CA – 1%
O
O
O
HO
O
HO
O
O
![Page 20: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/20.jpg)
Experimental Studies: Graph Mining (II) Scalability
0300600900
120015001800
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT #Pat MbT #Pat
0
1
2
3
4
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
Log(DT Abs Support) Log(MbT Abs Support)2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
![Page 21: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/21.jpg)
Experimental Studies: Graph Mining (III) AUC and Accuracy
0.5
0.6
0.7
0.8
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT MbTAUC
Accuracy
0.88
0.92
0.96
1
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT MbT
11 Wins
10 Wins 1 Loss
![Page 22: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/22.jpg)
AUC of MbT, DT MbT VS Benchmarks
Experimental Studies: Graph Mining (IV)
7 Wins, 4 losses
![Page 23: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/23.jpg)
Summary Model-based Search Tree
Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play
Experiment Results Itemset Mining Graph Mining
Software and Dataset available from: www.cs.columbia.edu/~wfan
![Page 24: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812e3a550346895d93aeba/html5/thumbnails/24.jpg)