ACM CIKM October 31, 2013 Jeff Hawkins jhawkins@GrokSolutions
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
-
Upload
chuancong-gao -
Category
Technology
-
view
35 -
download
0
Transcript of CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
Efficient Itemset Generator Discovery over a StreamSliding Window
Chuancong Gao, Jianyong Wang
Database LaboratoryDepartment of Computer Science and Technology
Tsinghua University, Beijing 100084, China
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 1 / 28
Outline
IntroductionWhat is GeneratorWhy We Need GeneratorsWhat have We done
Related Work
The StreamGen AlgorithmFP-TreeEnumeration TreeADD and REMOVE Operations
Extension for Mining Classification Rules
Evaluation Results
Conclusions
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 2 / 28
Introduction What is Generator
What is Generator
Example:Given the 4 transactions, with the
:::::::::minimum
::::::::support
:::::::::threshold
:::::::::(supmin) of 2.
A B CA D
A B C DA B D
Ø : 4
D : 3C : 2B : 3A : 4
ABD : 2ABC : 2
BD : 2BC : 2AD : 3AC : 2AB : 3
Equivalence Class
Generator ItemsetClosed Itemset
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 3 / 28
Introduction What is Generator
What is Generator
Example:Given the 4 transactions, with the
:::::::::minimum
::::::::support
:::::::::threshold
:::::::::(supmin) of 2.
A B CA D
A B C DA B D
Ø : 4
D : 3C : 2B : 3A : 4
ABD : 2ABC : 2
BD : 2BC : 2AD : 3AC : 2AB : 3
Equivalence Class
Generator ItemsetClosed Itemset
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 3 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.
I At least one generator sharing the same support and confidence withothers for each equivalence class;
I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.
I At least one generator sharing the same support and confidence withothers for each equivalence class;
I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;
I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;I The number is much smaller than all frequent ones;
I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;
I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;
I Preferred by:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
Contributions:
I First algorithm mining frequent itemset generators over stream slidingwindows;
I Novel::::::::::::enumeration
:::::tree structure and some effective optimization
techniques;
I Extended to directly mine classification rules on a sliding window;
I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
Contributions:
I First algorithm mining frequent itemset generators over stream slidingwindows;
I Novel::::::::::::enumeration
:::::tree structure and some effective optimization
techniques;
I Extended to directly mine classification rules on a sliding window;
I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
Contributions:
I First algorithm mining frequent itemset generators over stream slidingwindows;
I Novel::::::::::::enumeration
:::::tree structure and some effective optimization
techniques;
I Extended to directly mine classification rules on a sliding window;
I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
Contributions:
I First algorithm mining frequent itemset generators over stream slidingwindows;
I Novel::::::::::::enumeration
:::::tree structure and some effective optimization
techniques;
I Extended to directly mine classification rules on a sliding window;
I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Related Work
Related Work
Itemset Mining Algorithms:
I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.
I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Itemset Mining Algorithms:
I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.
I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Itemset Mining Algorithms:
I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.
I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Itemset Mining Algorithms:
I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.
I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Stream Itemset Mining Algorithms:
I Catch the moment: maintaining closed frequent itemsets over a data stream slidingwindow. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
Related Work
Related Work
Stream Itemset Mining Algorithms:
I Catch the moment: maintaining closed frequent itemsets over a data stream slidingwindow. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.
Itemset based Classification Algorithms:
I On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans.Knowl. Data Eng., 2006.
I Discriminative frequent pattern analysis for effective classification. H. Cheng, X. Yan, J.Han, and C.-W. Hsu. ICDE, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
Related Work
Related Work
Stream Itemset Mining Algorithms:
I Catch the moment: maintaining closed frequent itemsets over a data stream slidingwindow. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.
Itemset based Classification Algorithms:
I On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans.Knowl. Data Eng., 2006.
I Discriminative frequent pattern analysis for effective classification. H. Cheng, X. Yan, J.Han, and C.-W. Hsu. ICDE, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
The StreamGen Algorithm
The StreamGen Algorithm
Details of our algorithm here.
Example:One running example of stream data containing 6 transaction itemsets and withwindow size of 4.
T i m e L i n e
I D I t e m s e t
1
2
3
4
5
6
A B C
A D
A B C D
A B D
B C D
C D
W i n d o w
# 1
W i n d o w
# 2
W i n d o w
# 3
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28
The StreamGen Algorithm
The StreamGen Algorithm
Details of our algorithm here.
Example:One running example of stream data containing 6 transaction itemsets and withwindow size of 4.
T i m e L i n e
I D I t e m s e t
1
2
3
4
5
6
A B C
A D
A B C D
A B D
B C D
C D
W i n d o w
# 1
W i n d o w
# 2
W i n d o w
# 3
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28
The StreamGen Algorithm
A Few Basic Theorems
TheoremA frequent itemset S is a generator iff there exists no subset with size |S − 1|having the same support with S .
Hint:Can be used to check whether an itemset is a generator easily.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28
The StreamGen Algorithm
A Few Basic Theorems
TheoremA frequent itemset S is a generator iff there exists no subset with size |S − 1|having the same support with S .
Hint:Can be used to check whether an itemset is a generator easily.
TheoremAny subset of a generator would be also a generator.
TheoremAny superset of an unpromising itemset must be either unpromising orinfrequent.
Hint:Help define the border between generators and non-generators;
Form the foundation for the enumeration tree.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28
The StreamGen Algorithm FP-Tree
FP-Tree
A modified FP-Tree for store and compress transactions in each slidingwindow.
Example:FP-Tree of first sliding window in previous example.
1 A B C2 A D3 A B C D4 A B D
�
D : 3
C : 1
B : 1
A : 1
C : 1
B : 1
A : 1
1 4 3 2
I D T a b l e
A : 1 B : 1
A : 1 H e a
d T
a b l e
A
B
D
C
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28
The StreamGen Algorithm FP-Tree
FP-Tree
A modified FP-Tree for store and compress transactions in each slidingwindow.
Example:FP-Tree of first sliding window in previous example.
1 A B C2 A D3 A B C D4 A B D
�
D : 3
C : 1
B : 1
A : 1
C : 1
B : 1
A : 1
1 4 3 2
I D T a b l e
A : 1 B : 1
A : 1 H e a
d T
a b l e
A
B
D
C
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:
I Infrequent Node;
I Unpromising Node.
I Generator Node.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:
I Infrequent Node;
I Unpromising Node.
I Generator Node.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:
I Infrequent Node;
I Unpromising Node.
I Generator Node.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:
I Infrequent Node;
I Unpromising Node.
I Generator Node.
A hash table is prepared for each level of the enumeration tree toaccelerate the checking operation.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
Example:Enumeration tree of first sliding window with minimum support 2
1 A B C2 A D3 A B C D4 A B D
Ø: : 4
D : 3 C : 2 B : 3 A : 4
B C : 2 B D : 2 C D : 1
Solid border ellipse: Generator NodeDotted border ellipse: Unpromising NodeDotted border rectangle: Infrequent Node
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 13 / 28
The StreamGen Algorithm ADD and REMOVE Operations
ADD and REMOVE Operations
Core part:Enumeration tree-node status transforming matrix.
ADD REMOVE
Type x < y x = y x > y x < y x = y x > y
G G G G G G/U I/G
U U G/U U U U I/U
I I I I/G/U I I Ix = |itemsetn ∩ T |, y = |itemsetn| − 1G = Generator, U = Unpromising, I = Infrequent
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 14 / 28
The StreamGen Algorithm ADD and REMOVE Operations
Example of ADD Operation
Ø: : 4
D : 3 C : 2 B : 3 A : 4
B C : 2 B D : 2 C D : 1
ADDType x < y x = y x > y
G G G GU U G/U UI I I I/G/U
x = |itemsetn ∩ T |, y = |itemsetn| − 1T = B C D
1 A B C2 A D3 A B C D4 A B D5 B C D +
Ø: : 5
D : 4 C : 3 B : 4 A : 4
A B : 3 A C : 2
A B C : 2
A D : 2 B C : 3 B D : 3 C D : 2
A C D : 1 A B D : 2
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 15 / 28
The StreamGen Algorithm ADD and REMOVE Operations
Example of REMOVE Operation
Ø: : 5
D : 4 C : 3 B : 4 A : 4
A B : 3 A C : 2
A B C : 2
A D : 2 B C : 3 B D : 3 C D : 2
A C D : 1 A B D : 2
1 A B C −2 A D3 A B C D4 A B D5 B C D
REMOVEType x < y x = y x > y
G G G/U I/GU U U I/UI I I I
x = |itemsetn ∩ T |, y = |itemsetn| − 1T = A B C
Ø: : 4
D : 4 C : 2 B : 3 A : 3
A B : 2 A C : 1 B C : 2
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 16 / 28
The StreamGen Algorithm ADD and REMOVE Operations
Combine Two Operations
I For Sliding Window:I ADD when window is not fullI REMOVE when window is full
I For IncrementalI Only ADD
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 17 / 28
Extension for Mining Classification Rules
Extension for Mining Classification Rules
Algorithm 1: StreamGenRules(n)
Input : The root node n of the enemuration tree.begin1
nodes← getGenerators(n);2
sort nodes by info-gain;3
rules← ∅;4
foreach cn ∈ nodes do5
if ∀r ∈ rules, r 6⊂ cn then6
if cn covers at least one transaction then7
rules← rules ∪ {cn};8
remove covered transactions;9
if no more transactions then10
break;11
return rules;12
end13
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 18 / 28
Evaluation Results
Datasets
Dataset # Items # tran. # Pos. # Neg. Avg. Len.
mushroom 116 8,124 4,208 3,916 21.695
horse 89 368 232 136 16.769
adult 128 48,842 11,687 37,155 13.868
breast 45 699 458 241 8.977
hepatitus 55 155 32 123 17.923
pima 40 768 500 268 8
chess 75 3,196 - - 37
connect 129 67,557 - - 43
pumsb 2,113 49,046 - - 74The above part is for both runtime evaluation and classification evaluation,
The bottom part is only for runtime evaluation.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 19 / 28
Evaluation Results
Runtime Comparing with Moment
Comparsion with Moment, one frequent closed itemset mining algorithmon sliding windows:
1
10
100
10 20 30 40 50
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
MomentStreamGen
dataset = mushroomwindow size = 2,000
1
10
100
10 20 30 40 50
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
MomentStreamGen
dataset = mushroomwindow size = 4,000
0.1
1
10
100
1000
75 80 85 90 95 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
MomentStreamGen
dataset = chesswindow size = 1,000
0.1
1
10
100
1000
60 70 80 90 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
MomentStreamGen
dataset = chesswindow size = 2,000
10
100
75 80 85 90 95 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
MomentStreamGen
dataset = pumsbwindow size = 2,500
10
100
70 80 90 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
MomentStreamGen
dataset = pumsbwindow size = 10,000
1
10
100
1000
99.333 99.5 99.667 99.833 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
MomentStreamGen
dataset = connectwindow size = 30,000
1
10
100
95 95.833 96.667 97.5 98.333 99.167 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
MomentStreamGen
dataset = connectwindow size = 60,000
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 20 / 28
Evaluation Results
Memory Use Comparing with Moment
Peak memory uses of Moment and StreamGen in KB:Dataset window size supmin Moment StreamGen
mushroom 4,000 0.1 14,476 10,108
mushroom 2,000 0.1 12,504 8,472
chess 2,000 0.6 103,180 31,636
chess 1,000 0.75 34,624 9,176
connect-4 60,000 0.95 141,756 98,236
connect-4 30,000 0.998 73,056 52,372
pumsb 10,000 0.7 1,732,136 75,316
pumsb 2,500 0.75 90,944 23,472
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 21 / 28
Evaluation Results
Runtime Comparing with DPM & DDPMine
Comparsion with DPM, one frequent generator itemset mining algorithm on staticdata:
0.1
1
10
100
1000
50 60 70 80 90 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
DPMStreamGen
dataset = mushroomwindow size = 4,000
0.1
1
10
100
1000
75 80 85 90 95 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
DPMStreamGen
dataset = chesswindow size = 1,000
1
10
100
1000
97.015 97.761 98.507 99.254 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
DPMStreamGen
dataset = connectwindow size = 67,000
1
10
89.796 91.837 93.878 95.918 97.959 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
DPMStreamGen
dataset = pumsbwindow size = 49,000
*The runtimes of DPM $ DDPMine are only mearsured on full-sized windows.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 22 / 28
Evaluation Results
Runtime Comparing with DPM & DDPMine
Comparsion with DPM, one frequent generator itemset mining algorithm on staticdata:
0.1
1
10
100
1000
50 60 70 80 90 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
DPMStreamGen
dataset = mushroomwindow size = 4,000
0.1
1
10
100
1000
75 80 85 90 95 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
DPMStreamGen
dataset = chesswindow size = 1,000
1
10
100
1000
97.015 97.761 98.507 99.254 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
DPMStreamGen
dataset = connectwindow size = 67,000
1
10
89.796 91.837 93.878 95.918 97.959 100
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
DPMStreamGen
dataset = pumsbwindow size = 49,000
Comparsion with DDPMine, one frequent itemset based classification rule miningalgorithm on static data:
0.1
1
10
100
1000
10000
50 60 70 80 90
Run
time
(in s
econ
ds)
Minimum Support Threshold (in %)
DDPMineStreamGen
dataset = mushroomwindow size = 8,000
0.01
0.1
1
10
100
1000
10000
10 20 30 40 50R
untim
e (in
sec
onds
)Minimum Support Threshold (in %)
DDPMineStreamGen
dataset = horsewindow size = 600
*The runtimes of DPM $ DDPMine are only mearsured on full-sized windows.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 22 / 28
Evaluation Results
Classification Experiment Results
Classification Accuracy:Dataset StreamGen DDPMine
Accuracy max. len. avg. len. avg. num. Accuracy max. len. avg. len. avg. num.
breast 96.708 3 1.551 23.6 95.28 9 2.448 11.6adult 82.146 3 1.831 13 81.292 14 4.583 7.2
mushroom 98.918 3 1.958 9.6 97.184 22 15.592 16.2hepatitus 82.006 4 2.387 15 76.986 8 4.8 5
horse 81.512 2 1.389 3.6 81.246 20 4.88 10pima 74.87 4 1.663 18.4 75.124 7 2.435 12.6
Rule Example on “mushroom”:StreamGen DDPMine
38 17 3912 25 5 7 8 11 13 15 16 17 18 19 20 2613 25 8 17 187 67 5 7 9 13 14 15 16 17 18 19 20 40 41 46 53 5466 2 7 9 11 13 14 15 16 17 18 19 20 21 38 40 44 53 54 76
7 68 2 7 9 11 13 14 15 16 17 18 19 20 28 38 40 44 53 54 7611 18 2 7 9 11 13 14 15 16 17 18 19 20 32 38 40 53 54 65 76
6 18 37 2 7 9 11 13 14 15 16 17 18 19 20 22 32 38 40 53 54 764 53 2 7 9 11 13 14 15 16 17 18 19 20 28 32 38 40 46 53 54 76
2 7 9 11 13 14 15 16 17 18 19 20 21 32 38 40 45 46 53 54 762 7 9 11 13 14 15 16 17 18 19 20 21 32 34 38 40 46 48 53 54 76
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 23 / 28
Evaluation Results
Classification Experiment Results
Classification Accuracy:Dataset StreamGen DDPMine
Accuracy max. len. avg. len. avg. num. Accuracy max. len. avg. len. avg. num.
breast 96.708 3 1.551 23.6 95.28 9 2.448 11.6adult 82.146 3 1.831 13 81.292 14 4.583 7.2
mushroom 98.918 3 1.958 9.6 97.184 22 15.592 16.2hepatitus 82.006 4 2.387 15 76.986 8 4.8 5
horse 81.512 2 1.389 3.6 81.246 20 4.88 10pima 74.87 4 1.663 18.4 75.124 7 2.435 12.6
Rule Example on “mushroom”:StreamGen DDPMine
38 17 3912 25 5 7 8 11 13 15 16 17 18 19 20 2613 25 8 17 187 67 5 7 9 13 14 15 16 17 18 19 20 40 41 46 53 5466 2 7 9 11 13 14 15 16 17 18 19 20 21 38 40 44 53 54 76
7 68 2 7 9 11 13 14 15 16 17 18 19 20 28 38 40 44 53 54 7611 18 2 7 9 11 13 14 15 16 17 18 19 20 32 38 40 53 54 65 76
6 18 37 2 7 9 11 13 14 15 16 17 18 19 20 22 32 38 40 53 54 764 53 2 7 9 11 13 14 15 16 17 18 19 20 28 32 38 40 46 53 54 76
2 7 9 11 13 14 15 16 17 18 19 20 21 32 38 40 45 46 53 54 762 7 9 11 13 14 15 16 17 18 19 20 21 32 34 38 40 46 48 53 54 76
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 23 / 28
Conclusions
Conclusions
I Explored a new and challenging problem:Mining frequent itemset generators over stream sliding window;
I Devised novel enumeration tree structure;
I Also proposed effective optimization techniques;
I Outperformed other state-of-the-art algorithms in terms of efficiencyand classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
Conclusions
Conclusions
I Explored a new and challenging problem:Mining frequent itemset generators over stream sliding window;
I Devised novel enumeration tree structure;
I Also proposed effective optimization techniques;
I Outperformed other state-of-the-art algorithms in terms of efficiencyand classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
Conclusions
Conclusions
I Explored a new and challenging problem:Mining frequent itemset generators over stream sliding window;
I Devised novel enumeration tree structure;
I Also proposed effective optimization techniques;
I Outperformed other state-of-the-art algorithms in terms of efficiencyand classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
Conclusions
Conclusions
I Explored a new and challenging problem:Mining frequent itemset generators over stream sliding window;
I Devised novel enumeration tree structure;
I Also proposed effective optimization techniques;
I Outperformed other state-of-the-art algorithms in terms of efficiencyand classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
Conclusions
The End
Thank you for Listening!
Questions or Comments?
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 25 / 28