Cartesian Contour: A Concise Representation for a Collection of Frequent Sets
description
Transcript of Cartesian Contour: A Concise Representation for a Collection of Frequent Sets
Cartesian Contour: A Concise Representation for a Collection of
Frequent Sets
Ruoming Jin Kent State University
Joint work with Yang Xiang and Lin Liu (KSU)
Frequent Pattern Mining• Summarizing the underlying datasets, providing
key insights• Key building block for data mining toolbox
– Association rule mining– Classification– Clustering– Change Detection– etc…
• Application Domains– Business, biology, chemistry, WWW, computer/netwo
ring security, software engineering, …
The Problem
• The number of patterns is too large• Attempt
– Maximal Frequent Itemsets – Closed Frequent Itemsets– Non-Derivable Itemsets – Compressed or Top-k Patterns– …
• Tradeoff– Significant Information Loss– Large Size
Pattern Summarization
• Using a small number of itemsets to best represent the entire collection of frequent itemsets– The Spanning Set Approach [Afrati-Gionis-Mannila,
KDD04]– Exact Description = Maximal Frequent Itemsets
• Our problem:– Can we find a concise representation which can allow
both exact and approximate summarization of a collection of frequent itemsets?
Basic Idea{A,B,G,H}, {A,B,I,J}, {A,B,K,L}
{C,D,G,H}, {C,D,I,J}, {C,D,K,L}
{E,F,G,H}, {E,F,I,J}, {E,F,K,L}
9 itemsets, 36 items.
{{A,B},{C,D},{E,F}}
Picturing
Cartesian Product
{{G,H},{I,J},{K,L}}
1 biclique, 6 itemsets, 12 items
Covering
Cartesian Covering
Non-frequent itemsets
Problem Formulation• Cartesian product
– e.g.• Cost of a Cartesian product
– e.g. 1 biclique, 3 itemsets, and 5 items• Covering
– e.g.
}},,{},,{},,{},,{},{},{},{{}},,,{},,,{},,,{
},,{},,,{},,{},,{},,{},{},,{},,{},{},{},{},{{
DCEDCDECEDCEDCBADCBDCA
DBACBADCDBCBDACABADCBA
}},,{},,,,{{}},{{}}{},,{{ DCEDCBADCEBA
},,{},,,{ 22}),{}}{},,({{ DCEDCBADCEBAC
How can we use Cartesian products to concisely represent a collection of frequent itemsets?
Exact and Approximate Covering
Exact Representation
Approximate Representation
}},{{}{}},{{}}{{
DCBAG
}},{},,{{}}{{ DCBAG Cost: 1 biclique, 3 itemsets, 5 items
False positive: {G,C},{G,D},{G,C,D}
Cost: 2 biclique, 4 itemsets, 6 items
False positive: none
Covering Maximal Frequent Itemsets
{{ABC}, {CDE}}
{{GHI}, {JKL}}
{{MNO}, {PQR}}
{{STU}, {VWX}}
ABCSTU
ABCGHI
CDESTU
CDEGHI
CDEVWXCDEJKL MNOVWX
MNOGHI
PQRJKL
Problem Reformulation
Given Maximal Frequent Itemsets: }}{},{},{},{},{},{{ EGHKDGHKDEFJCEFJABDIABCI
Frequent Itemsets
Exact representation
Approximate representation
C1 C2 C1 C2
Minimal Biclique Set Cover Problem
Ground Set: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
1, 2,3,4,6,7,8,9
5,10,11
NP-hardness
• By reducing the Minimal Biclique Set Cover into our problem, we can easily prove our problem1 (exact) and problem2 (approximate) are NP-hard.
• Minimal Biclique Set Cover is a Variant of the Classical Set Cover Problem
Can we use the standard set-cover greedy algorithm?
Naïve greedy algorithm
• Greedy algorithm:– Each time choose a biclique with the lowest price
.
– is the cost.– This method has a logarithmic approximation
bound.• The problem?
– The number of candidate bicliques are 2|X|+|Y| !!
||||||)(
ii YXee
iii S
YXC
|||| ii YX
Candidate Reduction
• Assume one side of the biclique candidate is known, how to choose the other side?
Greedy Algorithm
Fixed!
Split and sort
Covering 4 Covering 3 Covering 3
Add 1st single Y-vertex Biclique
Cost = 1;
Add 2nd single Y-vertex Biclique
Cost = 5/7;
Add 3th single Y-vertex Biclique
Cost = 6/8> 5/7
Cheapest sub-biclique!
Biclique Candidate
Approximation Bound of the Greedy Algorithm
The greedy SubBiclique procedure can find a sub-biclique whose price is less than or equal to e/(1-e) of the price of the optimal sub-biclique (cheapest price)!
Further Reduction
• Only using the IDEA1, the time complexity is still exponential .
• How to reduce this further??– Are all the combinations equally
important?– No, because some are more likely to connect to
the Y side.– Our solution: Frequent itemset mining!
||2 || YX
||2 X
Using Frequent Itemset Mining
Overall Algorithm
• Step 1: Use the Frequent Itemset Mining tool to find all the (one-side maximal) biclique candidates;
• Step 2: Calculate the cheapest sub-biclique for each candidate using the greedy procedure;
• Step 3: Compare all the sub-bicliques, choose the cheapest one;
• Step 4: if MFI totally covered, done; else go to Step 2.
Approximation Bound
Our algorithm has e/(1-e) (ln (n)+1) approximation ratio with respect to the candidate set (all the sub-bicliques with one sides coming from the frequent itemset mining).
Speed-up techniques (1)
• Using Closed itemsets for X and Y– Initially X and Y contain all the FI, respectively.– Using to cover MFI is similar to factorizing MFI;
– MFI’s maximal factor itemsets are closed itemsets, whose number is much smaller!
},,,,,,{ 1111 mllm YXYXYXYXYX
Speed-up techniques (2)A,B C,D E,F A,G K,L G ,H B,G A,I
G I J M A B
G ,I M ,G B,I
(1)
Sparse Graph Dense Graph
# Frequent itemsets is small;
Valuable biclique candidates are not be fully used!
# Frequent itemsets is big;
Handling those candidates are too slow!
Frequent Itemset
Supporting Transaction
TRADEOFF
Speed-up techniques (3)
• Iterative procedure– A large number of closed itemsets;– To cover MFI in one time can produce a huge
number of biclique candidates;– So to cover MFI in several times ; – Support level is reduced gradually!
Experiments
• Data sets:
Conclusion
• We propose an interesting summarization problem which consider the interaction between frequent patterns
• We transform this problem into a generalized minimal biclique covering problem and design an approximate algorithm with bound
• The experimental results demonstrate the effective and efficiency of our approach
Thank you !!!
Reference[Bayardo98] Roberto J. Bayardo Jr. Efficiently mining long patterns from databases. SIGMOD98.[Pasquier99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Descovering frequent clo
sed itemsets for association rules. ICDT99.[Calder07] Toon Calder and Bart Goethals. Non-derivable itemset mining. Data Min. Knowl. Discover.
07.[Han02] Jiawei Han, Jianyong Wang, Ying Lu and Petre Tzvetkov. Mining top-k frequent closed patter
ns without minimum support. ICDM02.[Xin06] Dong Xin, Hong Cheng, Xifeng Yan, and Jiawei Han. Extracting redundancy-aware top-k patt
erns. KDD06.[Xin05] Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern set
s. VLDB05.[Afrati04] Foto Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets.
KDD04.[Yan05] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarization itemset patterns: a profil
e-based approach. KDD05.[Wang06] Chao Wang and Srinivasan Parthasarathy. Summarizing itemset patterns using probabilisti
c models. KDD06.[Jin08] Ruoming Jin, Muad Abu-Ata, Yang Xiang, and Ning Ruan. Effective and efficient itemset patte
rn summarization: regression-based approaches. KDD08.[Xiang08] Yang Xiang, Ruoming Jin, David Fuhy, and Feodor F. Dragan. Succinct Summarization of t
ransactional databases: an overlapped hyperrectangle scheme. KDD08.
Related Work
• K-itemset approximation: [Afrati04].– Difference:
• their work is a special case of our work;• their work is expensive for exact description;• Our work use set cover and max-k cover methods.
• Restoring the frequency of frequent itemsets: [Yan05, Wang06, Jin08].
• Hyperrectangle covering problem: [Xiang08].