Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.
-
date post
20-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.
Mining Frequent patterns without candidate generation
Jiawei Han,
Jian Pei
and
Yiwen Yin
Problem : Mining Frequent Pattern
• I={a1, a1, …, am} is a set of items.
• DB={T1, T1, …, Tn} is the database of transactions where each transaction is a non empty subset of I.
• A pattern is also a subset of I.• A pattern is frequent if it is contained in
(supported by) more than a fixed number (ξ) of transactions.
Previous work : Apriori
• It may need to generate a huge number of candidate itemsets. To discover a frequent pattern of size k it needs to generate more than 2k candidates in total.
• It may need to scan the database repeatedly and check for the frequencies of the candidates.
FP-growth
• FP-growth mines frequent patterns without generating the candidate sets. It grows the patterns from fragments.
• It builds an extended prefix tree (FP-tree) for the transaction database. This tree is a compressed representation of the database. It saves repeated scan of the database.
FP-tree
TID Items BoughtFrequent Items
100f,a,c,d,g,i,m,p
f,c,a,m,p
200 a,b,c,f,l,m,o f,c,a,b,m
300 b,f,h,j,o f,b
400 b,c,k,s,p c,b,p
500a,f,c,e,l,p,m,n
f,c,a,m,p
sorted in descending order of the freq.
root
c:1
m:1
b:1
b:1
p:2
m:2
c:3
a:3
f:4
b:1
p:1
Minimum support (ξ) = 3
Conditional FP-tree of p
Items Bought
Frequent Items
f,c,a,m c
f,c,a,m c
c,b c
Conditional pattern base for p
root
c:3
Conditional FP-treeof p
Minimum support (ξ) = 3
The set of frequent patterns containing p is { cp , p }{p }
Frequent patterns containing mItems Bought
Frequent Items
f,c,a f,c,a
f,c,a f,c,a,
f,c,a,b f,c,a
Conditional pattern base for m
root
c:3
a:3
f:3
Conditional FP-treeof m
ItemsFreq. Items
f,c f,c
f,c f,c
f,c f,c
Conditional pattern base for am
root
c:3
f:3
Conditional FP-treeof am
root
f:3
ItemsFreq. Items
f f
f f
f f
The set of frequent patterns containing m is { m }{ m, am }{ m, am, cam }{ m, am, cam, fcam }
Conditional FP-treeof cam
root
c:3
a:3
f:3
{ m, am, cam, fcam, fam}pattern
base for cam
root
c:3
f:3
{ m, am, cam, fcam, fam, cm, fcm }
root
c:3
a:3
f:3
{ m, am, cam, fcam, fam, cm, fcm, fm }
Complete Frequent Pattern set
f ac pm
fc
b
apfmfa ca amcm
camfca fcm fam
fcam
Generated by conditional FP tree of m which is a single
Path
• A single path generates each combination of its nodes as frequent pattern
• Supports for a pattern is equal to the minimum support of a node in it.
root
c:3
a:3
f:3
Pseudocode
• Procedure FP-growth(Tree,α)• if Tree contains a single path P
• for each combination (β) of the nodes in P• Generate pattern βUα with support = minimum support of
a node in β
• else• for each ai in the header of Tree do
• Generate pattern β= αUai with support = ai.support.
• Construct β’s conditional pattern base and conditional FP-tree Treeβ
• if Treeβ ≠ Ø
• Call FP-growth(Treeβ, β)
Implementation issues
• For different support thresholds (ξ) there are different FP-trees. We may chose ξ=20 if 98% of the queries have ξ≥20.
• Updating the FP-tree after each new transaction may be costly. We may count the occurrence frequency of every items and update the tree if relative frequency of an item gets a large change.
New Challenges
• FP-growth may output a large number of frequent patterns for small (ξ) and very small number of frequent patterns for large (ξ). We may not know the (ξ) for our purpose.
• Which frequent patterns are good instances for generating interesting association rules?
Top-K frequent closed patterns
• Closed pattern is a pattern whose support is larger than any of its super pattern.
TID Items BoughtFrequent Items
100f,a,c,d,g,i,m,p
f,c,a,m,p
200 a,b,c,f,l,m,o f,c,a,b,m
300 b,f,h,j,o f,b
400 b,c,k,s,p f,c,b,p
500a,f,c,e,l,p,m,n
f,c,a,m,pf:5 a:3c:4 p:3m:3
fc:4
b:3
ap:3fm:3fa:3 ca:3 am:3cm:3
cam:3fca:3 fcm:3 fam:3
fcam:3
fp:3 fb:3
• We can also specify the minimum length of the patterns.• Top-2 frequent closed patterns with length ≥ 2 is fc and fcam
Mining Top-K closed FP
• The algorithm starts with an FP-tree having 0 support threshold.
• While building the tree, it prunes the smaller patterns with length < min_length.
• After the tree is built, it prunes the relatively infrequent patterns by raising the support threshold.
• Mining is performed on the final pruned FP-tree.
Compressed Frequent Pattern
• FP-growth may end up with a large set of patterns.
• We can compress the set of frequent patterns by clustering it minimally and selecting a representative pattern from each cluster.
f ac pm
fc
b
apfmfa ca amcm
camfca fcm fam
fcam{fcam, cam, ap, b}
Clustering Criterion
• For each cluster there must be a representative pattern Pr .
• D(P,Pr ) ≤ δ for all patterns inside the cluster of Pr .
• D(P1,P2 ) = 1- |T(P1)∩T(P2)| |T(P1)UT(P2)|
• T(P) is the set of transactions that support P.• D is a metric for closed patterns.
Summary
• FP-tree is an extended prefix tree that summarizes the database in a compressed form.
• FP-growth is an algorithm for mining frequent patterns using FP-tree.
• FP-tree can also be used to mine Top-K frequent closed patterns and Compressed frequent patterns.
References
• Mining Frequent Patterns without Candidate Generation– Jiawei Han, Jian Pei and Yiwen Yin
• Mining Top-K Frequent Closed Patterns without Minimum Support– Jiawei Han, Jianyong Wang, Ying Lu and Petre
Tzetkov
• Mining Compressed Frequent-Pattern Sets– Dong Xin, Jiawei Han, Xipheng Yan and Hong Cheng
Thank You