Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal...
-
Upload
darleen-knight -
Category
Documents
-
view
213 -
download
1
Transcript of Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal...
Verify and mining frequent patterns from large windows over data streams
Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo
ICDE2008
Outline
Introduction and motivationSWIM algorithmDTV、 DFV algorithmExperimentsConclusion
Introduction and motivation
Conditional counting
Verifiers: DTV ,DFV verify the frequency of previously frequent itemsets over newly arriving windows
Fast verifier for incremental frequent itemset mining: Sliding window incremental miner (SWIM)
SWIM algorithm
The difficulty: a new pattern is added to pattern tree for the first time, its true frequency in the whole window is not known, since this pattern wasn`t frequent in the previous n-1 slides
W: window PT (Pattern tree): a superset of the frequent patterns over W aux_array: stores the frequency of a pattern for each window,
for which the frequency is unknown p.fi: the frequency of p in the ith slide p.freq: p`s cumulative frequency in the current window
SWIM algorithm (cont.)
Example:
S2 S3 S4 S5 S6 S7… …
W4: aux_array=<p.f4,p.f4>
p.freq=p.f4W5: aux_array=<p.f2+p.f4,p.f4+p.f5> p.freq=p.f4+p.f5W6:aux_array=<p.f2++p.f3+p.f4,p.f3+p.f4+p.f5> p.freq=p.f4+p.f5+p.f6W7:p.freq=p.f5+p.f6+p.f7
W4 W5 W6 W7
Analysis of SWIM algorithm
Delay: the frequency of pattern turns out to be larger than the minimum support
Maximum delay:n-1 slides (n: number of slides)
Bottleneck: counting frequencies of itemsets over a given dataset( delay=L, n-L+1slides)
Conditional counting
Goal: verifies counts for a given set of patterns
1.p`s true frequency in D if it has occurred at least min_freq times
2.reports it has occurred less than min_freq (frequency not required in this case, it can skip any pattern whose frequency less than min_freq)
Conditional counting (cont.)
Verification given a set of transaction T, a set of pattern P and a thresh
old s goal: find the exact freq of each p P w.r.t T,
iff its freq is s≧ if s=0 ,verification=counting, but if s>0 extra computation can
be avoided
Proposed fast verifiers DTV, DFV, hybrid
∈
Double-Tree Verifier (DTV)FP-tree
root:?
b:? g:?
e:?
d:?
ab
dc
efgh
f:? g:?
Pattern-tree
b
dc
ef
aroot:?
b:?
d:?
root
d:4
b:5
a:5
c:5
e:1
b:1
g:1
e:1
h:1g:1
f:1
ab
dc
efgh
ab
dc
ef
root
a:3 b:1
c:3
b:3 e:1
d:2
c:2
a:2
root
b:2
abc
b
dc
ef
aroot:4
b:4
d:2
root:?
b:? g:4
e:?
d:?
ab
dc
efgh
f:? g:2
Conditionalized fp-tree on g Conditionalized fp-tree |g on dOriginal fp-tree
Initial pattern tree pattern tree | ”g” pattern tree | ”g” after verification against FP-tree
Filling original pattern tree using reverse pointers
g:2
Double-Tree Verifier (DTV)
for very small min_freq values, it becomes impossible to run FP-growth due to the exponential number of paths
Advantage: it is useful when the minimum support decreases
Depth-First Verifier (DFV)
Ancestor Failure: if a path in the fp-tree has already proved to not contain a prefix of the pattern p, then it does not contain p itself either (apriori property)
Smaller Sibling Equivalence: if a path in the fp-tree has already been marked to (or not to) contain a smaller sibling of the pattern p, then it does (or does not) contain p itself too
Parent Success: if a path in the fp-tree has already been marked to contain the parent pattern of p, then it also contain p
Hybrid Version
many transactions in the fp-tree and many patterns in the pattern tree :DTV is faster than DFV
trees are small: DFV is faster than DFV
Hybrid: start with DTV until the conditionalized tree are “small enough” and after that point switch to DFV
Experiments
Experiments (cont.)
transaction=100k
Conclusion
Speed up many other application:• incremental mining (SWIM)• enhancing static algorithms (counting phase)• privacy preserving techniques (long transaction)• monitoring /concept shift detection
Hybrid : no exactly point to switch DTV to DFV