Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal...

17
Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zani olo ICDE2008

Transcript of Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal...

Page 1: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Verify and mining frequent patterns from large windows over data streams

Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo

ICDE2008

Page 2: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Outline

Introduction and motivationSWIM algorithmDTV、 DFV algorithmExperimentsConclusion

Page 3: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Introduction and motivation

Conditional counting

Verifiers: DTV ,DFV verify the frequency of previously frequent itemsets over newly arriving windows

Fast verifier for incremental frequent itemset mining: Sliding window incremental miner (SWIM)

Page 4: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

SWIM algorithm

The difficulty: a new pattern is added to pattern tree for the first time, its true frequency in the whole window is not known, since this pattern wasn`t frequent in the previous n-1 slides

W: window PT (Pattern tree): a superset of the frequent patterns over W aux_array: stores the frequency of a pattern for each window,

for which the frequency is unknown p.fi: the frequency of p in the ith slide p.freq: p`s cumulative frequency in the current window

Page 5: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

SWIM algorithm (cont.)

Example:

S2 S3 S4 S5 S6 S7… …

W4: aux_array=<p.f4,p.f4>

p.freq=p.f4W5: aux_array=<p.f2+p.f4,p.f4+p.f5> p.freq=p.f4+p.f5W6:aux_array=<p.f2++p.f3+p.f4,p.f3+p.f4+p.f5> p.freq=p.f4+p.f5+p.f6W7:p.freq=p.f5+p.f6+p.f7

W4 W5 W6 W7

Page 6: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Analysis of SWIM algorithm

Delay: the frequency of pattern turns out to be larger than the minimum support

Maximum delay:n-1 slides (n: number of slides)

Bottleneck: counting frequencies of itemsets over a given dataset( delay=L, n-L+1slides)

Page 7: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Conditional counting

Goal: verifies counts for a given set of patterns

1.p`s true frequency in D if it has occurred at least min_freq times

2.reports it has occurred less than min_freq (frequency not required in this case, it can skip any pattern whose frequency less than min_freq)

Page 8: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Conditional counting (cont.)

Verification given a set of transaction T, a set of pattern P and a thresh

old s goal: find the exact freq of each p P w.r.t T,

iff its freq is s≧ if s=0 ,verification=counting, but if s>0 extra computation can

be avoided

Proposed fast verifiers DTV, DFV, hybrid

Page 9: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Double-Tree Verifier (DTV)FP-tree

root:?

b:? g:?

e:?

d:?

ab

dc

efgh

f:? g:?

Pattern-tree

b

dc

ef

aroot:?

b:?

d:?

root

d:4

b:5

a:5

c:5

e:1

b:1

g:1

e:1

h:1g:1

f:1

ab

dc

efgh

ab

dc

ef

root

a:3 b:1

c:3

b:3 e:1

d:2

c:2

a:2

root

b:2

abc

b

dc

ef

aroot:4

b:4

d:2

root:?

b:? g:4

e:?

d:?

ab

dc

efgh

f:? g:2

Conditionalized fp-tree on g Conditionalized fp-tree |g on dOriginal fp-tree

Initial pattern tree pattern tree | ”g” pattern tree | ”g” after verification against FP-tree

Filling original pattern tree using reverse pointers

g:2

Page 10: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Double-Tree Verifier (DTV)

for very small min_freq values, it becomes impossible to run FP-growth due to the exponential number of paths

Advantage: it is useful when the minimum support decreases

Page 11: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.
Page 12: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Depth-First Verifier (DFV)

Ancestor Failure: if a path in the fp-tree has already proved to not contain a prefix of the pattern p, then it does not contain p itself either (apriori property)

Smaller Sibling Equivalence: if a path in the fp-tree has already been marked to (or not to) contain a smaller sibling of the pattern p, then it does (or does not) contain p itself too

Parent Success: if a path in the fp-tree has already been marked to contain the parent pattern of p, then it also contain p

Page 13: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.
Page 14: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Hybrid Version

many transactions in the fp-tree and many patterns in the pattern tree :DTV is faster than DFV

trees are small: DFV is faster than DFV

Hybrid: start with DTV until the conditionalized tree are “small enough” and after that point switch to DFV

Page 15: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Experiments

Page 16: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Experiments (cont.)

transaction=100k

Page 17: Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Conclusion

Speed up many other application:• incremental mining (SWIM)• enhancing static algorithms (counting phase)• privacy preserving techniques (long transaction)• monitoring /concept shift detection

Hybrid : no exactly point to switch DTV to DFV