Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

13
Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Cent er

Transcript of Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Page 1: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Efficiently Mining Long Patterns from Databases

Roberto J. Bayardo Jr.

IBM Almaden Research Center

Page 2: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Abstract

• Max-Miner : scale roughly linearly in the number of maximal patterns, irrespective of the length of the longest pattern

• previous algorithms : scale exponentially with longest pattern length

Page 3: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Introduction (Cont’d)

• pattern mining algorithms have been developed to operate on databases where the longest patterns are relatively short

• Interesting data-sets with long patterns– sales transactions detailing the purchases made

by regular customers over a large time window– biological data from the fields of DNA and

protein analysis

Page 4: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Introduction (Cont’d)

• Apriori-like algorithms are inadequate on data-sets with long patterns

• a bottom-up search– enumerates every single frequent itemset– exponential complexity

• in order to produce a frequent itemset of length l, it must produce all 2l of its subsets since they too must be frequent

– restricts Apriori-like algorithms to discovering only short patterns

Page 5: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Introduction

• Max-Miner algorithm– for efficiently extracting only the maximal freq

uent itemsets– roughly linear in the number of maximal freque

nt itemsets– “look ahead” , not bottom-up search– can prune all its subsets from consideration, by

identifying a long frequent itemset early on

Page 6: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Max-Miner (Cont’d)

• Rymon’s generic set-enumeration tree search frame work

• ex. Figure 1.• breadth-first search

– in order to limit the number of passes

• pruning strategies– subset infrequency pruning (as does Apriori)– superset frequency pruning

Page 7: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Figure 1. Complete set-enumeration tree over four items

{ }

1 2 3 4

1,2 1,3 1,4 2,3 2,4 3,4

1,2,3 1,3,4 2,3,4

1,2,3,4

Page 8: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Max-Miner (Cont’d)

• candidate group g,– head, h(g)

• represents the itemsets enumerated by the node

– tail, t(g) • an ordered set

• contains all items not in h(g) that can potentially appear in any sub-node

– ex. the node enumerating itemset {1} h(g) = {1}, t(g) = {2, 3, 4}

Page 9: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Max-Miner

• counting the support of a candidate group g,– computing the support of itemsets h(g), h

(g) t(g) and h(g) {i} for all i t(g)

• superset-frequency pruning– halting sub-node expansion at any candidate gr

oup g for which h(g) t(g) is frequent

• subset-infrequency pruning– removing any such tail item from candidate gro

up before expanding its sub-nodes

Page 10: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Figure 2. Max-Miner at its top level

MAX-MINER (Data-set T) ;; Returns the set of maximal frequent itemsets present in T Set of Candidate Groups C { } Set of Itemsets F {GEN-INITIAL-GROUP(T, C)} while C is non-empty do

scan T to count the support of all candidate groups in Cfor eachfor each g g C such that h(g) C such that h(g) t(g) is frequent t(g) is frequent dodo

F F F F {h(g) {h(g) t(g)} t(g)}Set of Candidate Groups Cnew { }for each g C such that h(g) t(g) is infrequent do

F F {GEN-SUB-NODES(g, Cnew)} C Cnew

remove from F any itemset with a proper superset in Fremove from C any group g such that h(g) t(g) has a

superset in F return F

Page 11: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Figure 3. Generating the initial candidate groups

GEN-INITIAL-GROUPS (Data-set T, Set of Candidate Groups C)

;; C is passed by reference and returns the candidate groups

;; The return value of the function is a frequent 1-itemset

scan T to obtain F1, the set of frequent 1-itemsets

impose an ordering on the items in F1

for each item i in F1 other than the greatest item do

let g be a new candidate with h(g) = {i}

and t(g) = {j | j follows i in the ordering}

C C {g}

return the itemset in F1 containing the greatest item

Page 12: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Figure 4. Generating sub-nodes

GEN-SUB-NODES (Candidate Group g, Set of Cand. Groups C) ;; C is passed by reference and returns the sub-nodes of g ;; The return value of the function is a frequent itemset remove any item i from t(g) if h(g) remove any item i from t(g) if h(g) {i} is infrequent {i} is infrequent reorder the items in t(g) for each i t(g) other than greatest do

let g’ be a new candidate with h(g’) = h(g) {i} and t(g’) = { j | j t(g) and j follows i in t(g)}

C C {g’} return h(g) {m} where m is the greatest item in t(g),

or h(g) if t(g) is empty

Page 13: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Item Ordering Policies

• to increase the effectiveness of superset-frequency pruning

• to position the most frequent items last• ordering : Gen-Initial-Group

– in increasing order of sup({i})

• reordering : Gen-Sub-Nodes– in increasing order of sup(h(g) {i})

• consider only the subset of transactions relevant to the given node