Download - Frequent Itemsets and Association Rules...Association rule mining Given a set of transactions, find all rules → , such that support → ≥𝑠min confidence → ≥ min MTAT.03.183

Frequent Itemsets and

Association Rules

MTAT.03.183 Data Mining. 20.02.2014.

Konstantin Tretyakov

Frequent patterns

Humpty Dumpty sat on a wall,

Humpty Dumpty had a great fall;

All the King's horses and all the King's men

Couldn't put Humpty together again

20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets

& Association Rules

2

Frequent patterns






& Association Rules

3

Frequent patterns






& Association Rules

4

Frequent patterns






& Association Rules

5

Frequent patterns






& Association Rules

6

[H|D]umpty

Frequent patterns


& Association Rules

7

Frequent patterns


& Association Rules

8

Frequent patterns


& Association Rules

9

Frequent patterns


& Association Rules

10

Front page Search Product A Product B Product C

Front page Ad Product B Purchase Product C

Front page Front page Feedback

Front page Search Product B Search Product C Purchase

Front page Ad Product C Search Product B Purchase

Front page Front page Front page Front page

Frequent patterns


& Association Rules

11







Frequent patterns


& Association Rules

12







Frequent patterns


& Association Rules

13







Why frequent patterns

Repetition in data always indicates structure.

This structure may be due known processes.

Otherwise, we want to know about it and

explain it.

Consequently, searching for frequent

patterns in data is one of the most basic

procedures used in descriptive data analysis.


& Association Rules

14

Quiz

Is a frequent pattern always interesting?

Is an interesting pattern always frequent?


& Association Rules

15

Quiz

Is a frequent pattern always interesting?

No (e.g. “Front page” is frequent but uninteresting)

Is an interesting pattern always frequent?

No (e.g. “Diapers & Beer” may be a rare but a very

valuable pattern)


& Association Rules

16

NB

Finding frequent patterns: primarily algorithmic problem

Finding “interesting” patterns: primarily statistical problem

Frequent itemsets

In general, we may be interested in different

types of data and types of patterns:

Natural language / Word sequences

Text / Regular expression patterns

Web log / Event sequences (various models)

Purchases / Item sets

…


& Association Rules

17

Frequent itemsets

In general, we may be interested in different

types of data and types of patterns:

Natural language / Word sequences

Text / Regular expression patterns

Web log / Event sequences (various models)

Purchases / Item sets

…


& Association Rules

18

Today

(however, many notions and ideas apply

to all data / pattern types).

Definitions

Let the data consist of transactions (i.e. sets of items)

The patterns we are interested are itemsets.

E.g. {Milk, Eggs} is an itemset.


& Association Rules

19

Definitions

The support of an itemset is the proportion of

transactions where it occurs

support({Milk, Eggs}) = ?

support({Bread}) = ?

support({Milk, Beer}) = ?


& Association Rules

20

Definitions



support({Milk, Eggs}) = 0

support({Bread}) = 4/5 = 0.8

support({Milk, Beer}) = 2/5 = 0.4


& Association Rules

21

Definitions



support({Milk, Eggs}) = 0

support({Bread}) = 4

support({Milk, Beer}) = 2


& Association Rules

22

NB: in some treatments, support

may denote the number of

transactions (“support count”)

Definitions

An itemset is frequent if its support is greater than or

equal to some predefined threshold 𝑠min.

Let 𝑠min= 3. Find all frequent itemsets.


& Association Rules

23

Definitions

An itemset is frequent if its support is greater than or

equal to some predefined threshold 𝑠min.

Let 𝑠min= 3. Find all frequent itemsets.{Bread}, {Milk}, {Diaper}, {Beer}, {Bread, Milk}, {Bread, Diaper},

{Milk, Diaper}, {Diaper, Beer}, {}.


& Association Rules

24

Explanations for itemsets

Suppose we found that {Beer, Milk, Diapers}

is a frequent itemset.

How do we explain it?


& Association Rules

25

Explanations for itemsets

Suppose we found that {Beer, Milk, Diapers}

is a frequent itemset.

How do we explain it?

Perhaps it is frequent due to the fact that there

is a causal relationship{Diapers, Milk} => Beer


& Association Rules

26

Association rules

An association rule is an implication of the form

𝑋 → 𝑌, where 𝑋 and 𝑌 are itemsets.

The support of a rule is just the support of 𝑋 ∪ 𝑌.

support 𝑋 → 𝑌 = support(𝑋 ∪ 𝑌)

The confidence of a rule is the proportion of

transactions satisfying the right part among the

transactions, which satisfy the left.

confidence 𝑋 → 𝑌 =support 𝑋 ∪ 𝑌

support(𝑋)20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets

& Association Rules

27

Association rule mining

Given a set of transactions, find all rules 𝑋 → 𝑌,

such that

support 𝑋 → 𝑌 ≥ 𝑠min

confidence 𝑋 → 𝑌 ≥ 𝑐min


& Association Rules

28



such that



Solution:

Find frequent itemset 𝐴 with support ≥ 𝑠min

Then find a partitioning of 𝐴 into left- and right-

part, so that the resulting rule has high confidence.


& Association Rules

29



such that



Solution:

Find frequent itemset 𝐴 with support ≥ 𝑠min

Then find a partitioning of 𝐴 into left- and right-

part, so that the resulting rule has high confidence.

The algorithmic approach is the same in

both steps. We’ll start with the first one.20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets

& Association Rules

30

Frequent itemset mining

What would be the simplest method for

finding all frequent itemsets.


& Association Rules

31


Brute force approach:

Enumerate all possible sets of items

For each set compute its support in the database

Output sets with support ≥ 𝑠min


& Association Rules

32







& Association Rules

33

for itemset in subsets(items):

if support(itemset, data) >= s_min:

yield itemset






Let there be 𝑑 different items,

𝑛 transactions, average transaction size 𝑤.

What is the complexity of this algorithm?


& Association Rules

34






Let there be 𝑑 different items,

𝑛 transactions, average transaction size 𝑤.

What is the complexity of this algorithm?


& Association Rules

35

2𝑑 iterations

Scan 𝑛 transactions,

do 𝑤 operations at each

𝑂(2𝑑𝑛𝑤)

Faster itemset mining

Apriori: Avoid scanning through all 2𝑑

itemsets.

FP-tree: Also store transactions in an

efficient data structure, speeding up matching.


& Association Rules

36

The Apriori idea

Suppose that

support({A, C}) = 42/100

What does it tell you about

support({A, B, C}) ?


& Association Rules

37

The Apriori idea

Suppose that

support({A, C}) = 42/100

It follows that

MTAT.03.183 Data Mining - Frequent Itemsets

& Association Rules

38

support({A, B, C}) ≤ 42/100

(A, B, C, D, E)

(A, C, D, E)

( B, D, E, F)

(A, B, C, F)

(A, C, F, G)

(A, B, C, D)

The Apriori idea

Suppose that

support({A, C}) = 42/100

It follows that


& Association Rules

39

support({A, B, C}) ≤ 42/100

(A, B, C, D, E)

(A, C, D, E)

( B, D, E, F)

(A, B, C, F)

(A, C, F, G)

(A, B, C, D)

The Apriori idea

Suppose that

support({A, C}) = 42/100

It follows that


& Association Rules

40

support({A, B, C}) ≤ 42/100

(A, B, C, D, E)

(A, C, D, E)

( B, D, E, F)

(A, B, C, F)

(A, C, F, G)

(A, B, C, D)

Anti-monotonicity of support

In general,


& Association Rules

41

𝑋 ⊆ 𝑌 ⇒ support 𝑋 ≥ support(𝑌)

Anti-monotonicity of support

In general,

it follows that:

and


& Association Rules

42

𝑋 ⊆ 𝑌 ⇒ support 𝑋 ≥ support(𝑌)

If an itemset is not

frequent, all of its

supersets are also

not frequent.

If an itemset is

frequent, all of its

subsets are also

frequent.

Apriori principle


& Association Rules

43

Basic Apriori algorithm

First generate frequent 1-sets,

Next, generate frequent 2-sets from 1-sets,

… then generate frequent 3-sets from 2-sets,

… etc, until there are no frequent 𝑘-sets


& Association Rules

44



How?


& Association Rules

45



Simply count the frequency of each item and leave only

the frequent ones.


& Association Rules

46




the frequent ones.


& Association Rules

47

{A}: 10

{B}: 15

{C}: 3

{D}: 4

{E}: 6

{F}: 10

{G}: 4




the frequent ones.


& Association Rules

48

{A}: 10

{B}: 15

{C}: 3

{D}: 4

{E}: 6

{F}: 10

{G}: 4

Let min support count = 5




the frequent ones.


& Association Rules

49

{A}: 10

{B}: 15

{E}: 6

{F}: 10


Next, generate frequent 2-sets. This is done in

several steps.


& Association Rules

50

{A}: 10

{B}: 15

{E}: 6

{F}: 10



several steps.


& Association Rules

51

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}

{A,E}

{A,F}

{B,E}

{B,F}

{E,F}

All subsets of a candidate set

must be frequent.

For 2-sets it simply means that both

elements are from frequent 1-sets.

1. Generate candidate 2-sets



several steps.


& Association Rules

52

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,E}: 3

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5

2. Count actual support for

each candidate

(requires a full pass over the

transaction database for each

candidate)



several steps.


& Association Rules

53

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,E}: 3

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5

3. … and leave only actually

frequent ones


Now we have frequent 1- and 2-sets.


& Association Rules

54

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5




& Association Rules

55

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5

In many practical situations

the algorithm stops here.

Why?




& Association Rules

56

In many practical situations

the algorithm stops here.

1. Because there are so many

items that enumerating

beyond 2-sets is

impractical.

2. Because knowing frequent

2-sets is already useful

enough (think of the

“Beer/Diapers” example)

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5


But let’s try generating frequent 3-sets. We

proceed as before.


& Association Rules

57

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5



proceed as before.


& Association Rules

58

1. Generate candidate 3-sets

We augment each 2-set with an additional

element and check that all 2-subsets of

the resulting set is frequent.

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5

{A,B,F}

{B,E,F}

This can be optimized somewhat, see, e.g:

http://www.dais.unive.it/~orlando/PAPERS/dawak01.pdf

http://www.dais.unive.it/~orlando/PAPERS/dawak01.pdf



proceed as before.


& Association Rules

59

2. Count actual support

(Again, a pass over the whole DB)

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5

{A,B,F}: 4

{B,E,F}: 5



proceed as before.


& Association Rules

60

3. … and throw away the

non-frequent ones

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5

{A,B,F}: 4

{B,E,F}: 5



& Association Rules

61

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5

{B,E,F}: 5


Quiz: In this particular

example, can there be

frequent 4-sets?


& Association Rules

62

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5

{B,E,F}: 5


Quiz: In this particular

example, can there be

frequent 4-sets?

No, because if, say, {B,E,F,X} is frequent,

then {B,E,X} must be

frequent too!


& Association Rules

63

{A}: 10

{B}: 15

{E}: 6

{F}: 10

{A,B}: 10

{A,F}: 5

{B,E}: 6

{B,F}: 10

{E,F}: 5

{B,E,F}: 5



& Association Rules

64

# Candidate 1-sets

C = [(i,) for i in items]

# Frequent 1-sets

F = [itemset for itemset in C

if support(itemset, data) >= s_min]

# Frequent (k+1)-sets from k-sets.

while len(F) != 0:

output(F)

# Generate candidates of size k+1

C = [itemset + (item,)

for itemset in F

for item in items

if item > itemset[-1]]

# Check that all k-subsets of each candidate are frequent

C = [itemset for itemset in C

if all([s in F for s in combinations(itemset, len(itemset)-1)])]

# Count actual support and leave only the frequent ones

F = [itemset for itemset in C

if support(itemset, data) >= s_min]

Many optimizations are possible

Avoid generating the separate candidate set explicitly

(do it on-the-fly while counting).

Store 𝑘-sets in a hash tree data structure, (speeds up

the counting/generation process).

Use a part of the whole transaction DB (sample or

partition)

Use Bloom-filter like data structures to reduce

candidate set.

etc.


& Association Rules

65

(see, e.g. http://i.stanford.edu/~ullman/mmds/ch6a.pdf)

http://i.stanford.edu/~ullman/mmds/ch6a.pdf

Compact representation of itemsets

If {A,B,C,D,E} is frequent, then also those sets are

frequent:

{A},{B},{C},{D},{E},

{A,B},{A,C},{A,D},{A,E},{B,C},{B,D},{B,E},{C

,D},{C,E},{D,E},{A,B,C},{A,B,D},{A,B,E},{A,C

,D},{A,C,E},{A,D,E},{B,C,D},{B,C,E},{B,D,E},

{C,D,E},{A,B,C,D},{A,B,C,E},{A,B,D,E},{A,C,D

,E},{B,C,D,E}

Do we really want our algorithm to report those?


& Association Rules

66

Closed itemsets


& Association Rules

67

Maximal frequent itemsets


& Association Rules

68

Quiz

Can an itemset be:

Closed and not Maximal?

Not closed and Maximal?

Closed and Maximal?


& Association Rules

69

Quiz

Can an itemset be:

Closed and not Maximal? Yes.

E.g. adding any item will reduce support, but adding some

items will still make a frequent set.

Not closed and Maximal? No.

Not closed, hence we can add some other item without

reducing support.

Closed and Maximal? Yes.

Adding any item will reduce support and make the set

infrequent.


& Association Rules

70

Closed vs maximal frequent itemsets


& Association Rules

71

Intermediate summary

Frequent itemsets are interesting because

those correspond to structure in the data.

Association rules are basically frequent

itemsets with bells and whistles.

Apriori-like algorithms search for frequent

sets better than brute-force.

It is sufficient to find only maximal itemsets

(or closed itemsets, if some flexibility in

changing cutoff later is needed).


& Association Rules

72

Apriori is fine, but

When the number of frequent patterns is

large, the candidate sets are large and the

repeated database scans are slow.

This happens, in particular, when there is a long

frequent pattern (then all subsets are also

frequent).


& Association Rules

73

FP-tree

A frequent-pattern tree is a data structure for

storing the transaction database, which allows

to enumerate frequent itemsets.


& Association Rules

74

Constructing the FP-tree


& Association Rules

75

TID Items

100 F,A,C,D,G,I,M,P

200 A,B,C,F,L,M,O

300 B,F,H,J,O

400 B,C,K,S,P

500 A,F,C,E,L,P,M,N



& Association Rules

76

1. Count the frequency of each itemTID Items

100 F,A,C,D,G,I,M,P

200 A,B,C,F,L,M,O

300 B,F,H,J,O

400 B,C,K,S,P

500 A,F,C,E,L,P,M,N

F: 4

A: 3

C: 4

D: 1

G: 1

I: 1

M: 3

S: 1

P: 3

B: 3

L: 2

O: 2

H: 1

J: 1

K: 1

E: 1

N: 1



& Association Rules

77

2. Leave only the frequent itemsTID Items

100 F,A,C,M,P

200 A,B,C,F,M

300 B,F

400 B,C,P

500 A,F,C,P,M

F: 4

A: 3

C: 4

B: 3

M: 3

P: 3



& Association Rules

78

3. Sort items by frequency

(recommended)TID Items

100 F, C,A,M,P

200 F,C,A,B,M

300 F,B

400 C,B,P

500 F,C,A,M,P

F: 4

C: 4

A: 3

B: 3

M: 3

P: 3



& Association Rules

79

4. Create a header table

and a root nodeTID Items

100 F, C,A,M,P

200 F,C,A,B,M

300 F,B

400 C,B,P

500 F,C,A,M,P

F

C

A

B

M

P

root



& Association Rules

80

5. Now start inserting transaction.

Each transacton will be a path in

the tree:

TID Items

100 F, C,A,M,P

200 F,C,A,B,M

300 F,B

400 C,B,P

500 F,C,A,M,P

F

C

A

B

M

P

root

F

C

A

M

P



& Association Rules

81

Header table keeps a linked list for

the occurrences of each item in the

tree.

TID Items

100 F, C,A,M,P

200 F,C,A,B,M

300 F,B

400 C,B,P

500 F,C,A,M,P

F

C

A

B

M

P

root

F

C

A

M

P



& Association Rules

82

Next transaction shares a part of the

Path (F,C,A) with the previously inserted.

We keep counts at each node

to track this.

TID Items

100 F, C,A,M,P

200 F,C,A,B,M

300 F,B

400 C,B,P

500 F,C,A,M,P

F

C

A

B

M

P

root

F:2

C:2

A:2

M:1

P:1

B:1

M:1



& Association Rules

83

We update header table lists (the

new nodes B and M need to be

included)

TID Items

100 F, C,A,M,P

200 F,C,A,B,M

300 F,B

400 C,B,P

500 F,C,A,M,P

F

C

A

B

M

P

root

F:2

C:2

A:2

M:1

P:1

B:1

M:1



& Association Rules

84

Inserting third transaction…TID Items

100 F, C,A,M,P

200 F,C,A,B,M

300 F,B

400 C,B,P

500 F,C,A,M,P

F

C

A

B

M

P

root

F:3

C:2

A:2

M:1

P:1

B:1

M:1

B:1



& Association Rules

85

… fourthTID Items

100 F, C,A,M,P

200 F,C,A,B,M

300 F,B

400 C,B,P

500 F,C,A,M,P

F

C

A

B

M

P

root

F:3

C:2

A:2

M:1

P:1

B:1

M:1

B:1

C:1

B:1

P:1



& Association Rules

86

... fifth is exactly like the first, so we

only update counts along the pathTID Items

100 F, C,A,M,P

200 F,C,A,B,M

300 F,B

400 C,B,P

500 F,C,A,M,P

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1



& Association Rules

87

The resulting data structure has all the information from the

original database, but allows some fast lookups.

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

88

How to find out the total number of transactions stored

in this tree?

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

89

How to find out the total number of transactions stored

in this tree?

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

90

How many transactions contain “M”?

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

91

How many transactions contain “M”?

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

92

Which transactions contain “B”?

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

93


F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

94


F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

95


F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

96

Make a tree, that corresponds to all transactions, that

contain “B”.

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

97


contain “B”.

F

C

A

B

M

P

root

F:2

C:1

A:1B:1

M:1

B:1

C:1

B:1

P:1

Note the count changes!

Quiz


& Association Rules

98


contain “B” and “C”.

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

99

First find the tree for “B”, then further trim it for “C”:

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

100


F

C

A

B

M

P

root

F:2

C:1

A:1B:1

M:1

B:1

C:1

B:1

P:1

Quiz


& Association Rules

101


F

C

A

B

M

P

root

F:1

C:1

A:1B:1

M:1

C:1

B:1

P:1

FP-growth algorithm


& Association Rules

102

Frequent-pattern enumeration based on FP-tree

Core idea:

• Recursively enumerate all subsets.

• For each subset construct a FP-tree of transactions

that contain this subset

• Output those subsets for which the corresponding

tree contains more than (support count) transactions.

FP-growth algorithm


& Association Rules

103


Core idea:


.def subsets(items, current_subset):

process(current_subset)

for i in items:

if i > max(current_subset):

new_subset = current_subset + [i]

subsets(items, new_subset)

Start enumeration with

subsets(items, [])

FP-growth algorithm


& Association Rules

104

Core idea:


.def subsets(items, current_subset, FP_tree):

process(current_subset)

for i in items:

if i > max(current_subset):

new_subset = current_subset + [i]

new_tree = FP_tree.filter(i)

subsets(items, new_subset, new_tree)

Start enumeration with

subsets(items, [], full_tree)

Now while enumerating itemset

we keep track of the FP-tree with

all transactions having this itemset

FP-growth algorithm


& Association Rules

105


F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

FP-growth algorithm


& Association Rules

106


Start from the bottom

of the header table

Extract all transactions

that contain “P”.

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

FP-growth algorithm


& Association Rules

107



of the header table


that contain “P” and make

a corresponding FP-tree

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

FP-growth algorithm


& Association Rules

108


F

C

A

B

M

P

root

F:2

C:2

A:2

M:2

P:2

C:1

B:1

P:1


of the header table


that contain “P” and make

a corresponding FP-tree

FP-growth algorithm


& Association Rules

109


We can see that P

is frequent (for support

threshold 3).

Now descend

recursively to examine

all sets like P + {…},

start with {P, M}

F

C

A

B

M

P

root

F:2

C:2

A:2

M:2

P:2

C:1

B:1

P:1

FP-growth algorithm


& Association Rules

110


F

C

A

B

M

P

root

F:2

C:2

A:2

M:2

P:2

We can see that P

is frequent (for support

threshold 3).

Now descend

recursively to examine

all sets like P + {…},

start with {P, M}

FP-growth algorithm


& Association Rules

111


It is not frequent,

so no need to dig

deeper with {P,M, …}

We can return from

the recursive call

F

C

A

B

M

P

root

F:2

C:2

A:2

M:2

P:2

FP-growth algorithm


& Association Rules

112


F

C

A

B

M

P

root

F:2

C:2

A:2

M:2

P:2

C:1

B:1

P:1

… back to level {P}

FP-growth algorithm


& Association Rules

113


F

C

A

B

M

P

root

C:1

B:1

P:1

… and descend

to {P, B}

FP-growth algorithm


& Association Rules

114


… etc:

{P,B} – not frequent, backtrack

{P,A} – not frequent, backtrack,

{P,C} – frequent, recurse,

{P,C,F} – not frequent, backtrack

{P,F} – not frequent, backtrack

Now we are done with “P”

completely!

Recursion follows with

{M}, then {M,B}, then {M,A}

(frequent), then {M,A,C}

(frequent), then {M,A,C,F}

(frequent), then {M,A,F}

(frequent), … etc

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

FP-growth algorithm


& Association Rules

115


• Several optimizations are

possible to the idea just

described.

• In a careful implementation you

never need to go downwards in

the tree (i.e. against the

pointers): when processing

each item you really only care

about the part of the tree above

it.

• You can have the algorithm

output only the maximal sets.

• Special cases like a “chain”-

tree can be treated separately.

F

C

A

B

M

P

root

F:4

C:3

A:3

M:2

P:2

B:1

M:1

B:1

C:1

B:1

P:1

Quiz

Could you use FP-tree with the “usual” Apriori

algorithm?


& Association Rules

116

Intermediate summary

Frequent itemsets are interesting because those

correspond to structure in the data.

Association rules are basically frequent itemsets

with bells and whistles.

Apriori-like algorithms search for frequent sets

better than brute-force.

FP-tree is a data structure for keeping transactions.

FP-growth can be more memory-efficient than

Apriori.

Maximal and closed itemsets are your friends.


& Association Rules

117

Back to association rules

Recall association rule mining:

Problem: find association rules 𝑋 → 𝑌 with

Support at least 𝑠min

Confidence at least 𝑐min

Solution:

Find frequent itemsets with support at least 𝑠min

For each itemset find a split into 𝑋 → 𝑌, which ensures

required confidence.


& Association Rules

118

Rule generation

Suppose we found that 𝐴, 𝐵, 𝐶, 𝐷 is a

frequent itemset with the necessary support.

We can make a variety of rules from it:

{𝐴} → {𝐵, 𝐶, 𝐷}

𝐴, 𝐵 → 𝐶, 𝐷

𝐵, 𝐶, 𝐷 → {𝐴}

How to efficiently find those which satisfy the

confidence threshold?


& Association Rules

119

Rule generation

Recall that

confidence 𝐴 → 𝐵, 𝐶, 𝐷 =support 𝐴, 𝐵, 𝐶, 𝐷

support( 𝐴 )

confidence 𝐴, 𝐵 → 𝐶,𝐷 =support 𝐴, 𝐵, 𝐶, 𝐷

support( 𝐴, 𝐵 )

confidence 𝐴, 𝐶, 𝐷 → 𝐵 =support 𝐴, 𝐵, 𝐶, 𝐷

support( 𝐴, 𝐶, 𝐷 )


& Association Rules

120

Rule generation

Recall that

confidence 𝐴 → 𝐵, 𝐶, 𝐷 =support 𝐴, 𝐵, 𝐶, 𝐷

support( 𝐴 )

Consequently, among all rules built on the set

{𝐴, 𝐵, 𝐶, 𝐷}, confidence is inverse proportional

to the support of antecedent (the left part of

the rule).

I.e. confidence is monotonic wrt antecedent

(and anti-monotonic wrt consequent).


& Association Rules

121

Rule generation

In other words, you can use Apriori for rule

generation from a found frequent set.


& Association Rules

122

Quiz

Can you apply FP-growth to search for the

rule split for a given frequent set?


& Association Rules

123

Questions?


& Association Rules

124

Image Credits

Title page image: © Marco Taiana,

http://www.flickr.com/photos/marcotaiana/92122947/

Some diagrams borrowed from a slides by Tan et al,

http://www-users.cs.umn.edu/~kumar/dmbook/index.php


& Association Rules

125

http://www.flickr.com/photos/marcotaiana/92122947/