Frequent Itemsets and
Association Rules
MTAT.03.183 Data Mining. 20.02.2014.
Konstantin Tretyakov
Frequent patterns
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall;
All the King's horses and all the King's men
Couldn't put Humpty together again
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
2
Frequent patterns
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall;
All the King's horses and all the King's men
Couldn't put Humpty together again
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
3
Frequent patterns
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall;
All the King's horses and all the King's men
Couldn't put Humpty together again
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
4
Frequent patterns
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall;
All the King's horses and all the King's men
Couldn't put Humpty together again
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
5
Frequent patterns
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall;
All the King's horses and all the King's men
Couldn't put Humpty together again
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
6
[H|D]umpty
Frequent patterns
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
7
Frequent patterns
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
8
Frequent patterns
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
9
Frequent patterns
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
10
Front page Search Product A Product B Product C
Front page Ad Product B Purchase Product C
Front page Front page Feedback
Front page Search Product B Search Product C Purchase
Front page Ad Product C Search Product B Purchase
Front page Front page Front page Front page
Frequent patterns
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
11
Front page Search Product A Product B Product C
Front page Ad Product B Purchase Product C
Front page Front page Feedback
Front page Search Product B Search Product C Purchase
Front page Ad Product C Search Product B Purchase
Front page Front page Front page Front page
Frequent patterns
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
12
Front page Search Product A Product B Product C
Front page Ad Product B Purchase Product C
Front page Front page Feedback
Front page Search Product B Search Product C Purchase
Front page Ad Product C Search Product B Purchase
Front page Front page Front page Front page
Frequent patterns
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
13
Front page Search Product A Product B Product C
Front page Ad Product B Purchase Product C
Front page Front page Feedback
Front page Search Product B Search Product C Purchase
Front page Ad Product C Search Product B Purchase
Front page Front page Front page Front page
Why frequent patterns
Repetition in data always indicates structure.
This structure may be due known processes.
Otherwise, we want to know about it and
explain it.
Consequently, searching for frequent
patterns in data is one of the most basic
procedures used in descriptive data analysis.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
14
Quiz
Is a frequent pattern always interesting?
Is an interesting pattern always frequent?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
15
Quiz
Is a frequent pattern always interesting?
No (e.g. βFront pageβ is frequent but uninteresting)
Is an interesting pattern always frequent?
No (e.g. βDiapers & Beerβ may be a rare but a very
valuable pattern)
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
16
NB
Finding frequent patterns: primarily algorithmic problem
Finding βinterestingβ patterns: primarily statistical problem
Frequent itemsets
In general, we may be interested in different
types of data and types of patterns:
Natural language / Word sequences
Text / Regular expression patterns
Web log / Event sequences (various models)
Purchases / Item sets
β¦
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
17
Frequent itemsets
In general, we may be interested in different
types of data and types of patterns:
Natural language / Word sequences
Text / Regular expression patterns
Web log / Event sequences (various models)
Purchases / Item sets
β¦
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
18
Today
(however, many notions and ideas apply
to all data / pattern types).
Definitions
Let the data consist of transactions (i.e. sets of items)
The patterns we are interested are itemsets.
E.g. {Milk, Eggs} is an itemset.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
19
Definitions
The support of an itemset is the proportion of
transactions where it occurs
support({Milk, Eggs}) = ?
support({Bread}) = ?
support({Milk, Beer}) = ?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
20
Definitions
The support of an itemset is the proportion of
transactions where it occurs
support({Milk, Eggs}) = 0
support({Bread}) = 4/5 = 0.8
support({Milk, Beer}) = 2/5 = 0.4
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
21
Definitions
The support of an itemset is the proportion of
transactions where it occurs
support({Milk, Eggs}) = 0
support({Bread}) = 4
support({Milk, Beer}) = 2
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
22
NB: in some treatments, support
may denote the number of
transactions (βsupport countβ)
Definitions
An itemset is frequent if its support is greater than or
equal to some predefined threshold π min.
Let π min= 3. Find all frequent itemsets.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
23
Definitions
An itemset is frequent if its support is greater than or
equal to some predefined threshold π min.
Let π min= 3. Find all frequent itemsets.{Bread}, {Milk}, {Diaper}, {Beer}, {Bread, Milk}, {Bread, Diaper},
{Milk, Diaper}, {Diaper, Beer}, {}.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
24
Explanations for itemsets
Suppose we found that {Beer, Milk, Diapers}
is a frequent itemset.
How do we explain it?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
25
Explanations for itemsets
Suppose we found that {Beer, Milk, Diapers}
is a frequent itemset.
How do we explain it?
Perhaps it is frequent due to the fact that there
is a causal relationship{Diapers, Milk} => Beer
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
26
Association rules
An association rule is an implication of the form
π β π, where π and π are itemsets.
The support of a rule is just the support of π βͺ π.
support π β π = support(π βͺ π)
The confidence of a rule is the proportion of
transactions satisfying the right part among the
transactions, which satisfy the left.
confidence π β π =support π βͺ π
support(π)20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
27
Association rule mining
Given a set of transactions, find all rules π β π,
such that
support π β π β₯ π min
confidence π β π β₯ πmin
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
28
Association rule mining
Given a set of transactions, find all rules π β π,
such that
support π β π β₯ π min
confidence π β π β₯ πmin
Solution:
Find frequent itemset π΄ with support β₯ π min
Then find a partitioning of π΄ into left- and right-
part, so that the resulting rule has high confidence.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
29
Association rule mining
Given a set of transactions, find all rules π β π,
such that
support π β π β₯ π min
confidence π β π β₯ πmin
Solution:
Find frequent itemset π΄ with support β₯ π min
Then find a partitioning of π΄ into left- and right-
part, so that the resulting rule has high confidence.
The algorithmic approach is the same in
both steps. Weβll start with the first one.20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
30
Frequent itemset mining
What would be the simplest method for
finding all frequent itemsets.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
31
Frequent itemset mining
Brute force approach:
Enumerate all possible sets of items
For each set compute its support in the database
Output sets with support β₯ π min
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
32
Frequent itemset mining
Brute force approach:
Enumerate all possible sets of items
For each set compute its support in the database
Output sets with support β₯ π min
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
33
for itemset in subsets(items):
if support(itemset, data) >= s_min:
yield itemset
Frequent itemset mining
Brute force approach:
Enumerate all possible sets of items
For each set compute its support in the database
Output sets with support β₯ π min
Let there be π different items,
π transactions, average transaction size π€.
What is the complexity of this algorithm?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
34
Frequent itemset mining
Brute force approach:
Enumerate all possible sets of items
For each set compute its support in the database
Output sets with support β₯ π min
Let there be π different items,
π transactions, average transaction size π€.
What is the complexity of this algorithm?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
35
2π iterations
Scan π transactions,
do π€ operations at each
π(2πππ€)
Faster itemset mining
Apriori: Avoid scanning through all 2π
itemsets.
FP-tree: Also store transactions in an
efficient data structure, speeding up matching.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
36
The Apriori idea
Suppose that
support({A, C}) = 42/100
What does it tell you about
support({A, B, C}) ?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
37
The Apriori idea
Suppose that
support({A, C}) = 42/100
It follows that
MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
38
support({A, B, C}) β€ 42/100
(A, B, C, D, E)
(A, C, D, E)
( B, D, E, F)
(A, B, C, F)
(A, C, F, G)
(A, B, C, D)
The Apriori idea
Suppose that
support({A, C}) = 42/100
It follows that
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
39
support({A, B, C}) β€ 42/100
(A, B, C, D, E)
(A, C, D, E)
( B, D, E, F)
(A, B, C, F)
(A, C, F, G)
(A, B, C, D)
The Apriori idea
Suppose that
support({A, C}) = 42/100
It follows that
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
40
support({A, B, C}) β€ 42/100
(A, B, C, D, E)
(A, C, D, E)
( B, D, E, F)
(A, B, C, F)
(A, C, F, G)
(A, B, C, D)
Anti-monotonicity of support
In general,
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
41
π β π β support π β₯ support(π)
Anti-monotonicity of support
In general,
it follows that:
and
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
42
π β π β support π β₯ support(π)
If an itemset is not
frequent, all of its
supersets are also
not frequent.
If an itemset is
frequent, all of its
subsets are also
frequent.
Apriori principle
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
43
Basic Apriori algorithm
First generate frequent 1-sets,
Next, generate frequent 2-sets from 1-sets,
β¦ then generate frequent 3-sets from 2-sets,
β¦ etc, until there are no frequent π-sets
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
44
Basic Apriori algorithm
First generate frequent 1-sets,
How?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
45
Basic Apriori algorithm
First generate frequent 1-sets,
Simply count the frequency of each item and leave only
the frequent ones.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
46
Basic Apriori algorithm
First generate frequent 1-sets,
Simply count the frequency of each item and leave only
the frequent ones.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
47
{A}: 10
{B}: 15
{C}: 3
{D}: 4
{E}: 6
{F}: 10
{G}: 4
Basic Apriori algorithm
First generate frequent 1-sets,
Simply count the frequency of each item and leave only
the frequent ones.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
48
{A}: 10
{B}: 15
{C}: 3
{D}: 4
{E}: 6
{F}: 10
{G}: 4
Let min support count = 5
Basic Apriori algorithm
First generate frequent 1-sets,
Simply count the frequency of each item and leave only
the frequent ones.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
49
{A}: 10
{B}: 15
{E}: 6
{F}: 10
Basic Apriori algorithm
Next, generate frequent 2-sets. This is done in
several steps.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
50
{A}: 10
{B}: 15
{E}: 6
{F}: 10
Basic Apriori algorithm
Next, generate frequent 2-sets. This is done in
several steps.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
51
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}
{A,E}
{A,F}
{B,E}
{B,F}
{E,F}
All subsets of a candidate set
must be frequent.
For 2-sets it simply means that both
elements are from frequent 1-sets.
1. Generate candidate 2-sets
Basic Apriori algorithm
Next, generate frequent 2-sets. This is done in
several steps.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
52
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,E}: 3
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
2. Count actual support for
each candidate
(requires a full pass over the
transaction database for each
candidate)
Basic Apriori algorithm
Next, generate frequent 2-sets. This is done in
several steps.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
53
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,E}: 3
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
3. β¦ and leave only actually
frequent ones
Basic Apriori algorithm
Now we have frequent 1- and 2-sets.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
54
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
Basic Apriori algorithm
Now we have frequent 1- and 2-sets.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
55
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
In many practical situations
the algorithm stops here.
Why?
Basic Apriori algorithm
Now we have frequent 1- and 2-sets.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
56
In many practical situations
the algorithm stops here.
1. Because there are so many
items that enumerating
beyond 2-sets is
impractical.
2. Because knowing frequent
2-sets is already useful
enough (think of the
βBeer/Diapersβ example)
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
Basic Apriori algorithm
But letβs try generating frequent 3-sets. We
proceed as before.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
57
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
Basic Apriori algorithm
But letβs try generating frequent 3-sets. We
proceed as before.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
58
1. Generate candidate 3-sets
We augment each 2-set with an additional
element and check that all 2-subsets of
the resulting set is frequent.
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
{A,B,F}
{B,E,F}
This can be optimized somewhat, see, e.g:
http://www.dais.unive.it/~orlando/PAPERS/dawak01.pdf
Basic Apriori algorithm
But letβs try generating frequent 3-sets. We
proceed as before.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
59
2. Count actual support
(Again, a pass over the whole DB)
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
{A,B,F}: 4
{B,E,F}: 5
Basic Apriori algorithm
But letβs try generating frequent 3-sets. We
proceed as before.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
60
3. β¦ and throw away the
non-frequent ones
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
{A,B,F}: 4
{B,E,F}: 5
Basic Apriori algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
61
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
{B,E,F}: 5
Basic Apriori algorithm
Quiz: In this particular
example, can there be
frequent 4-sets?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
62
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
{B,E,F}: 5
Basic Apriori algorithm
Quiz: In this particular
example, can there be
frequent 4-sets?
No, because if, say, {B,E,F,X} is frequent,
then {B,E,X} must be
frequent too!
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
63
{A}: 10
{B}: 15
{E}: 6
{F}: 10
{A,B}: 10
{A,F}: 5
{B,E}: 6
{B,F}: 10
{E,F}: 5
{B,E,F}: 5
Basic Apriori algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
64
# Candidate 1-sets
C = [(i,) for i in items]
# Frequent 1-sets
F = [itemset for itemset in C
if support(itemset, data) >= s_min]
# Frequent (k+1)-sets from k-sets.
while len(F) != 0:
output(F)
# Generate candidates of size k+1
C = [itemset + (item,)
for itemset in F
for item in items
if item > itemset[-1]]
# Check that all k-subsets of each candidate are frequent
C = [itemset for itemset in C
if all([s in F for s in combinations(itemset, len(itemset)-1)])]
# Count actual support and leave only the frequent ones
F = [itemset for itemset in C
if support(itemset, data) >= s_min]
Many optimizations are possible
Avoid generating the separate candidate set explicitly
(do it on-the-fly while counting).
Store π-sets in a hash tree data structure, (speeds up
the counting/generation process).
Use a part of the whole transaction DB (sample or
partition)
Use Bloom-filter like data structures to reduce
candidate set.
etc.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
65
(see, e.g. http://i.stanford.edu/~ullman/mmds/ch6a.pdf)
Compact representation of itemsets
If {A,B,C,D,E} is frequent, then also those sets are
frequent:
{A},{B},{C},{D},{E},
{A,B},{A,C},{A,D},{A,E},{B,C},{B,D},{B,E},{C
,D},{C,E},{D,E},{A,B,C},{A,B,D},{A,B,E},{A,C
,D},{A,C,E},{A,D,E},{B,C,D},{B,C,E},{B,D,E},
{C,D,E},{A,B,C,D},{A,B,C,E},{A,B,D,E},{A,C,D
,E},{B,C,D,E}
Do we really want our algorithm to report those?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
66
Closed itemsets
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
67
Maximal frequent itemsets
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
68
Quiz
Can an itemset be:
Closed and not Maximal?
Not closed and Maximal?
Closed and Maximal?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
69
Quiz
Can an itemset be:
Closed and not Maximal? Yes.
E.g. adding any item will reduce support, but adding some
items will still make a frequent set.
Not closed and Maximal? No.
Not closed, hence we can add some other item without
reducing support.
Closed and Maximal? Yes.
Adding any item will reduce support and make the set
infrequent.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
70
Closed vs maximal frequent itemsets
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
71
Intermediate summary
Frequent itemsets are interesting because
those correspond to structure in the data.
Association rules are basically frequent
itemsets with bells and whistles.
Apriori-like algorithms search for frequent
sets better than brute-force.
It is sufficient to find only maximal itemsets
(or closed itemsets, if some flexibility in
changing cutoff later is needed).
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
72
Apriori is fine, but
When the number of frequent patterns is
large, the candidate sets are large and the
repeated database scans are slow.
This happens, in particular, when there is a long
frequent pattern (then all subsets are also
frequent).
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
73
FP-tree
A frequent-pattern tree is a data structure for
storing the transaction database, which allows
to enumerate frequent itemsets.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
74
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
75
TID Items
100 F,A,C,D,G,I,M,P
200 A,B,C,F,L,M,O
300 B,F,H,J,O
400 B,C,K,S,P
500 A,F,C,E,L,P,M,N
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
76
1. Count the frequency of each itemTID Items
100 F,A,C,D,G,I,M,P
200 A,B,C,F,L,M,O
300 B,F,H,J,O
400 B,C,K,S,P
500 A,F,C,E,L,P,M,N
F: 4
A: 3
C: 4
D: 1
G: 1
I: 1
M: 3
S: 1
P: 3
B: 3
L: 2
O: 2
H: 1
J: 1
K: 1
E: 1
N: 1
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
77
2. Leave only the frequent itemsTID Items
100 F,A,C,M,P
200 A,B,C,F,M
300 B,F
400 B,C,P
500 A,F,C,P,M
F: 4
A: 3
C: 4
B: 3
M: 3
P: 3
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
78
3. Sort items by frequency
(recommended)TID Items
100 F, C,A,M,P
200 F,C,A,B,M
300 F,B
400 C,B,P
500 F,C,A,M,P
F: 4
C: 4
A: 3
B: 3
M: 3
P: 3
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
79
4. Create a header table
and a root nodeTID Items
100 F, C,A,M,P
200 F,C,A,B,M
300 F,B
400 C,B,P
500 F,C,A,M,P
F
C
A
B
M
P
root
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
80
5. Now start inserting transaction.
Each transacton will be a path in
the tree:
TID Items
100 F, C,A,M,P
200 F,C,A,B,M
300 F,B
400 C,B,P
500 F,C,A,M,P
F
C
A
B
M
P
root
F
C
A
M
P
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
81
Header table keeps a linked list for
the occurrences of each item in the
tree.
TID Items
100 F, C,A,M,P
200 F,C,A,B,M
300 F,B
400 C,B,P
500 F,C,A,M,P
F
C
A
B
M
P
root
F
C
A
M
P
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
82
Next transaction shares a part of the
Path (F,C,A) with the previously inserted.
We keep counts at each node
to track this.
TID Items
100 F, C,A,M,P
200 F,C,A,B,M
300 F,B
400 C,B,P
500 F,C,A,M,P
F
C
A
B
M
P
root
F:2
C:2
A:2
M:1
P:1
B:1
M:1
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
83
We update header table lists (the
new nodes B and M need to be
included)
TID Items
100 F, C,A,M,P
200 F,C,A,B,M
300 F,B
400 C,B,P
500 F,C,A,M,P
F
C
A
B
M
P
root
F:2
C:2
A:2
M:1
P:1
B:1
M:1
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
84
Inserting third transactionβ¦TID Items
100 F, C,A,M,P
200 F,C,A,B,M
300 F,B
400 C,B,P
500 F,C,A,M,P
F
C
A
B
M
P
root
F:3
C:2
A:2
M:1
P:1
B:1
M:1
B:1
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
85
β¦ fourthTID Items
100 F, C,A,M,P
200 F,C,A,B,M
300 F,B
400 C,B,P
500 F,C,A,M,P
F
C
A
B
M
P
root
F:3
C:2
A:2
M:1
P:1
B:1
M:1
B:1
C:1
B:1
P:1
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
86
... fifth is exactly like the first, so we
only update counts along the pathTID Items
100 F, C,A,M,P
200 F,C,A,B,M
300 F,B
400 C,B,P
500 F,C,A,M,P
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Constructing the FP-tree
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
87
The resulting data structure has all the information from the
original database, but allows some fast lookups.
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
88
How to find out the total number of transactions stored
in this tree?
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
89
How to find out the total number of transactions stored
in this tree?
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
90
How many transactions contain βMβ?
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
91
How many transactions contain βMβ?
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
92
Which transactions contain βBβ?
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
93
Which transactions contain βBβ?
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
94
Which transactions contain βBβ?
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
95
Which transactions contain βBβ?
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
96
Make a tree, that corresponds to all transactions, that
contain βBβ.
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
97
Make a tree, that corresponds to all transactions, that
contain βBβ.
F
C
A
B
M
P
root
F:2
C:1
A:1B:1
M:1
B:1
C:1
B:1
P:1
Note the count changes!
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
98
Make a tree, that corresponds to all transactions, that
contain βBβ and βCβ.
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
99
First find the tree for βBβ, then further trim it for βCβ:
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
100
First find the tree for βBβ, then further trim it for βCβ:
F
C
A
B
M
P
root
F:2
C:1
A:1B:1
M:1
B:1
C:1
B:1
P:1
Quiz
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
101
First find the tree for βBβ, then further trim it for βCβ:
F
C
A
B
M
P
root
F:1
C:1
A:1B:1
M:1
C:1
B:1
P:1
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
102
Frequent-pattern enumeration based on FP-tree
Core idea:
β’ Recursively enumerate all subsets.
β’ For each subset construct a FP-tree of transactions
that contain this subset
β’ Output those subsets for which the corresponding
tree contains more than (support count) transactions.
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
103
Frequent-pattern enumeration based on FP-tree
Core idea:
β’ Recursively enumerate all subsets.
.def subsets(items, current_subset):
process(current_subset)
for i in items:
if i > max(current_subset):
new_subset = current_subset + [i]
subsets(items, new_subset)
Start enumeration with
subsets(items, [])
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
104
Core idea:
β’ Recursively enumerate all subsets.
.def subsets(items, current_subset, FP_tree):
process(current_subset)
for i in items:
if i > max(current_subset):
new_subset = current_subset + [i]
new_tree = FP_tree.filter(i)
subsets(items, new_subset, new_tree)
Start enumeration with
subsets(items, [], full_tree)
Now while enumerating itemset
we keep track of the FP-tree with
all transactions having this itemset
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
105
Frequent-pattern enumeration based on FP-tree
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
106
Frequent-pattern enumeration based on FP-tree
Start from the bottom
of the header table
Extract all transactions
that contain βPβ.
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
107
Frequent-pattern enumeration based on FP-tree
Start from the bottom
of the header table
Extract all transactions
that contain βPβ and make
a corresponding FP-tree
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
108
Frequent-pattern enumeration based on FP-tree
F
C
A
B
M
P
root
F:2
C:2
A:2
M:2
P:2
C:1
B:1
P:1
Start from the bottom
of the header table
Extract all transactions
that contain βPβ and make
a corresponding FP-tree
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
109
Frequent-pattern enumeration based on FP-tree
We can see that P
is frequent (for support
threshold 3).
Now descend
recursively to examine
all sets like P + {β¦},
start with {P, M}
F
C
A
B
M
P
root
F:2
C:2
A:2
M:2
P:2
C:1
B:1
P:1
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
110
Frequent-pattern enumeration based on FP-tree
F
C
A
B
M
P
root
F:2
C:2
A:2
M:2
P:2
We can see that P
is frequent (for support
threshold 3).
Now descend
recursively to examine
all sets like P + {β¦},
start with {P, M}
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
111
Frequent-pattern enumeration based on FP-tree
It is not frequent,
so no need to dig
deeper with {P,M, β¦}
We can return from
the recursive call
F
C
A
B
M
P
root
F:2
C:2
A:2
M:2
P:2
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
112
Frequent-pattern enumeration based on FP-tree
F
C
A
B
M
P
root
F:2
C:2
A:2
M:2
P:2
C:1
B:1
P:1
β¦ back to level {P}
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
113
Frequent-pattern enumeration based on FP-tree
F
C
A
B
M
P
root
C:1
B:1
P:1
β¦ and descend
to {P, B}
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
114
Frequent-pattern enumeration based on FP-tree
β¦ etc:
{P,B} β not frequent, backtrack
{P,A} β not frequent, backtrack,
{P,C} β frequent, recurse,
{P,C,F} β not frequent, backtrack
{P,F} β not frequent, backtrack
Now we are done with βPβ
completely!
Recursion follows with
{M}, then {M,B}, then {M,A}
(frequent), then {M,A,C}
(frequent), then {M,A,C,F}
(frequent), then {M,A,F}
(frequent), β¦ etc
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
FP-growth algorithm
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
115
Frequent-pattern enumeration based on FP-tree
β’ Several optimizations are
possible to the idea just
described.
β’ In a careful implementation you
never need to go downwards in
the tree (i.e. against the
pointers): when processing
each item you really only care
about the part of the tree above
it.
β’ You can have the algorithm
output only the maximal sets.
β’ Special cases like a βchainβ-
tree can be treated separately.
F
C
A
B
M
P
root
F:4
C:3
A:3
M:2
P:2
B:1
M:1
B:1
C:1
B:1
P:1
Quiz
Could you use FP-tree with the βusualβ Apriori
algorithm?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
116
Intermediate summary
Frequent itemsets are interesting because those
correspond to structure in the data.
Association rules are basically frequent itemsets
with bells and whistles.
Apriori-like algorithms search for frequent sets
better than brute-force.
FP-tree is a data structure for keeping transactions.
FP-growth can be more memory-efficient than
Apriori.
Maximal and closed itemsets are your friends.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
117
Back to association rules
Recall association rule mining:
Problem: find association rules π β π with
Support at least π min
Confidence at least πmin
Solution:
Find frequent itemsets with support at least π min
For each itemset find a split into π β π, which ensures
required confidence.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
118
Rule generation
Suppose we found that π΄, π΅, πΆ, π· is a
frequent itemset with the necessary support.
We can make a variety of rules from it:
{π΄} β {π΅, πΆ, π·}
π΄, π΅ β πΆ, π·
π΅, πΆ, π· β {π΄}
How to efficiently find those which satisfy the
confidence threshold?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
119
Rule generation
Recall that
confidence π΄ β π΅, πΆ, π· =support π΄, π΅, πΆ, π·
support( π΄ )
confidence π΄, π΅ β πΆ,π· =support π΄, π΅, πΆ, π·
support( π΄, π΅ )
confidence π΄, πΆ, π· β π΅ =support π΄, π΅, πΆ, π·
support( π΄, πΆ, π· )
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
120
Rule generation
Recall that
confidence π΄ β π΅, πΆ, π· =support π΄, π΅, πΆ, π·
support( π΄ )
Consequently, among all rules built on the set
{π΄, π΅, πΆ, π·}, confidence is inverse proportional
to the support of antecedent (the left part of
the rule).
I.e. confidence is monotonic wrt antecedent
(and anti-monotonic wrt consequent).
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
121
Rule generation
In other words, you can use Apriori for rule
generation from a found frequent set.
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
122
Quiz
Can you apply FP-growth to search for the
rule split for a given frequent set?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
123
Questions?
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
124
Image Credits
Title page image: Β© Marco Taiana,
http://www.flickr.com/photos/marcotaiana/92122947/
Some diagrams borrowed from a slides by Tan et al,
http://www-users.cs.umn.edu/~kumar/dmbook/index.php
20.02.2014MTAT.03.183 Data Mining - Frequent Itemsets
& Association Rules
125
Top Related