Post on 13-Feb-2022
How Many Words Is a Picture Worth?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2
E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
Burnt or Burned?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 3
E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
Store Layout Design
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 4
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
Transaction Data
• Alphabet: a set of items – Example: all products sold in a store
• A transaction: a set of items involved in an activity – Example: the items purchased by a customer in
a visit • Other information is often associated
– Timestamp, price, salesperson, customer-id, store-id, …
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 5
How to Store Transaction Data?
• Transaction-id (t123, a, b, c) (t236, b, d)
• Relational storage • Transaction-based storage • Item-based (vertical) storage
– Item a: …, t123, … – Item b: …, t123, …, t236, … – …
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 7
Tid Item t123 a t123 b t123 c … … t236 b t236 d
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 8
Transaction Data Analysis
• Transactions: customers’ purchases of commodities – {bread, milk, cheese} if they are bought together
• Frequent patterns: product combinations that are frequently purchased together by customers
• Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 9
Why Frequent Patterns?
• What products were often purchased together?
• What are the frequent subsequent purchases after buying a iPod?
• What kinds of genes are sensitive to this new drug?
• What key-word combinations are frequently associated with web pages about game-evaluation?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 10
Why Frequent Pattern Mining?
• Foundation for many data mining tasks – Association rules, correlation, causality,
sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …
• Broad applications – Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, web log (click stream) analysis, …
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 11
Frequent Itemsets
• Itemset: a set of items – E.g., acm = {a, c, m}
• Support of itemsets – Sup(acm) = 3
• Given min_sup = 3, acm is a frequent pattern
• Frequent pattern mining: finding all frequent patterns in a database
TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n
Transaction database TDB
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 12
A Naïve Attempt
• Generate all possible itemsets, test their supports against the database
• How to hold a large number of itemsets into main memory? – 100 items à 2100 – 1 possible itemets
• How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the
support of 220 – 1 = 1,048,575 itemsets
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 13
Transactions in Real Applications
• A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books
relevant to data mining • Walmart has more than 20 million
transactions per day, AT&T produces more than 275 million calls per day
• Mining large transaction databases of many items is a real demand
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 14
How to Get an Efficient Method?
• Reducing the number of itemsets that need to be checked
• Checking the supports of selected itemsets efficiently
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 15
Candidate Generation & Test
• Any subset of a frequent itemset must also be frequent – an anti-monotonic property – A transaction containing {beer, diaper, nuts} also
contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must
also be frequent • In other words, any superset of an infrequent
itemset must also be infrequent – No superset of any infrequent itemset should be
generated or tested – Many item combinations can be pruned!
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 16
Apriori-Based Mining
• Generate length (k+1) candidate itemsets from length k frequent itemsets, and
• Test the candidates against DB
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 17
The Apriori Algorithm [AgSr94]
TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2
Itemset Sup a 2 b 3 c 3 d 1 e 3
Data base D 1-candidates
Scan D
Itemset Sup a 2 b 3 c 3 e 3
Freq 1-itemsets Itemset
ab ac ae bc be ce
2-candidates
Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2
Counting
Scan D
Itemset Sup ac 2 bc 2 be 3 ce 2
Freq 2-itemsets Itemset
bce
3-candidates
Itemset Sup bce 2
Freq 3-itemsets
Scan D
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 18
The Apriori Algorithm Level-wise, candidate generation and test • Ck: Candidate itemset of size k • Lk : frequent itemset of size k
• L1 = {frequent items}; • for (k = 1; Lk !=∅; k++) do
– Ck+1 = candidates generated from Lk; – for each transaction t in database do increment the
count of all candidates in Ck+1 that are contained in t – Lk+1 = candidates in Ck+1 with min_support
• return ∪k Lk;
Candidate generation
Test
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 19
Important Steps in Apriori
• How to find frequent 1- and 2-itemsets? • How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning • How to count supports of candidates?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 20
Finding Frequent 1- & 2-itemsets
• Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array – Initialize c[item]=0 for each item – For each transaction T, for each item in T,
c[item]++; – If c[item]>=min_sup, item is frequent
• Finding frequent 2-itemsets using a 2-dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 21
Counting Array
• A 2-dimensional triangle matrix can be implemented using a 1-dimensional array
1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5
There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/2+5-3]=c[9]
1 2 3 4 5 6 7 8 9 10
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 22
Example of Candidate-generation
• L3 = {abc, abd, acd, ace, bcd} • Self-joining: L3*L3
– abcd ß abc * abd – acde ß acd * ace
• Pruning: – acde is removed because ade is not in L3
• C4={abcd}
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 23
How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-join Lk-1
INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
• Step 2: pruning – For each itemset c in Ck do
• For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 24
How to Count Supports?
• Why is counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates
• Method – Candidate itemsets are stored in a hash-tree – A leaf node of hash-tree contains a list of itemsets and
counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a
transaction
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 25
Example: Counting Supports
1,4,7 2,5,8
3,6,9 Subset function
2 3 4 5 6 7
1 4 5 1 3 6
1 2 4 4 5 7 1 2 5
4 5 8 1 5 9
3 4 5 3 5 6 3 5 7 6 8 9
3 6 7 3 6 8
Transaction: 1 2 3 5 6
1 + 2 3 5 6
1 2 + 3 5 6
1 3 + 5 6
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 26
Association Rules
• Rule c à am • Support: 3 (i.e., the support
of acm) • Confidence: 75% (i.e.,
sup(acm) / sup(c)) • Given a minimum support
threshold and a minimum confidence threshold, find all association rules whose support and confidence passing the thresholds
TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n
Transaction database TDB
To-Do List
• Read Sections 6.1, 6.2.1 and 6.2.2 in the textbook
• Understand the concept of frequent itemsets and association rules
• Understand algorithm Apriori • Figure out how to use Weka to mine
frequent itemsets
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 27