Association Rule Mining
description
Transcript of Association Rule Mining
The UNIVERSITY of KENTUCKY
Association Rule Mining
CS 685: Special Topics in Data Mining
CS685: Special Topics in Data Mining
2
Frequent Pattern Analysis
Finding inherent regularities in data
What products were often purchased together?— Beer
and diapers?!
What are the subsequent purchases after buying a PC?
What are the commonly occurring subsequences in a
group of genes?
What are the shared substructures in a group of
effective drugs?
CS685: Special Topics in Data Mining
3
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
Applications
Identify motifs in bio-molecules
DNA sequence analysis, protein structure analysis
Identify patterns in micro-arrays
Business applications:
Market basket analysis, cross-marketing, catalog design, sale campaign
analysis, etc.
CS685: Special Topics in Data Mining
4
Data
An item is an element (a literal, a variable, a symbol, a descriptor, an attribute, a measurement, etc) A transaction is a set of items A data set is a set of transactionsA database is a data set
Transaction-id Items bought
100 f, a, c, d, g, I, m, p
200 a, b, c, f, l,m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n
CS685: Special Topics in Data Mining
5
Association RulesItemset X = {x1, …, xk}
Find all the rules X Y with minimum support and confidence
support, s, is the probability that a transaction contains X Y
confidence, c, is the conditional probability that a transaction having X also contains Y
Let supmin = 50%, confmin = 50%
Association rules:A C (60%, 100%)C A (60%, 75%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Transaction-id
Items bought
100 f, a, c, d, g, I, m, p
200 a, b, c, f, l,m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n
CS685: Special Topics in Data Mining
6
Apriori-based Mining
Generate length (k+1) candidate itemsets from length k frequent itemsets, and
Test the candidates against DB
CS685: Special Topics in Data Mining
7
Apriori Algorithm
A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994)
TID Items10 a, c, d20 b, c, e30 a, b, c, e40 b, e
Min_sup=2
Itemset Supa 2b 3c 3d 1e 3
Data base D 1-candidates
Scan D
Itemset Supa 2b 3c 3e 3
Freq 1-itemsetsItemset
abacaebcbece
2-candidates
Itemset Supab 1ac 2ae 1bc 2be 3ce 2
Counting
Scan D
Itemset Supac 2bc 2be 3ce 2
Freq 2-itemsetsItemset
bce
3-candidates
Itemset Supbce 2
Freq 3-itemsets
Scan D
CS685: Special Topics in Data Mining
8
Important Details of Apriori
How to generate candidates?Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
CS685: Special Topics in Data Mining
9
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an orderStep 1: self-join Lk-1 INSERT INTO Ck
SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 qWHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
Step 2: pruningFor each itemset c in Ck do
For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck
CS685: Special Topics in Data Mining
10
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:acde is removed because ade is not in L3
C4={abcd}
CS685: Special Topics in Data Mining
11
How to Count Supports of Candidates?
Why counting supports of candidates a problem?The total number of candidates can be very hugeOne transaction may contain many candidates
Method:Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction
CS685: Special Topics in Data Mining
12
Apriori: Candidate Generation-and-test
Any subset of a frequent itemset must be also frequent — an anti-monotone property
A transaction containing {beer, diaper, nuts} also contains {beer, diaper}
{beer, diaper, nuts} is frequent {beer, diaper} must also be frequent
No superset of any infrequent itemset should be generated or tested
Many item combinations can be pruned
CS685: Special Topics in Data Mining
13
The Apriori Algorithm
Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do
Ck+1 = candidates generated from Lk;for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in tLk+1 = candidates in Ck+1 with min_support
return k Lk;
CS685: Special Topics in Data Mining
14
Challenges of Frequent Pattern Mining
ChallengesMultiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideasReduce number of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
CS685: Special Topics in Data Mining
15
DIC: Reduce Number of Scans
ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Itemset lattice
Once both A and D are determined frequent, the counting of AD can beginOnce all length-2 subsets of BCD are determined frequent, the counting of BCD can begin
Transactions
1-itemsets2-itemsets
…Apriori
1-itemsets2-items
3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
CS685: Special Topics in Data Mining
16
DHP: Reduce the Number of Candidates
A hashing bucket count <min_sup every candidate in the buck is infrequent
Candidates: a, b, c, d, eHash entries: {ab, ad, ae} {bd, be, de} …Large 1-itemset: a, b, d, eThe sum of counts of {ab, ad, ae} < min_sup ab should not be a candidate 2-itemset
J. Park, M. Chen, and P. Yu, 1995
CS685: Special Topics in Data Mining
17
Partition: Scan Database Only Twice
Partition the database into n partitionsItemset X is frequent X is frequent in at least one partition
Scan 1: partition database and find local frequent patternsScan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe, 1995
CS685: Special Topics in Data Mining
18
Sampling for Frequent Patterns
Select a sample of original database, mine frequent patterns within sample using AprioriScan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked
Example: check abcd instead of ab, ac, …, etc.
Scan database again to find missed frequent patternsH. Toivonen, 1996
CS685: Special Topics in Data Mining
19
Bottleneck of Frequent-pattern Mining
Multiple database scans are costlyMining long patterns needs many passes of scanning and generates lots of candidates
To find frequent itemset i1i2…i100
# of scans: 100# of Candidates:
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
30100 1027.112100
100
2
100
1
100
CS685: Special Topics in Data Mining
20
Set Enumeration Tree
Subsets of I can be enumerated systematically
I={a, b, c, d}
a b c d
ab ac ad bc bd cd
abc abd acd bcd
abcd
CS685: Special Topics in Data Mining
21
Borders of Frequent Itemsets
ConnectedX and Y are frequent and X is an ancestor of Y all patterns between X and Y are frequent
a b c d
ab ac ad bc bd cd
abc abd acd bcd
abcd
CS685: Special Topics in Data Mining
22
Projected Databases
To find a child Xy of X, only X-projected database is needed
The sub-database of transactions containing X
Item y is frequent in X-projected database
a b c d
ab ac ad bc bd cd
abc abd acd bcd
abcd
CS685: Special Topics in Data Mining
23
Tree-Projection Method
Find frequent 2-itemsets
For each frequent 2-itemset xy, form a projected database
The sub-database containing xy
Recursive miningIf x’y’ is frequent in xy-proj db, then xyx’y’ is a frequent pattern
CS685: Special Topics in Data Mining
24
Borders and Max-patterns
Max-patterns: borders of frequent patternsA subset of max-pattern is frequent
A superset of max-pattern is infrequent
a b c d
ab ac ad bc bd cd
abc abd acd bcd
abcd
CS685: Special Topics in Data Mining
25
MaxMiner: Mining Max-patterns
1st scan: find frequent itemsA, B, C, D, E
2nd scan: find support for AB, AC, AD, AE, ABCDEBC, BD, BE, BCDECD, CE, CDE, DE,
Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scanBaya’98
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Potential max-
patterns
Min_sup=2
CS685: Special Topics in Data Mining
26
Frequent Closed Patterns
For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern
“acdf” is a frequent closed pattern
Concise rep. of freq patsReduce # of patterns and rulesN. Pasquier et al. In ICDT’99
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
Min_sup=2
CS685: Special Topics in Data Mining
27
CLOSET: Mining Frequent Closed Patterns
Flist: list of all freq items in support asc. orderFlist: d-a-f-e-c
Divide search spacePatterns having dPatterns having d but no a, etc.
Find frequent closed pattern recursivelyEvery transaction having d also has cfa cfad is a frequent closed pattern
PHM’00
TID Items10 a, c, d, e, f20 a, b, e30 c, e, f40 a, c, d, f50 c, e, f
Min_sup=2
CS685: Special Topics in Data Mining
28
Closed and Max-patterns
Closed pattern mining algorithms can be adapted to mine max-patterns
A max-pattern must be closed
Depth-first search methods have advantages over breadth-first search ones
CS685: Special Topics in Data Mining
29
Multiple-level Association Rules
Items often form hierarchyFlexible support settings: Items at the lower level are expected to have lower support.Transaction database can be encoded based on dimensions and levelsexplore shared multi-level mining
uniform support
Milk[support = 10%]
2% Milk [support = 6%]
Skim Milk [support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Level 1min_sup = 5%
Level 2min_sup = 3%
reduced support
CS685: Special Topics in Data Mining
30
Multi-dimensional Association Rules
Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)
MD rules: 2 dimensions or predicatesInter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)
hybrid-dimension assoc. rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
Categorical Attributes: finite number of possible values, no order among valuesQuantitative Attributes: numeric, implicit order
CS685: Special Topics in Data Mining
31
Quantitative/Weighted Association Rules
age(X,”33-34”) income(X,”30K - 50K”) buys(X,”high resolution TV”)
Numeric attributes are dynamically discretizedmaximize the confidence or compactness of the rules
2-D quantitative association rules: Aquan1 Aquan2 Acat
Cluster “adjacent” association rules to form general rules using a 2-D grid. 70-80k
60-70k
50-60k
40-50k
30-40k
20-30k
<20k
32 33 34 35 36 37 38
Income
Age
CS685: Special Topics in Data Mining
32
Constraint-based Data Mining
Find all the patterns in a database autonomously? The patterns could be too many but not focused!
Data mining should be interactiveUser directs what to be mined
Constraint-based miningUser flexibility: provides constraints on what to be mined
System optimization: push constraints for efficient mining
CS685: Special Topics in Data Mining
33
Constraints in Data Mining
Knowledge type constraint classification, association, etc.
Data constraint — using SQL-like queries find product pairs sold together in stores in New York
Dimension/level constraintin relevance to region, price, brand, customer category
Rule (or pattern) constraintsmall sales (price < $10) triggers big sales (sum >$200)
Interestingness constraintstrong rules: support and confidence