Association Rule Mining

The UNIVERSITY of KENTUCKY

Association Rule Mining

CS 685: Special Topics in Data Mining

CS685: Special Topics in Data Mining

2

Frequent Pattern Analysis

Finding inherent regularities in data

What products were often purchased together?— Beer

and diapers?!

What are the subsequent purchases after buying a PC?

What are the commonly occurring subsequences in a

group of genes?

What are the shared substructures in a group of

effective drugs?


3

What Is Frequent Pattern Analysis?

Frequent pattern: a pattern (a set of items, subsequences,

substructures, etc.) that occurs frequently in a data set

Applications

Identify motifs in bio-molecules

DNA sequence analysis, protein structure analysis

Identify patterns in micro-arrays

Business applications:

Market basket analysis, cross-marketing, catalog design, sale campaign

analysis, etc.


4

Data

An item is an element (a literal, a variable, a symbol, a descriptor, an attribute, a measurement, etc) A transaction is a set of items A data set is a set of transactionsA database is a data set

Transaction-id Items bought

100 f, a, c, d, g, I, m, p

200 a, b, c, f, l,m, o

300 b, f, h, j, o

400 b, c, k, s, p

500 a, f, c, e, l, p, m, n


5

Association RulesItemset X = {x1, …, xk}

Find all the rules X Y with minimum support and confidence

support, s, is the probability that a transaction contains X Y

confidence, c, is the conditional probability that a transaction having X also contains Y

Let supmin = 50%, confmin = 50%

Association rules:A C (60%, 100%)C A (60%, 75%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-id

Items bought

100 f, a, c, d, g, I, m, p

200 a, b, c, f, l,m, o

300 b, f, h, j, o

400 b, c, k, s, p

500 a, f, c, e, l, p, m, n


6

Apriori-based Mining

Generate length (k+1) candidate itemsets from length k frequent itemsets, and

Test the candidates against DB


7

Apriori Algorithm

A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994)

TID Items10 a, c, d20 b, c, e30 a, b, c, e40 b, e

Min_sup=2

Itemset Supa 2b 3c 3d 1e 3

Data base D 1-candidates

Scan D

Itemset Supa 2b 3c 3e 3

Freq 1-itemsetsItemset

abacaebcbece

2-candidates

Itemset Supab 1ac 2ae 1bc 2be 3ce 2

Counting

Scan D

Itemset Supac 2bc 2be 3ce 2

Freq 2-itemsetsItemset

bce

3-candidates

Itemset Supbce 2

Freq 3-itemsets

Scan D


8

Important Details of Apriori

How to generate candidates?Step 1: self-joining Lk

Step 2: pruning

How to count supports of candidates?


9

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an orderStep 1: self-join Lk-1 INSERT INTO Ck

SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1

FROM Lk-1 p, Lk-1 qWHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

Step 2: pruningFor each itemset c in Ck do

For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck


10

Example of Candidate-generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:acde is removed because ade is not in L3

C4={abcd}


11

How to Count Supports of Candidates?

Why counting supports of candidates a problem?The total number of candidates can be very hugeOne transaction may contain many candidates

Method:Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction


12

Apriori: Candidate Generation-and-test

Any subset of a frequent itemset must be also frequent — an anti-monotone property

A transaction containing {beer, diaper, nuts} also contains {beer, diaper}

{beer, diaper, nuts} is frequent {beer, diaper} must also be frequent

No superset of any infrequent itemset should be generated or tested

Many item combinations can be pruned


13

The Apriori Algorithm

Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do

Ck+1 = candidates generated from Lk;for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in tLk+1 = candidates in Ck+1 with min_support

return k Lk;


14

Challenges of Frequent Pattern Mining

ChallengesMultiple scans of transaction database

Huge number of candidates

Tedious workload of support counting for candidates

Improving Apriori: general ideasReduce number of transaction database scans

Shrink number of candidates

Facilitate support counting of candidates


15

DIC: Reduce Number of Scans

ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Itemset lattice

Once both A and D are determined frequent, the counting of AD can beginOnce all length-2 subsets of BCD are determined frequent, the counting of BCD can begin

Transactions

1-itemsets2-itemsets

…Apriori

1-itemsets2-items

3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.


16

DHP: Reduce the Number of Candidates

A hashing bucket count <min_sup every candidate in the buck is infrequent

Candidates: a, b, c, d, eHash entries: {ab, ad, ae} {bd, be, de} …Large 1-itemset: a, b, d, eThe sum of counts of {ab, ad, ae} < min_sup ab should not be a candidate 2-itemset

J. Park, M. Chen, and P. Yu, 1995


17

Partition: Scan Database Only Twice

Partition the database into n partitionsItemset X is frequent X is frequent in at least one partition

Scan 1: partition database and find local frequent patternsScan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski, and S. Navathe, 1995


18

Sampling for Frequent Patterns

Select a sample of original database, mine frequent patterns within sample using AprioriScan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked

Example: check abcd instead of ab, ac, …, etc.

Scan database again to find missed frequent patternsH. Toivonen, 1996


19

Bottleneck of Frequent-pattern Mining

Multiple database scans are costlyMining long patterns needs many passes of scanning and generates lots of candidates

To find frequent itemset i1i2…i100

# of scans: 100# of Candidates:

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation?

30100 1027.112100

100

2

100

1

100


20

Set Enumeration Tree

Subsets of I can be enumerated systematically

I={a, b, c, d}

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd


21

Borders of Frequent Itemsets

ConnectedX and Y are frequent and X is an ancestor of Y all patterns between X and Y are frequent

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd


22

Projected Databases

To find a child Xy of X, only X-projected database is needed

The sub-database of transactions containing X

Item y is frequent in X-projected database

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd


23

Tree-Projection Method

Find frequent 2-itemsets

For each frequent 2-itemset xy, form a projected database

The sub-database containing xy

Recursive miningIf x’y’ is frequent in xy-proj db, then xyx’y’ is a frequent pattern


24

Borders and Max-patterns

Max-patterns: borders of frequent patternsA subset of max-pattern is frequent

A superset of max-pattern is infrequent

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd


25

MaxMiner: Mining Max-patterns

1st scan: find frequent itemsA, B, C, D, E

2nd scan: find support for AB, AC, AD, AE, ABCDEBC, BD, BE, BCDECD, CE, CDE, DE,

Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scanBaya’98

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

Potential max-

patterns

Min_sup=2


26

Frequent Closed Patterns

For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern

“acdf” is a frequent closed pattern

Concise rep. of freq patsReduce # of patterns and rulesN. Pasquier et al. In ICDT’99

TID Items

10 a, c, d, e, f

20 a, b, e

30 c, e, f

40 a, c, d, f

50 c, e, f

Min_sup=2


27

CLOSET: Mining Frequent Closed Patterns

Flist: list of all freq items in support asc. orderFlist: d-a-f-e-c

Divide search spacePatterns having dPatterns having d but no a, etc.

Find frequent closed pattern recursivelyEvery transaction having d also has cfa cfad is a frequent closed pattern

PHM’00

TID Items10 a, c, d, e, f20 a, b, e30 c, e, f40 a, c, d, f50 c, e, f

Min_sup=2


28

Closed and Max-patterns

Closed pattern mining algorithms can be adapted to mine max-patterns

A max-pattern must be closed

Depth-first search methods have advantages over breadth-first search ones


29

Multiple-level Association Rules

Items often form hierarchyFlexible support settings: Items at the lower level are expected to have lower support.Transaction database can be encoded based on dimensions and levelsexplore shared multi-level mining

uniform support

Milk[support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Level 1min_sup = 5%

Level 2min_sup = 3%

reduced support


30

Multi-dimensional Association Rules

Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)

MD rules: 2 dimensions or predicatesInter-dimension assoc. rules (no repeated predicates)

age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)

hybrid-dimension assoc. rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)

Categorical Attributes: finite number of possible values, no order among valuesQuantitative Attributes: numeric, implicit order


31

Quantitative/Weighted Association Rules

age(X,”33-34”) income(X,”30K - 50K”) buys(X,”high resolution TV”)

Numeric attributes are dynamically discretizedmaximize the confidence or compactness of the rules

2-D quantitative association rules: Aquan1 Aquan2 Acat

Cluster “adjacent” association rules to form general rules using a 2-D grid. 70-80k

60-70k

50-60k

40-50k

30-40k

20-30k

<20k

32 33 34 35 36 37 38

Income

Age


32

Constraint-based Data Mining

Find all the patterns in a database autonomously? The patterns could be too many but not focused!

Data mining should be interactiveUser directs what to be mined

Constraint-based miningUser flexibility: provides constraints on what to be mined

System optimization: push constraints for efficient mining


33

Constraints in Data Mining

Knowledge type constraint classification, association, etc.

Data constraint — using SQL-like queries find product pairs sold together in stores in New York

Dimension/level constraintin relevance to region, price, brand, customer category

Rule (or pattern) constraintsmall sales (price < $10) triggers big sales (sum >$200)

Interestingness constraintstrong rules: support and confidence

Association Rule Mining

Documents

Transcript of Association Rule Mining