1 Frequent Itemsets Association rules and market basket analysis CS240B--UCLA Notes by Carlo Zaniolo...

53
1 Frequent Itemsets Association rules and market basket analysis CS240B--UCLA Notes by Carlo Zaniolo Most slides borrowed from Jiawei Han,UIUC May 2007
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of 1 Frequent Itemsets Association rules and market basket analysis CS240B--UCLA Notes by Carlo Zaniolo...

1

Frequent ItemsetsAssociation rules

and market basket analysis

CS240B--UCLANotes by Carlo Zaniolo

Most slides borrowed fromJiawei Han,UIUC

May 2007

2

Association Rules & Correlations

Basic concepts Efficient and scalable frequent itemset mining

methods: Apriori, and improvements FP-growth

Rule derivation, visualization and validation Multi-level Associations Temporal associations and frequent sequences Other association mining methods Summary

3

Market Basket Analysis: the context

Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket”

Customer1

Customer2 Customer3

Milk, eggs, sugar, bread

Milk, eggs, cereal, bread Eggs, sugar

4

Market Basket Analysis: the context

Given: a database of customer transactions, where each transaction is a set of items

Find groups of items which are frequently purchased together

5

Goal of MBA

Extract information on purchasing behavior Actionable information: can suggest

new store layouts new product assortments which products to put on promotion

MBA applicable whenever a customer purchases multiple things in proximity credit cards services of telecommunication companies banking services medical treatments

6

MBA: applicable to many other contexts

Telecommunication:

Each customer is a transaction containing the set of customer’s phone calls

Atmospheric phenomena:

Each time interval (e.g. a day) is a transaction containing the set of observed event (rains, wind, etc.)

Etc.

7

Association Rules

Express how product/services relate to each other, and tend to group together

“if a customer purchases three-way calling, then will also purchase call-waiting”

simple to understandactionable information: bundle three-

way calling and call-waiting in a single package

8

Frequent Itemsets

Transaction:Relational format Compact format<Tid,item> <Tid,itemset><1, item1> <1, {item1,item2}><1, item2> <2, {item3}><2, item3>

Item: single element, Itemset: set of itemsSupport of an itemset I: # of transaction containing I

Minimum Support : threshold for support

Frequent Itemset : with support .

Frequent Itemsets represents set of items which are positively correlated

9

Frequent Itemsets Example

Support({dairy}) = 3 (75%)Support({fruit}) = 3 (75%)Support({dairy, fruit}) = 2 (50%)

If = 60%, then

{dairy} and {fruit} are frequent while {dairy, fruit} is not.

Transaction ID Items Bought1 dairy,fruit2 dairy,fruit, vegetable3 dairy4 fruit, cereals

10

Itemset support & Rules confidence

Let A and B be disjoint itemsets and let: s = support(AB) and

c= support(AB)/support(A)

Then the rule A B holds with support s and confidence c: write A B [s, c]

Objective of the mining task. Find all rules with minimum support minimum confidence Thus A B [s, c] holds if : s and c

11

Association Rules: Meaning

A B [ s, c ]

Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database.

support(A B [ s, c ]) = p(A B)

Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability .

confidence(A B [ s, c ]) = p(B|A) = p(A & B)/p(A).

12

Association Rules - Example

For rule A C:support = support({A, C}) = 50%confidence = support({A, C})/support({A}) = 66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F Frequent Itemset Support

{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

13

Closed Patterns and Max-Patterns

A long pattern contains very many subpatterns---combinatorial explosion Closed patterns and max-patterns

An itemset is closed if none of its supersets has the same support Closed pattern is a lossless compression of freq. patterns--

Reducing the # of patterns and rules

An itemset is maximal frequent if none of its supersets is frequent But support of their subsets is not known – additional DB scans

are needed

14

Frequent Itemsets

Minimum support = 2

# Frequent = 13

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

15

Maximal Frequent Itemset: if none of its supersets is frequent

Minimum support = 2

# Frequent = 13

# Maximal = 4

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

16

Closed Frequent Itemset: None of its superset has the same support

Minimum support = 2

# Frequent = 13

# Closed = 9

# Maximal = 4

Closed and maximal

Closed but not maximal

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Closed and maximal

17

Maximal vs Closed Itemsets

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

As we move from an itemset A to its superset support can:

1. Remain the same,

2. Drop but still remain above treshold, A is closed but not maximal

3. Drop below the threshold: A is maximal (and closed)

1

23

18

Scalable Methods for Mining Frequent Patterns

The downward closure property of frequent patterns Every subset of a frequent itemset must be frequent

[antimonotonic property] If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts}

also contains {beer, diaper}

Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao

@SDM’02)

19

Apriori: A Candidate Generation-and-Test Approach

Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

Method: Initially, scan DB once to get frequent 1-itemset

Generate length (k+1) candidate itemsets from length k frequent itemsets

Test the candidates against DB

Terminate when no frequent or candidate set can be generated

20

Association Rules & Correlations

Basic concepts Efficient and scalable frequent itemset mining

methods: Apriori, and improvements

21

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C22nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset sup{B, C,

E}2

Supmin = 2

22

Important Details of Apriori

How to generate candidates? Step 1: self-joining Lk

Step 2: pruning

How to count supports of candidates? Example of Candidate-generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd acde from acd and ace

Pruning: acde is removed because ade is not in L3

C4={abcd}

23

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an

order

Step 1: self-joining Lk-1 insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-

1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

24

How to Count Supports of Candidates?

Why counting supports of candidates a problem? The total number of candidates can be very huge

One transaction may contain many candidates

Data Structures used: Candidate itemsets can be stored in a hash-tree

or in a prefix-tree (trie)--example

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25

Effect of Support Distribution

Many real data sets have skewed support distribution

Support distribution of a retail data set

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26

Effect of Support Distribution

How to set the appropriate minsup threshold?– If minsup is set too high, we could miss itemsets

involving interesting rare items (e.g., expensive products)

– If minsup is set too low, it is computationally expensive and the number of itemsets is very large

Using a single minimum support threshold may not be effective

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27

Rule Generation

How to efficiently generate rules from frequent itemsets?– In general, confidence does not have an anti-monotone

propertyc(ABC D) can be larger or smaller than c(AB

D)

– But confidence of rules generated from the same itemset has an anti-monotone property

– e.g., L = {A,B,C,D}:

c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

Rule Generation

Given a frequent itemset L, find all non-empty subsets f L such that f L–f satisfies the minimum confidence requirement

If |L| = k, then there are 2k candidate association rules (including L and L)

– Example: L= {A,B,C,D} is the frequent itemset, then

– The candidate rules are:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,

But antimonotonicity will make things converge fast.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29

Lattice of rules: confidence(f

L–f)=support(L)/support(f)

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned Rules

Low Confidence Rule

L={A,B,C,D}

L= f

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30

Rule Generation for Apriori Algorithm

1. Candidate rule is generated by merging two rules that share the same prefixin the rule consequent

2. join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC

3. Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence.

4. Finally check the validity of rule D=>ABC (This is not an expensive operation so we might skip 3)

BD=>ACCD=>AB

D=>ABC

31

Rules: some useful, some trivial, others unexplicable

Useful: “On Thursdays, grocery store consumers often purchase diapers and beer together”.

Trivial: “Customers who purchase maintenance agreements are very likely to purchase large appliances”.

Unexplicable: “When a new hardaware store opens, one of the most sold items is toilet rings.”

Conclusion: Inferred rules must be validate by domain expert, before they can be used in the marketplace: Post Mining of association rules.

32

Mining for Association Rules

The main steps in the process1. Select a minimum support/confidence

level 2. Find the frequent itemsets3. Find the association rules4. Validate (postmine) the rules so found.

33

Mining for Association Rules: Checkpoint

Apriori opened up a big commercial market for DM association rules came from the db fields, classifier from AI,

clustering precedes both … and DM

Many open problem areas, including1. Performance: Faster Algorithms needed for frequent itemsets

2. Improving statistical/semantic significance of rules

3. Data Stream Mining for association rules. Even Faster algorithms needed, incremental computation, adaptability, etc. Also the post-mining process becomes more challenging.

34

Performance: Efficient Implementation Apriori in SQL

Hard to get good performance out of pure SQL (SQL-92)

based approaches alone

Make use of object-relational extensions like UDFs,

BLOBs, Table functions etc. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating

association rule mining with relational database systems: Alternatives and implications. In SIGMOD’98

A much better solution: use UDAs—native or imported.

Haixun Wang and Carlo Zaniolo: ATLaS: A Native Extension of SQL for Data Mining. SIAM International Conference on Data Mining 2003, San Francisco, CA, May 1-3, 2003

35

Performance for Apriori

Challenges Multiple scans of transaction database [not for data

streams]

Huge number of candidates

Tedious workload of support counting for candidates

Many Improvements suggested: general ideas Reduce passes of transaction database scans

Shrink number of candidates

Facilitate counting of candidates

36

Partition: Scan Database Only Twice

Any itemset that is potentially frequent in DB must

be frequent in at least one of the partitions of DB

Scan 1: partition database and find local frequent

patterns

Scan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski, and S. Navathe. An

efficient algorithm for mining association in large

databases. In VLDB’95

Does this scaleup to larger partitions?

37

Sampling for Frequent Patterns

Select a sample S of original database,

mine frequent patterns within sample using

Apriori

To avoid losses mine for a support less than

that required

Scan rest of database to find exact counts. H. Toivonen. Sampling large databases for

association rules. In VLDB’96

38

DIC: Reduce Number of Scans

ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Itemset lattice

Once both A and D are determined frequent, the counting of AD begins

Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins

Transactions

1-itemsets2-itemsets

…Apriori

1-itemsets2-items

3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97

39

Improving Performance (cont.)

APriori Multiple database scans are costly Mining long patterns needs many passes of

scanning and generates lots of candidates

To find frequent itemset i1i2…i100

# of scans: 100

# of Candidates: (1001) + (100

2) + … + (11

00

00) = 2100-1 =

1.27*1030 !

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation?

40

Mining Frequent Patterns Without Candidate Generation

FP-Growth Algorithm

1. Build FP-tree: items are listed by decreasing

frequency

2. For each suffix (recursively)

Build its conditionalized subtree

and compute its frequent items

An order of magnitude faster than Apriori

4141

Frequent Patterns (FP) Algorithm

_________________________________________These slides are based on those by:

Yousry Taha,Taghrid Al-Shallali, Ghada AL Modaifer ,Nesreen AL Boiez       

The algorithm consists of two steps:Step 1: builds the FP-Tree (Frequent Patterns Tree).Step 2: use FP_Growth Algorithm for finding frequent itemsets from the FP- Tree.

4242

Frequent Pattern Tree Algorithm:Example

• The first scan of database is same as Apriori, which derives the set of 1-itemsets & their support counts.

• The set of frequent items is sorted in the order of descending support count.

• An Fp-tree is constructed

• The Fp-tree is conditionalized and mined for frequent itemsets

T-ID List of Items

101 Milk, bread, cookies, juice

792 Milk, juice

1130 Milk, eggs

1735 Bread, cookies, coffee

4343

NULL

Milk:1Milk:2Milk:3

Bread:1

Cookies:1

Cookies:1

Bread:1

Juice:1

Juice:1

Item Id Support Node-link

milk 3

bread 2

cookies 2

juice 2

Table: Item header table

FP-tree

FP-Tree for T-ID List of Items

101 Milk, bread, cookies, juice

792 Milk, juice

1130 Milk, eggs

1735 Bread, cookies, coffee

4444

FP-Growth Algorithm For Finding Frequent Itemsets

Steps:

1. Start from each frequent length-1 pattern (as an initial suffix pattern).

2. Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree co-occurring with suffix pattern.

3. Then, Construct its conditional FP-Tree & perform mining on such a tree.

4. The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-Tree.

5. The union of all frequent patterns (generated by step 4) gives the required frequent itemset.

4545

FP-Growth: for each suffix find (1) its supporting paths, (2) its conditional FP-tree, and (3) the frequent patterns with such an ending (suffix)

Suffix Tree paths supporting suffix

(conditional pattern base)

Conditional FP-Tree

Frequent pattern

generated

juice {(milk,

bread,cookies:1),

(milk: 1)}

{milk:2} {juice,

milk:2}

cookies {(milk, bread:1),(bread:

1)}

{bread: 2} {cookies,

bread:2}

bread {(milk: 1)} - -

milk - - -

… then expand the suffix and repeat these operations

4646

NULL

Milk:1Milk:2Milk:3

Bread:1

Cookies:1

Cookies:1

Bread:1

Juice:1

Juice:1

Starting from least frequent suffix: Juice

NULL

Milk:1Milk:2Milk:3

Bread:1

Cookies:1

Juice:1

Juice:1

2

4747

Conditionalized tree for Suffix “Juice ”

NULL

Milk:2

Thus: (Juice, Milk:2) is a frequent pattern

4848

NULL

Milk:1Milk:2Milk:3

Bread:1

Cookies:1

Cookies:1

Bread:1

Now Patterns with Suffix “Cookies ”

Item Id Sup Count Node-link

milk 3 ..

bread 2 Next

cookies 2 NOW

juice Done Done

NULL

Bread:2

NULL

Milk:1Milk:2Milk:1

Bread:1

Bread:1 Thus: (Cookies, Bread:2)

is frequent

4949

Why Frequent Pattern Growth Fast ?

• Performance study showsFP-growth is an order of magnitude faster than Apriori

• Reasoning− No candidate generation, no candidate test

− Use compact data structure

− Eliminate repeated database scan

− Basic operation is counting and FP-tree building

5050

Other types of Association RULES

• Association Rules among Hierarchies.

• Multidimensional Association

• Negative Association

51516/22/2000

FP-growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime(s

ec.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

52526/22/2000

FP-growth vs. Apriori: Scalability With Number of Transactions

0

10

20

30

40

50

60

0 20 40 60 80 100

Number of transactions (K)

Ru

n t

ime

(sec

.)

FP-growth

Apriori

Data set T25I20D100K (1.5%)

53

FP-Growth: pros and cons

FP- tree is Complete Preserve complete information for frequent pattern mining Never break a long pattern of any transaction

FP- tree Compact Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently

occurring, the more likely to be shared Never be larger than the original database (not count node-

links and the count field) FP-tree is generate in one scan of database (data

streams mining?) However, deriving the frequent patterns from the FP-tree is

still computationally expensive—improved algorithms needed for data streams.