Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring...

50
Mining Generalized Mining Generalized Association Rules Association Rules Ramkrishnan Strikant Ramkrishnan Strikant Rakesh Agrawal Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran

Transcript of Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring...

Page 1: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Mining Generalized Mining Generalized Association RulesAssociation Rules

Ramkrishnan StrikantRamkrishnan StrikantRakesh AgrawalRakesh Agrawal

Data Mining Seminar, spring semester, 2003

Prof. Amos Fiat

Student: Idit Haran

Page 2: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

2

OutlineOutlineMotivationTerms & Definitions Interest MeasureAlgorithms for mining generalized

association rulesComparison Conclusions

Page 3: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

3

MotivationMotivation Find Association Rules of the form:

Diapers Beer Different kinds of diapers:

Huggies/Pampers, S/M/L, etc. Different kinds of beers:

Heineken/Maccabi, in a bottle/in a can, etc. The information on the bar-code is of type:

Huggies Diapers, M Heineken Beer in bottle The preliminary rule is not interesting, and probably

will not have minimum support.

Page 4: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

4

TaxonomyTaxonomy is-a hierarchies

Clothes

Outwear Shirts

Jackets Ski Pants

Footwear

Shoes Hiking Boots

Page 5: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

5

Taxonomy - ExampleTaxonomy - ExampleLet say we found the rule:

Outwear Hiking Bootswith minimum support and confidence.

The rule Jackets Hiking Boots

may not have minimum supportThe rule

Clothes Hiking Boots may not have minimum confidence.

Page 6: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

6

TaxonomyTaxonomy Users are interested in generating rules that span

different levels of the taxonomy. Rules of lower levels may not have minimum

support Taxonomy can be used to prune uninteresting or

redundant rules Multiple taxonomies may be present.

for example: category, price(cheap, expensive), “items-on-sale”. etc.

Multiple taxonomies may be modeled as a forest, or a DAG.

Page 7: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

7

NotationsNotations

c1

p

c2

zancestors

(marked with ^)

descendants

child

parentedge:

is_a relationship

Page 8: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

8

NotationsNotationsI = {i1, i2, …, im}- items.

T- transaction, set of items TI(we expect the items in T to be leaves in T .)

D – set of transactionsT supports item x, if x is in T or x is an

ancestor of some item in T.T supports XI if it supports every item

in X.

Page 9: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

9

NotationsNotations

A generalized association rule: X Y if XI , YI , XY = , and no item in Y is an ancestor of any item in X.

The rule XY has confidence c in D if c% of transactions in D that support X also support Y.

The rule XY has support s in D if s% of transactions in D supports XY.

Page 10: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

10

Problem StatementProblem Statement

To find all generalized association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf) respectively.

Page 11: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

11

ExampleExampleRecall the taxonomy:

Clothes

Outwear Shirts

Jackets Ski Pants

Footwear

Shoes Hiking Boots

Page 12: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

12

Database D

TransactionItems Bought

100Shirt

200Jacket, Hiking Boots

300Ski Pants, Hiking Boots

400Shoes

500Shoes

600Jacket

Frequent Itemsets

ItemsetSupport

{Jacket}2

{Outwear}3

{Clothes}4

{Shoes}2

{Hiking Boots}2

{Footwear}4

{Outwear, Hiking Boots}2

{Clothes,Hiking Boots}2

{Outwear, Footwear}2

{Clothes, Footwear}2Rules

RuleSupportConfidence

Outwear Hiking Boots33%66.6%

Outwear Footwear33%66.6%

Hiking Boots Outwear33%100%

Hiking Boots Clothes33%100%

ExampleExample

minsup = 30%

minconf = 60%

Page 13: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

13

Observation 1Observation 1If the set{x,y} has minimum support,

so do {x^,y^} {x^,y} and {x^,y^} For example:

if {Jacket, Shoes} has minsup, so will {Outwear, Shoes}, {Jacket,Footwear}, and {Outwear,Footwear}

Clothes

Outwear Shirts

Jackets Ski Pants

Footwear

Shoes Hiking Boots

Page 14: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

14

Observation 2Observation 2

If the rule xy has minimum support and confidence, only xy^ is guaranteed to have both minsup and minconf.

The rule OutwearHiking Boots has minsup and minconf. The rule OutwearFootwear has both minsup and minconf.

Clothes

Outwear Shirts

Jackets Ski Pants

Footwear

Shoes Hiking Boots

Page 15: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

15

Observation 2 – cont.Observation 2 – cont.

However, the rules x^y and x^y^ will have minsup, they may not have minconf.

For example: The rules ClothesHiking Boots and ClothesFootwear have minsup, but not minconf.

Clothes

Outwear Shirts

Jackets Ski Pants

Footwear

Shoes Hiking Boots

Page 16: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

16

Interesting Rules – Interesting Rules – Previous WorkPrevious Work

a rule XY is not interesting if:support(XY) support(X)•support(Y)

Previous work does not consider taxonomy.The previous interest measure pruned less

than 1% of the rules on a real database.

Page 17: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

17

Interesting Rules – Interesting Rules – Using the TaxonomyUsing the Taxonomy

MilkCereal (8% support, 70% conf)Milk is parent of Skim Milk, and 25% of

sales of Milk are Skim MilkWe expect:

Skim MilkCereal to have 2% support and 70% confidence

Page 18: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

18

R-Interesting RulesR-Interesting RulesA rule is XY is R-interesting w.r.t an

ancestor X^Y^ if:

or,

With R = 1.1 about 40-55% of the rules were prunes.

real support(XY) >

expected support (XY) based on (X^Y^)R •

real confidence(XY) > expected confidence (XY)

based on (X^Y^)R •

Page 19: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

19

Problem Statement (new)Problem Statement (new)

To find all generalized R-interesting association rules (R is a user-specified minimum interest called min-interest) that have support and confidence greater than minsup and minconf respectively.

Page 20: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

20

Algorithms – 3 stepsAlgorithms – 3 steps

1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets.

2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(ABCD) = support(ABCD)/support(AB)

3. Prune all uninteresting rules from this set.

*All presented algorithms will only implement step 1.

Page 21: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

21

Algorithms – 3 stepsAlgorithms – 3 steps

1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets.

2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(ABCD) = support(ABCD)/support(AB)

3. Prune all uninteresting rules from this set.

*All presented algorithms will only implement step 1.

Page 22: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

22

Algorithms (step 1)Algorithms (step 1)

Input: Database, TaxonomyOutput: All frequent itemsets3 algorithms (same output, different run-time):

Basic, Cumulate, EstMerge

Page 23: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

23

Algorithm Basic – Main IdeaAlgorithm Basic – Main Idea

Is itemset X is frequent? Does transaction T supports X?

(X contains items from different levels of taxonomy, T contains only leaves)

T’ = T + ancestors(T); Answer: T supports X X T’

Page 24: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

24

Algorithm BasicAlgorithm Basic

k

k

kk

t

kt

k-k

k-1

;L Answer

minsup}|c.countC { c L

c.count

Cc

,t)(C C

Ttt

Dt

)(L C

); k 2; L( k

s} 1-itemset {frequent L

end

end

end

;

do candidates forall

subset

),(ancestor-add

begin do on transactiforall

;genapriori-

begin do For

1

1

Count item occurrences

Generate new k-itemsets candidates

Add all ancestors of each item in t to t, removing any duplication

Find the support of all the candidates

Take only those with support over minsup

Page 25: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

25

Candidate generationCandidate generation

Join step

Prune step

1k1k2k2k11

1k1k

1k1k1

k

q.itemp.item,q.itemp.item,...,q.itemp.item

qp,LL

itemqitempitempp.item

C

where

from

.,.,.,select

intoinsert

2

k

k-1

k

c from C

) L(s

ets s of c(k-1)-subs

C itemsets c

delete

then if

do forall

do forall

P and q are 2 k-1 frequent itemsets identical in all k-2 first items.

Join by adding the last item of q to p

Check all the subsets, remove a candidate with “small” subset

Page 26: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

26

Optimization 1Optimization 1

Filtering the ancestors added to transactionsWe only need to add to transaction t the

ancestors that are in one of the candidates.If the original item is not in any itemsets, it can

be dropped from the transaction.Example:

candidates: {clothes,shoes}.Transaction t: {Jacket, …} can be replaced with {clothes, …}

Clothes

Outwear Shirts

Jackets Ski Pants

Page 27: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

27

Optimization 2Optimization 2

Pre-computing ancestors Rather than finding ancestors for each item

by traversing the taxonomy graph, we can pre-compute the ancestors for each item.

We can drop ancestors that are not contained in any of the candidates in the same time.

Page 28: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

28

Optimization 3Optimization 3

Pruning itemsets containing an item and its ancestor If we have {Jacket} and {Outwear}, we will have

candidate {Jacket, Outwear} which is not interesting. support({Jacket} ) = support({Jacket, Outwear}) Delete ({Jacket, Outwear}) in k=2 will ensure it will not

erase in k>2. (because of the prune step of candidate generation method)

Therefore, we can prune the rules containing an item an its ancestor only for k=2, and in the next steps all candidates will not include item + ancestor.

Page 29: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

29

Algorithm CumulateAlgorithm Cumulate

k

k

kk

t

kt

k-k

k-1

;L Answer

minsup}|c.countC { c L

c.count

Cc

,t)(C C

Ttt

Dt

)(L C

); k 2; L( k

s} 1-itemset {frequent L

TfromTCompute

end

end

end

;

do candidates forall

subset

),(ancestor-add

begin do on transactiforall

)C,y(Tunnecessar-remove T

)prune(C then 2)(k if

;genapriori-

begin do For

*

k**

2

1

1

*

Optimization 2: compute the set of all ancestors T* from T

Optimization 3: Delete any candidate in C2 that consists of an item and its ancestor

Optimization 1: Delete any ancestors in T* that are not present in any of the candidates in CkOptimzation2: foreach item xt add all ancestor of x in T* to t. Then, remove any duplicates in t.

Page 30: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

30

StratificationStratification

Candidates: {Clothes, Shoes}, {Outwear,Shoes}, {Jacket,Shoes}

If {Clothes, Shoes} does not have minimum support, we don’t need to count either {Outwear,Shoes} or {Jacket,Shoes}

We will count in steps: step 1: count {Clothes, Shoes}, and if it has minsup -

step 2: count {Outwear,Shoes}, if has minsup – step 3: count {Jacket,Shoes}

Clothes

Outwear Shirts

Jackets Ski Pants

Footwear

Shoes Hiking Boots

Page 31: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

31

Version 1: StratifyVersion 1: StratifyDepth of an itemset:

itemsets with no parents are of depth 0.others:

depth(X) = max({depth(X^) |X^ is a parent of X}) + 1

The algorithm: Count all itemsets C0 of depth 0. Delete candidates that are descendants to the itemsets in C0 that

didn’t have minsup. Count remaining itemsets at depth 1 (C1) Delete candidates that are descendants to the itemsets in C1 that

didn’t have minsup. Count remaining itemsets at depth 2 (C2), etc…

Page 32: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

32

Tradeoff & OptimizationsTradeoff & Optimizations

#candidates counted #passes over DB

CumulateCount each depth on different pass

Optimiztion 1: Count together multiple depths from certain level

Optimiztion 2: Count more than 20% of candidates per pass

Page 33: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

33

Version 2: EstimateVersion 2: EstimateEstimating candidates support using sample1st pass: (C’k)

count candidates that are expected to have minsup (we count these candidates as candidates that has 0.9*minsup in the sample)

count candidates whose parents expect to have minsup.

2nd pass: (C”k) count children of candidates in C’k that were not

expected to have minsup.

Page 34: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

34

Example for EstimateExample for Estimate

Candidates

Itemsets

Support in

Sample

Support in Database

Scenario AScenario B

{Clothes, Shoes}8%7%9%

{Outwear, Shoes}4%4%6%

{Jacket, Shoes}2%

minsup = 5%

Page 35: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

35

Version 3: EstMergeVersion 3: EstMerge

Motivation: eliminate 2nd pass of algorithm Estimate Implementation: count these candidates of C”k with

the candidates in C’k+1.

Restriction: to create C’k+1 we assume that all candidates in C”k has minsup.

The tradeoff: extra candidates counted by EstMerge v.s. extra pass made by Estimate.

Page 36: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

36

Algorithm EstMergeAlgorithm EstMerge

k

k

kkk

kk

kkk

k-kk

k-k

ksk

k-k-k

k-k-1

s

;L Answer

minsup}c.count|C{c L L

minsup}c.count|C {c L

C' C C

,C"D,C' C

,C"D,C'

CD C

C"L C

; k C" ; LC2 k

DD

s} 1-itemset {frequent L

end

"

'

; - "

;)(sdescendent-prune

; )(support-find

);,(sons-and-frequent-expected'

);,(candidates-neratege

begin do ) or ",(For

;)(sample-generate

111

1

1

11

11

1

Count item occurrences

Generate a sample over the Database, in the first pass

Find the support of C’kC”k-1 by making a pass over D

Generate new k-itemsets candidates from Lk-1C”k-1

Delete candidates in Ck whose ancestors in C’k don’t have minsup

Estimate Ck candidate’s support by making a pass over Ds. C’k = candidates that are expected to have minsup + candidates whose parents are expected to have minsup

Remaining candidates in Ck that are not in C’k

Add all candidate in C”k with minsup

All candidate in C’k with minsup

Page 37: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

37

Stratify - VariantsStratify - Variants

Page 38: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

38

Size of SampleSize of Sample

Pr[support in sample < a]

P=5%P=1%P=0.5%P=0.1%

a=.8pa=.9pa=.8pa=.9pa=.8pa=.9pa=.8pa=.9p

n=10000.320.760.800.950.890.970.980.99

n=10,0000.000.070.110.590.340.770.800.95

n=100,0000.000.000.000.010.000.070.120.60

n=1,000,0000.000.000.000.000.000.000.000.01

Page 39: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

39

Size of SampleSize of Sample

Page 40: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

40

Performance EvaluationPerformance Evaluation

Compare running time of 3 algorithms:Basic, Cumulate and EstMerge

On synthetic data:effect of each parameter on performance

On real data:Supermarket DataDepartment Store Data

Page 41: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

41

Synthetic Data GenerationSynthetic Data Generation

Parameter

Default Value

|D|Number of transactions1,000,000

|T|Average size of the Transactions10

|I|Average size of the maximal potentially frequent itemsets

4

|I |Number of maximal potentially frequent itemsets10,000

NNumber of items100,000

RNumber of Roots250

LNumber of Levels4-5

FFanout5

DDepth-ration ( probability that item in a rule comes from level i / probability that item comes from level i+1)

1

Page 42: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

42

Minimum SupportMinimum Support

Page 43: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

43

Number of TransactionsNumber of Transactions

Page 44: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

44

FanoutFanout

Page 45: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

45

Number of ItemsNumber of Items

Page 46: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

46

Reality CheckReality CheckSupermarket Data

548,000 items Taxonomy: 4 levels, 118 roots ~1.5 million transactions Average of 9.6 items per transaction

Department Store Data 228,000 items Taxonomy: 7 levels, 89 roots 570,000 transactions Average of 4.4 items per transaction

Page 47: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

47

ResultsResults

Page 48: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

48

ConclusionsConclusionsCumulate and EstMerge were 2 to 5 times

faster than Basic on all synthetic datasets. On the supermarket database they were 100 times faster !

EstMerge was ~25-30% faster than Cumulate. Both EstMerge and Cumulate exhibits linear

scale-up with the number of transactions.

Page 49: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

49

SummarySummaryThe use of taxonomy is necessary for finding

association rules between items at any level of hierarchy.

The obvious solution (algorithm Basic) is not very fast.

New algorithms that use the taxonomy benefits are much faster

We can use the taxonomy to prune uninteresting rules.

Page 50: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

Idit Haran, Data Mining Seminar, 2003

50