Mining Generalized Association Rules R. Srikant & R. Agrawal (IBM) Presentation by: Colin Cherry.

25
Mining Generalized Association Rules R. Srikant & R. Agrawal (IBM) Presentation by: Colin Cherry

Transcript of Mining Generalized Association Rules R. Srikant & R. Agrawal (IBM) Presentation by: Colin Cherry.

Mining Generalized Association Rules

R. Srikant & R. Agrawal (IBM)

Presentation by: Colin Cherry

Objectives

• What are generalized association rules?

• Why do we care?

• How can we get them efficiently?

• How can we reduce rule redundancy?

• Is the efficient method any good?

Motivation

• Association rules find rules of the form:XY, where X and Y are sets of items

• What if there is structure over your items?

• Structure can be used to generalize

Hierarchy Example

Pepsi Coke

Cola

Soft Drink

Beverage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.… …

Hierarchy Example

On Sale Not On Sale

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.… …

• Goal of this paper: Given hierarchies over items: Capture interesting rules at all levels of

multiple hierarchies

Simple Fix

• Just add parents to each transaction.

• {Coke, 7-up, ranch Doritos, bananas}

would become:{Coke, 7-up, ranch Doritos, bananas,

Doritos, cola, clear pop, soft drink, chips, junk food, fruit, produce}

Fix Cont’d

• Run Apriori on expanded database

• Redefine association rules:

Make sure:XY XY={} Y contains no ancestors of any item in X

Problems with the fix

• Counting may slow down Total number of items & average

transaction size will grow

• Could get a lot of redundant rulesMilk Cereal (70%) Skim Milk Cereal (70%)

Do we care?

An Efficient Algorithm

• “Cumulate”Filtering ancestors added to

transactionsHierarchy-aware itemset pruning

• For more complicated, speculative algorithms, see paper

Filtering Ancestors

• Not counting soft drink? Don’t add it.Only add ancestors that are in at least

one of the candidate itemsets

• Delete any items we are not countingNot counting Doritos? Replace with chips

• Each iteration: Pre-compute the ancestors for each item

Itemset Pruning

• No sense counting both {coke,cola,chips} and {coke,chips}, they’ll always be the same

• Take out {coke,cola} during count size=2 and you’ll never have to deal with it

ˆ y = ancestor(y)

∀X : (y ∈ X ∧ ˆ y ∈ X) → sup(X) = sup(X −{ ˆ y })

Reducing Redundancy

Milk Cereal (8% sup, 70% conf)Skim Milk Cereal (2% sup, 70% conf)

• If Skim Milk accounts for 1/4 of Milk sales, then the 2nd rule is redundant

• Expected support and confidence (wrt hierarchy) will define interesting

Close Ancestors

• An itemset Z’ is an ancestor of Z if: Z’ = Z with some items replaced by ancestors Z’ has the same number of items as Z

• Z’ is a close ancestor of Z if: No ancestor of Z has Z’ as an ancestor

Take {coke,bananas} as ZZ’={cola, bananas} is a close ancestorZ’={soft drink, bananas} is not close Z’={cola,fruit} is not close

Interestingness

• A rule XY is interesting if for all interesting, close ancestors X’Y’:

Sup({X,Y}) > R*ExpSup({X,Y}|{X’,Y’})or:

Conf(XY) > R*ExpConf(XY|X’Y’)

• R is defined by the user

Putting it all together

• #1 is interesting - has no ancestor• #2 is interesting - twice expected support• #3 is not interesting

Has exactly expected support according to closest ancestor (#2)

Item Sup

Clothes 20

Outerwear 8

Jacket 4

Rule Sup (Exp)

Clothes Footwear 10 (-)

Outerwear Footwear 8 (4)

Jackets Footwear 4 (4)

Experiments

• Lots of experiments on artificial data in paper.

• We’ll look at the results of using Cumulate on real data

• Compare to the quick fix - just adding in ancestors to transactions

Supermarket

Department Store

Interestingness Results

• Hierarchical Interestingness pruning:

R = 25% resulted in pruning roughly 40% of the rules

R = 50% resulted in pruning roughly 50% of the reuslts

• Pruning had a significant impact!

Objectives Revisited

• What are generalized association rules? Rules aware of hierarchies over items

• Why do we care? Support can be low for individual items

• How can we get them efficiently? Cumulate algorithm - hierarchy aware counting

• How can we reduce rule redundancy? Check surprise with respect to ancestors

• Is the efficient method any good? Yeap!

Questions?

?

Hierarchy Example

Cans Bottles

Beverage

Fridge

Impulse

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.… …

Pros

• Rules over items low in the tree may not have minimum support

• Can raise min support Shoot for fewer, more general rulesBUT: You can catch rules at any level of

the hierarchy

Data Sets

• Supermarket: 500,000 items 1.5 million transactionsHierarchy has 4 levels, 118 roots

• Department Store: 200,000 items 500,000 transactionsHierarchy has 7 levels, 89 roots

Summary

• Nothing ground-breaking in this paper

• But, it provides a solid, efficient method for working with hierarchies

• Generalization is a powerful tool to have available in association rules