Lecture16 - Advances topics on association rules PART III

26
Introduction to Machine Introduction to Machine Learning Learning Lecture 16 Lecture 16 Advanced Topics in Association Rules Mining Albert Orriols i Puig htt // lb t il t http://www.albertorriols.net [email protected] Artificial Intelligence Machine Learning Enginyeria i Arquitectura La Salle Universitat Ramon Llull

description

 

Transcript of Lecture16 - Advances topics on association rules PART III

Page 1: Lecture16 - Advances topics on association rules PART III

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 16Lecture 16Advanced Topics in Association Rules Mining

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Page 2: Lecture16 - Advances topics on association rules PART III

Recap of Lecture 13-15Ideas come from the market basket analysis (MBA)y ( )

Let’s go shopping!

Milk, eggs, sugar, bread

Milk, eggs, cereal, b d

Eggs, sugar

bread

Customer1Customer1

Customer2 Customer3

What do my customer buy? Which product are bought together?

Aim: Find associations and correlations between the different

Slide 2

d assoc at o s a d co e at o s bet ee t e d e e titems that customers place in their shopping basket

Artificial Intelligence Machine Learning

Page 3: Lecture16 - Advances topics on association rules PART III

Recap of Lecture 15Aim: Find associations between items

But wait!There are many different diapers

Dodot, Huggies …gg

There are many different beers: heineken, desperados, king fisher … in bottle/can …, p , g

Which rule do you prefer?Clothes

diapers ⇒ beer

dodot diapers M ⇒ Dam beer in CanOutwear Shirts

dodot diapers M ⇒ Dam beer in Can

Which will have greater support?Jackets Ski Pants

Slide 3

Which will have greater support?

Artificial Intelligence Machine Learning

Page 4: Lecture16 - Advances topics on association rules PART III

Today’s Agenda

Continuing our journey through some advancedContinuing our journey through some advanced topics in ARM

Mining frequent patterns without candidate generation

Multiple Level AR

Sequential Pattern MiningSequential Pattern Mining

Quantitative association rules

Mining class association rules

B d t & fidBeyond support & confidence

Applications

Slide 4Artificial Intelligence Machine Learning

Page 5: Lecture16 - Advances topics on association rules PART III

Introduction to Seq. ARSo far, we have seen,

Apriori

F thFp-growth

Mining multiple level AR

But none of them consider the order of transactions

However is the sequence important?However, is the sequence important?Whether the hen or the egg?

Sometimes, really importantAnalyze the sequence of items bought buy a customerWeb usage mining searches for navigational patterns of

Slide 5

users

Artificial Intelligence Machine Learning

Page 6: Lecture16 - Advances topics on association rules PART III

An Example in Web Usage Mining

Web sequence: < {Homepage} {Electronics} {Computers}Web sequence: < {Homepage} {Electronics} {Computers} {Laptops} {Sony Vaio} {Order Confirmation} {Return to Shopping} >

Slide 6Artificial Intelligence Machine Learning

Page 7: Lecture16 - Advances topics on association rules PART III

DefinitionDefining the problem:g p

Let I = {i1, i2, …, im} be a set of items

S A d d li t f it tSequence: An ordered list of itemsets

Itemset/element: A non-empty set of items X ⊆ I. We denote a b < > h i it t hi h i lsequence s by <a1a2…ar>, where ai is an itemset, which is also

called an element of s

A l t ( it t) f i d t d b {An element (or an itemset) of a sequence is denoted by {x1, x2, …, xk}, where xj ∈ I is an item

W ith t l f lit th t it i l tWe assume without loss of generality that items in an element of a sequence are in lexicographic order

Slide 7Artificial Intelligence Machine Learning

Page 8: Lecture16 - Advances topics on association rules PART III

DefinitionDefining the problem:g p

Size: The size of a sequence is the number of elements (or itemsets) in the sequence e se s) e seque ce

Length: The length of a sequence is the number of items in the sequenceseque ce

A sequence of length k is called k-sequence

A ⟨ ⟩ i b f thA sequence s1 = ⟨a1a2…ar⟩ is a subsequence of another sequence s2 = ⟨b1b2…bv⟩, or s2 is a supersequence of s1, if there exist integers 1 ≤ j1 < j2 < … < jr 1 < jr ≤ v such that a1 ⊆t e e e st tege s j1 j2 jr−1 jr suc t at a1 ⊆bj1, a2 ⊆ bj2, …, ar ⊆ bjr. We also say that s2 contains s1

Slide 8Artificial Intelligence Machine Learning

Page 9: Lecture16 - Advances topics on association rules PART III

ExampleLet I = {1, 2, 3, 4, 5, 6, 7, 8, 9}. { , , , , , , , , }

Sequence ⟨{3}{4, 5}{8}⟩ is contained in (or is a subsequence of) ⟨{6} {3 7}{9}{4 5 8}{3 8}⟩subsequence of) ⟨{6} {3, 7}{9}{4, 5, 8}{3, 8}⟩

because {3} ⊆ {3, 7}, {4, 5} ⊆ {4, 5, 8}, and {8} ⊆ {3, 8}.

However, ⟨{3}{8}⟩ is not contained in ⟨{3, 8}⟩ or vice versa.

The size of the sequence ⟨{3}{4, 5}{8}⟩ is 3, and the length of the sequence is 4

Slide 9Artificial Intelligence Machine Learning

Page 10: Lecture16 - Advances topics on association rules PART III

Objective

Objective of sequential pattern mining (SPM)j q p g ( )Input: A set S of input data sequences (or sequence database)

G l th bl f i i ti l tt i t fi d ll thGoal: the problem of mining sequential patterns is to find all the sequences that have a user-specified minimum support

E h h i ll d f tEach such sequence is called a frequent sequence, or a sequential pattern

The support for a sequence is the fraction of total data sequences in S that contains this sequence

Slide 10Artificial Intelligence Machine Learning

Page 11: Lecture16 - Advances topics on association rules PART III

ExampleCustomer

IDTransaction

timeTransaction

(items bought)Customer

IDCustomer Sequence

1 July 20, 2005 30

1 July 25, 2005 90

2 July 9, 2005 10, 20

1 < (30) (90)>

2 <(10 20) (30) (40 60 70)>

3 <(30 50 70)>y , ,

2 July 14, 2005 30

2 July 20, 2005 40,60,70

( )

4 <(30) (40 70) (90)>

5 <(90)>

3 July 25, 2005 30,50,70

4 July 25, 2005 30

4 July 29, 2005 40, 70y , ,

4 August 2, 2005 90

5 July 12, 2005 90

Sequential patterns with support >25%1-sequence < (30)> <(40)> <(70)> <(90)>1-sequence < (30)> <(40)> <(70)> <(90)>

2-sequence <(30)(40)> <(30)(70)><(30)(90)><(40 70)>

3-sequence <(30) (40 70)>

Slide 11

Example borrowed from Bing Liu

Artificial Intelligence Machine Learning

Page 12: Lecture16 - Advances topics on association rules PART III

GSPGSP follows closely Apriori but for sequential patternsy p q p

If a sequence S is not frequent, then none of the super-sequences of S is frequentseque ces o S s eque

For instance, if <ab> is infrequent so do <acb> and <(ca)b>

GSP f ll th t tGSP follows the next steps:Initially, every item in DB is a candidate of length-1

For each level (i.e., sequences of length-k) doScan database to collect support count for each candidate sequenceGenerate candidate length-(k+1) sequences from length-k frequent sequences using Apriorifrequent sequences using AprioriRepeat until no frequent sequence or no candidate can be found

Slide 12

found

Strength: Candidate pruning by AprioriArtificial Intelligence Machine Learning

Page 13: Lecture16 - Advances topics on association rules PART III

The Algorithm

Slide 13

Does this remind you Apriori?

Artificial Intelligence Machine Learning

Page 14: Lecture16 - Advances topics on association rules PART III

Quantitative AR

Transaction ID Age Married NumCars1 23 No 12 25 Yes 13 29 No 04 34 Yes 25 38 Y 25 38 Yes 2

<Age: 30..39> and <Married: Yes> => <NumCars: 2>

Support = 40% Conf = 100%Support 40%, Conf 100%

How can we deal with these data?

Slide 14Artificial Intelligence Machine Learning

Page 15: Lecture16 - Advances topics on association rules PART III

Map to Boolean Values

Record Age Age Married Married NumCars NumCarsID

g[20..29]

g[30..39] Yes No 0 1

100 1 0 0 1 0 1200 1 0 1 0 0 1300 1 0 0 1 1 0400 0 1 1 0 0 0500 0 1 1 0 0 0

Now use any system for mining boolean ARNow, use any system for mining boolean ARApriori

Slide 15

FP-growth

Artificial Intelligence Machine Learning

Page 16: Lecture16 - Advances topics on association rules PART III

Problems with this ApproachMinSup

If number of intervals is large, the support of a single interval can be lower

MinConfInformation lost during partitionInformation lost during partition values into intervals. Confidence can be lower as number of intervals is smaller

ExampleIn the used partition:

<NumCars:0> ⇒ <Married:No> c=100%

But now, assume that in the partition, NumCars:0 and NumCars:1 go to the same interval

Slide 16

<NumCars:0,1> ⇒ <Married:No> c=66.67%

Artificial Intelligence Machine Learning

Page 17: Lecture16 - Advances topics on association rules PART III

Problems with this ApproachHow we can solve this problem?

Increase the number of intervals (to reduce information lost)

hil bi i dj t (t i t)while combining adjacent ones (to increase support)

ExecTime blows up as items per record increasesper record increases

ManyRules: Number of rules also blows up.Many of them will not be interestingMany of them will not be interesting

Slide 17Artificial Intelligence Machine Learning

Page 18: Lecture16 - Advances topics on association rules PART III

Second ApproachOther solutions?

Well, the problem was that intervals were not the best ones

L t’ t t t th b t i t l f d tLet’s try to create the best intervals for our data

How?Discretizing/Clustering techniques

Apply a discretizing/clustering technique to find the best y g gpartitionsEmploy those partitions

We’ll see how clustering techniques work in the next class. So, keep this in mind and pitch the pieces together next class!

Slide 18

p p p g

Artificial Intelligence Machine Learning

Page 19: Lecture16 - Advances topics on association rules PART III

Third ApproachAnd what if we do not map the input to a boolean p pspace?

Create interval-based associationCreate interval based association rules directly

So decide the best interval andSo, decide the best interval and, then, count the support

Usually these approaches do notUsually, these approaches do not provide all the association rules, but the ones with larger support

fand confidence

Fuzzy logics can also be applied here. But again, we’ll see GFS in two three lectures

Slide 19Artificial Intelligence Machine Learning

Page 20: Lecture16 - Advances topics on association rules PART III

Mining Class Association Rules

So far, we have seen ARM without any specific target, y p gIt finds all possible rules that exist in data, i.e., any item can appear as a consequent or a condition of a rule

However, what if we are interested in some specific targets? E g :E.g.:

The user has a set of text documents from some known topics. He/she wants to find out what words are associated or correlated with each topic

So, now, we want to find:X ⇒ y, where X ⊆ I, and y ∈ Y

The algorithms are very similar to those of ARMThe algorithms are very similar to those of ARM

We are not going to see them in class. But you have information on the estudy

Slide 20

information on the estudy

Artificial Intelligence Machine Learning

Page 21: Lecture16 - Advances topics on association rules PART III

Beyond Support and Confidence

Support and Confidence are the basic measures of ppinterestingness

But many more have been proposed during the last fewBut many more have been proposed during the last few years

Slide 21Artificial Intelligence Machine Learning

Page 22: Lecture16 - Advances topics on association rules PART III

Some ApplicationsWal-Mart has used the technique for years to mine POS data and arrange their store to maximize sales from such analysissales from such analysis

Medical databases to discover commonly occurring diseases amongst groups of people

L tt lt d t b t di th l k bi ti fLottery results databases, to discover those lucky combinations of numbers

Slide 22Artificial Intelligence Machine Learning

Page 23: Lecture16 - Advances topics on association rules PART III

Some ApplicationsPower System Restorationy

PSR is a multi-objective, multi-period, nonlinear, mixed integer optimization problem with various constraints and op a o p ob e a ous co s a s a dunforeseeable factors

Discovering of associations that help build heuristics for PSRsco e g o assoc a o s a e p bu d eu s cs o S

Actions in a PSRstart black start unit(x)start_black_start_unit(x)energize_line(x)pick up load(x)pick_up_load(x)synchronize(x,y)connect tie line(x)connect_tie_line(x)crank_unit(x)energize busbar(x)

Slide 23

energize_busbar(x)

Artificial Intelligence Machine Learning

Page 24: Lecture16 - Advances topics on association rules PART III

Some ApplicationsCorrelations with color, spatial relationships, etc.Correlations with color, spatial relationships, etc.

From coarse to Fine Resolution mining

Slide 24Artificial Intelligence Machine Learning

Page 25: Lecture16 - Advances topics on association rules PART III

Next Class

Clustering

Slide 25Artificial Intelligence Machine Learning

Page 26: Lecture16 - Advances topics on association rules PART III

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 16Lecture 16Advanced Topics in Association Rules Mining

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull