Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The...

46
Constraint Mining of Frequent Patterns in Long Sequences Presented by Yaron Gonen
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The...

Page 1: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Constraint Mining of Frequent Patterns in Long Sequences

Presented by Yaron Gonen

Page 2: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

OutlineIntroduction

Problems definition and motivationPrevious work

The CAMLS AlgorithmOverviewMain contributionsResults

Future Work

Page 3: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Frequent Item-sets:The Market-Basket Model

A set of items, e.g., stuff sold in a supermarket

A set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.

Page 4: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

SupportSupport for item-set I = the number of

baskets containing all items in I (Usually given as a percentage)

Given a support threshold minSup, sets of items that appear in > minSup baskets are called frequent item-sets

Simplest question: find sets of frequent item-sets

Page 5: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

ExampleItems:

Minimum Support = 0.6 (2 baskets)

Page 6: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Application (1)Items: products at a supermarketBaskets: set of products a customer bought

at one time.Example: many people by beer and diapers

together. Place beer next to diapers to increase both

salesRun a sale on diapers and raise price of beer.

Page 7: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Application (2)(Counter-Intuitive)Items: species of plantsBaskets: each basket represent an attribute.

A basket contains items (plants) that have that attribute

Frequent sets may indicate similarity between plants

Page 8: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Scale of ProblemCostco sells more than 120k different items,

and has 57m members (from Wikipedia)Botany has identified about 350k extant

species of plants

Page 9: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

The Naïve AlgorithmGenerate all possible itemsets.Check their support.

I

, ,,, ,

,,

,,

… , I2

Page 10: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

The Apriori PropertyAll nonempty subsets of a frequent itemset

must also be frequent.

XX

XX

Page 11: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

The Apriori AlgorithmFind

frequent 1-itemsets

Merge and prune to generate

candidate of next size

Has candidate

s?

End

Go though whole DB to

count support

yes

no

> min support?

Frequent itemset

Here’s where the apriori property is

used.

Largest itemset’s length times

going over the DB

Page 12: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Vertical FormatIndex on items.Calculating support is fast

1 2 3

2 12

123

3

Page 13: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Frequent Sequences:Taking it to the Next Level

A large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time

2 weeks 5 days

Page 14: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

SupportSubsequence: a sequence, that all its events

are subsets of another sequence, in the same order (but not necessarily consecutive)

Support for subsequence s = the number of sequences containing s (Usually given as a percentage)

Given a support threshold minSup, subsequences that appear in > minSup sequences are called frequent subsequences

Simplest question: find all frequent subsequence

Page 15: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

NotationsItems are letters: a,b,…Events are parenthesized: (ab), (bdf),…

Except for events with single itemsSequences are surrounded by <…>Every sequence has an identifier sid

Page 16: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Example

sid sequence

1 <a(abc)(ac)d(cf)>

2 <(ad)c(bc)(ae)>

3 <(ef)(ab)(df)cb>

4 <eg(af)cbc>

Frequent Sequence

s

<a>:4

<(a)(a)>:3

<(a)(c)>:4

<(a)(bc)>:2

<(e)(a)(c)>:2

minSup = 0.5

Page 17: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

MotivationCustomer shopping patternsStock market fluctuationWeblog click stream analysisSymptoms of a diseasesDNA sequence analysisWeather forecastMachine anti agingMany more…

Page 18: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Much Harder than Frequent Item-sets!2m*n possible candidates!

Where m is the number of items, and n in the number of transactions in the longest sequence

Page 19: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

The Apriori PropertyIf a sequence is not frequent, then any

sequence that contains it cannot be frequent

Page 20: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

ConstraintsProblems:

Too many frequent sequencesmost frequent sequences are not useful

Solution remove themConstraints are a way to define usefulness The trick do so while mining

Page 21: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Previous WorkGSP (Srikant and Agrawal, 1996)

Generation-and-test Apriori Based approachSPADE (Zaki, 2001)

Generation-and-test Apriori Based approachUses equivalence-class for memory

optimizationUses a vertical-format db

PrefixSpan (Pei, 2004)No candidate generationUsing a db-projection method

Page 22: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Why a New Algorithm?Huge set of candidate-sequences/projected

db generatedMultiple Scans of database neededInefficient for mining long sequential

patternsNo exploits of domain-specific propertiesWeak constraints support

Page 23: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

The CAMLS AlgorithmConstraint-based Apriori algorithm for

Mining Long SequencesDesigned especially for efficient mining of

long sequencesOutperforms SPADE and PrefixSpan on both

synthetic and real data

Page 24: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

The CAMLS AlgorithmMakes a logical distinction between two types

of constraints:Intra-Event: not time related (i.e. mutually

exclusive items)Inter-Event: addresses the temporal aspect

of the data (i.e. values that can or cannot appear one after the other)

Page 25: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Event-wise ConstraintsEvent must/must not contain a specific itemTwo items cannot occur on the same timemax_event_length: An event cannot contain

more than a fixed number of items

Page 26: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Sequence–wise Constraintsmax_sequence_length: a sequence cannot

contain more than a fixed number of eventsmax_gap: long time between events

dismisses the pattern

Page 27: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

CAMLS Overview

Constraints (minSup, maxGap, …)

Input

Event-wise

Sequence-wise

Output

Frequent events + occurrence index

Page 28: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

What Do We Get?The best of both worlds:

Much less candidates are being generated.Support check is fast.Worst case: works like SPADE.Tradeoff: Uses a bit more memory (for storing

the frequent item-sets).

Page 29: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Event-wise PhaseInput: sequence database and constraintsOutput: frequent events + occurrence

indexUse Apriori or FP-Growth to find frequent

itemsets (both with minor modifications)

Page 30: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Event-wise1. L1 = all frequent items

2. for k=2;Lk-1≠Φ;k++ do

1. generateCandidates(Lk-1)

2. Lk = pruneCandidates()

3. L = L Lk

3. end for

Example soon!

If two frequent (k-1) event have the same prefix merge them and form a new candidate

Prune, calculate support count and create occurrence index

Page 31: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Occurrence IndexA compact representation of all occurrences

of a sequenceStructure: list of sids, each associated with a

list of eids

Example on next slide!

eid1

sid1

sid2

sid3

eid2 eid3

eid4 eid5

eid6 eid7 eid8 eid9

sequence

Page 32: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Event-wise Example(Using Apriori)

event eid sid

(acd) 0 1

(bcd) 5 1

b 10 1

a 0 2

c 4 2

(bd) 8 2

(cde) 0 3

e 7 3

(acd) 11 3

minSup=2

All frequent items:a:3, b:2, c:3, d:3

candidates:(ab),(ac),(ad),(bc),…

Support count:(ac):2, (ad):2, (bd):2, (cd):2

candidates:(abc), (abd),(acd),…

Support count:(acd):2

No more candidates!

13

0

11

Page 33: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Sequence-wise PhaseInput: frequent events + occurrence

index, constraintsOutput: all frequent sequencesSimilar to GSP’s and SPADE’s candidate

generation phase – except using the frequent itemsets as seeds

Page 34: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Sequence-wise1. L1 = all frequent 1-sequences

2. for k=2;Lk-1≠Φ;k++ do

1. generateCandidates(Lk-1)

2. Lk = pruneAndSupCalc()

3. L = L Lk

3. end forElaboration on next two slide

Page 35: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Sequence-wise Candidate GenerationIf two frequent k-sequences s’ and s’’ share a

common k-1 prefix and s1 is a generator, we form a new candidate

s‘ = <s’1s’2…s’k>

s’’ = <s’’1s’’2…s’’k><s’1s’2…s’k-1> =

<s’’1s’’2…s’’k-1>

<s’1s’2…s’k s’’k

>

Page 36: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Sequence-wise Pruning1. Keep a radix-ordered list of pruned sequences in current

iteration2. In the same iteration its possible that a k-sequence will

contain another k-sequence in the same iteration.3. With a new candidate:

1. Check subsequence in pruned list: Very Fast!2. Test for frequency3. Add to pruned list if needed

Page 37: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Support CalculationA simple intersection operation between the

occurrence index of the forming sequencesWhen a new occurrence index is formed,

calculation is trivial

Page 38: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

The maxGap ConstraintmaxGap is a special kind of constraint:

Data dependantApriori property not applicable

The occurrence index enables fast maxGap checkA frequent sequence that does not satisfy maxGap is

flagged as non-generator. Example: Assume <ab> is frequent but gap between a and b > maxgap But frequent sequences <ac> and <ab> and in <acb> all

maxgap constraints are ok! So <ab> is a non-Generator but kept in order not to prune

<acb>

Page 39: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Sequence-Wise Exampleevent

eid

sid

(acd)

0 1

(bcd)

5 1

b 10 1

a 0 2

c 4 2

(bd) 8 2

(cde)

0 3

e 7 3

(acd)

11 3

Original DB

Event-wise

minSup=2maxGap=5

g <(a)> : 3

g <(b)> : 2

g <(c)> : 3

g <(d)> : 3

g <(ac)> : 2

g <(ad)> : 2

g <(bd)> : 2

g <(cd)> : 2

g <(acd)> : 2

Candidate generation

<aa>

<ab>

<a(acd)>

<ba>

<bb>

<(acd) (acd)>

<aa> is added to pruned list.<a(ac)> is a super-sequence of <aa>, therefore it is pruned.<ab> does not pass maxGap, therefore it is not a generator.

<ab> : 2

g <ac> : 2

<ad> : 2

<a(bd)> : 2

g <cb> : 2

g <cd> : 2

g <c(bd)> : 2

<dc> : 2

<dd> : 2

<d(cd)> : 2

<acb>

<acd>

<acb>:2

No more candidates!

Page 40: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Evaluation (1):Machine Anti Aging

How can Sequence Mining Help?Data collected from machine is a sequenceDiscover typical behavior leading to failureMonitor machine and alert before failureDomain:

Light intensity for wavelengths (continuous)Pre-process

DiscretizationMeta features (maxDisc, maxWL, isBurned)

Synm stands for a synthetic database simulating the machine behavior with m meta-features

Page 41: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Evaluation (2)Real Stocks data values

Rn stands for stock data (10 different stocks) for n days

Page 42: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

CAMLS Compared with PrefixSpan

Page 43: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

CAMLS Compared with Spade and PrefixSpan

Page 44: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

So, What’s CAMLS Contribution?Constraints distinction: easy implementationTwo phasesHandling on the MaxGap constraintOccurrence index data structureFast new pruning method

Page 45: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Future ResearchMain issue: closed sequencesMore constraints (aspiring regexp)

Page 46: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Thank You!