Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The...

Constraint Mining of Frequent Patterns in Long Sequences

Presented by Yaron Gonen

OutlineIntroduction

Problems definition and motivationPrevious work

The CAMLS AlgorithmOverviewMain contributionsResults

Future Work

Frequent Item-sets:The Market-Basket Model

A set of items, e.g., stuff sold in a supermarket

A set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.

SupportSupport for item-set I = the number of

baskets containing all items in I (Usually given as a percentage)

Given a support threshold minSup, sets of items that appear in > minSup baskets are called frequent item-sets

Simplest question: find sets of frequent item-sets

ExampleItems:

Minimum Support = 0.6 (2 baskets)

Application (1)Items: products at a supermarketBaskets: set of products a customer bought

at one time.Example: many people by beer and diapers

together. Place beer next to diapers to increase both

salesRun a sale on diapers and raise price of beer.

Application (2)(Counter-Intuitive)Items: species of plantsBaskets: each basket represent an attribute.

A basket contains items (plants) that have that attribute

Frequent sets may indicate similarity between plants

Scale of ProblemCostco sells more than 120k different items,

and has 57m members (from Wikipedia)Botany has identified about 350k extant

species of plants

The Naïve AlgorithmGenerate all possible itemsets.Check their support.

I

, ,,, ,

,,

,,

… , I2

The Apriori PropertyAll nonempty subsets of a frequent itemset

must also be frequent.

XX

XX

The Apriori AlgorithmFind

frequent 1-itemsets

Merge and prune to generate

candidate of next size

Has candidate

s?

End

Go though whole DB to

count support

yes

no

> min support?

Frequent itemset

Here’s where the apriori property is

used.

Largest itemset’s length times

going over the DB

Vertical FormatIndex on items.Calculating support is fast

1 2 3

2 12

123

3

Frequent Sequences:Taking it to the Next Level

A large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time

2 weeks 5 days

SupportSubsequence: a sequence, that all its events

are subsets of another sequence, in the same order (but not necessarily consecutive)

Support for subsequence s = the number of sequences containing s (Usually given as a percentage)

Given a support threshold minSup, subsequences that appear in > minSup sequences are called frequent subsequences

Simplest question: find all frequent subsequence

NotationsItems are letters: a,b,…Events are parenthesized: (ab), (bdf),…

Except for events with single itemsSequences are surrounded by <…>Every sequence has an identifier sid

Example

sid sequence

1 <a(abc)(ac)d(cf)>

2 <(ad)c(bc)(ae)>

3 <(ef)(ab)(df)cb>

4 <eg(af)cbc>

Frequent Sequence

s

<a>:4

<(a)(a)>:3

<(a)(c)>:4

<(a)(bc)>:2

<(e)(a)(c)>:2

…

minSup = 0.5

MotivationCustomer shopping patternsStock market fluctuationWeblog click stream analysisSymptoms of a diseasesDNA sequence analysisWeather forecastMachine anti agingMany more…

Much Harder than Frequent Item-sets!2m*n possible candidates!

Where m is the number of items, and n in the number of transactions in the longest sequence

The Apriori PropertyIf a sequence is not frequent, then any

sequence that contains it cannot be frequent

ConstraintsProblems:

Too many frequent sequencesmost frequent sequences are not useful

Solution remove themConstraints are a way to define usefulness The trick do so while mining

Previous WorkGSP (Srikant and Agrawal, 1996)

Generation-and-test Apriori Based approachSPADE (Zaki, 2001)

Generation-and-test Apriori Based approachUses equivalence-class for memory

optimizationUses a vertical-format db

PrefixSpan (Pei, 2004)No candidate generationUsing a db-projection method

Why a New Algorithm?Huge set of candidate-sequences/projected

db generatedMultiple Scans of database neededInefficient for mining long sequential

patternsNo exploits of domain-specific propertiesWeak constraints support

The CAMLS AlgorithmConstraint-based Apriori algorithm for

Mining Long SequencesDesigned especially for efficient mining of

long sequencesOutperforms SPADE and PrefixSpan on both

synthetic and real data

The CAMLS AlgorithmMakes a logical distinction between two types

of constraints:Intra-Event: not time related (i.e. mutually

exclusive items)Inter-Event: addresses the temporal aspect

of the data (i.e. values that can or cannot appear one after the other)

Event-wise ConstraintsEvent must/must not contain a specific itemTwo items cannot occur on the same timemax_event_length: An event cannot contain

more than a fixed number of items

Sequence–wise Constraintsmax_sequence_length: a sequence cannot

contain more than a fixed number of eventsmax_gap: long time between events

dismisses the pattern

CAMLS Overview

Constraints (minSup, maxGap, …)

Input

Event-wise

Sequence-wise

Output

Frequent events + occurrence index

What Do We Get?The best of both worlds:

Much less candidates are being generated.Support check is fast.Worst case: works like SPADE.Tradeoff: Uses a bit more memory (for storing

the frequent item-sets).

Event-wise PhaseInput: sequence database and constraintsOutput: frequent events + occurrence

indexUse Apriori or FP-Growth to find frequent

itemsets (both with minor modifications)

Event-wise1. L1 = all frequent items

2. for k=2;Lk-1≠Φ;k++ do

1. generateCandidates(Lk-1)

2. Lk = pruneCandidates()

3. L = L Lk

3. end for

Example soon!

If two frequent (k-1) event have the same prefix merge them and form a new candidate

Prune, calculate support count and create occurrence index

Occurrence IndexA compact representation of all occurrences

of a sequenceStructure: list of sids, each associated with a

list of eids

Example on next slide!

eid1

sid1

sid2

sid3

eid2 eid3

eid4 eid5

eid6 eid7 eid8 eid9

sequence

Event-wise Example(Using Apriori)

event eid sid

(acd) 0 1

(bcd) 5 1

b 10 1

a 0 2

c 4 2

(bd) 8 2

(cde) 0 3

e 7 3

(acd) 11 3

minSup=2

All frequent items:a:3, b:2, c:3, d:3

candidates:(ab),(ac),(ad),(bc),…

Support count:(ac):2, (ad):2, (bd):2, (cd):2

candidates:(abc), (abd),(acd),…

Support count:(acd):2

No more candidates!

13

0

11

Sequence-wise PhaseInput: frequent events + occurrence

index, constraintsOutput: all frequent sequencesSimilar to GSP’s and SPADE’s candidate

generation phase – except using the frequent itemsets as seeds

Sequence-wise1. L1 = all frequent 1-sequences

2. for k=2;Lk-1≠Φ;k++ do

1. generateCandidates(Lk-1)

2. Lk = pruneAndSupCalc()

3. L = L Lk

3. end forElaboration on next two slide

Sequence-wise Candidate GenerationIf two frequent k-sequences s’ and s’’ share a

common k-1 prefix and s1 is a generator, we form a new candidate

s‘ = <s’1s’2…s’k>

s’’ = <s’’1s’’2…s’’k><s’1s’2…s’k-1> =

<s’’1s’’2…s’’k-1>

<s’1s’2…s’k s’’k

>

Sequence-wise Pruning1. Keep a radix-ordered list of pruned sequences in current

iteration2. In the same iteration its possible that a k-sequence will

contain another k-sequence in the same iteration.3. With a new candidate:

1. Check subsequence in pruned list: Very Fast!2. Test for frequency3. Add to pruned list if needed

Support CalculationA simple intersection operation between the

occurrence index of the forming sequencesWhen a new occurrence index is formed,

calculation is trivial

The maxGap ConstraintmaxGap is a special kind of constraint:

Data dependantApriori property not applicable

The occurrence index enables fast maxGap checkA frequent sequence that does not satisfy maxGap is

flagged as non-generator. Example: Assume <ab> is frequent but gap between a and b > maxgap But frequent sequences <ac> and <ab> and in <acb> all

maxgap constraints are ok! So <ab> is a non-Generator but kept in order not to prune

<acb>

Sequence-Wise Exampleevent

eid

sid

(acd)

0 1

(bcd)

5 1

b 10 1

a 0 2

c 4 2

(bd) 8 2

(cde)

0 3

e 7 3

(acd)

11 3

Original DB

Event-wise

minSup=2maxGap=5

g <(a)> : 3

g <(b)> : 2

g <(c)> : 3

g <(d)> : 3

g <(ac)> : 2

g <(ad)> : 2

g <(bd)> : 2

g <(cd)> : 2

g <(acd)> : 2

Candidate generation

<aa>

<ab>

…

<a(acd)>

<ba>

<bb>

…

<(acd) (acd)>

<aa> is added to pruned list.<a(ac)> is a super-sequence of <aa>, therefore it is pruned.<ab> does not pass maxGap, therefore it is not a generator.

<ab> : 2

g <ac> : 2

<ad> : 2

<a(bd)> : 2

g <cb> : 2

g <cd> : 2

g <c(bd)> : 2

<dc> : 2

<dd> : 2

<d(cd)> : 2

<acb>

<acd>

…

<acb>:2

No more candidates!

Evaluation (1):Machine Anti Aging

How can Sequence Mining Help?Data collected from machine is a sequenceDiscover typical behavior leading to failureMonitor machine and alert before failureDomain:

Light intensity for wavelengths (continuous)Pre-process

DiscretizationMeta features (maxDisc, maxWL, isBurned)

Synm stands for a synthetic database simulating the machine behavior with m meta-features

Evaluation (2)Real Stocks data values

Rn stands for stock data (10 different stocks) for n days

CAMLS Compared with PrefixSpan

CAMLS Compared with Spade and PrefixSpan

So, What’s CAMLS Contribution?Constraints distinction: easy implementationTwo phasesHandling on the MaxGap constraintOccurrence index data structureFast new pruning method

Future ResearchMain issue: closed sequencesMore constraints (aspiring regexp)

Thank You!

Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The...

Documents

Transcript of Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The...