Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The...
Constraint Mining of Frequent Patterns in Long Sequences
Presented by Yaron Gonen
OutlineIntroduction
Problems definition and motivationPrevious work
The CAMLS AlgorithmOverviewMain contributionsResults
Future Work
Frequent Item-sets:The Market-Basket Model
A set of items, e.g., stuff sold in a supermarket
A set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.
SupportSupport for item-set I = the number of
baskets containing all items in I (Usually given as a percentage)
Given a support threshold minSup, sets of items that appear in > minSup baskets are called frequent item-sets
Simplest question: find sets of frequent item-sets
ExampleItems:
Minimum Support = 0.6 (2 baskets)
Application (1)Items: products at a supermarketBaskets: set of products a customer bought
at one time.Example: many people by beer and diapers
together. Place beer next to diapers to increase both
salesRun a sale on diapers and raise price of beer.
Application (2)(Counter-Intuitive)Items: species of plantsBaskets: each basket represent an attribute.
A basket contains items (plants) that have that attribute
Frequent sets may indicate similarity between plants
Scale of ProblemCostco sells more than 120k different items,
and has 57m members (from Wikipedia)Botany has identified about 350k extant
species of plants
The Naïve AlgorithmGenerate all possible itemsets.Check their support.
I
, ,,, ,
,,
,,
… , I2
The Apriori PropertyAll nonempty subsets of a frequent itemset
must also be frequent.
XX
XX
The Apriori AlgorithmFind
frequent 1-itemsets
Merge and prune to generate
candidate of next size
Has candidate
s?
End
Go though whole DB to
count support
yes
no
> min support?
Frequent itemset
Here’s where the apriori property is
used.
Largest itemset’s length times
going over the DB
Vertical FormatIndex on items.Calculating support is fast
1 2 3
2 12
123
3
Frequent Sequences:Taking it to the Next Level
A large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time
2 weeks 5 days
SupportSubsequence: a sequence, that all its events
are subsets of another sequence, in the same order (but not necessarily consecutive)
Support for subsequence s = the number of sequences containing s (Usually given as a percentage)
Given a support threshold minSup, subsequences that appear in > minSup sequences are called frequent subsequences
Simplest question: find all frequent subsequence
NotationsItems are letters: a,b,…Events are parenthesized: (ab), (bdf),…
Except for events with single itemsSequences are surrounded by <…>Every sequence has an identifier sid
Example
sid sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>
Frequent Sequence
s
<a>:4
<(a)(a)>:3
<(a)(c)>:4
<(a)(bc)>:2
<(e)(a)(c)>:2
…
minSup = 0.5
MotivationCustomer shopping patternsStock market fluctuationWeblog click stream analysisSymptoms of a diseasesDNA sequence analysisWeather forecastMachine anti agingMany more…
Much Harder than Frequent Item-sets!2m*n possible candidates!
Where m is the number of items, and n in the number of transactions in the longest sequence
The Apriori PropertyIf a sequence is not frequent, then any
sequence that contains it cannot be frequent
ConstraintsProblems:
Too many frequent sequencesmost frequent sequences are not useful
Solution remove themConstraints are a way to define usefulness The trick do so while mining
Previous WorkGSP (Srikant and Agrawal, 1996)
Generation-and-test Apriori Based approachSPADE (Zaki, 2001)
Generation-and-test Apriori Based approachUses equivalence-class for memory
optimizationUses a vertical-format db
PrefixSpan (Pei, 2004)No candidate generationUsing a db-projection method
Why a New Algorithm?Huge set of candidate-sequences/projected
db generatedMultiple Scans of database neededInefficient for mining long sequential
patternsNo exploits of domain-specific propertiesWeak constraints support
The CAMLS AlgorithmConstraint-based Apriori algorithm for
Mining Long SequencesDesigned especially for efficient mining of
long sequencesOutperforms SPADE and PrefixSpan on both
synthetic and real data
The CAMLS AlgorithmMakes a logical distinction between two types
of constraints:Intra-Event: not time related (i.e. mutually
exclusive items)Inter-Event: addresses the temporal aspect
of the data (i.e. values that can or cannot appear one after the other)
Event-wise ConstraintsEvent must/must not contain a specific itemTwo items cannot occur on the same timemax_event_length: An event cannot contain
more than a fixed number of items
Sequence–wise Constraintsmax_sequence_length: a sequence cannot
contain more than a fixed number of eventsmax_gap: long time between events
dismisses the pattern
CAMLS Overview
Constraints (minSup, maxGap, …)
Input
Event-wise
Sequence-wise
Output
Frequent events + occurrence index
What Do We Get?The best of both worlds:
Much less candidates are being generated.Support check is fast.Worst case: works like SPADE.Tradeoff: Uses a bit more memory (for storing
the frequent item-sets).
Event-wise PhaseInput: sequence database and constraintsOutput: frequent events + occurrence
indexUse Apriori or FP-Growth to find frequent
itemsets (both with minor modifications)
Event-wise1. L1 = all frequent items
2. for k=2;Lk-1≠Φ;k++ do
1. generateCandidates(Lk-1)
2. Lk = pruneCandidates()
3. L = L Lk
3. end for
Example soon!
If two frequent (k-1) event have the same prefix merge them and form a new candidate
Prune, calculate support count and create occurrence index
Occurrence IndexA compact representation of all occurrences
of a sequenceStructure: list of sids, each associated with a
list of eids
Example on next slide!
eid1
sid1
sid2
sid3
eid2 eid3
eid4 eid5
eid6 eid7 eid8 eid9
sequence
Event-wise Example(Using Apriori)
event eid sid
(acd) 0 1
(bcd) 5 1
b 10 1
a 0 2
c 4 2
(bd) 8 2
(cde) 0 3
e 7 3
(acd) 11 3
minSup=2
All frequent items:a:3, b:2, c:3, d:3
candidates:(ab),(ac),(ad),(bc),…
Support count:(ac):2, (ad):2, (bd):2, (cd):2
candidates:(abc), (abd),(acd),…
Support count:(acd):2
No more candidates!
13
0
11
Sequence-wise PhaseInput: frequent events + occurrence
index, constraintsOutput: all frequent sequencesSimilar to GSP’s and SPADE’s candidate
generation phase – except using the frequent itemsets as seeds
Sequence-wise1. L1 = all frequent 1-sequences
2. for k=2;Lk-1≠Φ;k++ do
1. generateCandidates(Lk-1)
2. Lk = pruneAndSupCalc()
3. L = L Lk
3. end forElaboration on next two slide
Sequence-wise Candidate GenerationIf two frequent k-sequences s’ and s’’ share a
common k-1 prefix and s1 is a generator, we form a new candidate
s‘ = <s’1s’2…s’k>
s’’ = <s’’1s’’2…s’’k><s’1s’2…s’k-1> =
<s’’1s’’2…s’’k-1>
<s’1s’2…s’k s’’k
>
Sequence-wise Pruning1. Keep a radix-ordered list of pruned sequences in current
iteration2. In the same iteration its possible that a k-sequence will
contain another k-sequence in the same iteration.3. With a new candidate:
1. Check subsequence in pruned list: Very Fast!2. Test for frequency3. Add to pruned list if needed
Support CalculationA simple intersection operation between the
occurrence index of the forming sequencesWhen a new occurrence index is formed,
calculation is trivial
The maxGap ConstraintmaxGap is a special kind of constraint:
Data dependantApriori property not applicable
The occurrence index enables fast maxGap checkA frequent sequence that does not satisfy maxGap is
flagged as non-generator. Example: Assume <ab> is frequent but gap between a and b > maxgap But frequent sequences <ac> and <ab> and in <acb> all
maxgap constraints are ok! So <ab> is a non-Generator but kept in order not to prune
<acb>
Sequence-Wise Exampleevent
eid
sid
(acd)
0 1
(bcd)
5 1
b 10 1
a 0 2
c 4 2
(bd) 8 2
(cde)
0 3
e 7 3
(acd)
11 3
Original DB
Event-wise
minSup=2maxGap=5
g <(a)> : 3
g <(b)> : 2
g <(c)> : 3
g <(d)> : 3
g <(ac)> : 2
g <(ad)> : 2
g <(bd)> : 2
g <(cd)> : 2
g <(acd)> : 2
Candidate generation
<aa>
<ab>
…
<a(acd)>
<ba>
<bb>
…
<(acd) (acd)>
<aa> is added to pruned list.<a(ac)> is a super-sequence of <aa>, therefore it is pruned.<ab> does not pass maxGap, therefore it is not a generator.
<ab> : 2
g <ac> : 2
<ad> : 2
<a(bd)> : 2
g <cb> : 2
g <cd> : 2
g <c(bd)> : 2
<dc> : 2
<dd> : 2
<d(cd)> : 2
<acb>
<acd>
…
<acb>:2
No more candidates!
Evaluation (1):Machine Anti Aging
How can Sequence Mining Help?Data collected from machine is a sequenceDiscover typical behavior leading to failureMonitor machine and alert before failureDomain:
Light intensity for wavelengths (continuous)Pre-process
DiscretizationMeta features (maxDisc, maxWL, isBurned)
Synm stands for a synthetic database simulating the machine behavior with m meta-features
Evaluation (2)Real Stocks data values
Rn stands for stock data (10 different stocks) for n days
CAMLS Compared with PrefixSpan
CAMLS Compared with Spade and PrefixSpan
So, What’s CAMLS Contribution?Constraints distinction: easy implementationTwo phasesHandling on the MaxGap constraintOccurrence index data structureFast new pruning method
Future ResearchMain issue: closed sequencesMore constraints (aspiring regexp)
Thank You!