Post on 07-Jan-2016
description
Mining Sequential Patterns: Generalizations and
Performance Improvements
R. Srikant R. Agrawal
IBM Almaden Research Center
Advisor: Dr. HsuPresented by: M.H. Lin
Outline
Motivation Objective Introduction Problem Statement The New Algorithm: GSP Performance Evaluation Conclusion Personal Opinion
Motivation
The problem of mining sequential patterns was recently introduced.
Limitations of the AprioriAll [Agrawal, 1995] Absence of time constraints Rigid definition of a transaction Absence of taxonomies
Objective
We present GSP, a new algorithm that discovers these generalized sequential patterns
Empirically compared the performance of GSP with the AprioriAll algorithm.
Introduction
Instance A database of sequences, called data-sequences Each sequence is a list of transactions ordered by
transaction-time Each transaction is a set of items
Definitions: A sequential pattern consists a list of itemsets Support:the number of data-sequences that contain
the pattern Problem:
To discover all the sequential patterns with a user-specified minimum support
Example Of A Sequential Pattern
Database of book-club, each data-sequence corresponds to a given customer’s all book selection, each transaction contains the books selected by the given customer in one order
A sequential pattern:5% of customers bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’
Features of A Sequential Pattern
E.g: 5% cust. bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’
The Maximum and/or minimum time gaps between adjacent elements.
Eg: the time between buying ‘Foundation’, and then ‘Foundation and Empire’ and ‘Ringworld’ should be within 3 months
A sliding time window over the sequence-pattern elements E.g.: one week Mo: BK-a Sa: BK-b Next Su: BK-c ; This data-sequence supports the pattern “BK-a” and “ BK-
b”, then “BK-c” User-defined Taxonomies
Example coming soon….
A User-defined Taxonomy
A customer who bought Foundation,then Perfect Spy, would support the following patterns:
•Foundation, then Perfect Spy
•Asimov, then Perfect Spy
•Science Fiction, then Le Carre
…
The Old Algorithm--AprioriAll
A 3-phase algorithm Phase 1: finds all frequent itemsets with min.
support Phase 2: transforms the DB s.t. each transaction
only contains the frequent itemsets Phase 3: finds sequential patterns
Pros. Can Discover all frequent sequential patterns
Cons. Computationally expensive: space, time Not feasible to incorporate sliding windows
Problem Statement
Definitions: Let I = {i1,i2,…,im} be a set of literals, called items Let T be a directed acyclic graph on the literals. An itemset is a non-empty set of items A sequence is an ordered list of itemsets We denote a sequence s by <s1s2…sn>, where sj is an itemset. We denote an element of sequence by (x1,x2,…,xm), where xj
is an item. A sequence <a1a2…an> is a subsequence of another sequence
<b1b2…bm> if there exist integers i1<i2<…<in such that a1 bi1 , a2 bi2 , …, an bin.
E.g:<(3)(4,5)(8)> is a subsequence of <(7)(3,8)(9)(4,5,6)(8)> E.g:<(3)(5)> is not a subsequence of <(3,5)>
Problem Statement(contd.)
A data-sequence contains a sequence s if s is a subsequence of the data-sequence.
Plus taxonomies: a transaction T contains an item x I if x is in T or x is an
ancestor of some item in T. Plus sliding windows:
A data-sequence d = <d1…dm> contains a sequence s = <s1…sn> if there exist integers l1≤u1<l2≤u2<…<ln ≤un such that
1. si is contained in , 1 ≤ i ≤ n , and 2. transaction-time(dui) – transaction-time(dli) ≤window-size , 1 ≤ i ≤ n
Plus time constraints: 3. transaction-time(dli) - transaction-time(dui-1) > min-gap, 2 ≤ i ≤ n,
and 4. transaction-time(dui) - transaction-time(dli-1) ≤ max-gap, 2 ≤ i ≤ n.
Problem Definition
Input: Database D : data sequences Taxonomy T : a DAG, not a tree User-specified min-gap and max-gap time
constraints A user-specified sliding window size A user-specified minimum support
Goal: To find all sequences whose support is
greater than the given support
Example
minimum support: 2 data-sequences With the AprioriAll
<(Ringworld)(Ringworld Engineers)> Sliding-window of 7 days adds the pattern
<(Foundation, Ringworld)(Ringworld Engineers)> Max-gap of 30 days
both patterns dropped Add the taxonomy, no sliding-window or time constraints, one is
added <(Foundation)(Asimov)>
GSP:Basic Structure
Phase 1: makes the first pass over database To yield all the 1-element frequent sequences
Phase 2: the kth pass: starts with seed set found in the (k-1)th pass to
generate candidate sequences, which has one more item than a seed sequence;
A new pass over D to find the support for these candidate sequences
These frequent candidates become the seed for the next pass
Phase 3: terminates when no more frequent sequences are found no candidate sequences are generated
GSP: implementation
Generating Candidates: To generate as few candidates as possible
while maintaining completeness Counting Candidates:
To determine the candidate sequence’s support
Implementing Taxonomies
Candidate Generation
Definition: K-sequence : a sequence with k items, Lk : the set of frequent k-sequences, Ck : the set of candidate k-sequences
Goal: given the set of all frequent (k-1)-sequences, generate a candidate set of all frequent k-sequences
Algorithm: Join Phase: joining Lk-1 with Lk-1 . s1 can join with s2 if (s1 –
first item) is the same as (s2 – last item) Prune Phase: delete candidate sequences that have a
contiguous (k-1) subsequence whose support count is less than the minimum support
Candidate Generation: Example
Join phase: <(1,2)(3)> joins with <(2)(3,4)> => <(1,2)(3,4)> <(1,2)(3)> joins with <(2)(3)(5)> => <(1,2)(3)(5)>
Prune phase: <(1,2)(3)(5)> is dropped => <(1)(3)(5)> is not in L3
Counting Candidates
Problem: given a set of candidate sequences C and a data sequence d, find all sequences in C that are contained in d.
Two techniques are used Hash-tree data structure: to reduce the number
of candidates in C that need to be checked. Transformation the representation of the data-
sequences d : to find whether a specific candidate is a subsequence of d efficiently.
Hash-Tree Structure
Purpose: reducing the number of candidates
Leaf node: a list of sequences Interior node: a hash table Operations:
Adding candidate sequences to the hash-tree Finding the candidates contained in a data-
sequence Min-gap Max-gap Sliding window size
Representation Transformation
Purpose: to efficiently find the first occurrence of an element
Transform the data sequences into transaction-links, each link is identified by one item
E.g.:max-gap=30,min-gap=5,window-size=0,<(1,2)(3)(4)>
E.g.:window-size:7,find(2,6) after time=20
Implementing Taxonomies
Basic Idea: to replace each data-sequence d with an “extended
sequence” d’, where each transaction di ’ contains all the items in the corresponding transaction di ,as well as all their ancestors.
E.g.:<(Foundation, Ringworld)(Second Foundation)> => <Foundation,Ringworld,Asimov,Niven,Science Fiction)(Second Foundation,Asimov,Science Fiction)>
Optimizations Pre-compute the ancestors of each item, drop infrequent
ancestors before a new pass Not count patterns with an element that contains an item x
and its ancestor y Problem: redundancy
E.g.
Performance Evaluation
Comparison of GSP and AprioriAll Result: 2 to 20 times faster Contributing factors:
Fewer candidates Directly finding the candidates
Scale-up: scales linearly with the number of data-
sequences Effects of Time Constraints and Sliding
Windows: there was no performance degradation
Experiment Result
Experiment Result(contd.)
Experiment Result(contd.)
Experiment Result(contd.)
Experiment Result(contd.)
Conclusion
GSP is a Generalized Sequence Mining Algorithm
Discovering all the sequential patterns Good Customizability Has been incorporated into IBM’s data mining
product
Personal Opinion
Hash-tree Structure: main memory limitation Multi-pass over the database Apply GSP to CIS data