Post on 28-Dec-2015
Data Mining Techniques Sequential Patterns
Sequential Pattern Mining• Progress in bar-code technology has made it
possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data
• A record in such data typically consists of the transaction date and the items bought in the transaction
• Very often, data records also contain customer-id, particularly when the purchase has been made using a credit card or a frequent-buyer card
• Catalog companies also collect such data using the orders they receive
Sequential Pattern Mining• An example of such a pattern is that customers typi
cally rent “Star Wars (星際大戰 )”, then “Empire Strikes Back (帝國大反擊 )”, and then “Return of the Jedi (絕地大反攻 )”
• These rentals need not be consecutive– Customers who rent some other videos in between also s
upport this sequential pattern
• Elements of a sequential pattern need not be simple items– “Computer Science and Programming Language”, follo
wed by “Data Structure”, followed by “System Programs and Operating Systems” is an example of a sequential pattern in which the elements are sets of items
Sequential Pattern Mining• Given Transaction Time, Customer Id,
Items BoughtOriginal Database
Answer Set
Definition• The length of a sequence is the number of ite
msets in the sequence• A sequence of length k is called a k-sequence
• The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
• The itemset i and the 1-sequence <i> have the same support
• An itemset with minimum support is called a large (frequent) itemset or litemset
AprioriAll Algorithm• Each itemset in a large sequence must have
minimum support
• Any large sequence must be a list of litemsets
• Finding all sequential patterns in five phases– Sort Phase– Litemset Phase– Transformation Phase– Sequence Phase– Maximal Phase
AprioriAll Algorithm:Sort Phase
Customer-Sequence Version of the Database
AprioriAll Algorithm:Litemset Phase
Apriori/DHPFP Growth
min_sup_count=2
AprioriAll Algorithm:Transformation Phase
AprioriAll Algorithm:Sequence Phase
Customer Sequences Large 1-Sequences
Large 2-Sequences
Large 3-Sequences
Large 4-Sequences
Maximal Large Sequences
2
Sequence Phase:Candidate Generation
AprioriAll Algorithm:Maximal Phase
• The sequence <(3) (4 5) (8)> is contained in <(7) (3 8) (9) (4 5 6) (8)>, since (3) (3 8), (4 5) (4 5 6) and (8) (8)
• The sequence <(3) (5)> is not contained in <(3 5)> (and vice versa)– The former represents items 3 and 5 being bought one after
the other
– The latter represents items 3 and 5 being bought together.
• In a set of sequences, a sequence s is maximal if s is not contained in any other sequence.
AprioriAll Algorithm
• With minimum support set to 25%, i.e., a minimum support of 2 customers– < (30) (90)> and <(30) (40 70)> are maximal – <(10 20) (30)> which is only supported by customer 2
does not have minimum support– <(30)>, <(40)>, <(70)>, <(90)>, <(30) (40)>, <(30) (70)>
and <(40 70)>, though having minimum support, are not in the answer because they are not maximal.
Answer Set
Summary
Discussions
• AprioriAll algorithm will generate a huge set of candidate sequences– If there are 1000 frequent sequences of length-1, t
he algorithm will generate 1000 × 1000 + (1000 × 999) / 2 = 1,499,500 candidate sequences
• Many scans of databases in mining
• Difficulties at mining long sequential patterns
Research Topics• Time-Interval Sequential Patterns• Time-Gap Sequential Patterns• Non-redundant Sequential Patterns• Constrained Sequential Pattern Mining• Multi-dimensional Sequential Patterns• Generalized Sequential Patterns• Incremental Mining Sequential Patterns• Data Stream Sequential Pattern Mining• Interactive Mining Sequential Patterns
Exercise 6
A Sequence Database (min-sup = 50%)
<eg(af)cbc>40
<(ef)(ab)(df)cb>30
<(ad)c(bc)(ae)>20
<a(abc)(ac)d(cf)>10
Customer sequenceSID