Post on 10-Oct-2020
“Sequential Sequence Mining (SSM) Technique
in Large Data base”
A Thesis submitted in partial fulfillment of the requirements for the
award of the degree of Doctor of Philosophy in the Faculty of
Engineering & Technology
Guide Research Scholar
Dr. J. S. Shah Kiran R. Amin
M.E, PhD B.E(Computer),ME(Computer)
PRINCIPAL, CLASS-I ASSO. PROF. & HEAD (CE)
GOVT. ENGG. COLLEGE, U.V. Patel College of Engg.
PATAN GANPAT VIDYANAGAR
At & Po : Katpur Reg. No: EN/02/002/07
U. V. PATEL COLLEGE OF ENGINEERING
GANPAT UNIVERSITY
DECEMBER- 2012
© Copyright 2012
By
Kiran R. Amin
All Rights Reserved
CERTIFICATE
This is to certify that the thesis entitled “Sequential Sequence Mining (SSM)
Technique in Large Data base” submitted by Kirankumar Ramchandra Amin of U. V.
Patel College of Engineering is the bonafied work completed under my supervision and
guidance for the award of Degree of Doctor of Philosophy in the Faculty of Engineering &
Technology, Ganpat University, Ganpat Vidyanagar. The experimental work included in
the thesis was carried out at the Department of Computer Engineering, U. V. Patel College
of Engineering under my supervision and the work is up to my satisfaction.
Research Guide
Prof. (Dr.) J. S. Shah
M.E, PhD
Forwarded Through:-
Dr. N. D. Jotwani
PhD,
Dean, Faculty of Engineering & Technology
Date :
Place : Ganpat Vidyanagar
CERTIFICATE
This is to certify that Mr. Kirankumar Ramchandra Amin is a research Scholar and
doing his PhD under my supervision. He has presented the findings of his research work
in pre-synopsis Seminar in front of the Doctoral held on 30th
August 2012 at U. V. Patel
College of Engineering, Ganpat Vidyanagar. He has incorporated all the
modifications/suggestions made by oral defense committee.(Ref No: F. No.
89/GNU/PhD/1176/2012 Dated 28th
September 2012) in the thesis entitled “Sequential
Sequence Mining (SSM) Technique in Large Data base”.
Research Guide
Prof. (Dr.) J. S. Shah
M.E, PhD
Date :
Place : Ganpat Vidyanagar
THESIS APPROVAL SHEET
The PhD thesis entitled “Sequential Sequence Mining (SSM) Technique in Large
Data base” by Mr. Kirankumar Ramchandra Amin has been approved for the award of the
Degree of Philosophy under the Faculty of Engineering & Technology, Ganpat University.
External Examiner(s) Research Guide
Date :
Place : Ganpat Vidyanagar
TABLE OF CONTENTS
CHAPTER PAGE
Declaration by Author i
Acknowledgement iv
Abstract vii
List of Figures viii
List of Tables x
Abbreviation xi
1 Chapter 1 Introduction 1
1.1 Background
1
1.2 Thesis organization
4
1.3 Aim of the Research
5
2 Chapter 2 Related work 6
2.1 Literature Survey and Critical Assessment
6
2.2 Sequential Sequence Mining Techniques
7
2.2.1 Apriori-based Techniques
7
2.2.2 Tree-based Techniques
8
2.2.3 Lattice-based Techniques
9
2.2.4 Regular Expression based Techniques
10
2.2.5 Prefix-based Techniques
11
2.2.6 Closed Sequential Sequences Techniques
12
2.2.7 Time interval Sequence Mining Techniques
12
2.3 State-of-the-art techniques in Sequential Sequence
Mining
13
CHAPTER PAGE
2.4 Categories of sequential sequence mining techniques 17
2.5 Empirical Analysis of State-of-Art techniques
21
2.5.1 Apriori Algorithm-Formal Description 21
2.5.1.1 Support 22
2.5.1.2 Formal Definition: Apriori property 22
2.5.1.3 Algorithm : Apriori
22
2.5. 2 Algorithm - Apriori-gen
24
2.5.2.1 The join procedure -Apriori-gen algorithm
24
2.5.2.2 The prune procedure of the Apriori-gen
algorithm
25
2.5.3 DHP Algorithm 25
2.5.4 Partitioning Algorithm-Formal Descriptin 27
2.5.4.1 Algorithm-Partition
28
2.5.4.1.1 Phase I
28
2.5.4.1.2 Merge Phase
28
2.5.4.1.3 Phase II
28
2.5.5 Sampling Algorithm 30
2.5.6 DIC Algorithm 31
2.5.7 Improved Apriori Algorithm 31
2.5.8 AprioriAll-Formal Description 32
2.5.8.1 Sort Phase 32
2.5.8.2 Litemset Phase 32
2.5.8.3 Transformation Phase 34
2.5.8.4 Sequence Phase 34
2.5.8.5 Maximal Phase 35
CHAPTER PAGE
2.5.9 Algorithm-AprioriAll 35
2.5.10 AprioriSome Algorithm 36
2.5.10.1 Algorithm- AprioriSome : Forward Phase 36
2.5.10.2 AprioriSome : Backward Phase
36
2.5.11 Relative performance - AprioriAll & AprioriSome
37
2.5.12 DynamicSome-Formal Description 37
2.5.13 Algorithm-DynamicSome 38
2.5.13.1 Initialization Phase
38
2.5.13.2 Forward Phase
38
2.5.13.3 Intermediate Phase
39
2.5.14 GSP 39
2.5.14.1 Formal Description
39
2.5.14.2 Join Phase 40
2.5.14.3 Prune Phase 40
2.5.14.4 Relative Performance 40
2.5.15 FreeSpan 44
2.5.16 SPADE 44
2.5.17 Prefixspan 48
2.5.18 SPAM 51
2.5.19 Allen’s Algorithm
59
2.5.19.1 Generalization of temporal events-
Formal Description
60
2.5.19.2 Algorithm : Generalization of temporal 60
2.5.19.3 Algorithm : Temporal interval relation rule
discovery
61
CHAPTER PAGE
3 Chapter 3 Motivation 63
4 Chapter 4 Scope of Work 65
5 Chapter 5 Proposed Algorithms 67
5.1 Sequential Sequence Mining 67
5.1.1 Support
69
5.1.2 Super sequences and sub sequences 70
5.3 Formal Notations & New Equations : MySSM
70
5.3.1 Customer 70
5.3.2 Item 70
5.3.3 Transaction 70
5.3.4 SequenceID 70
5.3.5 Equation for time interval 71
5.3.6 Equation for same time interval items
71
5.3.7 Equation for support
71
5.4 Algorithms of MySSM 72
5.4.1 Algorithm 1 SYNTIM
72
5.4.2 Algorithm 2 GCON
73
5.4.3 Algorithm 3 FS & GSGT 74
5.4.4 Algorithm 4 GAS
75
5.4.5 Algorithm 5 CMEM
75
5.4.6 Algorithm 6 OUTR
76
5.4.7 Algorithm 7 MySSM
77
6 Chapter 6.0 Empirical Analysis & Comparative Results 82
7 Chapter 7.0 Conclusion & Future Scope
92
Bibliography
94
Own Publication List 101
My other research publications 102
i
GANPAT UNIVERSITY
DECLARATION BY THE AUTHOR OF THE THESIS
I Kiran R. Amin Reg. No: EN/02/002/07,registered a research scholar of PhD
programme in the Faculty of Engineering & Technology, Ganpat University do hereby
submit my thesis entitled “Sequential Sequence Mining (SSM) Technique in Large Data
base.” (Here in referred to as my thesis) in printed as well as in electronic form for
holding in the library of records of the University.
I hereby declare that:
1. The electronic version of my thesis submitted here with in CDROM is in PDF
does not infringe or violate the rights of anyone else.
2. My thesis is my original work of which the copyright vests in me and my thesis
do not infringe or violate the right of anyone else.
3. The content of the electronic version of my thesis submitted herewith are the
same as those submitted as final hard copy of my thesis after my viva-voce and
adjudication of my thesis.
4. I agree to abide by the terms and conditions of the Ganpat University policy on
intellectual property(here after policy) currently in effect, as approved by the
competent authority of the University.
5. I agree to allow the University to make available the abstract of my thesis to any
user in both Hard copies(Printed) and electronic forms.
ii
6. For the University‟s own non-commercial, academic use, I grant to the University
submission of the thesis the non-exclusive license to make limited copies of my
thesis in whole or in part and to loan such copies at the University‟s discretion to
academic persons and bodies approved from time to time by the University for
non commercial academic use. All usage under this clause will be governed by
relevant fair use provisions in the policy and by the Indian Copy Right Act in
force at the time of submission of thesis.
7. I agree to allow the University to place such copies of the electronic version of
my format.
8. I agree to allow the University to place such copies of electronic version of my
thesis on the private intranet maintained by the University for its own academic.
9. If in the opinion of the University, my thesis contains patentable or copyrithable
material and if the University decides to proceed with the process of securing
copyrights and/or patents. I expressly authorize the University to do so. I also
undertake not to disclose any of the patentable intellectual properties before being
permitted by the University to do so or for a period of one year from the date of
final thesis examination, whichever is earlier.
10. In accordance with the intellectual property policy of the University, I accept that
any commercialized intellectual property contained in my thesis I the joint
property of me, my co-workers, my supervisors and the Institute. I authorize the
University to proceed with the protection of the intellectual property right in
accordance with prevailing laws. I agree to abide by the provisions of the
University intellectual property right policy to facilitate protection of the
intellectual property contained in my thesis.
iii
11. If I intend to file a patent based on my thesis when the University does not wish
so, I shall notify my intention to the University. In such case, my thesis should be
marked as patentable intellectual property and access to my thesis is restricted. No
part of my thesis should be disclosed by the University to any person(s) without
my written authorization for one year after my information to the University to
protect the IP on my own, within 2 years after the date of submission of the thesis
or the period necessary for sealing the patent, whichever is earliest.
Name of Research Student Name of Guide
Kirankumar Ramchandra Amin Prof. (Dr.) J. S. Shah
M.E(Computer) ME, PhD
Signature of Research Student Signature of Guide
Date : 26th
December 2012
Place : Ganpat Vidyanagar
iv
Acknowledgement
The humble accomplishment of this thesis would not have been possible without
the contribution of many individuals, to whom I express my appreciation and gratitude.
Firstly, I am deeply grateful to my supervisor Dr. J. S. Shah, Professor &
Principal of Government Engineering College, Katpur (PATAN) who guided me every
step of the way and was a source of inspiration. I am thankful to him for giving me
constant guidance, support, sparing valuable time throughout the course of this thesis. I
successfully overcame many difficulties and learned a lot. Despite of his ill health, he
used to review my thesis progress. Gave me valuable suggestions and made corrections.
His unflinching courage and conviction will always inspire me, and I hope to continue to
work with his noble thoughts.
.
I would like to thank to Dr. L. N. Patel, Vice Chancellor of Ganpat University. I
gratefully acknowledge him for his encouragement and personal attention which have
provided good and smooth basis for my Ph.D. tenure.
I am extremely indebted to Dr. N. D. Jotwani, Principal of U. V. Patel College of
Engineering & Dean of Faculty of Engineering & Technology for providing me required
infrastructure. I am also thankful to him for his constant support and encouragement and
supporting me to carry out my research work.
I am thankful to Doctoral Committee Members, Dr. D.C. Jinwala, Professor,
SVNIT, Surat and Dr. M.V. Joshi, Professor, DAIICT, Gandhinagar for their helpful
suggestions, valuable advice, constructive criticism and helpful comments during my pre-
synopsis Seminar.
I am also thankful to Dr. Ketan Kotecha, Director, NIT, Ahmedabad, Dr. Y. P.
Kosta, Director, Marwadi Education Foundations, Rajkot and Dr. N. D. Jotwani for
giving their useful comments during my pre-synopsis Seminar.
v
I thank to Shri Darshit Khambholja, Director, Bhavi Technolsoft for providing
me necessary resources to accomplish my research work.
I would like to express my appreciation to the Registrar, Deputy Registrar and
other staff members of the Ganpat University and U. V. Patel College of Engineering for
their unlimited support.
At this moment of accomplishment, I express my thanks to my well wishers, my
friends, colleagues and all those who contributed in many ways to the success of this
study and made it an unforgettable experience for me.
At Last but not least, I am greatly indebted for all the support of my wife and
children who have lost a lot due to my research work.
Kiran Amin
vi
Dedicated To
My wife Dr. Falguni
&
My Children Dhvani & Nisarg
vii
Abstract
The sequential sequence mining is very important in datamining. It produces
useful sequences occur frequently in database. These sequences are used in finding
users’ purchasing behavior in retail Industries, User’s access sequences to access web
pages, to identify the sequences repeatedly occur and responsible for particular disease
etc. The current state-of-the-art methods have not succeeded to produce sequences for
large database with Time Gap interval. They are found Memory consuming and Time
consuming. This motivated us to produce the sequences in large database by reducing
Memory and Time by including Time Gap between successive items of transactions.
We have proposed the sequential sequence mining technique which produces the
sequences for large database by reducing considerable amount of Memory and Time.
Our algorithms outperform current state-of-the-art techniques in sequential sequence
mining by not only in Computing Time and Memory but also in scalability with respect to
various parameters.
The Thesis focuses on the sequential sequence mining techniques in large
database.
viii
LIST OF FIGURES
FIGURE
PAGE
2.1 Apriori Algorithm
23
2.2 Apriorigen Algorithm
24
2.3 Apriorigen Algorithm : Join Procedure
25
2.4 Apriori-gen Algorithm : Prune Procedure
25
2.5 Mining Frequent itemsets using Partition algorithm 28
2.6 Partition Algorithm
28
2.7 Algorithm AprioriAll
35
2.8 Algorithm AprioriSome
38
2.9 Algorithm DynamicSome
39
2.10 Relative Performance
42
2.11 Comparison – GSP, Freespan, SPADE
47
2.12 Comparison – Freespan, SPADE, PrefixSpan
51
2.13 Comparison – PrefixSpan with SPAM
54
2.14 No of Customer vs Memory
55
2.15 No of Transaction vs Memory
55
2.16 Memory Prefixspan vs SPAM
56
2.17 Support vs Memory
57
2.18 Allen‟s Algorithm
59
2.19 Generalization of events
60
2.20 Temporal interval relation rule discovery
61
5.1 Algorithm SYNTIM
72
5.2 Algorithm GCON
73
5.3 Algorithm FS & GSGT 74
5.4 Algorithm GAS
75
5.5 Algorithm CMEM
75
5.6 Algorithm OUTR
76
5.7 Algorithm MySSM
77
ix
LIST OF FIGURES
FIGURE
PAGE
6.1 Number of Customers v/s Time(Milliseconds) for support =0.4 84
6.2 No of Customers v/s Memory(MB) for support =0.4 84
6.3 Number of Customers v/s Time(Milliseconds) for support=0.02 85
6.4 Number of Customers v/s Memory(MB) for support=0.02 85
6.5 Number of Customers v/s Time(Milliseconds) for support=0.3 86
6.6 Number of Customers v/s Memory(MB) for support=0.3 87
6.7 Number of Customers v/s Time(Milliseconds) 88
6.8 Number of Customers v/s Memory(MB) 88
6.9 Support v/s Time in Milliseconds 89
6.10 Support v/s Memory in MB 90
6.11 Support v/s Time in Milliseconds 90
6.12 No of different items v/s Total sequences 91
6.13 No of different sequences for number of different items=100 91
6.14 No of different sequences for number of different items=10 92
6.14 No of different sequences for number of different items=6 92
x
LIST OF TABLES
TABLE PAGE
2.1 Sample Database 23
2.2 Mining Frequent itemsets using AprioriAll
33
2.3 Mapping of sequence
33
2.4 Transformed sequence
34
2.5 Data set Example
46
2.6 Vertical Data format
46
2.7 Vertical Data format
46
2.8 Sequence Database
49
2.9 Projected Database
50
2.10 S-Matrix
50
2.11 Data set
52
2.12 Vertical format
52
2.13 S-step process
53
2.14 I-step process
53
5.1 Data set 1
68
5.2 Data set 2
69
5.3 Sequence Generator Table
78
5.4 Sequence Generator Table with Time stamp 78
5.5 Sequence generator Table
79
5.6 Table of time interval sequence for „p‟
81
xi
ABBREVIATION
SPAM Sequential Pattern Mining
PREFIXSPAN Prefix-projected Sequential pattern mining
SPADE Sequential Pattern Discovery using Equivalent Class
SPIRIT Sequential pattern mining with regular expression constraints
BIDE Bi-Directional Extension
CloSpan Closed sequential patterns
FTAPs Frequent temporal association pattern
CTMSP-Mine Cluster-based Temporal Mobile Sequential Pattern Mine
CTMSPs Cluster-based Temporal Mobile Sequential Patterns
CO-Smart-CAST Cluster-Object-based Smart Cluster Affinity Search Technique
DIC Dynamic Itemset Counting
GSP Generalised Sequential Pattern
SID Sequence Id
CID Customer ID
I-APRIORI Improved Apriori
I-PREFIXSPAN Improved PrefixSpan
SYNTIM Synthetic Time Date
MySSSM My Sequential Sequence Mining
GCON Get Configuration
FS Find Sequence 0 items
GSGT Generate Sequence Generator Table
GAS Generate All Sequences
CMEM Check Memory
OUTR Output Result
- 1 -
Chapter 1
Introduction
1.1 Background
Data mining extracts implicit, potentially useful knowledge from large amounts of
data. It is also called knowledge mining, knowledge extraction, data/sequence/pattern
analysis, data archaeology and data dredging from databases. In other words, data mining
is the act of drilling through huge volumes of data to discover relationships or answer
queries, generalized for traditional query tools.
In general, data mining tasks can be classified into two categories:
Descriptive mining: It is the process of drawing the essential characteristics or
general properties of the data in the database. Clustering, Association and Sequential
mining are one of the descriptive mining techniques.
- 2 -
Predictive mining: This is the process of inferring sequences form data to make
predictions. Classification, Regression and Deviation detection are predictive mining
techniques.
Data mining technique is useful in various areas, such as market basket analysis,
decision support, fraud detection, business management, telecommunications etc. The
data mining were drawn from Database Technology, Machine Learning, Artificial
Intelligence, Neural Networks, Statistics, Pattern Recognition, Knowledge-based
Systems, Knowledge Acquisition, Information Retrieval, High-performance computation
and Data Visualization.
Many methods came up to extract the information. The Sequential Sequence
Mining is one of the most important techniques that facilitate us to make the decisions in
various applications. The mining problem was first proposed by Agrawal and Srikant
[10]. It discovers sequential sequences which occur frequently in a sequence database.
In the Medicine, finding of time interval sequence of diseases from medical
records like diseases, treatments, and durations of hospital stay etc. are recorded in the
database of Hospitals. However, all the events such as suffering and curing diseases or
occurring symptoms are interval-based. The conventional sequential sequence mining is
not appropriate for the discovery of the sequences in these events. On other hand, time
interval sequences are more useful to identify if a patient suffers from a certain disease or
not. It also predicts the symptoms of a patient who has a certain disease.
In investment, a certain stock rises or falls is one of the important tasks that the
stock investors wanted to know. Further, the owners are worried about the stock trend of
their own businesses. Stockholders or Industry analysts also like to know the rise/fall of
certain stocks, which is actually one of the useful information extractions from the time
interval sequences of stock prices. The stock prices are recorded in every transaction
which acts as a historical data. We may find the time interval stock sequences from the
stock interval event database.
- 3 -
In the E-marketing, some Internet vendors provide new selling methods like
group buying offer. These occur when vendors wanted to sell products at lower prices
when someone collects a crowd of people to buy this product. The duration when an
individual joins a group buying section for a certain product till the closing of the session
is considered as an interval-based event. Since many group buying customers may join
buying sessions for a number of products concurrently or later, these interval-based
events form a set of sequences, which may include some interesting time oriented
sequences. Discovering time oriented sequences from group buying records will help the
purchasing behaviors of customers and make effective marketing strategies.
Traditional Association Rule Mining [10] works on transactional data. It
considers various items to be purchased in single transaction of a particular customer. It
doesn‟t care for the same customer purchases items in different transactions. The concept
of sequential sequence mining arrived and it considers various items to be purchased in
different transactions. It covers the idea regarding same customer purchases items in
more than one transaction and in more than one time. However the current state-of-the-art
techniques have limitations with the performance of Memory and Time which are
focused by us.
Sequential sequence mining mines sequential sequence from data base with
efficient support counting. It is used to find frequent subsequences occur with minimum
support value. The sequential sequence mining focuses on sequence of events occurred
frequently in given dataset unlike simple association rule mining. For example, the
customer in electronics retail shop purchases Computer System then again he purchases
Scanner after some amount of time. That means the purchasing of Scanner is made after
the purchasing of Computer System. The sequence of the items plays major role. We use
the order dataset where all events stored in some particular order. The traditional
sequential sequence mining doesn‟t care for the timing between the purchasing of items.
- 4 -
The goal of our research work is to develop and evaluate new algorithms of
MySSM which efficiently produce sequential sequences in large database having
significant improvement in execution Time and Memory.
1.2 Thesis organization
We have discussed introductory part of our thesis in Chapter 1. We have also
focused on the organization of our thesis and the aim of our research work in this chapter.
Chapter 2 focuses on the related work to our research. The first part of this
chapter is based on literature survey. In second section, we have discussed various
sequential sequence mining techniques. Third section of this chapter focuses on state-of-
the-art techniques for finding sequential sequence mining. Gradually these techniques are
compared with in close proximity techniques. The results of empirical analysis of state-
of-the-art methods are discussed in fourth section of this chapter. This chapter helped us
to strengthen to our technique by considering various parameters of matrix of evaluation
in the area of sequential sequence mining.
Chapter 3 provides the motivation of our research work. It focuses on our
inspiration to do the research work in the sequential sequence mining. The deficiency in
state-of-the-art methods motivated us to develop new sequential sequence mining
technique.
Chapter 4 focuses on the scope of work of our algorithm MySSM. We have
discussed proposed algorithms in chapter 5 which includes the steps of our Algorithm
MySSM. We have proposed seven algorithms named SYNTIM, MySSM, GCON, FS,
GSGT, GAS, CMEM and OUTR which all are discussed in this chapter.
Chapter 6 serves to experimentally validate the claims of efficiency in terms of
Time and Memory. In addition, we have empirically analyzed it for large database with
- 5 -
various parameters like various support values, no of items per transactions, no of
transactions per customers, no of customers per database.
Chapter 7 summarizes the thesis and focuses on future scope of the work. This
chapter is followed by references used in our thesis.
1.3 Aim of the Research
The fundamental aim of our thesis is to study and develop a new sequential
sequence mining technique that produces sequential sequences from the large database. It
considers the time gap between successive items to be purchased by the customers. It
produces the sequential sequences with reasonable amount of Time and Memory.
- 6 -
Chapter 2
Related work
Sequential sequence mining is one of the important techniques in data mining.
From the literature review of association rule mining technique to sequential sequence
mining technique, we found that more efforts have been exerted in discovering sequential
sequences. To design new algorithm for resolving these mining problems, we referred
well-known sequential sequence mining techniques. These techniques and brief critiques
are focused here.
2.1 Literature Survey and Critical Assessment
We referred important literatures in the area of sequential sequence mining and
studied various techniques related to our work. These techniques are elaborated here.
Brief critique with gradual improvement over various techniques is discussed here.
The state-of-the-art techniques in sequential sequence mining algorithms are
classified in to different classes with respect to following:
- 7 -
(1) Methods and data-structures used for the candidate sequence generation.
(2) Pruning techniques used to accelerate the mining process.
(3) Final output set that the algorithms are targeting.
Above classes provide seven various techniques. These techniques are focused
here in section 2.2 and the proposed algorithms are discussed in section 2.3. Based on
our literature survey, we have discussed various state-of-the-art methods in section 2.2 to
2.4. The empirically tested results are compared in section 2.5.
2.2 Sequential Sequence Mining Techniques
Sequential sequence [7] is defined as: The data set is a set of sequences, named as
data-sequences. Each data-sequence is a group of transactions. Each transaction is a set
of literals, called items or events. Typically there is a transaction time associated with
each transaction. The sequential sequence mining finds all sequential sequences with a
user defined minimum support.
2.2.1 Apriori-based Techniques
The first and simplest family of sequential sequence mining algorithms is Apriori-
based algorithms and their main characteristic is that they use Apriori principle [10]. The
problem of sequential sequence mining was introduced along with other three Apriori-
based algorithms (AprioriAll, AprioriSome and DynamicSome) [7]. At each step k, a set
of candidate frequent sequences Ck of size k is generated by performing a self-join on
Lk−1; Lk consists of all those sequences in Ck that satisfy a minimum support threshold.
The efficiency of support counting was improved by using a hash-tree structure.
A similar approach, GSP (Generalized Sequential Patterns) was developed [6]
that uses time constraints as well as the window constraints. This was proved to be more
efficient than its predecessors. Mannila et al. introduced the idea of mining frequent
- 8 -
episodes [17], i.e. frequent sequential sequences in a single long input sequence. They
used a sliding window to cut the input sequence into smaller segments and employed a
mining algorithm similar to that of AprioriAll.
Discovering all frequent sequential sequences in large databases was a very
challenging task since the search space was large.
For the database with m attributes and length of k frequent sequence, there are
O(mk) potentially frequent ones. Increasing the number of objects might lead to a high
computational cost. Apriori-based algorithms utilize a bottom-up search lists every single
frequent sequence. To produce a frequent sequence of length l, all 2l subsequences have
to be generated. It can be easily worked out that this exponential complexity is restricting
all the Apriori-based algorithms to discover only short sequences, since they only
implement subset infrequency pruning by removing any candidate sequence for which
there exists a subsequence that does not belong to the set of frequent sequences.
2.2.2 Tree-based Techniques
A faster and more efficient candidate production can be attained by using a tree-
like structure [18]. The traversal is made in a depth-first search manner. It is applied such
that all the candidate sequences applying both subset infrequency and superset frequency
pruning. Initially, the above idea was introduced for mining frequent itemsets, but then it
was extended for sequential sequences. Ayres employed an efficient approach in SPAM
[3]. SPAM generated sequence enumeration tree to generate all the candidate frequent
sequences. The level k of the tree contains the complete set of sequences of size k (with
each node representing one sequence) that occurs in the database. The nodes of each level
are generated from the nodes of the previous level using two types of extensions:
(1) Itemset extension (the last itemset in the sequence is extended by adding one
more item to the set),
- 9 -
(2) Sequence extension (a sequence is extended by adding a new itemset at the
end of the sequence).
The candidate sequences are specified by traversing the tree using depth-first
search. If the sequence is found infrequent, the sub tree of the node representing that
sequence is pruned. If the sequence is found to be frequent, then all its subsequences have
to be frequent, thus the tree nodes representing those sequences are skipped. For efficient
support counting, the database is represented by a bitmap, which further improves
performance over the lattice-based approaches [4] discussed in next method.
2.2.3 Lattice-based Techniques
Lattice structure was another class of sequential sequence mining algorithms was
proposed a lattice based method to enumerate the candidate sequences efficiently. In fact,
a lattice seems to be a “tree-like” structure where each node may have more than one
parent node. A node on the lattice represents a sequence s, is connected to all the pairs of
nodes on the previous level that can be joined to form s. This is shown in the example: let
s = {d, (bc), a}, then all the following nodes should be connected to s on the lattice: {(bc),
a}, {d, b, a}, {d, (bc)}, {d, c, a}, since all pairs of these subsequences can be joined to
form s.
SPADE [4] used above structure to efficiently specify the candidate sequences.
The basic characteristics of SPADE were
(1) Vertical representation of the database using id-lists, where each sequence is
associated with a list of database sequences in which it occurs.
(2) Used lattice-based approach to decompose the original search space into
smaller subspaces.
(3) Each sub-lattice, two different search strategies (breadth-first and depth-first
search) were used for getting frequent sequences.
- 10 -
cSPADE was the extension of SPADE was proposed in [4], which allows a set of
constraints to be placed on the mined sequences. These constraints are:
(1) Length and width constraints
(2) Gap and window constraints
(3) Item constraints
(4) Class constraints
GO-SPADE [19] was the similar algorithm proposed later on, where the idea of
generalized occurrences was introduced. The aim behind GO-SPADE was that in a
sequence database certain items may appear in a consecutive way. For reducing the cost
of the mining process, GO-SPADE tried to compact all these consecutive occurrences by
defining a generalized occurrence of a sequence p as a tuple (sid, [min, max]), where sid
is the sequence id, and [min, max] used for the interval of the consecutive occurrences of
the last event of p.
2.2.4 Regular Expression based Techniques
Huge majority of the former algorithms focused the discovery of frequent
sequential sequences based on only a support threshold, which limits the results to the
most common. Thus, a lack of user controlled focus in the sequence mining process can
be detected that may sometimes lead to great volume of useless sequences. A solution to
this problem was proposed in [20], where the mining process was restricted by a support
threshold and user-specified constraints modeled by regular expressions. Later on the
series of SPIRIT [20] algorithms were introduced, where a set of constraints C was
pushed into the mining process along with a sequence database. Therefore, the minimum
support requirement and a set of additional user specified constraints were applied
simultaneously which restrict the set of candidate sequences produced during the mining
process. To fulfill this, two different types [20] of pruning techniques were used.
First was based on constraint and second was based on support value. The first
technique used a relaxation C0 of C ensuring that during each pass of the candidate
- 11 -
generation, all the candidate sequences satisfy C0. The second technique, tries to ensure
that all the subsequences of a candidate sequence satisfy C0 are present in the current set
of discovered frequent sequences.
Another characteristic of the SPIRIT [20] algorithms were related to anti-
monotonicity. Consider a given set of candidates C and a relaxation C0 of C. In fact C0
was a weaker constraint which was less restrictive.
In such case, support-based pruning was maximized, since support information
for every subsequence of a candidate sequence in C0 could be used for pruning. In
addition, if C0 was not anti-monotone, the efficiency of both support-based and
constraint-based pruning depends on the relaxation C0.
2.2.5 Prefix-based Techniques
Other techniques of sequential sequence mining algorithms include the prefix-
based [21]. In this method, the database is projected with respect to a frequent prefix
sequence and based on the outcome of the projection, new frequent prefixes are identified
and used for further projections until the support threshold constraint is satisfied.
The main steps of a prefix-based algorithm are following:
(1) Scanning of the database for the frequent 1-sequences.
(2) Project the database with respect to s for each frequent 1-sequences found in
the previous step.
(3) Scan the projected database for local frequent items.
(4) Add each new frequent item to the end of the prefix and project the database
with respect to the new prefix.
(5) Repeat steps 3-4 for each new prefix, until the projected database is of size
less than the support threshold.
- 12 -
2.2.6 Closed Sequential Sequences Techniques
In addition to mine the complete set of frequent sequences including their
subsequences, the closed frequent sequence techniques were proposed. The algorithms
were proposed by Zaki [4] and Pei [5]. Two of the most efficient algorithms for mining
frequent closed sequences were BIDE [23] and CloSpan [48]. They are based on the
notion of the projected database. They use special techniques to limit the number of
frequent sequences and finally keep only the closed ones.
CloSpan[48] used the candidate maintenance-and-test approach, i.e. it first
generates a set of closed sequence candidates which is stored in a hash-indexed tree
structure and then prunes the search space using Common Prefix and Backward Sub
sequence pruning. However the drawback of CloSpan is that it consumes much memory
when there are many closed frequent sequences, since sequence closure checking leads to
a vast search space. Therefore, it does not scale well with respect to the number of closed
sequences. To overcome this limitation, BIDE employed a BIDirectional Extension
paradigm for mining closed sequences, where a forward directional extension is used to
grow the prefix sequences which checks their closure and a backward directional
extension. It is used to check the closure of a prefix sequence and prune the search space.
Overall, It is seen that BIDE[23] has high efficiency, regarding speed (an order of
magnitude faster than CloSpan[48]) and scalability with respect to database size.
2.2.7 Time interval Sequence Mining Techniques
Up to this point, the events were considered to be instantaneous. There were
several techniques on discovering intervals that occurred frequently in a transactional
database [24]. In most cases, the intervals were not labelled and no relations were
between them considered. Vill. [25] extended the sequential sequence techniques by also
including the relation introduced previously. In time interval sequential mining, the time
between events is considered.
- 13 -
2.3 State-of-the-art techniques in Sequential Sequence Mining
Here we will depict some of the existing and past researches on the field of
sequential sequence mining. It is followed by innovation of our research.
Chen [8] proposed a method for discovering time-interval sequential sequences in
sequence databases. Dhany, Saputra [1] proposed improved version of prefixspan named
as i-prefixspan.
W. Li [28] proposed novel concept of a frequent time interval association
sequences. They used multiple gene sequences. Their algorithm has several advantages
over traditional methods. A set of genes simultaneously show complex time item interval
expression sequences recurrently across multiple microarray datasets. Such time interval
signals are hard to recognize in individual microarray datasets, but become significant by
their frequent occurrences across multiple datasets. They designed an efficient two-stage
algorithm to identify FTAPs [28]. First for each gene, they recognized expression trends
that occurred frequently across multiple datasets. Second, they found for a set of genes
that simultaneously exhibit their respective trends recurrently in multiple datasets. They
applied this algorithm to 18 yeast time-series microarray datasets. The majority of FTAPs
identified by the algorithm were associated with specific biological functions. Moreover,
a significant number of sequences included genes those were functionally related but do
not exhibit co-expression. Their approach offers advantages: (1) it can identify complex
associations of time interval trends in gene expression, an important step towards
understanding the complex mechanisms governing cellular systems; (2) it is capable of
integrating time-series data with different time scales and intervals.
Tsai [26] proposed a sequential sequence method to explore consumer behaviors
for purchasing items. They concentrated on how to improve accuracy and efficiency of
their methods and discussed how to detect sequential sequence changes between two
time-periods. To help business managers understand the changing behaviors of their
- 14 -
customers, a three-phase sequential sequence change detection framework was proposed
by them. In phase I [26], two sequential sequence sets were generated respectively from
two time-period databases. In phase II, the dissimilarities between all pairs of sequential
sequences were evaluated using the proposed sequential sequence matching algorithm.
Based on a set of judgment criteria, a sequential sequence was clarified as one of the
following three change types: an emerging sequential sequence, an unexpected sequence
change, or an added sequential sequence. In phase III, significant change sequences were
returned to managers if the degree of change for a sequence is large enough.
Mirko B.[27] proposed method of recognizing customer segments and tracking
their change over time. It was important for businesses which operate in dynamic markets
with customers who wanted for new innovations and competing products, had highly
changing demands and attitudes. They presented a system for customer segmentation
which accounts for the dynamics of today‟s markets. Their approach [27] was based on
the discovery of frequent item sets and the analysis of their change over time which,
finally, resulted in a change-based notion of segment interestingness. Their approach
allowed them to detect arbitrary segments and analyzed their temporal development.
Thereby, their approach was assumption-free and pro-active and could be run
continuously.
Fabian Moerchen[22] represented Temporal pattern mining for time point based
and time intervals based methods. They distinguished time point-based methods and
interval-based methods as well as univariate and multivariate methods.
They presented symbolic temporal data models and temporal operators that were
used for pattern discovery in data mining research. They divided temporal data models
such as time point v/s. time interval data, univariate v/s. multivariate data and numeric
v/s. symbolic data.
- 15 -
They divided the sequences based on time point data. They are categorized into
mining sub sequences with suffix tries [29], Mining sequential sequences [30], Mining
episodes [31] and Mining partial orders [32].
J. Kang and H. Yong[33] proposed mining spatio-temporal patterns in trajectory
data. The spatio-temporal sequences extorted from historical trajectories from the moving
objects expose important knowledge about behavior of the movement for Location based
services. They compared with the existing approaches which transform trajectories into
sequences of location symbols and derive frequent subsequences by applying
conventional sequential pattern mining algorithms. However, the loss of spatio-temporal
correlation occurred due to the inappropriate approximations of spatial and temporal
properties. They addressed the problem of mining spatio-temporal [33] sequences from
trajectory data. The inefficient description of temporal information decreases the mining
efficiency and the interpretability of the sequences. They provided an efficient
representation of spatio-temporal movements and proposed a new approach to discover
spatio-temporal sequences in trajectory data. Their proposed method first finds spatio-
temporal regions by using prefix-projection methods and extracts frequent spatio-
temporal sequences.
With the advances in mobile communication [33] and positioning technology,
large amounts of moving objects data from various types of devices, such as GPS
equipped mobile phones or vehicles with navigational equipment was collected. From
these devices, movements of objects were collected in the form of trajectories. Spatio-
temporal sequences in trajectories which represented the movement of sequences of
objects could provide useful information for high quality Location Based Services (LBS).
They addressed the problem of inefficient representation of spatio-temporal
properties and proposed new algorithms for mining spatio-temporal sequences. First they
introduced two compact representations of movements of objects, which abstracted
original trajectories into sequences of regions which objects mostly visited. This spatio-
- 16 -
temporal abstraction of data contributed for improving the mining efficiency and the
interpretability of extracted sequences.
Yan H. [34] proposed a Framework for mining sequential sequences from spatio-
temporal event data sets.
In large spatio-temporal database of events, each event consists of the fields like
event ID, time, location, and event type, mining spatio-temporal sequential sequences
recognizes significant event-type sequences. Such spatio-temporal sequential sequences
are critical for the investigation of spatial & temporal evolutions in many applications.
Earlier research literature explored the sequential sequences on transaction data and
trajectory analysis on moving objects. However, these methods could not be directly
applied to mining sequential sequences from a large number of spatio-temporal events.
However two major research challenges still remained: 1) the definition of significance
measures for spatio-temporal sequential sequences to avoid spurious ones and 2) the
algorithmic design under the significance measures, which could not give guarantee of
the downward closure property. In this paper [34], they proposed a sequence index as the
significance measure for spatio-temporal sequential sequences, which was meaningful
due to its interpretability using spatial statistics. They proposed slicing-STS-Miner to
tackle the algorithmic design challenge using the spatial sequence index, which did not
preserve the downward closure property.
Damian F. Zhang Chen[35] proposed sequential pattern mining of multi modal
data streams in dyadic Interactions. Finding sequential sequences from multi modal data
is an important topic in various research fields, such as human-human communication,
human-agent or human-robot interactions, and human development and learning. Using a
multimodal human-robot interaction dataset, they showed that ESM data mining
algorithm was able to detect and validate various kinds of reliable temporal sequences
from multi-streaming, multi-modal data streams. They [35] proposed a sequential
sequence mining method to analyze multimodal data streams using a quantitative
temporal approach. They presented a new temporal data mining method focusing on
- 17 -
extracting exact timings and durations of sequential patterns extracted from multiple
temporal event streams. While other related existing algorithms could only find
sequential orders of temporal events. Their method [35] with its application to the
detection and extraction of human sequential behavioral sequences over multiple
multimodal data streams in human-robot interactions.
Eric Lu [36] proposed Mining Cluster-Based Temporal Mobile Sequential
Sequences in Location-Based Service Environments. Due to a wide range of potential
applications, researches on Location-Based Service (LBS) have been emerging in recent
years. The earlier studies focused on discovering mobile sequences from the whole logs.
However, this kind of sequences might not be precise enough for predictions since the
differentiated mobile behaviors among users and temporal periods were not considered.
They proposed an algorithm, namely, Cluster-based Temporal Mobile Sequential Pattern
Mine (CTMSP-Mine) which discovers the Cluster-based Temporal Mobile Sequential
Patterns (CTMSPs). Moreover, a prediction strategy was proposed to predict the
subsequent mobile behaviors.
In CTMSP-Mine, user clusters were constructed by a novel algorithm named
Cluster-Object-based Smart Cluster Affinity Search Technique (CO-Smart-CAST) and
similarities between users were evaluated by the proposed measure, Location-Based
Service Alignment. By the time, a time segmentation approach was presented to find
segmenting time intervals where similar mobile characteristics. They worked on mining
and prediction of mobile behaviors with considerations of user relations and temporal
property simultaneously. Through experimental evaluation under various simulated
conditions, their proposed methods were shown to deliver excellent performance.
2.4 Categories of sequential sequence mining techniques
Sequential sequence mining is categorized into two methods.
- 18 -
1. Point based methods.
2. Interval based methods.
We understand that the events (or items) in each data sequence occur at a time
point are called point-based events. Most of existing sequential sequence mining
methods finds sequences for data-sequences of point-based events.
The point based state-of-the-art methods can be categorized into the following
classes:
1. Performance enhancing algorithms
2. Constraint-based sequential sequence mining
3. Incremental sequential sequence mining
4. Mining variants of sequential sequences
The variants of sequential sequences are
1. Maximum sequences
2. Similar sequences
3. Fuzzy sequential sequences
4. Closed sequences
5. Multidimensional sequences
The Interval based methods includes time interval sequences.
Above methods are elaborated below.
1. Performance enhancing sequential sequence mining algorithms.
The sequential sequence mining algorithm improves the performance based on
various parameters of evaluation matrix. Many efforts devoted for improving the
performance of discovering sequential sequences by proposing new mining algorithms.
The performance analysis is discussed in section 2.5.
- 19 -
2. Constraint-based Sequential Sequence Mining
In many applications, the requirements of the discovered sequences may be
different. SPIRIT [20] allows a user to discover user-specified sequential sequences by
giving regular expression constraints. Pei [39] proposed mining sequential sequences
with constraints, which improves the efficiency and effectiveness of mining results.
3. Incremental Sequential Sequence Mining
In the dynamic environment, the databases are updated every time & everyday.
Mining whole database when it changes seems inefficient. Therefore, many incremental
mining methods are developed to solve the problem [40], [41].
4. Mining variants of sequential sequences
When applying the sequential sequence mining methods into real time
applications, users may require the variants of the revealed sequences. The following are
some typical variations.
(1) Maximal sequences
A sequential sequence is called maximal if it is not contained in any other
sequence in the set. Agrawal and Srikant found the maximal sequences [44]. Discovering
maximal sequences may reduce the amount of output sequences.
(2) Similar sequences
Similar sequences are found by similar sequence mining methods, occur
frequently in data sequences by processing similarity queries. The difference between
similar sequences and sequential sequences is that a similar sequence should not exactly
occur in data sequences. A similarity query is satisfied if the similarity between the query
sequence and a data-sequence is high enough.
- 20 -
(3) Periodic Sequences
The sequences recurring in the database are found by Periodic sequence mining
methods [44], [45],[46], [47]. For example, the events behaving cyclically in time series
are interesting in the marketing and biology domains.
(4) Closed sequences
A closed sequential sequence is a sequential sequence included in no other
sequential sequence having exactly the same support [48], [30]. To discover closed
sequential sequences may generate results that are more compact and perform more
efficiently.
(5) Episode
Episode is a gathering of events following a specified structure and occurring
repeatedly in a time series [50], [51]. Episode is useful and efficient to analyze time
series data.
(6) Multidimensional sequences
While traditional sequential sequence mining considered only the time dimension
of items, multidimensional sequences consider more than one dimension of items, such as
region, time, customer group etc. [52], [11]. Multidimensional sequences give more
information than traditional methods.
(7) Fuzzy sequential sequences
Sequential sequences can be extended by using fuzzy sets. Chen Ko discovered
fuzzy time-interval sequential sequences [50]. Moreover, Hong, Kuo proposed fuzzy
sequential sequences with quantitative data.
Yen Chen[8] used sequential sequence mining, which finds frequent
subsequences as sequences in a sequence database. They considered time between items
to be purchased. They addressed sequential sequences that include time intervals, called
time-interval sequential sequences. They developed two efficient algorithms for mining
- 21 -
time-interval sequential sequences. The first algorithm was based on the conventional
Apriori algorithm, while the second one was based on the PrefixSpan algorithm.
2.5 Empirical Analysis of State-of-the-art techniques
Here we have discussed experimental evaluation of various state-of-the-art
techniques. Earlier the association rule mining [10] was introduced by Agrawal and
Srikant which is described as follows.
2.5.1 Apriori Algorithm-Formal Description
The Apriori was the first algorithm developed by R. Agrawal and R. Shrikant on
Association Rule Mining [10] which generates candidate item sets to find frequent item
set. The basic of the algorithm is understood by the set of items is = [ i1 , i2 , i3 , …. ,
im ]. The set of database transaction is D where each transaction T is a set of items such
that T . TID is associated with each transaction. Let us consider A be a set of items. if
A T then A transaction T is said to contain A.
An association rule is implied by A B, where A , B and A B = .
The rule A B gives the transaction set D whose support is s,
Where, s is the percentage of transaction in D that contain A B
( i. e both A and B ).
This is considered as the probability, P (A B ).
2.5.1.1 Support [10]
Support( A B ) = P ( A B ) =
(# Tuple containing A with B )/Total no of tuples
Confidence( A B ) = P ( B | A )
(# Tuple containing A with B ) / # Tuple containing A
- 22 -
Rules, satisfying both minimum support threshold (min-sup) and minimum
confidence threshold (min-conf) are called strong. For simplicity, we write support and
confidence values occur between 0% to 100% rather than 0 to 1.0. A set of items is
referred to as an itemset. An itemset that contain k items is a k-itemset. The set
{ computer , financial_management_software } is a 2-itemset. The set of frequent k-
itemsets is commonly denoted by Lk.
2.5.1.2 Formal Definition: Apriori property [10]
All nonempty subset of a frequent itemsets must also be frequent. Apriori is an
important algorithm for mining frequent itemsets for Association rules. The name of the
algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset
properties. Apriori employs an iterative approach, where k-itemsets are used to explore (k
+ 1)-itemsets. First, the set of frequent L1-itemsets is found. This set is denoted L1. L1 is
used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no
more frequent k-itemsets can be found. The Apriori property presented below is used to
reduce the search space.
Apriori property says that all subsets of frequent itemset must also be frequent.
This property belongs to a special category of properties called anti-monotone in the
sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It is
called anti-monotone because the property is monotonic in the context of failing a test.
2.5.1.3 Algorithm : Apriori
Algorithm Apriori
Join Step Ck is generated by joining Lk-1 with itself
Prune Step Any (k-1)-itemset that is not frequent
Begin
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
- 23 -
L1 {frequent items}
for ( k = 1; Lk! = Ø ; k++ ) do
Begin
Ck+1 = candidates generated from Lk
for each transaction t in database do
Increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 candidates in Ck+1 with min_support
End
Return Uk Lk
End
Figure 2.1 Apriori Algorithm
It is tedious to repeatedly scan the database and check a large set of candidates.
Apriori algorithm scans the database too many times. It generates large amount of
frequent itemsets which are not efficient.
Let us consider sample database as shown in Table 2.1.
Customer_Id Item_Id
1 {1,2,3,4,5}
2 {1,3}
3 {1,2}
4 {1,2,3,4}
Table 2.1 : Sample Database
For example there are five different items, 1 to 5 and four different transactions. If
we set minimum support 0.5 then the frequent itemsets are {1}, {2}, {3}, {4}, {1,2}, {1,
3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}, {1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {2, 3, 4}, and {1, 2, 3, 4}.
Here they occur at least half of the transactions. An itemset is called as frequent item if
it‟s support value is above minimum support. Otherwise it is called as infrequent.
- 24 -
For instance, itemset {1, 2} is frequent. Here three out of four transactions
(transaction 1, 3, and 4) contain items 1 and 2 whose support is 0.75 which is more than
0.5.
On the other hand, itemset {2, 5} is infrequent as such only one out of the four
transactions contains items 2 and 5 whose support value is 0.25 which is less than
minimum support 0.5.
2.5.2 Algorithm - Apriori-gen
Algorithm Apriori-gen
Input A database and a user-defined minimum support
Output All frequent itemsets
Begin
L0 Ø; k 1
C1 {{i}| i I }
Answer Ø
While Ck ≠ Ø
Read database and count supports for Ck Lk { frequent itemsets in Ck }
Ck+1 Apriori-gen(Lk)
k k + 1
Answer Answer Lk Return Answer
End
Figure 2.2 Apriorigen Algorithm
2.5.2.1 The join procedure - Apriori-gen algorithm
Input Lk, the set containing frequent itemsets found in pass
k
Output Preliminary candidate set Ck+1 Begin
for i from 1 to | Lk - 1 |
for j from i + 1 to |Lk| if Lk.itemseti and Lk.itemsetj have the same (k- 1)-
prefix
Ck+1 := Ck+1 {Lk.itemseti Lk.itemsetj }
- 25 -
else
break
End
Figure 2.3 Apriorigen Algorithm : Join Procedure
2.5.2.2 The prune procedure of the Apriori-gen algorithm
Input Preliminary candidate set Ck+1 generated from
join procedure
Output Final candidate set Ck+1 which does not contain any
infrequent subset
Begin
for all itemsets c in Ck+1
for all k-subsets s of c
if kLs
Delete c from Ck+1
End
Figure 2.4: Apriori-gen Algorithm : Prune Procedure
The Apriori-Gen [37] was later developed which uses the property of Apriori.
However the candidate generation process is divided into two steps. First, the preliminary
candidate set is calculated as C’k ={ X X’ | X, X’ Lk-1 and | X X’ |= k-2 }
against the actual candidates are generated in Apriori by Ck = { X C’k | X contains k
members of Lk-1 }.
The Apriori-gen overcomes Apriori by reducing candidates which are used in
Partition and DHP and Sampling Algorithm.
2.5.3 DHP Algorithm [14]
DHP [14] has improved Apriori by using a hash filter. It counts the support for the
next pass. It is observed that reducing the candidate items from the database is one of the
important tasks for increasing the efficiency. The support value is used to eliminate
- 26 -
candidates. This algorithm reduces the number of candidates in the second pass, which is
generated very large in Apriori. Thus a DHP technique [14] was proposed to reduce the
number of candidates in the early passes Ck for k > 1 and thus the size of database
reduces. In this method, support is counted by mapping the items from the candidate list
into the buckets which is divided according to support known as Hash table structure.
The new itemset is encountered if item exists earlier then it increases the bucket count
otherwise it inserts into new bucket. Thus at the end the bucket whose support count is
less than minimum support is removed from the candidate set.
Here we use an example to show how the hash filter works. Suppose there are {1},
{2}, {3}, {5} frequent 1-itemsets in a database of five items 1, 2, 3, 4, and 5. In the first
pass, during transaction is examined, DHP [14] not only update the support of all 1-
itemsets transaction but also updates the counts in a hash table for 2-itemsets. It uses hash
function.
Suppose the Hash function is defined as h({x, y}) = (10x + y) mod 7, the
transaction {1, 3, 5} increments the supports for 1-itemsets {1}, {3}, and {5}. DHP also
updates the counts in index h({1, 3}), index h({1, 5}), and index h({3, 5}) of the hash
table. Hence it updates index 6, index 1, and index 0 of the hash table. Again the database
is read, if the count in a bucket is less than the minimum support then 2-itemsets in this
bucket are considered as infrequent and the value 0 is set in the filter. Otherwise, the
value 1 is set to in the filter.
The candidates are pruned using filter. It prunes the candidates before reading
database in the next pass. However, according to the experiments made [43], this
optimization may not be as good as using a two-dimensional array as discussed in [44].
DHP considers every frequent itemset like Apriori.
However, the limitation is that, it reduces the generation of candidate sets in the
earlier stages but as the level increases, the size of bucket also increases thus it is difficult
to manage hash table as well candidate set.
- 27 -
2.5.4 Partition Algorithm-Formal Description [12]
Apriori and DHP have limitations such as they require scanning of database
multiple times as many times as the length of the longest frequent itemset. The second
issue is that most of the records in the database are not useful in the later passes, since
many of the records may not even contain the items in the candidates. In other words, a
record that does not contain any item in any candidates can be removed without affecting
the support counting process.
Partitioning algorithm [12] is based on finding the frequent elements on the basis
of partitioning the database in n parts. It overcomes the memory problem for large
database which may not fit into main memory because small parts of database easily fit
into main memory. This algorithm divides into two passes as shown in Figure 2.5.
Step 1 : In the first pass, whole database is divided into n number of parts based
on the size of database.
Step 2 : Each partitioned database is loaded into main memory one by one and
local frequent elements are found.
Step 3 : Combine the all locally frequent elements and make it globally candidate
set.
Step 4 : Find the globally frequent elements from this candidate set.
Figure 2.5: Mining Frequent itemsets using Partition algorithm [12]
- 28 -
The Partition algorithm is given in Figure 2.6.
2.5.4.1 Algorithm-Partition
P partition_database(D)
n Number of partitions
2.5.4.1.1 Phase I
for i = 1 to n do
Begin
Read in_partition(pi P)
Li gen_large_itemsets(pi)
End
2.5.4.1.2 Merge Phase
for (i = 2; Li j , j = 1,2....,n; i++) do
Begin
Ci j
j = 1,2,...nLi j
End
2.5.4.1.3 Phase II
for i = 1 to n do
Begin
Read-in_partition(pi P)
for all candidates c CG gen_count(c, pi)
LG {c CG| c.count > minsup}
End
Figure 2.6: Partition Algorithm
To resolve first issue, the database is divided equally into equal sized partitioned
horizontally, which can fit in main memory. Each partition is processed independently to
- 29 -
produce a local frequent set for that partition in the first pass. This process uses a bottom-
up approach similar to Apriori however with a different data structure. After all local
frequent sets are discovered, their union forms superset of the actual frequent set, called
global candidate set. It depends on the fact that if an itemset is frequent then it must be
frequent in at least one of the partition. Similarly, if an itemset is not frequent in any
partition, then it must be infrequent.
During the second pass, it produces the actual support for global candidate set by
reading the database again. Therefore, the entire process finishes within two passes. It
uses bottom-up approach. It extends the length of the candidates by one in every loop
until no more candidates are generated.
To prevent reading the database each time the length of the candidate is
incremented, the database is transformed into a TID-list. Each candidate stores a list of
the transaction IDs that support this candidate. The database is partitioned into a size that
fits into the main memory. The TID-list solves the second issue, since only those
transactions that support current candidates will be available in the TID-list. However,
this TID-list gets additional overhead, the transaction ID for a transaction containing m
items may appear, in the worst case, in
k
m TID-lists for the kth
pass. However the
partition approach has three major limitations. First, it requires the choice of a good
partition size to get a good performance. If the partition is too big, then the TID-list might
increase too fast and it may create a problem to fit in the main memory. But, if the
partition is too small, then there will be large set of global candidates and which may lead
to be infrequent.
Second limitation is, negatively impacted by data skew, which causes the local
frequent set to be very different from each other. Then, the global candidate set will be
very large.
- 30 -
Third limitation is, the algorithm will consider more candidates than Apriori. So
this algorithm is infeasible for long maximal frequent itemsets.
2.5.5 Sampling Algorithm [13]
The partition approach uses whole database, hence therefore it increase the I/O
overhead. To reduce this Sampling algorithm was proposed by Toivonen [13]. It
considers only some samples of the database and discovers an approximate frequent set
by using a bottom-up approach. This random sampling approach overcomes the problem
of the data skew in the Partition algorithm.
This algorithm is based on the idea to pick a random sample of itemset R from the
database instead of whole database D. The sample is picked in such a way that whole
sample is accommodated in the main Memory. In this way it tries to find the frequent
elements for the sample only and there is chance to miss the global frequent elements in
that sample therefore lower threshold support is used instead of actual minimum support
to find the frequent elements local to sample.
This algorithm is a guess-and-correct algorithm [42]. It estimates an answer in the
first pass and corrects the answer in subsequent passes. This algorithm looks only at a
part of the database in the first pass unlike partition algorithm which looks at the entire
database, Therefore, a frequent itemset found in the sample database may not be actually
frequent (false positive) and an infrequent itemset found in the sample database may turn
out to be frequent (false negative). With the support value, the false positive itemsets are
removed after reading entire database. It is more difficult to recover the missing frequent
itemsets (false negative itemsets). The performance of this Sampling algorithm depends
on the sample database. This algorithm considers at least the same candidates as Apriori.
Therefore, it has still limitations when the frequent itemsets are long.
- 31 -
2.5.6 Dynamic Itemset Counting (DIC) [15]
This algorithm is also used to reduce the number of database scan. It is based
upon the downward disclosure property in which adds the candidate itemsets at different
point of time during the scan. In this dynamic blocks are formed from the database
marked by start points and unlike the previous techniques of Apriori. It dynamically
changes the sets of candidates during the database scan.
2.5.7 Improved versions of Apriori [16]
The improved version of Apriori algorithm [16] is based on the combination of
forward scan and reverse scan of a given database. If certain conditions are satisfied, the
improved algorithm can greatly reduce the iterations, scanning time required for the
discovery of candidate itemsets.
Suppose the itemset is frequent, all of its nonempty subsets are frequent. Based on
this thought, it was proposed an improved method of Apriori by combining forward and
reverse thinking. Here, first it finds the maximum frequent itemsets from the maximum
itemset. Then it gets all the nonempty subsets of the frequent itemsets. We know that they
are frequent as per Apriori's property. Then it scans the database again from the lowest
itemset and count the frequent itemsets. During this scanning, if one item is found out
being excluded in the frequent set, it will be processed to judge whether the itemsets
associated with it is frequent or not? if they are frequent, they will be added in the barrel-
structure. Here we get all the frequent itemsets. The key of this algorithm is to find the
maximum frequent itemset in fast manner.
R. Shrikant and R. Agrawal introduced the problem of mining sequential
sequences over such databases. They proposed algorithms AprioriAll, AprioriSome [7] to
solve this problem. They evaluated their performance using synthetic data. Both have
comparable performance, although AprioriSome performs a slightly better when the
minimum number of customers support a sequential sequence is low. Scale-up
- 32 -
experiments show that both AprioriSome and AprioriAll[7] scale linearly with the
number of customer transactions. They also have excellent scale-up properties with
respect to the number of transactions per customer and the number of items in a
transaction. Let us see in detail.
2.5.8 AprioriAll -Formal Description [7]
It finds frequent subsequence item sets. The frequent subsequence item sets are
generated with the help of the candidate itemsets. Also it scans dataset every time to find
out k-large sequences. The algorithm uses five phases:
i) Sort phase
ii) Litemset phase
iii) Transformation phase
iv) Sequence phase
v) Maximal phase
2.5.8.1 Sort Phase :
The database is sorted with customer-id as the major key and transaction-time as a
minor key. This step converts the dataset in the sequential order.
2.5.8.2 Litemset Phase :
In this phase it finds the set of all litemsets L. It also simultaneously finds the set
of all large l-sequences, since this set is just the problem of finding large itemsets in a
given set of customer transactions, although with a slightly different definition of support,
has been considered in [7]. The support for an itemset has been defined as the fraction of
transactions in which an itemset is present.
- 33 -
The main difference is that the support count should be incremented only once per
customer even if the customer buys the same set of items in two different transactions.
The set of litemsets is mapped to a set of contiguous integers. The large itemsets are (30),
(40), (70), (40 70) and (90) are shown in Table 2.2.
Customer_Id Transaction Time Items Brought
1 August 24, ‘12 20
1 August 29, ‘12 80
2 August 9, ‘12 5, 10
2 August 14, ‘12 20
2 August 19, ‘12 30, 50, 60
3 August 24, ‘12 20, 40, 60
4 August 24, ‘12 20
4 August 29, ‘12 30, 60
4 August 24, ‘12 80
5 August 11, ‘12 80
Table 2.2: Mining Frequent itemsets using AprioriAll
Customer_Id Customer
Sequence
Large
Itemsets
Mapped
To
1 <(20)(80)
(20) 1
2 <(5 10)(20)(30 50
60)>
(30) 2
3 <(20 40 60)> (60) 3
4 <(20)(30 60)(80)> (30 60) 4
5 <80)> (80) 5
Table 2.3 : Mapping of sequence
min_sup_count=2
- 34 -
2.5.8.3 Transformation Phase:
This phase repeatedly determines which of a given set of large sequences are
contained in a customer sequence. To make this test fast each customer sequence is
transformed into an alternative representation as shown in Table 2.3 & Table 2.4. In a
transformed customer sequence, each transaction is replaced by the set of all litemsets
contained in that transaction. If a transaction does not contain any litemset, it is not
retained in the transformed sequence. If a customer sequence does not contain any
litemset, this sequence is dropped from the transformed database. However, it still
contributes to the count of total number of customers. A customer sequence is now
represented by a list of sets of litemsets.
Customer_Id
Original
Customer Sequence
Transformed
Customer Sequence
After
Mapping
1 <(20)(80) > <{(20)} {(80)}> <{(1)} {(5)}>
2 <(5 10)(20)(30 50
60)>
<{(20)} {(30), (60),
(30 60)}>
<{(1)} {2, 3, 4 }>
3 <(20 40 60)> <{(20 60)}> <{(1 3)}>
4 <(20)(30 60)(80)> <{(20)}{(30), (60),
(30 0)}{(80)}>
<{(1)}{2, 3,
4}{5}>
5 <80)> <(80)> <(5)>
Table 2.4 : Transformed sequence
2.5.8.4 Sequence Phase:
The algorithm scans the dataset multiple times. In each scan, it starts with a seed
set of large sequences. The seed set for generating new potentially large sequences are
used which is called candidate sequences. It finds the support for these candidate
- 35 -
sequences during the scan over the data. At the end of the scan, it determines which of
the candidate sequences are actually frequent. These large candidates become the seed for
the next scan. The two families of algorithms, called as count-all and count-some. The
count-all algorithm counts all the large sequences including non-maximal sequences. The
non-maximal sequences must then be pruned out (in the maximal phase). They presented
count-all algorithm, called AprioriAll. They presented two count-some algorithms:
AprioriSome and DynamicSome.
2.5.8.5 Maximal Phase:
It finds the maximal sequences among the set of large sequences. Having found
the set of all large sequences S in the sequence phase it finds maximal sequences. Let the
length of the longest sequence be n.
The AprioriAll algorithm is available in Figure 2.7.
2.5.9 Algorithm-AprioriAll [7]
L1 {large 1-sequences}
for ( k = 2; Lk-1 <>0; k++ ) do
Begin
Ck New candidates generated from Lk-1
for each customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c. Lk Candidates in Ck with minimum support.
Answer = Maximal Sequences in k Lk
L1 = {large 1-sequences}
for ( k = 2; Lk-1 <>0; k++ ) do
Begin
Ck = New candidates generated from Lk-1
For each customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c. Lk = Candidates in Ck with minimum support.
Answer = Maximal Sequences in k Lk
Figure 2.7 : Algorithm AprioriAll
- 36 -
2.5.10 AprioriSome-Formal Description [7]
AprioriSome algorithm [7] runs in forward and backward pass. In forward pass, it
only counts sequence of certain lengths. For example it counts sequences of length 1, 2, 4
and 6 in the forward pass and count sequences of length 3 and 5 in the backward pass. It
saves the time by not counting those sub-sequences which are not maximum. Sometimes
we required all the frequent sub-sequences rather than only max-subsequences. So it also
saves the time and memory. The detail algorithm is given in Figure 2.8.
2.5.10.1 Algorithm- AprioriSome : Forward Phase
L1 {large 1-sequences}
C1 L1
Last 1
for (k = 2; Ck-1 ≠ 0; and Llast ≠ 0; k++) do
Begin
If (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
If (k==next(last)) then
Begin
for each customer-sequence c in the
database do
Increment the count of all candidates in Ck
that are contained in c.
Lk = Candidates in Ck with minimum support.
Last k;
End
End
2.5.10.2 AprioriSome : Backward Phase
Begin
for (k--; k>=1; k--) do
if (Lk not found in forward phase) then
Begin
Delete all sequences in Ck contained in some
Li i>k;
for each customer-sequence c in DT do
- 37 -
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
End
else
Delete all sequences in Ck contained in some
Li i>k;
Answer = k Lk
Here DT is Transformed database
End
Figure 2.8 : Algorithm AprioriSome
2.5.11 Relative performance - AprioriAll & AprioriSome
The major advantage of AprioriSome over AprioriAll is that it ignores counting of
many non-maximal sequences. However this advantage is reduced because of two
reasons. First, candidates Ck in AprioriAll are generated using Lk-1 unlike AprioriSome.
Since the number of candidates generated using AprioriSome can be larger. Second,
although AprioriSome skips over counting candidates of some lengths, they are
generated and stay memory resident. If memory gets filled up, AprioriSome is forced to
count the last set of candidates generated even if the heuristic suggests skipping some
more candidate sets. This effect decreases the skipping distance between the two
candidate sets that are indeed counted and AprioriSome starts behaving more like
AprioriAll. For lower supports, there are longer large sequences and hence more non-
maximal sequences and AprioriSome does better.
2.5.12 DynamicSome-Formal Description [7]
DynamicSome generates candidates on-the-fly using the large sequences found in
the previous passes and the customer sequences read from the database. There are four
phases of this algorithm shown in Figure 2.9. First is initialization phase, in this phase all
the large sequences up to steps are counted. Second phase is forward phase, in this phase
all the sequences whose length is multiple of steps are counted. Third phase is
- 38 -
intermediate phase, all the candidate sequences which are not counted in first two phase
are counted here. Last phase is backward phase that is identical to AprioriSome
algorithm.
However, unlike in AprioriSome, these candidate sequences were not generated in
the forward phase. The intermediate phase generates them. Then the backward phase is
identical to AprioriSome.
The limitation of this algorithm is the main memory capacity. It fails when there
is little main memory, or many potentially large sequences.
2.5.13 Algorithm-DynamicSome
2.5.13.1 Initialization Phase
L1 = {large 1-sequences}
for ( k = 2; k <= step and Lk-1<>0 ;; k++ )
do Begin
Ck = New candidates generated from Lk-1;
for each customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c.
Lk = Candidates in Ck with minimum support
End
2.5.13.2 Forward Phase
for ( k = step; Lk <> 0 ; k += step )
Find Lk+step from Lk and Lstep
Begin
Ck+step = 0
for each customer sequences c in DT do
Begin
X = otf-generate(c, Lk, Lstep)
for each sequence x X’, increment its count in Ck+
step End
Lk+step = Candidates in Ck+step with minimum support.
End
- 39 -
2.5.13.3 Intermediate Phase
for ( k--; k > 1; k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
Figure 2.9 : Algorithm DynamicSome
2.5.14 GSP [6]
R. Shrikant and R. Agrawal introduced the GSP algorithm. It uses the downward-
closure property of sequential sequences and multiple-pass, candidate generate-and-test
approach. It uses a horizontal format. During first scan it finds all of the frequent items
with minimum support. Each item gives a 1-event frequent sequence consisting of that
item. Candidate 2–sequences are formed from the frequent sequences. The higher
candidates are generated from the candidates generated in previous step. This process is
repeated until no more frequent sequences are found.
2.5.14.1 Formal Description
The algorithm [6] creates multiple passes over the data. The first pass decides the
support of each item. At the end of the first pass, the algorithm knows which items are
frequent. Each such item generates a 1-element frequent sequence consisting of that
item. Each subsequent pass starts the frequent sequences found in the previous pass
called as a seed set. The seed set is used to generate new frequent sequences, named as
candidate sequences. Each candidate sequence contains one more item than a seed
sequence. In same pass, all the candidate sequences will have the same number of items.
The support of these candidate sequences is found during this pass. Finally, the algorithm
determines which of the candidate sequences are actually frequent. These frequent
- 40 -
candidates are used as the seed for the next pass. This algorithm terminates when there
are no frequent sequences or no candidates generated at the end of a pass.
It uses two keys:
1. Candidate generation: Candidates sequences are generated when the pass
begins.
2. Counting candidates: For the support count of candidate sequences.
The Candidates are generated in two steps:
2.5.14.2 Join Phase: It generates candidate sequences Ck+1 by joining Lk with Lk. The
candidate sequence are generated by joining s1 with s2 which is the sequence s1 extended
with the last item in s2. The added item is a separate item if it is a separate element in s2,
and part of the last element of s1.
2.5.14.3 Prune Phase: It deletes candidate sequences that have a contiguous subsequence
whose support value is less than minimum support.
During counting of candidates, it uses Hash-tree data structure to reduce the
number of candidates in C that are tested for a data-sequence. It is transformed in the
representation of the data-sequence d so that it can efficiently find whether a specific
candidate is a subsequence of d. It checks for the data-sequence containing a specific
sequence.
2.5.14.4 Relative Performance:
Figure 2.10 shows the relative performance with respect to execution time. The
synthetic datasets was generated by using synthetic data generation using IBM Quest data
mining project (IBM). The datasets were generated by using following symbols with
various values.
- 41 -
D is used for number of customers in the dataset, C for the average number of
transactions per customer, T for the average number of Items per Transaction, S for
average length of maximal sequences and I is used for average length of transactions
within the maximal sequences. These dataset is compared with various parameters. The
values are taken as D10000-C10-T2.5-S4-I1.25. Means, for 10,000 Customers in dataset
with average numbers of transactions per customer were 10. The average 2.5 items per
transaction are considered. The average length of sequences was 1.25 per Item were
taken in the empirical Analysis.
Here for the three algorithms and for the given synthetic datasets, the minimum
support is decreased from 1% support to 0.2% support. The graph for DynamicSome is
not plotted due to it generates too many candidates and run out of memory for minimum
support. Even if DynamicSome has more memory, the cost of finding the support for that
many candidates ensures execution time much larger than those for Apriori or
AprioriSome.
The execution time of all the algorithms increase as the support is decreased
because of a large increase in the number of large sequences in the result. DynamicSome
performs worse than the other two algorithms mainly because it generates and counts a
much larger number of candidates in the forward phase. Execution time increases as
number of the customers increase. The number of transactions per customer increase,
time also increases. The value of support increases, time reduces because there is less
sequences that qualify minimum support criteria.
Scale-up experiments show that both AprioriSome and AprioriAll scale linearly
with the number of customer transactions. Two of the algorithms, AprioriSome and
AprioriAll have similar performance, although AprioriSome performs a little better for
the lower values of the minimum number of customers.
- 42 -
The major advantage of AprioriSome over AprioriAll is that it avoids counting of
many non-maximal sequences. AprioriSome and AprioriAll have similar performance.
However AprioriSome performs a little better for the lower values of the minimum
support value.
The comparative performance is shown in Figure 2.10. We can see that as
minimum support decreases, the difference of timing between AprioriAll and
AprioriSome increases. So AprioriSome performs better than AprioriAll.
Figure 2.10: Relative Performance
GSP [6] has some limitations. A huge set of candidate sequences are generated.
1,000 frequent length-1 sequences generates huge number of length-2 candidates
500,499,12
999100010001000
. Especially for 2-item candidate sequences, multiple
scans of database are needed. The length of each candidate grows by one at each database
scan. It is inefficient for mining long sequential sequences.
GSP & DynamicSome generate too many candidate items for low values of
minimum support. Execution time of all the algorithms increases as the support
- 43 -
decreases because of a large increase in the number of large sequences in the result. GSP
& DynamicSome perform worse. DynamicSome generates and counts a much larger
number of candidates in the forward phase & intermediate stages.
The efficiency [38] of all frequent sequence mining algorithms is provided by
following way. With the minimum support threshold is with n = |C| different items in
the item collection, C. For | I | different possible existent itemsets, where I is the powerset
of C, and its value is given by equation 2.1.
121||1
nn
j j
nI
…Equation 2.1
Let the database has sequences with at most m itemsets and each itemset has at
most one item. In this condition, there would be nm possible different sequences with m
itemsets and different arbitrary length sequences. It is given in equation 2.2
1
1
1
n
nnn
mk
m
k
…Equation 2.2
Similarly, if each itemset has an arbitrary number of items, there exists Sm with
possible frequent sequences with m itemsets, with the value of Sm is given by equation
2.3.
mnmImS )12(|| …Equation 2.3
The S sequences in general, as in equation 2.4.
nmn
nmnkn
m
k
S 212
121)12(
)12(
1
…Equation 2.4
- 44 -
2.5.15 FreeSpan [5]
The FreeSpan algorithm [5], introduced by the Jiawei han and Jianpei. FreeSpan
uses the projected sequential databases to confine the search and growth of subsequence
fragments. First it scans the database and then finds the frequent item lists, which is 1-
length list. The complete set of sequential sequences is divided into number of subsets
according to frequency. The frequent item list is generated without overlap. It uses bi-
technique first time for finding the frequent sub-sequences. Bi-technique reduces the
number of projected database.
It gives advantages over Apriori based algorithm. The alternatively-level
projection in FreeSpan [5] reduces the cost of scanning multiple projected databases and
takes advantage of Apriori way candidate pruning. It works faster than the Apriori
because it examines the substantially fewer combinations of subsequences.
FreeSpan [5] has many bottlenecks. The major overhead of FreeSpan is that it
generates many nontrivial projected databases. If a sequence appears in each sequence of
a database, it‟s projected database does not shrink and it is likely to be the original
database.
The growth of a subsequence is explored at any split point in a candidate
sequence, it is very expensive. So it generates some of the unnecessary sequences.
2.5.16 SPADE [4]
SPADE (Sequential PAttern Discovery using Equivalent Class) [4] was
developed by the Zaki. SPADE outperforms GSP (Generalized Sequential Patterns) [6]
by a factor of two and by an order of magnitude with pre computed support of 2-
sequences.
- 45 -
SPADE [4] uses only simple temporal join operation on id-lists. As the length of a
frequent sequence increases, the size of its id-list decreases results in very fast joins. No
complicated hash-tree structure is used and no overhead of generating and searching of
subsequences incurred. SPADE [4] has excellent locality, since a join requires only a
linear scan of two lists.
As the minimum support is lowered, more and larger frequent sequences are
found. GSP makes a complete dataset scan for each iteration. SPADE [4] on the other
hand restricts itself to usually only three scans. This algorithm uses the vertical data
format. The sample data set is shown in Table 2.5.
Seq ID Sequences
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Table 2.5 : Data set Example
Here it converts the dataset in the vertical format. It uses SID and CID. SID
means in which sequence, these items belongs to and CID represents in which
transaction, it belongs to as shown in Table 2.6.
SID CID Items SID CID Items
1 1 a 3 1 ef
1 2 abc 3 2 ab
1 3 ac 3 3 df
1 4 d 3 4 c
1 5 cf 3 5 b
2 1 ad 4 1 e
2 2 c 4 2 g
2 3 bc 4 3 af
2 4 ae 4 4 c
- 46 -
4 5 b
4 6 c
Table 2.6 : Vertical Data format1
The Sequence ID of „a‟ is 1(because its sequences index no is 1.) and CID = 1
(Because it happened in 1st transaction). Here it uses the vertical data format where each
item has SID and CID. Now we have to scan the dataset and the find out the frequent
items. So here we have a, b, c, d, e, f which are the frequent items if we take minimum
support as 2 as shown in Table 2.7.
‘a’ ‘b’
SID CID SID CID
1 1 1 2
1 2 2 3
1 3 3 2
2 1 3 5
2 4 4 5
3 2 4 3
ab ba
SID CID(a) CID(b) SID CID(b) CID(a)
1 1 2 1 2 3
2 1 3 2 3 4
3 2 5
4 3 5
Table 2.7: Vertical Data format
Here we have to find the 2-length sequences. It generates the candidate sequences
using 1-length sequence then checks for the frequent sequence. It generates the table for
the sequences „ab‟. It takes those „a‟ and „b‟ which have same SID. The CID of „a‟
should have lower index value than „b‟ for generating the sequence „ab‟. Here with SID
1, the sequence is generated as „ab‟. Look the CID of „a‟ and „b‟. It is 1 and 2
respectively. It indicates that „a‟ occurs before „b‟. Similarly it generates the 3-length
sequence sequences and so on.
- 47 -
This algorithm uses index number which finds frequent items. It works faster than
the GSP. However multiple scans waste the time.
The limitation of SPADE is that it needs an exponential number of short
candidates.
A 1030 candidate sequences are used to generate length-100 sequential
sequences. Means 30100100
1
1012100
i i
The C10-T2.5-S4-I1.25 dataset is used in experiments with different minimum
support levels ranging from 0.25% to 1%. The comparative results are shown in Figure
2.11. It is observed the SPADE outperforms GSP & Freespan. However, Freespan gives
better performance in time compare to GSP due to reducing the cost of scanning multiple
projected databases several times.
When the minimum support decreases, the execution time increases. This can be
seen in the Figure 2.11.
Figure 2.11: Comparison – GSP, Freespan, SPADE
- 48 -
2.5.17 Prefixspan [9]
This algorithm uses a pattern growth approach. It never generates the candidates
which do not appear in the database. It uses optimization methods. For closed sequence,
it is easy to extend it with other constraints for the closed sequences. It uses a divide and
conquer technique. First it generates the projected database and then finds the frequent
sequences.
To overcome this bottleneck of the FreeSpan, Jiaweihan and Jianpei developed
new algorithms called PrefixSpan [2]. It outperforms both the Apriori and FreeSpan
algorithms in almost all the fields like huge no of sequences, support. Different projection
methods are used for PrefixSpan [2]: level-by-level projection, bi-level projection etc.
The pefixspan algorithm is discussed in Figure 2.12.
Algorithm Prefixspan
Input A sequence database S, the minimum support
Output The complete set of sequential patterns
Begin
Call PrefixSpan(<>,0,S)
procedure PrefixSpan (α, L, α S )
Scan α S once, find each frequent item b, such that:
b or <b> be appended to α to form a sequential sequence
for each f requent item b, append it to α to form a
sequential sequence α’ and output α’
for each α’, construct α’-projected database S |α’.
Call PrefixSpan(α',L+1, S|α’)
Figure 2.12: Algorithm - Prefixspan
The first step of PrefixSpan is to scan the sequential database for getting the
length-1 sequences, which is in fact the large 1-itmesets. Then the sequential database is
divided into different partition according the number of length-1 sequences. Each
partition means the projection of the sequential database that takes the corresponding
length-1 sequences as prefix. The projection database only contains the suffix of these
- 49 -
sequences. All the length-2 sequential sequences are generated from the parent of length-
1 sequential sequences as prefix from the projected database. Then the projected database
is partitioned again by those length-2 sequences. The same process is repeated until the
projected database is empty or no more frequent length-k are generated.
Let the sequence database is shown in Table 2.8. The item sets are {a, b, c, d}.
The sequence <ac(bc)d(abc)ad> have 7 elements, like (a), (c), (bc), (d), (abc), (a) & (d).
Table 2.8: Sequence Database
The main cost of above method is the time and space used to construct and scan
the projected database as show in Table 2.9. This is called level-by-level projection.
Another projection method is called bi-level projection is used to reduce the number of
projection databases. First step is the same, by scanning the sequential database we can
get the frequent 1-sequence.
In second step, instead of constructing projected database, one n×n triangle matrix
is constructed as shown in Table 2.10. It represents the support of all length-2 sequences.
Example [<d>, <a>] = (3, 3, 0) means supports of <d, a>, <a, d> and < (ad) > are 3, 3
and 0 respectively. It creates the projected database of length-2 sequences which are
frequent and pass the minimum support threshold.
The pseudo projection technique reduces the number and size of projected
databases. The idea is given as follows: Instead of performing physical projection, one
can register the index (or identifier) of the corresponding sequence and the starting
Customer ID Customer Sequence
1 <ac(bc)d(abc)ad>
2 <b(cd)ac(bd)>
3 <d(bc)(ac)(cd)>
- 50 -
position of the projected suffix in the sequence. Then, a physical projection of a sequence
is replaced by a sequence identifier and the projected position index point. It represents
like <sid, offset>. sid = item indication and offset is pointer to that sequence. i.e. <a, 2>
means „a‟ occurred in particular sequence at 2nd
place. Pseudo projection reduces the
cost of projection substantially when the projected database fits in main memory. But it
may not be efficient once the pseudo projection is used for disk-based accessing since
random access disk space is expensive. Based on this observation, if the original
sequence database or the projected databases is too big to fit into main memory, then the
physical projection should be applied.
Table 2.9: Projected Database
Table 2.10 : S-Matrix
Large
Itemsets
Projected Database
(suffix dataset or postfix)
A <c(bc) d(abc)ad >
<c(bd)>
<(_c)(cd)>
B <(_c)d(abc)ad>
<(cd)ac(bd)>
<(_c)(ac)(cd)>
C <(bc) d(abc)ad >
<(_d)ac(bd)>
<(ac)(cd)>
D <(abc)ad >
<ac(bd)>
<(bc)(ac)(cd)>
<a> 0
<b> (3,2,1) 0
<c> (3,3,2) (2,3,2) 0
<d> (3,3,0) (3,3,1) (3,3,2) 0
<a> <b> <c> <d>
- 51 -
The main cost of PrefixSpan [2] is that it takes much time when dataset is too
huge and low support because of scanning of projected database. However it is improved
by using bi-level projection and pseudo projection technique.
It does not use vertical representation. So it may need to scan the database several
times. The database has to be stored in memory. Generate the projection tables for every
sequence.
When the support threshold is high, it has a limited number of sequential
sequences and the length of sequences is short, these methods are very near in terms of
runtime. However, as the support threshold decreases, the time to generate the sequences
become more. It clearly seems that FreeSpan and PrefixSpan overcome GSP. And also
PrefixSpan methods are more efficient than FreeSpan.
Figure 2.12: Comparison – Freespan, SPADE, PrefixSpan
2.5.18 SPAM [3]
Jay Ayres developed the one more algorithm for the sequential mining is SPAM
(Sequential PAttern Mining) [3]. It uses a bitmap representation. There are several
- 52 -
optimizations possible with bitmaps. It uses a vertical representation of the database so
that the database only needs to be scanned once to create the vertical representation. It is
very fast to calculate the intersection of two SIDs sets (sets of sequence ids) by doing a
kind of logical AND with two bitmaps.
This is basically based on SPADE [4] means vertical representation. The author
introduced a novel depth-first search strategy [4] that integrates a depth-first traversal of
the search space with effective pruning mechanisms. The implementation of the search
strategy combines a vertical bitmap representation of the database with efficient support
counting. According to the two processes exited in SPAM [3], it uses two pruning
techniques: S-step pruning and I-step pruning, based on the Apriori heuristic to minimize
the size of candidate items. S-step adds the whole sequence at the end (i.e. <(a,b)(d)>)
and I-step adds the item at the end of the current sequence (i.e<(a,b,d)>). The dataset is
shown in Table 2.11.
SID Sequences
1 <(a,b,d),(b,c,d),(b,c,d)>
2 <b,(a,b,c)>
3 <(a,b),(b,c,d)>
Table 2.11: Data set
It uses the vertical data format. But it is different from the SPADE because SPAM
deal with the bitmap representation. Let us see the vertical bitmap dataset representation
of the example as shown in Table 2.12.
SID TID {a} {b} {c} {d}
1 1 1 1 0 1
1 2 0 1 1 1
1 3 0 1 1 1
2 1 0 1 0 0
2 2 1 1 1 0
3 1 1 1 0 0
3 2 0 1 1 1
Table 2.12: Vertical format
- 53 -
In this table it represents bitmap with various values of SID and TID. 1 indicates
that the item is available in the SID and TID. 0 indicates that item is not available in SID
and TID. Here 0 and 1 numbers are used. Means it represents the binary format. The
sequences are generated in S-type and I-type. The S-step process is shown in Table 2.13.
{a} ({a})s {b} ({a},{b})
1 0 1 0
0 1 1 1
0 S-step 1 1 1
0 0 & 1 Result 0
1 Process 0 1 0
1 0 1 0
0 1 1 1
Table 2.13 : S-step process
Here 1st column is shown in Table 2.14 for the {a}. 2
nd column is derived from 1
st
column. It shows that the item „a‟ occurs in the same sequence before item „b‟. Here item
„a‟ is available in the 1st TID of SID 1, so it is made zero. Then all successive TIDs are
made 1. This indicates that sequence „a‟ appeared. In ({a}s), we make all as 1 till the SID
doesn‟t change. Now the 3rd
column for the {b} is the same as it is. The AND operation
is performed and result shows that it is possible to make <(ab)> as one sequence (If AND
result is 1). This way the S-type sequences are created. Then I-type sequences are
generated. The sequences are shown in Table 2.14.
({a},{b}) AND {d} ({a},{b,d})
0 1 0
1 1 1
1 1 Result 1
0 & 0 0
0 0 0
0 0 1
1 1 0
Table 2.14 : I-step process
- 54 -
SPAM [3] performs better for large dataset. Prefixspan [2] sometime outperforms
the SPAM. However with huge number of items, SPAM works better than the
PrefixSpan. For the small dataset SPAM consumes more memory than the SPADE.
In Figure 3.23, the dataset with D10000-D50000C10T5S3.5I1.25 was taken for
100 different items and with 10000 to 50000 customers. The frequent sequences are
generated and compared them with time(sec). It is seen that the PrefixSpan[2] runs faster
than the SPAM[3]. SPAM generates candidates and then computes the SIDs set of the
candidate to calculate its support. It may generate many candidates that are not frequent,
therefore it wastes time. It can also generate candidates that do not appear in the database.
If the sequences are very long, the memory usage will go up because each bitmap
takes more memory space. For more frequent items, more bitmaps are needed to be
stored into memory. SPAM [3] generates candidates and then computes the SIDs set of
the candidate to calculate its support. It generates many candidates those are not frequent,
therefore it wastes time. It also generates candidates that do not appear in the database.
Figure 2.13: Comparison – PrefixSpan with SPAM
- 55 -
In Figure 2.13, the dataset with support 3.5% and 100 different items and vary the
number of customers from 10000 to 50000. The frequent sequences are found and
compared them with time (sec). We can see that PrefixSpan runs faster than the SPAM.
Figure 2.14: No of Customer v/s Memory
In Figure 2.14, the dataset with support 3.5% and 100 different items and vary the
number of customers from 10000 to 50000. The frequent sequences are found and
compared it with memory used by them. It is seen that SPAM uses less memory
compared with PrefixSpan [2]. Means that SPAM can deal with the large dataset.
Figure 2.15: No of Transaction v/s Memory
- 56 -
In Figure 2.15 shows the comparison for No of Transactions vs Memory with
dataset support is fixed from 2.5% to 100 different items It is noticed that as the
transactions per customer increase, Memory used by PrefixSpan [2] increases smoothly.
Where memory used by SPAM doesn‟t have much effect. So it indicates that SPAM
always works as memory management.
In Figure 2.16, the dataset with fixed support as a 2.5% and 100 different items
and vary the number of transactions per customers. The frequent sequences are found
with respect to time (seconds) used by them. We can see that as no of transactions
increase, the time to generate the sequences also increase respectively.
Figure 2.16: Memory Prefixspan v/s SPAM
In Figure 2.17, the dataset with 100 different items and vary the support 0.04 to
0.025. It is found that as the support decreases, there is increase in Memory size due to
availability of more sequences during lower support values.
- 57 -
Figure 2.17: Support v/s Memory
In 2009, Y.J. Lee [49] proposed the new algorithm for time interval sequential
mining technique based on Allen‟s theory. Their basic idea was to implement the
preprocessing algorithm in which they got time interval data from data with time points.
They worked on medical database. For example, if a patient showed daily a symptom B
between March and April, there would be several transactions checking a symptom B at
different time point. These transactions must have been executed uniformly during one
month, and thus these could be summarized as a single transaction with an interval from
March to April. Through this generalization process, they could produce time interval
data and reduce the size of search space for time interval sequences. The time interval
relation discovery algorithm could discover time interval relation rules among
summarized transactions involving time interval data.
They focus on the sequences of events of customers & proposed algorithms
related to time interval sequence mining.
If u is used as a set of time granularities. If a transaction is issued once a month,
the time granule is one month. Also the sequence has one month time granule if each
event of the sequence represents information about one month.
- 58 -
The time granularity [49] is U u and a base time point V TS, an event
sequence S is converted into a sequence S’.
Thus S’ = <(E1, [vs1, ve1]), (E2, [vs2,ve2]),. . .,(En, [vsn, ven])>,
where vei vsi+1 for i = 1,. . .,n - 1 and vsi, vei are positive numbers.
Also, the time interval of S’ = [vs1, ven] is converted into [1, m]. Where m is
positive number.
Each event pair (x, y) is included in a set of event pairs. X = {(x,y)|x, y IE, x –
y} has binary time interval relation R(x, y), where R is given as a binary time interval
relations between two events x and y.
A temporal interval relation is defined as R(x,y) = {P(x,y)|(x,y) , P IO}
[49]. The set of temporal interval operators is IO = {before, equals, meets, overlaps,
during} and P(x, y) is a binary predicate which expresses the temporal interval
relationship P between x and y. R(x, y) is defined as follows:
before(x, y) means that an event x occurs prior to the event period of y. before is
represented as before(x, y) x . ve < y . vs
equals(x, y) denotes that x and y occur in the same period. equals are represented
by g(x,y) (x . vs = y . vs) (x . ve =y . ve).
meets(x, y) means that y happens immediately after the event period of x. meets
can be represented by meets(x,y) x . ve = y . vs.
overlaps(x, y) expresses that y occurs before the end point of x. overlaps can be
represented by overlaps(x,y) x. vs < y . vs x . ve > y . vs.
- 59 -
during(x, y) represents that x occurs during the event period of y. during is
represented by during(x, y) x . vs > y . vs x . ve < y . ve.
They proved the theorem as per Dr. Allen‟s equation. However for large data, the
effort to discover temporal interval relations is too high.
For all (x, y) Ω , the total possible temporal interval relations in R(x, y) is (n)
(n 1) m where m is constant. The time complexity of Allen‟s algorithm is O(N2).
Hence the computing time of Allen‟s algorithm cannot be extended for large databases.
To solve this problem, Lee [49] presented a new algorithm for mining
temporal interval relation rules which is shown in Figure 2.18.
2.5.19 Allen’s Algorithm
Let IE be the event set
R(x, y) is the time interval relation
RS Φ for each event x in IE
for each event y in IE
RS RS R(x, y)
Return RS
Figure 2.18 : Allen’s Algorithm [49]
Lee [49] proposed two sub algorithms. The first was an event generalization
algorithm designed for summarizing time interval sequences. It reduces the size of input
database. The second one is a time interval relation rule discovery algorithm. It discovers
time interval relation rules from time interval data that satisfies a given minimum
support.
- 60 -
2.5.19.1 Generalization of temporal events-Formal Description [49]
Each transaction of given database DB consists of a customer-id, a transaction-
time stamped with a time-point, and a set of event types. A customer can issue several
transactions. For example, a patient periodically can take a medical examination. Each
medical examination is a transaction and shows multiple symptoms. Symptoms are the
events in a transaction. No customer has more than one transaction with the same
timestamp. All events in a transaction have the same timestamp.
2.5.19.2 Algorithm : Generalization of temporal
Input The transactions of a given database
Output A generalized events with time interval
Begin
Sort the transactions in a database DB as per
customerID(Cid) and the timestamps.
Calculate the frequent event types based on
customer ID.
Remove non-frequent event types from transactions
Calculate a set of event sequences per customer
SS(Cid) = {ES(Cid, Ei)| Ei ETS(Cid)}, where
ETS(Cid) contains only frequent event types
Calculate a set of all event sequence set S(Cust)
Calculate a set of uniform event types. Calculate
a set of sequences having uniform event type
Delete non-uniform event types from S(Cust)
Generalize each event sequence in S(Cust) into a
generalized event with a time interval
End
Figure 2.19 : Generalization of events [49]
- 61 -
2.5.19.3 Algorithm : Temporal interval relation rule discovery
Input data A database GD with generalized events & a time
interval.
Output A set of time interval relation rules {TR1, TR2,. .
.,TRn}
Begin
Find a set of all candidate time interval relations,
CR=
k
iCidiCR
1)(
.
Find a set of frequent time interval relations,
FR = {Ri(x, y)jSupp(Ri(x, y))/NcustP Suppmin and
Ri(x, y) from CR.
Discover the time interval relation rules {TR1,TR2,. .
.,TRn}, from FR.
End
Figure 2.20 : Temporal interval relation rule discovery [49]
Lee [49] proposed a new data mining technique to efficiently discover useful time
interval relation rules from time interval data on the basis of Allen‟s interval operators.
This technique is combination of an event generalization algorithm and a time interval
relation rule discovery algorithm. The event generalization algorithm summarizes the
time interval events with time points and generalizes it into time interval data. The time
interval relation rule discovery algorithm generates time interval relation rules by
discovering frequent time interval relations from time interval data generated from the
event generalization algorithm.
This technique has some significant advantages in comparing the existing
methods. First, it discovers useful time interval rules from time interval data. Secondly, it
enables us to extract time interval relation rules from a time interval database. To prove
the effectiveness of technique proposed by Lee [49], they performed several experiments
while scaling up datasets. First, the execution time of the algorithm increases slowly as
the number of records increases, so that it has significant performance benefits in
comparison to Allen‟s algorithm. Second, the time interval relationship step and the event
- 62 -
generalization step require the greatest amount of time among the different steps involved
in the algorithm proposed by Lee [49]. These algorithms have used for the concept of
time interval sequences. However still our proposed technique is still effective compare
to all techniques discussed here.
The algorithm proposed by Dhany Saputra[1] uses Seq-Tree Framework and
separator table[1]. The separator database is proposed by them stores the list of separator
indices of each customer. From the original database to check all items one by one is
time-consuming, hence I-PrefixSpan[1] does not use any of them.
Dr. Chen[8] & his team proposed an algorithm for time interval sequential
mining. They proposed two efficient algorithms for mining time-interval sequential
sequences. The first algorithm [8] is based on the conventional Apriori algorithm, while
the second one is based on the PrefixSpan algorithm. The second algorithm outperforms
the previous by considering computing time and scalability by considering various
parameters.
- 63 -
Chapter 3
Motivation
Our literature survey & critiques of various state-of-the-art methods motivated us
and directed us to propose the sequential sequence mining technique which overcomes
the limitations and adds the value to the state-of-the-art methods.
Various sequential sequence mining techniques were applied on data which are
critically evaluated and discussed in Chapter 2. These techniques could find the
sequential sequence in desired manner. We could find that every technique has tried to
justify and tried to overcome the deficit of earlier techniques and they tried to improve
their performance with different parameters. The relevant techniques with their
limitations and merits are elaborated in the chapter 2.
By analyzing these techniques, we came to know that very few have concentrated
on the Memory usage and execution time to find sequential sequence and also very few
researchers have worked by considering time interval between various events/items. It
was also observed that most of the state-of-the-art methods use the feature of sequential
sequence mining in various applications. Even the state-of-the-art methods have less
- 64 -
focused on the large size of the database. Hence therefore, it directed us to focus on this
issue and it was really a great challenge for us. We focused on this issue and planned to
propose new sequential sequence mining technique. Initially we proposed the technique
of sequential sequence mining for small dataset. Then we tried on large database by
proposing more algorithms. With many efforts, we could find notable improvement in
our technique.
Our proposed technique leads over all state-of-the-art techniques. We ensure that
our new approach will be useful to the researcher in the area of sequential sequence
mining. Our proposed technique is theoretically discussed in chapter 4 and empirically
evaluated in chapter 5. The improved results are compared with other state-of-the-art
methods. We could find improved results in our technique.
- 65 -
Chapter 4
Scope of Work
Sequential sequence mining finds the frequent sequences in a sequential database,
is an significant data mining problem with extensive applications including the analysis
of customers‟ purchase sequences or Web access sequences, the analysis of sequences or
time-related processes such as scientific experiments, natural disasters and disease
treatments, the analysis of DNA sequences and soon. In the world of E-commerce, the
purchasing behavior of customers can be extracted from log files. The web managers can
actively send desired information to their customers. Thus not only the customers
experience the convenience of quickly obtained, but also the likelihood that they
purchase products from this company is increased. Manufactures can analysis the
market demand, plan production schedules, and determine inventory levels so that
they can react to market changes correctly and quickly.
The scope of our algorithm is to provide better efficient sequential sequences with
various parameters of matrix of evaluation. The detail is discussed in chapter 4 and 5.
- 66 -
Our algorithms improve the performance and efficiency compared to various
algorithms developed for sequential sequences like DynamicSome, GSP, AprioriSome,
AprioriAll, SPAM, Prefixspan [2], I-prefixspan [8][1] etc. It generates various time
interval sequences by using sequence generator table. Here we have analyzed various
sequential mining techniques and compared them. Our algorithm outperforms other
sequential sequence mining algorithms. More ever our algorithms have excellent scale-up
properties.
Typical prefixspan [2] fails to provide sequences with time interval gap [8]
between sequences, our algorithm gives the sequences by taking care of time interval
between sequences.
In typical I-prefixspan [1], the projection table is created every time while
creation of every sequence, so it requires more Memory and Time while generating
sequences. The database is kept in the Memory after use so this algorithm is less effective
because of consumption of Memory. Our algorithm creates sequence generator table from
original database. The frequent sequences are created based on sequence generator table.
Hence therefore, it requires less Memory, Time and very efficient compare to latest
algorithms developed now a day.
- 67 -
Chapter 5
Proposed Algorithms
5.1 Sequential Sequence Mining
The Sequence is defined by order of events. Some time events occur in one
particular order. Sequential sequence mining is used to find out all the frequent sequences
which occur in maximum no of transactions. For example, the customer purchases a
laser printer will come back to buy Printer in two months and then a Scanner in three
months. Let us discuss sequential sequence mining in detail.
Let two sequences α=< a1, a2 … an> and β=< b1 b2 … bm> are given. α is called
a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1< j2<…<jn ≤m such that
a1⊆ bj1, a2⊆ bj2,…, an⊆ bjn. β is a super sequence of α. Here <a(cd)f> is the subsequence
of the <a(bcd)(ef)ad>.
The length of a sequence is the number of itemsets in the sequence. A sequence of
length k is called a k-sequence. i.e
Candidate 1-subsequences:
<i1>, <i2>, <i3>, …, <in>
- 68 -
Candidate 2-subsequences:
<i1, i2>, <i1, i3>, …, <(i1 i1)>, <(i1 i2)>, …, <(in-1 in)>
Let I = { i1, i2 ……, in } be a set of items for transaction data. We call a subset X
I an itemset and we call a | X | the size of X. A sequence k = ( k1 , k2 , ……., km )
is an ordered list of Itemsets,
where k, i I , i { 1,……., m } . The size m, of a sequence is the number of
Itemsets in the sequence, i.e. | k |. The length l of a sequence k = ( k1 , k2 ,…….., km) is
defined as
length ||1
m
i
iKl
suppose K = ( k1 , k2 , k3 )
Where K1 = { p } and K2 = { p , q }
K3 = { p , q , r }
K4 = { p , q , r , s}
Then length ||4
1
i
iKl
length l = length of K1 + length of K2 + length of K3+ length of K4
= 1 + 2 + 3 + 4
= 10
Now Let us see following transactional dataset.
SID Sequences
1 <a(bc)(ef)ad>
2 <bcd>
3 <adb>
Table 5.1: Data set 1
Various transactions are shown in Data set 1. SID is the Sequence ID of the
Customer. Sequences represent various transactions made by respective Customers.
- 69 -
Sequences represent in <…> brackets. For SID 1, the sequence is <a(bc)(ef)ad>. Note
that a,b,c,d,e,f are the item codes. The items available in (…) brackets indicate that these
items are purchased by the customer at same time means in single transaction. If
customer purchases single item in transaction in that case the (…) brackets are not
required. For SID 1, we have 5 transactions. In 1st transaction item „a‟ is purchased. In 2
nd
transaction items “b and c” are purchased in same time. In 3rd
transaction items “e and f”
are purchased together. In 4th
transaction item „a‟ is purchased and in last transaction item
„d‟ is purchased. In sequence mining “ab” and “ba” have different meaning.
5.1.1 Support
The absolute support of a sequence Kp in the sequence representation of a
database D is defined as the number of sequences k D that contain Kp , and the relative
support is defined as the percentage of sequences k D.
suppD( Kp ) gives the support of Kp in the database. The minimum support
threshold is minSup. The sequence Kp is always frequent if the suppD( Kp ) minSup.
The problem of mining sequential sequence is to find all frequent sequential sequences
for database D and given a support threshold.
The Support indicates the occurrence of sequences in the database. Prefixspan [2]
gives only the frequent sequences but doesn‟t give the time interval between successive
items. Our new method gives the time interval sequences between successive items. The
dataset 2 as shown in Table 5.2 gives the detail including time interval between two
successive Items. Here (a, 2) means item „a‟ occurs at time stamp 2.
Table 5.2: Data set 2
SID Sequences
1 <(a,2)(bc,4)(ef,7)(a,8)(d,9)>
2 <(b,4)(c,6)(d,7)>
3 <(a,2)(d,3)(b,6)>
- 70 -
5.1.2 Super Sequence and Subsequence:
A sequence with length l is called an l-sequence. A sequence Kp = < p1,p2 ,..,pn>
is contained in another sequence Kq = < q1,q2, ……..,qm >
if there exist integers 1 i1<i2 <…. m such that p1 qi1 , p2 qi2 , ., pn qin.
e.g. if Kq = <q1,q2,q3 >where q1 = {p1,p2,p3 }. Here p1 q11 , p2 q12, p3 q13 &
q1 ={ p1,p2, p3} , q2 ={ p4 } & q3 = {p5, p6}.
Then Kq = <{ p1, p2 , p3 } , {p4 },{p5, p6}>
If sequence Kp is contained in sequence Kq then Kp is called a subsequence of
Kq and Kq is called a super sequence of Kp. In above example, Kp is subsequence of Kq
and Kq is super sequence of Kp.
5.3 Formal Notations & New Equations
5.3.1 Customer : The sequence of transactions T1, T2,. . ., Tn in the database D such that
C=<(T1,)(T2,). . .,( Tn) >, where Ti < Tj and i < j. The customer ID represents the identity
of the customer. It is represented by CID.
5.3.2 Item : An event (Item) I is defined as I = (E, t), where E is an Item or Event, where
t T. Here T is time.
5.3.3 Transaction: A transaction is a set of Items or Events such as T = (Cid, I, t), where
Cid is a customer identifier, I is an item type and t is a time where the event occurred.
5.3.4 SequenceID : The sequence of transactions T1, T2,. . ., Tn, such that C = <(T1,),
(T2,). . .,( Tn) >, where Ti < Tj and i < j. For the same customer‟s all transactions are
denoted by SID with same value.
- 71 -
5.3.5 Equation for time interval
The equation for time interval with various sequential items P1.. Pn & Q1…Qm is
given by
< (P1 Q1 Q2.. Qm, t1), (P2 Q1 Q2.. Qm, t2), (P3 Q1 Q2.. Qm, tα.), (P4 Q1 Q2.. Qm, tα.)…
(Pn Q1 Q2.. Qm, tn)>
The time interval equation is
Iαβ = tβ - tα. ,where α , β are time interval, where β α
…Equation 5.1
The time interval given by equation 5.1 is considered for different time interval.
5.3.6 Equation for same time interval items
In equation 5.1, if α = β , then Iαβ = tβ - tα. is called as the items occurred in
same time interval.
Iαβ = tβ - tα., where α , β are time interval, where α = β
…Equation 5.2
5.3.7 Equation for support
Support = The occurrence sequences in the database D with respect to all
sequences of database is called support.
Support = P(s)/P(S) …Equation 5.3
Where s < (P1 Q1 Q2.. Qm, t1), (P2 Q1 Q2.. Qm, t2), … (Pn Q1 Q2.. Qm, tn)>
S = Total SIDs
For a sequence SID, a sequence of items represented by <i1, i2,. . ., in>, where ii =
(I, ti), ii Ti and ti ti+1 for each i = 1,. . ., n - 1. A time interval between the first item i1
and the last event in is denoted as [t1, tn].
- 72 -
5.4 Algorithms of MySSM
We have proposed the series of MySSM algorithms. The first algorithm is
proposed as a SYNTIM for synthetic data generation. It generates the synthetic data with
different time intervals, different transactions and different items. This algorithm is given
in Figure 5.1. Algorithm 2 reads the “config.dat” file. This proposed algorithm is called
as a GCON. Algorithm 3 is proposed as a FS & GSGT which finds the 0-sequence and
also generates the sequence generator Table. Algorithm 4 is proposed to generate all
frequent sequences. It is proposed as a GAS. Algorithm 5 is proposed as a CMEM which
checks the memory. The 6th
proposed algorithm is named as a OUTR, which generates
the sequences in “output.dat” and also generates the “analysis.dat” file. The 7th
proposed
algorithm is a MYSSM. It is a Sequential Sequence Generation Algorithm. This
algorithm is main algorithm which includes all algorithms. These algorithms are shown
in Figure 5.1 to Figure 5.7.
5.4.1 Algorithm 1 : SYNTIM
Algorithm SYNTIM
Input Number of Customers, Number of Items
Output Dataset.dat,Datasetdetail.dat
Begin
Open dataset.dat file for writing
for i 0 to Last customer do for j 0 to No of Transaction
do time random value
item random value
end for
end for
Close dataset.dat file
Open datasetdetail.dat file for writing
Average items per transaction
Total no of items/No. of transactions
Average number of transactions per customer
Total number of transactions / Total no of customers
- 73 -
Close dataset detail file.
End
Figure 5.1 : Algorithm SYNTIM
The SYNTIM algorithm generates the customers‟ transactions with various time
intervals and items to be purchased. It generates the items based on number of
transactions and number of items available. This detail is stored in “dataset.dat” file.
Later on, it is used by MsSSM algorithm for the finding sequential sequence. It also
generates the average items per transaction and average transactions per customer, which
is stored in “datasetdetail.dat” file.
5.4.2 Algorithm 2: GCON
Algorithm GCON
Input Config.dat
Output Time interval, range, items, support
Begin
Initialize line, data
Initialize interval, range, item, customer, minsup
Open config.dat file for reading
for line 1 to end of data do
if(line==1)then interval data
else if(line==2) then range data
else if(line==3)then item data
else if(linenum==4)then customerdata
else if(line==5)then minsup data
end if
end for
Close file
End
Figure 5.2 : Algorithm GCON
- 74 -
GCON algorithm reads the “config.dat” file. It reads all the data from the file. It
first reads the interval of time unit, range of time interval, items to be purchased, Number
of customers & minimum support. These values are used by MySSM algorithm.
5.4.3 Algorithm 3: FS & GSGT
Algorithm FS & GSGT
Input dataset.dat
Output sequence generator table
Begin
Initialize datanum,indexno,i,item,time,count
Open dataset.dat
Repeat until end of file encountered
read time index and item index
Initialize counter, indexno
Repeat until length of customer sequence
Store the item index and time where the sequence occurs
Generate sequence generator table
Store using array index and time interval for each SID
Read item occurred in all SIDs
Increment the counter for the particular item occurred
If the counter value is more than minimum support then
add this item in large item list
else ignore it
end repeat
Close file
End
Figure 5.3 : Algorithm FS & GSGT
The FS & GSGT algorithm reads the “dataset.dat” file and generates the sequence
generator table. The sequence generator table stores the values of item index and time. By
using sequence generator table, it finds sequential sequences which occur frequently.
- 75 -
5.4.4 Algorithm 4: GAS
Algorithm GAS
Input sequence generator table
Output frequent sequential sequence
Begin
Declare the variables
Scan the sequence generator table
Repeat until end of file encountered
Scan the sequence generator table by Sequence ID
Scan the sequence generator table by item ID
Measure the repeated sequences with Time ID
If occurrence >= minimum support then
Keep it
Else ignore it
Check other combinations
If found then keep it
Else ignore it
End
Figure 5.4 : Algorithm GAS
The GAS algorithm scans the sequence generator table by using sequence Id, Item
Id and Time ID. It generates all the frequent sequences occurred in the database whose
support count is more than minimum support.
5.4.5 Algorithm 5: CMEM
Algorithm CMEM
Input dataset.dat, config.dat
Output sequential
Initialize maxMemory 0
Begin
Get total Memory during runtime
- 76 -
Get total Free Memory during runtime
current Memory = Total Memory - Free Memory
If current Memory >= maxMemory
then maxMemory = Current Memory
Return maxMemory in MB
End
Figure 5.5 : Algorithm CMEM
The CMEM algorithm finds the maximum memory used during run time. First it
finds the total memory during execution. The memory used during execution is found by
making a difference between max memory and free memory.
5.4.6 Algorithm 6: OUTR
Algorithm OUTR
Input Sequence generated by GAS
Output output.dat, analysis.dat
Begin
Open the output.dat file for writing
Write minimum support
Do while sequences exist
Write 0-sequences
Write all desired sequences generated by GAS algorithm
End do
Close file
Open analysis.dat file for writing
Write Number of Time Intervals, Gap between Time
interval, Minimum support
Write summary of all sequences generated by GAS
algorithm
Write Total number of sequence generated
Write Execution time in MilliSeconds & MaxMemory in MB
Close file
End
Figure 5.6 : Algorithm OUTR
- 77 -
The algorithm OUTR uses the sequences generated by GAS algorithm. It
generates the “output.dat” file. It writes the minimum support along with 0-sequences and
all frequent sequences generated by GAS algorithm in “output.dat” file. The OUTR
algorithm generates the status of the execution process. It creates “analysis.dat” file in
which it writes the summary of the execution of the programs like, Number of Time
Intervals, Gap between Time intervals, Minimum support, Total number of sequence
generated, Execution time in MilliSeconds & MaxMemory in MB. This algorithm is very
important for the empirical analysis of our proposed algorithms.
5.4.7 Algorithm 7: MySSM
Algorithm MySSM
Input dataset.dat, config.dat
Output sequential sequences, Execution time, Memory used
Begin
Initialize time, range, item, support
Initialize t1, t2, maxMemory
Open Dataset.dat and config.dat files
Initialize customer’s sequence, counter
Initialize arraylist for finding index and time
Call Procedure GCON()
Read the parameters from config.sys
t1 System.currentTimeMillis();
Call procedure FS&GSGT()
Generate all sequences onwards sequence-0
Generate sequence generator table
Return large sequence
Call procedure CMEM()
Return Memory used
Call procedure OUTR()
Return Time Interval, Gap, Min support, sequences
t2 System.currentTimeMillis() - t1;
Return sequential sequence
Close files
End
Figure 5.7 : Algorithm MySSM
- 78 -
The algorithm MySSM reads the data from config.dat, dataset.dat files. It
generates the large sequential sequences whose support count is greater than minimum
support. It finds time and memory used during execution. MySSM is main algorithm. It
executes all other algorithms proposed by us. The complexity of MySSM algorithm in
running time is O(log n) which shows the improved performance our algorithms
compared to other algorithms available at present.
Let‟s take one dataset and find out the time interval sequential sequences.
Sequence ID Sequence
1 <(p,2),(r,4),(p,5),(q,5),(p,7),(t,7),(r,11)>
2 <(s,4),(p,6),(q,6),(t,6),(s,8),(t,8),(r,13),(s,13)>
3 <(p,9),(q,9),(t,12),(s,14),(q,17),(r,17),(t,21)>
4 <(q,14),(f,16),(t,17),(q,21),(t,21)>
Table 5.3: Sequence Generator Table
First we transform the dataset into equal time stamp separator. The table will look
like as shown in Table 5.4.
Sequence ID Sequence
1 <(p,2),(r,4),(p,q,5),(p,t,7),(r,11)>
2 <(s,4),(p,q,t,6),(s,t,8),(r,s,13)>
3 <(p,q,9),(t,12),(s,14),(q,r,17),(t,21)
4 <(q,14),(f,16),(t,17),(q,t,21)>
Table 5.4: Sequence Generator Table with Time stamp
- 79 -
The items in same „( )‟ bracket have same time stamp. The scanning of sequence
generator table occurs. For minimum support is 50% and number of time intervals are 4
and the gap is shown below.
I0 = 0; I1 = 0 < t ≤ 5
I2 = 6 < t ≤ 10 I3 = 11 < t ≤ ∞
1st step of this algorithm is same as typical PrefixSpan. In the 1
st scan of the
dataset, we find the frequent items, which are called as 1-sequences. Now from the
example, <a>, <b>, <c>, <d>, <e> are the frequent items which satisfy the minimum
support threshold. During the 1st step, the algorithm generates sequence generator table
as shown in Table 5.5 which is useful to find out the time interval sequential sequences.
SID <p> <q> <r> <s> <t>
1 (1,2), (5,5),
(8,7)
(6,5) (3,4), (11,11) Ø (9,7)
2 (3,6) (4,6) (10,13) (1,4),(7,8),
(11,13)
(5,6), (8,8)
3 (1,9) (2,9), (8,17) (9,17),
(11,21)
(6,14) (4,12)
4 Ø (1,14),
(7,21)
(8,21) Ø (5,17)
Table 5.5: Sequence generator Table
Sequence generator table looks same as pseudo projection table used in the typical
PrefixSpan[2] but here we include the time with the item index as well. 1st column of the
sequence generator table indicates the Sequences ID and 1st raw shows the frequent
sequences. This table is extended as it finds more and more sequences. Here for sequence
Id 1, <p> generates 3 pairs, (1, 2), (5, 5), (8, 7), which indicates that <p> item occurs 3
- 80 -
times in this sequence (but in different transaction). In (1, 2), 1 represents the index of
<p> and 2 indicates the time when particular <p> occurs. Same notation is applied for all
other cells. One more thing is to be noted for symbol „Ø‟. This symbol indicates that the
item does not occur in this sequence. Now understand how to generate sequence
generator table. Here in the 1st sequence there are <(p, 2),(r, 4),(p, q, 5),(p, t, 7),(r,
11)>sequences. Now scan this sequence. Here „p‟ occurs at 2nd
place so it‟s index is 2.
Then it generates the sequential sequences using both tables.
The sequences can be generated in two ways : <p, q> or <(p, q)>. First sequence
indicates that „p‟ and „q‟ occur in different transactions and the second sequence indicates
that „p‟ and „q‟ occur in same transaction. Now suppose we look for the <p, q> sequence.
Then first we have to find the index of the „p‟. There are 3 indexes in the first sequence.
They are (1, 2), (5, 5), (8, 7). The 1st index is 1. Find out the index of „q‟ which is greater
than index of item „p‟. Here, the index of „q‟ is 6 which greater than index of „p‟. If we
want to check either sequence <p, q> occurs or sequence <(p, q)> occurs. In this case, if
the index of „q‟ is greater than the index of „p‟ then we get the sequence <p, q> otherwise
we get the sequence <(p, q)>.
As per our example, the index of „p‟ is 1 and index of „q‟ is 6. After scanning
sequence generator table, the index of „q‟ is 6 which is greater than the index of „p‟. So
we find the <p, q> occurs in different transactions. Now for finding the time interval
between event „p‟ and event „q‟, the algorithm finds the difference of time interval. Here
we have time stamp with „p‟ is 2 and with „q‟ is 5. So time interval between „p‟ and „q‟ is
5-2 = 3. The value of „3‟ comes in range of I1 so the algorithm finds the sequence <p, I1,
q>. If the algorithm cannot find any index which is greater than the index of „p‟ and less
than the index of „q‟ then it assumes that both events occur in the same transaction. So we
get the sequence<p, I0, q>.
- 81 -
Table p q r s t
I0 0 3 0 0 2
I1 2 1 1 2 3
I2 0 1 4 1 0
I3 0 0 0 0 1
Table 5.6: Table of time interval sequence for ‘p’
Table 5.6 shows the time interval sequence for „p‟. In the table we have 1st
column as a time interval and 1st row as frequent items. Each cell indicates the count of
particular sequence. The value of the cell is considered as a counter. The counter is
incremented on occurrence of the sequence with „p‟. This is applicable for other items
also. Suppose we have < p I0 q >. The index of this sequence is added in the sequence
generator table. It helps to find out the 3-sequences. Here we got (p I0 q), (p I0 e), (p I1
e) and (p I3 c) Thus it finds various frequent sequences.
- 82 -
Chapter 6
Empirical Analysis & Comparative Results
To evaluate the performance of the algorithms over a large range of data
characteristics, we generated synthetic data set for customers‟ transactions. This is the
basic step of the evaluating the algorithm. Here we generate one synthetic dataset
generator similar to the IBM synthetic dataset generator. We have tested on large data
base means the size of the items for 100 items and the transactions of 50000 customers or
more in large database. It generates 0-sequence to long sequences till the frequent length
of sequence is possible with minimum support. All the data are tested with 2048 MB
RAM of Java virtual Machine in Intel I5 processor with 8 GB DDR3 RAM & 500 GB
HDD and we compared the results with the state-of-the-art methods.
The few lines of large dataset/database are given below.
Data Set
1 9 95 99 161 9 277 9 324 9 337 9 363 9 399 11 101
11 280 19 60 27 99 27 209 27 236 27 318 27 358 27 393
2 8 14 8 33 8 215 8 285 8 300 8 317 8 345 8
- 83 -
3 3 41 3 68 3 72 3 154 3 352 3 384 12 7 12 27 12 115
12 220 17 160
4 8 26 19 91 19 333 26 5 26 15 26
Here all the data are generated randomly. The algorithm works fine on the
synthetic large database. The comparative results are discussed in this section.
1st number indicates the sequence ID. The sequence ID is followed by time and
the item codes. In 2 8 14 8 33; 2 indicates customer ID, 8 indicates the time and 14
and 33 indicate the items codes.
We generated the synthetic dataset. We tested the scalability of MySSM in both
runtime and memory usage using different parameters of matrix of evaluation such as
different support, items per transaction and transactions per customer. MySSM shows a
linear scalability in both the runtime and memory usage. We compared our results with i-
prefixspan[1][8]. The scale-up properties with respect to these parameters are shown in
Figure 6.1 to 6.11. The empirically analysis shows that the performance of our algorithm
MySSM is better than the i-prefixspan.
Figure 6.1 shows the empirical analysis of Number of customers v/s Time in
Milliseconds with number of time intervals are 3 and gap of time interval is 8, support
in value is 0.4000, No of different items are 10, No. of transactions per customer are 11,
No of items per transaction are 3 for 10,000 to 1,00,000 customers. With increase in
number of customers the time also increases in both.
- 84 -
Figure 6.1 : Number of Customers v/s Time(Milliseconds) for support =0.4
With the same parameters as per Figure 6.1 are taken and they are tested with
respect to memory. In both the cases, the memory usage is increased during run time with
increase in customers. This is shown in Figure 6.2.
Figure 6.2 : No of Customers v/s Memory(MB) for support =0.4
Experimental results are expanded and shown in Figure 6.3 with the support value
is 0.0200, time intervals are 3 and gap of time interval is 8, Number of different items are
- 85 -
100, Number of transactions per customer are 11, Number of items per transaction are 3
for 10,000 to 1,00,000 customers. The runtime increases when no of customers increase.
There is sudden increase in time for 50,000 to 1,00,000 customers due to more gap
between number of customers.
Figure 6.3 : Number of Customers v/s Time(Milliseconds) for support=0.02
Figure 6.4 : Number of Customers v/s Memory(MB) for support=0.02
- 86 -
The memory analysis is shown in Figure 6.4 with the same parameters are
considered as taken in Figure 6.3. The storage space required increases as increase in
number of customers
Figure 6.5 shows the analysis graph of Number of customers v/s Time in
Milliseconds with number of time intervals are 3 and gap of time interval is 8, support
in value is 0.3, Number of different items are 10, Number of transactions per customer
are 11, Number of items per transaction are 3 for 500 to 1,20,000 customers.
Figure 6.5 : Number of Customers v/s Time(Milliseconds) for support=0.3
Figure 6.6 shows the analysis graph of Number of customers v/s Memory in MB
with number of time intervals are 3 and gap of time interval is 8, support in value is 0.3,
Number of different items are 10, Number of transactions per customer are 11, Number
of items per transaction are 3 for 500 to 1,20,000 customers. The graph linearly increase
when Number of customers increase.
- 87 -
Figure 6.6 : Number of Customers v/s Memory(MB) for support=0.3
The time and memory analysis for 0.0008 support value is shown in Figure 6.7
and Figure 6.8 respectively with number of time intervals are 3 and gap of time interval is
5, Number of different items are 100, Number of transactions per customer are 11,
Number of items per transaction are 3 for 1000 to 1,00,000 customers.
- 88 -
Figure 6.7 : Number of Customers v/s Time(Milliseconds)
In both the cases, as shown in Figure 6.7 and Figure 6.8, The time & memory
scale up linearly according to the increase in number of customers.
Figure 6.8 : Number of Customers v/s Memory(MB)
- 89 -
Figure 6.9 shows the analysis graph of Number of customers v/s Time in
Milliseconds for the number of customers are 10,000, Number of different items are 100,
Number of transactions per customer are 11, Number of items per transaction are 3 for
the various support values range from 0.03 to 0.0008.
Figure 6.9 : Support v/s Time in Milliseconds
The graphs are expanded in Figure 6.10 and Figure 6.11 for various support
values. Figure 6.10 is shown with number of customers v/s Memory in MB for the
number of customers are 10,000 while Figure 6.11 is shown with support v/s Time for
number of customers are 50,000 with number of different items are 100, number of
transactions per customer are 11, number of items per transaction are 3 for the various
support values range from 0.03 to 0.0008.
When the support decreases, the time and memory also increase because with the
lower support more number of sequences are produced.
- 90 -
Figure 6.10 : Support v/s Memory in MB
Figure 6.11 : Support v/s Time in Milliseconds
The empirical analysis is shown in Figure 6.1 to Figure 6.11 with various
parameters of matrix of evaluation; it shows that MySSM outperforms i-prefixspan.
MySSM takes less time and utilize less memory during execution.
- 91 -
Our test results show that for 30000 customers with time interval is 3 and range of
time interval is 8, it is observed that when number of different items decrease , the total
number of sequences also decrease. We can see in Figure 6.12.
Figure 6.12 : No of different items v/s Total sequences
It is also observed that, for 30000 customers with time interval is 3 and range of
time interval is 8, when number of different items decrease, the number of different
independent sequences increases. We can see in Figure 6.13,6.14.6.15.
Figure 6.13 : No of different sequences for number of different items=100
- 92 -
Figure 6.14 : No of different sequences for number of different items=10
Figure 6.15 : No of different sequences for number of different items=6
- 93 -
Chapter 7.0
Conclusion & Future Scope
We could generate the sequential sequences by MySSM algorithm in very efficient way.
With our observation and work we could conclude that MySSM provides better
performance compared to all earlier algorithms produced for sequential sequences. Our
empirical analysis and test results state that MySSM outperforms the state-of-the-art
methods because of using sequence generator table. The sequence generator table saves
the time during execution and decrease the memory usage during execution.
Further, we may improve the performance of the algorithm in future as extension
our work by using multiple thread architecture and parallel execution of threads. It may
be more efficient and effective by considering time and memory.
- 94 -
Bibliography
[1]. Dhany, Saputra and Rambli Dayang, R.A. and Foong, Oi Mean,“Mining Sequential
Patterns Using I-PrefixSpan”, World Academy of Science, Engineering and Technology,
Dec., 2008.
[2]. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu,
“Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach”,
Transactions on Knowledge and Data Engineering, Vol. 16, No. 11, Pages 1424-1440,
2004.
[3]. J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, “Sequential Pattern Mining Using a
Bitmap Representation”, Proc. ACM SIGKDD Int‟l Conf. Knowledge Discovery and
Data Mining (SIGKDD ‟02), Pages 429-435, July 2002.
[4]. M. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences”,
Machine Learning, Vol. 40, Pages 31-60, 2001.
[5]. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu., “Freespan:
Frequent pattern-projected sequential pattern mining”, In Proc. 2000, Int‟l Conf.
Knowledge Discovery and Data Mining (KDD‟00), Pages 355–359, Aug. 2000.
[6]. R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and
Performance Improvements”, Proc. Fifth Int‟l Conf. Extending Database Technology
(EDBT ‟96), Pages 3-17, Mar. 1996.
[7]. R. Agrawal and R. Srikant, “Mining Sequential Patterns”, Proc. 1995 Int‟l Conf.
Data Eng. (ICDE ‟95), Pages 3-14, Mar. 1995.
- 95 -
[8]. Chen, Y.L., Chiang, M.C. and Ko, M.T., "Discovering time-interval sequential
patterns in sequence databases", Expert Syst. Appl., Vol. 25, No. 3, Pages 343-354,
2003.
[9]. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu,
“PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth”,
Proc., Int‟l Conf. Data Eng. (ICDE ‟01), Pages 215-224,2001.
[10]. R Agrawal, R Srikant, “Fast Algorithm for Mining Association Rules”, Proc. 20th
Int‟l Conf. Very Large Data Bases, VLDB, Pages 487-499, 1994.
[11]. C. C. Yu and Y.-L. Chen, “Mining sequential patterns from multi-dimensional
sequence data”, IEEE Transaction on Data and Knowledge Engineering, 17(1), 136-140,
2005.
[12]. A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining
association rules in large databases”, In Proc. Int‟l Conf. Very Large Data Bases
(VLDB), Pages 432–443, Sept. 1995.
[13]. Toivonen. H., “Sampling large databases for association rules”, In Proc. Int‟l Conf.
Very Large Data Bases (VLDB), Pages 134–145,1996.
[14]. Park. J. S, M.S. Chen, P.S. Yu., “An effective hash-based algorithm for mining
association rules”, In Proc. ACM-SIGMOD Int‟l Conf. Management of Data (SIGMOD),
San Jose, CA, Pages 175–186, 1995.
[15]. Brin.S, Motwani. R, Ullman. J.D, and S. Tsur., “Dynamic itemset counting and
implication rules for market basket analysis”, In Proc. ACM-SIGMOD Int‟l Conf.
Management of Data (SIGMOD), Pages 255–264,1997.
- 96 -
[16]. Dongme Sun, Shaohua Teng, Wei Zhang, Haibin Zhu, “An Algorithm to Improve
the Effectiveness of Apriori”, In Proc. Int‟l Conf. on 6th IEEE Int‟l Conf. on Cognitive
Informatics (ICCI'07), 2007.
[17]. Mannila, H. and Toivonen, H.,”Discovering generalized episodes using minimal
occurrences”, In Proc. of ACM Conference on Knowledge Discovery and Data Mining
(SIGKDD), Pages 146–151, 1996.
[18]. Bayardo, R., Agrawal, R., and Gunopulos, D., “Constraint-based rule mining in
large, dense databases”, In Proc. of IEEE Int‟l Conf. on Data Engineering (ICDE), Pages
188–197, 1999.
[19]. Leleu, M., Rigotti, C., Boulicaut, J., and Euvrard, G., “Go-spade: Mining sequential
patterns over databases with consecutive repetitions”, In Proc. of Int‟l Conference on
Machine Learning and Data Mining in Pattern Recognition (MLDM), Pages 293–306,
2003.
[20]. Garofalakis, M., Rastogi, R., and Shim, K.,“Spirit: Sequential pattern mining with
regular expression constraints”, In Proc. of Int‟l Conf. on Very Large Databases (VLDB),
Pages 223–234,1999.
[21]. M.C., “Prefixspan: Mining sequential patterns efficiently by prefixprojected pattern
growth”, In Proc. of IEEE Int‟l Conf. on Data Engineering (ICDE), Pages 215–224,
2001.
[22]. Fabian Moerchen, “Temporal pattern mining for time points, time intervals, and
semi-intervals”, Siemens Corporate Research, January, 2011
[23]. Wang, J. and Han, J., “BIDE: Efficient mining of frequent closed sequences”, In
Proc. of IEEE Int‟l Conf. on Data Engineering (ICDE), Pages 79–90, 2004.
- 97 -
[24]. Lin, J. L.,”Mining maximal frequent intervals. Technical report”, In Proc. of Annual
ACM, Symposium on Applied Computing (SAC), Pages 624–629, 2002.
[25]. Villafane, R., Hua, K. A., Tran, D., and Maulik, B., “Knowledge discovery from
series of interval events”, Intelligent Information Systems, 15(1):71–89, 2000.
[26]. Chieh-Yuan Tsai, Yu-Chen Shieh,“A change detection method for sequential
patterns”, Decision Support Systems Vol. 46, Pages 501–511, Year 2009, ,ElsevierB.V.
,2000.
[27]. Mirko B., Martin S., Detlef N., Rudolf K., “Mining changing customer segments in
dynamic markets”, Expert Systems with Applications 36, ScienceDirect Page 155–164,
2009.
[28]. Wenyuan Li, Min Xu, Xianghong Jasmine Zhou, “Unraveling complex temporal
associations in cellular systems across multiple time-series microarray datasets”, Journal
of BI 43, Elsevier, ScienceDirect, Pages 550–559, 2010
[29]. A. Apostolico, M. E. Bock, S. Lonardi, and X. Xu. “Efficient detection of unusual
words”, Journal of Computational Biology, 7(1-2):71-94, 2000.
[30]. J. Wang and J. Han., “BIDE: Efficient mining of frequent closed sequences”, In
Proceedings of the 20th Int‟l Conf. on Data Engineering (ICDE'04), Pages 79-90. IEEE
Press, 2004.
[31]. S. Laxman, P. S. Sastry, and K. P. Unnikrishnan, “A fast algorithm for finding
frequent episodes in event streams”, In Proceedings of the 13th ACM SIGKDD Int‟l
Conf. on Knowledge Discovery and Data Mining (KDD'07), Pages 410-419, 2007.
- 98 -
[32]. J. Pei, H. Wang, J. Liu, K. Wang, J. Wang, and P. S. Yu., “Discovering frequent
closed partial orders from strings”, IEEE Transactions on Knowledge and Data
Engineering, 18(11):1467-1481, 2006.
[33]. Juyoung Kang and Hwan-Seung, “Mining Spatio-Temporal Patterns in Trajectory
Data”, Journal of Information Processing Systems, Vol. 6, No.4, 2010.
[34]. Yan Huang, Liqin Zhang, and Pusheng Zhang, “A Framework for Mining
Sequential Patterns from Spatio-Temporal Event Data Sets”, IEEE Transactions on
Knowledge and Data Engineering, Vol. 20, NO. 4, 2008.
[35]. Damian Fricker Hui Zhang Chen Yu, “Sequential Pattern Mining of Multi modal
Data Streams in Dyadic Interactions”, ICDL, 978-1-61284-990-4/11, IEEE, 2011.
[36]. Eric Hsueh-Chan Lu, Vincent S. Tseng, Philip S. Yu, “Mining Cluster-Based
Temporal Mobile Sequential Patterns in Location-Based Service Environments”, IEEE
Transactions on Knowledge and Data Engineering, Vol. 23, NO. 6, 2011.
[37]. H. Mannila, H. Toivonen, and A. Verkamo, “Improved methods for finding
association rules”, In Proc. AAAI Workshop on Knowledge Discovery, 1994.
[38]. Claudia Antunes and Arlindo L. Oliveira, “Sequential Pattern Mining Algorithms:
Trade-offs between Speed and Memory”, In 2nd
Workshop on Mining Graphs, Trees and
Seq, 2004.
[39]. J. Pei, J. Han, and W. Wang, "Mining sequential patterns with constraints in large
databases", Proceedings of the eleventh Int‟l Conf. on Information and knowledge
management, McLean, Virginia, USA, 2002.
[40]. S. Parthasarathy, et al., "Incremental and interactive sequence mining", Proceedings
of the eighth Int‟l Conf. on Information and knowledge management, 1999.
- 99 -
[41]. M. Zhang, et al.,"Efficient algorithms for incremental update of frequent
sequences", Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD2002), Taipel, Taiwan, 2002.
[42]. H. Mannila and H. Toivonen,“On an algorithm for finding all interesting
sentences”, In 13th European Meeting on Cybernetics and Systems Research, 1996.
[43]. R. Agrawal and J. Shafer, “Parallel mining of association rules”, IEEE Trans. on
Knowledge and Data Engineering, 1996.
[44]. R. Agrawal and R. Srikant, “Mining Sequential Patterns”, In Proc. 11th
Int'l Conf.
on Data Engineering (ICDE), 1995.
[45]. J. Yang, W. Wang, and P. S. Yu, “Mining asynchronous periodic patterns in time
series data”, IEEE Transactions on Knowledge and Data Engineering, 15(3), 613-628,
2003.
[46]. J. Han, W. Gong, and Y. Yin, “Mining segment-wise periodic patterns in time-
related databases”, Proc. Int‟l Conf. on Knowledge Discovery and Data Mining, 1998.
[47]. S. Ma, et al., “Mining partially periodic event patterns with unknown periods”,
Data Engineering, 2001. Proceedings. 17th
Int‟l Conf., 2001.
[48]. X. Yan, J. Han, and R. Afshar, “CloSpan: Mining closed sequential patterns in large
datasets”, Proceedings of the Int. Conf. SIAM Data Mining, 2003.
[49]. Y. J. Lee, J.W.Lee, D.J.Chai, B. H. Hwang, K.Ho Ryu, “Mining temporal interval
relational rules from temporal data”, The Journal of Systems and Software 82, Pages
155–167, 2009.
- 100 -
[50]. Y. L. Chen and T. C. K. Huang, “Discovering fuzzy time-interval sequential
patterns in sequence databases”, Systems, Man and Cybernetics, Part B, IEEE
Transactions on, 35(5), 959-972, 2005.
[51]. H. Mannila, H. Toivonen, and A. Inkeri Verkamo, “Discovery of frequent episodes
in event sequences”, Data Mining and Knowledge Discovery, 1(3), 259-289, 1997.
[52]. H. Pinto, et al., “Multi-dimensional sequential pattern mining”, Proceedings of the
10th
Int‟l Conf. on Information and Knowledge Management, 2001.
- 101 -
Own Publication List
Publications related to my research work
[International Journals/Conferences]
[1]. Kiran Amin, Dr. J.S. Shah, “Sequential Sequence Mining Technique in
Mammographic Information Analysis Database” in International Journal of
Emerging Technology and Advanced Engineering ,ISSN 2250-2459, Vol. 2, Issue 5,
May 2012.
[2]. Kiran Amin, Dr. J. S. Shah ” Improved technique in Sequential Sequence Mining in
large database of transaction”, International Journal of Engineering Research and
Technology, ISSN: 2278- 0181, Vol. 1, Issue 4, June 2012
[3]. Kiran Amin, Dr. J. S. Shah “ Gradual Evolution of Sequential Sequence Mining for
Customer relation database” , International Journal on Computer Science and
Engineering, ISSN: 2229-5631, Vol. 4, Issue 7, July 2012
[4]. Kiran Amin, Dr. J. S. Shah “ Sequential Sequence Mining Technique in Large
Information Analysis Database “ at 6th Int‟l Conf. on Next Generation Web Services
Practices (NWeSP 2010), November, 2010 - Gwalior, India, available on IEEE
Explorer
[5]. Kiran Amin, Dr. J. S. Shah” Sequential Sequence Mining Technique in Large
Database of Gene Sequence at Int‟l Conf. on Computational Intelligence and
Communication Networks (CICN 2010), November 2010, Bhopal, India, available on
IEEE Explorer
- 102 -
My other research publications
[International Journals/Conferences]
[1]. Kiran Amin, “Web search result rank optimization using search engine query log
mining”, Int‟l Conf. on Recent Advances in Engineering and Technology, ISBN :
978-81-923541-0-2, April, 2012
[2]. Kiran Amin, “Survey on web log data in terms of Web Usage Mining” International
Journal of Engineering Research and Applications” ISSN: 2248-9622
[3]. Kiran Amin, “Attribute Based Routing For Query Processing To Minimize Power
Consumption in Wireless Sensor Networks” on Innovations in Embedded Systems.
Mobile Communication and Computing Technologies, MACMILLAN
PUBLISHERS INDIA LTD. Catalog no. ISBN 13: 978-0230-63910-2 IN
MACMILLAN Advanced Research Series. Proceedings by Mobile Communication
and Networking Center of Excellence (MCNC) and PES School of Engineering,
Bangalore, India July, 2009
[4]. Kiran Amin “Utilization of SIP Contact Header for Reducing the Load on Proxy
Servers in FoIP Application” Intelligence, Communication Systems and Networks”,
(CICSyN-2009), Published in IEEE Computer Society Journal, (Copy writes transfer
to IEEE) on Proceedings published by jointly organized by UK Simulation Society
and Asia Modelling and Simulation Society, 2009.