A Using Multi
description
Transcript of A Using Multi
Using Multi-Core Processors for Mining
Frequent Sequential Patterns
Abstract. The problem of mining frequent sequential patterns (FSPs) has attracted a great deal
of research attention. Althought there are many efficient algorithms for mining FSPs, the mining
time of these algorithms is still high, especially for very large or dense datasets. Parallel
processing has been widely applied to improve the processing speed for various problems. Based
on a multi-core processor architecture, this paper proposes a parallel approach called PIB-
PRISM for mining FSPs from very large datasets. Baranches of the search tree are thought of as
tasks in PIB-PRISM and are processed in parallel. The Prime-encoding theory is also used for
fast determining the support values. Experiments are conducted to verify the effectiveness of
PIB-PRISM. Experimental results show that the proposed algorithm outperforms PRISM for
mining FSPs in terms of mining time.
1. Introduction
Mining sequential patterns (SPs) is a fundamental problem in data mining and is applied in
many domains, such as customer shopping cart analysis (Agrawal and Srikant, 1995), weblog
analysis (Weichbroth et al., 2012; Zubi et al., 2014), and DNA sequence analysis (Raza, 2013;
Zaki et al., 2001). The input data, called a sequence dataset, include a set of sequences. Each
sequence is a list of transactions, each of which contains a set of items. An SP consists of a list of
sets of items, with their appearing support above a user-specified minimum support, where the
support of an SP is the percentage of sequences that contain the pattern. The goal of SP mining is
to find all SPs in a sequence dataset.
An example of a sequence dataset is customer purchase behavior in a store. The dataset
contains the itemsets purchased in sequence by each customer, and thus the purchase behavior of
each customer can be represented in the form of [Customer ID, Ordered Sequence Events],
where each sequence event is a set of store items (e.g., bread, sugar, milk and tea). Below is an
example for the purchase behavior of two customers, C1 and C2: [C1, (bread, milk), (bread, milk,
sugar), (milk), (tea, sugar)]; [C2, (bread), (sugar, tea)]. The first customer, C1, purchases
(bread, milk), (bread, milk, sugar), (milk) and (tea, sugar) in sequence, and the second customer,
C2, purchases (bread) and (sugar, tea) in sequence.
The general idea of all existing methods is to start with general (short) sequences and then
extend them towards specific (long) ones. Existing methods can be divided into the following
three main types.
(1) Horizontal methods: Datasets are organized in the horizontal format, where each row has a
transaction in sid-itemset form, sid is a sequence (customer) ID, and itemset is a set of items
appeared in that sequence. Generalized Sequential Pattern algorithm denoted is GSP
(Agrawal and Srikant, 1996; Zhang et al., 2002), AprioriAll (Agrawal and Srikant, 1995) and
principle of GSP approach denoted is PSP (Masseglia et al., 1998), which are extensions of
the Apriori approach, are some examples. The main disadvantage of this type of methods is
that the datasets are scanned many times to determine frequent sequential pattern (FSP).
Therefore, the runtime and the memory usage of these algorithms are high.
(2) Vertical methods. Datasets are organized in the vertical format, in which, each row is with
the form of item, sid and sid is a set of sequence IDs containing item. In this layout, the
support of a sequence is the size of the set of sids. In SPADE (Zaki, 2001a), FSPs are
determined by combining sids based on the lattice theory to reduce the number of scaning a
dataset. Besides, SPAM (Ayres et al., 2002) and PRISM (Gouda et al., 2010a) algorithms use
bit vectors to store sids for reducing memory usage.
(3) Projection methods. Projection methods are hybrid methods that combine horizontal and
vertical approaches. Given any prefix sequence P, the main idea is to project the horizontal
datasets based on the methodology of pattern-growth mining of frequent patterns in
transaction databases developed in the FP-growth algorithm (Han et al., 2000) . Hence, the
projected (or conditional) dataset contains only those sequences that have P. The frequency
of extensions of P can be directly counted in the projection dataset. PrefixSpan (Pei et al.,
2004), an extension of FreeSpan (Han et al., 2000; Pei et al., 2004), is an example. Its general
idea is to examine only the prefix subsequences and project only their corresponding postfix
subsequences into projection datasets. This approach segments data, projections are first
performed on a dataset to reduce the cost of data storage. In each projected dataset,
sequential patterns are grown by exploring only local frequent patterns. Instead of projecting
sequence databases by considering all the possible occurrences of frequent subsequences, the
projection is based only on frequent prefixes because any frequent subsequence can always
be found by growing a frequent prefix. Beside that, data tree structure also using for
organized and stored represent candidate sequences it was represented by IMSR_PreTree
(Van et al., 2014) for mining sequential rules and MNSR-Pretree (Pham et al., 2014), for
mining non-redundant sequential rules are implemented based on the prefix tree architecture.
In addition, a number of extended problems related to sequence mining have been proposed,
including the mining of frequent closed SPs (Tran et al., 2015; Yan et al., 2003), frequent closed
sequences without candidate generation (Wang et al., 2004), sequence generation (Lo et al.,
2008; Vijayarani et al., 2014), and inter-sequence patterns (Wanga et al., 2009).
All FSP mining algorithms are implemented based on a sequential strategy and single-task
processing. This mean that a task must be completed before the next one can be started. Hence,
these methods are time-consuming for large datasets, especially dense datasets. To improve
performance, some researchers have applied parallel computing approaches to speed up the
processing. For example, Zaki (2001a) proposed a parallel algorithm named pSPADE, which is
based on SPADE (Zaki, 2001b). It is used for fast discovery of sequential patterns on computers
with distributed memory. Next, Cong et al. (2005) proposed the Par-CSP (parallel closed
sequential pattern mining) algorithm, also for computers with distributed memory. In addition,
multi-core processors (Andrew, 2008) allow for multiple tasks to be executed in parallel to
enhance performance. Some data mining researchers have developed parallel algorithms for
multi-core processor architectures. Liu et al. (2007) proposed cache-conscious FP-array and a
mechanism for parallel data lock-free dataset tiling for mining frequent itemsets on multi-core
computers. Yu and Wu (2011) proposed a strategy for effective load balancing to reduce the
number of duplicate candidates generated. In addition, parallel mining has been applied to closed
itemsets (Liu et al., 2007; Negrevergne et al., 2010; Schlegel et al., 2013; Yu et al., 2011),
correlated pattern mining (Casali et al., 2013), and generic pattern mining (Negrevergne et al.,
2013).
The present study proposes an approach for mining SPs in parallel based on the PRISM
algorithm for multi-core processor architectures. It is called parallel independent branch PRISM
(PIB-PRISM). PIB-PRISM uses prime block coding based on the prime factor theory for fast
determining the support values associated with SPs. Experiments were performed to show the
efficiency of PIB-PRISM when compared to that of PRISM.
The rest of this paper is organized as follows. Section 2 reviews the basic concepts of FSP
mining, multi-core processor architectures, prime block coding, and PRISM. The PIB-PRISM
algorithm is proposed in Section 3. Experimental results are discussed in Section 4. Finally,
conclusion and idea for future work are given in Section 5.
2. Related work
2.1. Preliminaries
Let I = {i1, i2, ..., im} be a set of m distinct attributes, also called items. An itemset X = {i1,
i2, ..., ik} is a non-empty unordered collection of items denoted as a k-itemset. Without loss of
generality, it is assumed that the items of an itemset are sorted in increasing order. A sequence S
is thus an ordered list of itemsets, and is denoted as (S1 S2... Sn), where a sequence element Si is
an itemset and n (the size of a sequence) is the number of itemsets (or elements) in this sequence.
The length of sequence is the total number of items in this sequence, denoted as k=∑j=1
n
Si. A
sequence of length k is called a k-sequence. For example, the sequence (B)(AC) is a 3-sequence
of size 2. A sequence = (b1 b2 ... bm) is called a subsequence of sequence = (a1 a2 ... an) and
is a supersequence of , denoted as , if there exist integers 1 j1 < j2 < ... < jn m such
that bk ajk, 1 k n. For example, the sequence (B)(AC) is a subsequence of (AB)(E)
(ACD), but (AB)(E)) is not a subsequence of (ABE)).
A sequence dataset D is a set of tuples (sid, S), where sid is a unique sequence identifier and
S = (S1 S2... Sn) is the sequence of itemsets. A pattern is a subsequence of ordered itemets, and
each itemset in a pattern is called an element. The absolute support of a sequence p in a dataset
D is defined as the total number of sequences in D that contain p, denoted as sup(p) = |{Si D |
p Si}|. The relative support of p is given as the fraction of the sequences that contain p in a
dataset. Absolute and relative supports are sometimes used interchangeably. Given a user-
specified threshold, called the minimum support (denoted minSup), a sequence p is said to be a
frequent sequence if it occurs more than minSup times. That is, sup(p) minSup. A frequent
sequence is maximal if it is not a subsequence of any other frequent sequence. A frequent
sequence is closed if it is not a subsequence of any other frequent sequence with the same
support. Given a sequence dataset D and minSup, the problem of mining SPs is to find all the
frequent sequences in the dataset.
Considering the sequence dataset D in Table 1. The set of items in the dataset is {A, B, C}.
Assume minSup is set at 2. Sequence S1 has six itemsets, including (AB), (B), (B), (AB), (B) and
(AC). The size and length of S1 are thus six and nine, respectively. Sequence p1 = (AB)(C) is a
subsequence of sequence S1. In the example (Table 1), only the three sequences S1, S2 and S5
contain p1. Therefore, sup(p1) = 3 and p1 is an FSP since sup(p1) > minSup.
Table 1. Example of a sequence dataset
SID Sequence dataset
S1 (AB)(B)(B)(AB)(B)(AC)
S2 (AB)(BC)(BC)
S3 (B)(AB)
S4 (B)(B)(BC)
S5 (AB)(AB)(AB)(A)(BC)
2.2. Multi-core processor architecture
In a multi-core architecture, a processor includes two or more independent cores in the same
physical package (Andrew, 2008), with each processing unit having separate memory and shared
Core 1 Core 2 Core 3 Core 4
Individual Memory
Individual Memory Individual
MemoryIndividual Memory
Chip Boundary
Shared Memory
Bus Interface
Off-chip Components
main memory. An example of a quad-core processor is shown in Figure 1. Multi-core processors
allow multiple tasks to be executed in parallel to increase performance.
Figure 1. Quad-core processor.
With the benefits of the multi-core architecture, this paper proposes a method for parallel
mining FSPs based on the multi-core architecture to speed up the execution process, thereby
improving the efficiency of intelligent systems.
2.3. Prime block encoding
The PRISM (Gouda et al., 2010) algorithm was proposed as an effective approach for
frequent-sequence mining via prime block encoding. It is briefly introduced as follows.
Given a set G of items is arranged, let P(G) is set of all subset of G. Considering S ∈ P(G), S
can be represented as a bit vector contains n bit (denoted by SB). SiB = 1 if Gi S vice versa, Si
B=
0 if Gi S. For example, given G = {2, 3, 5, 7} and S = {2, 5}, we have SB = 1010.
A set S has n items, S ∈ P(G), we define multiplication operator of S denoted is ⨂S =
S1×S2×...×Sn. If S = , ⨂S = 1. Meanwhile, ⨂P(G) = {⨂S: S ∈ P(G)} is a set obtained when
applying multiplication operator ⨂ on all sets S in P(G). G is a generators set of ⨂P(G) by
multiplication operator ⨂. If a set G only prime intergers then we call G is prime generator.
Given a set G of prime numbers sorted in increasing order, assume that N is a multiple of |G|
and B be a bit vector of length N. Then B can be partitioned into m = N
¿G∨¿¿ continuous
blocks. Each block Bi includes the segment from B[(i - 1) × |G| +1] to B[i × |G|], 1 ≤ i ≤ m. In
fact, each Bi {0, 1}|G|, is the indicator bit vector SB represents some subset S ⊆ G.
Let Bi[j] be the j-th bit in block Bi and G[j] be the j-th prime number in G. The value of Bi
with respect to G, denoted v(Bi, G), is defined as ⨂ {G [ j ]Bi [ j ]}. For example, let Bi = 1001 and G
= {2, 3, 5, 7}. Then v(Bi, G) = 21×30×50×71 = 2×7 = 14. If Bi = 0000, then v(Bi, G) = 1.
The prime block encoding of a bit vector B with respect to a base prime set G is denoted v(B,
G), which is the set of allv(Bi, G), 1 ≤ i ≤ m. We can write v(Bi, G) as v(Bi) and v(B, G) as v(B)
for simplicity. With the bit vector partitioned into blocks, each block has length |G|. For example,
given G = {2, 3, 5, 7} and B = 100111100100, we can partion B into 12/4 (= 3) blocks with B1 =
1001, B2 = 1110, and B3 = 0100. Thus, v(B1) = ⨂{2, 7} = 14, v(B2) = ⨂{2, 3, 5} = 30, and v(B)
= ⨂{3} = 3. Thus, prime block encoding v(B) of the bit vector B with respect to he given base
prime set G is {14, 30, 3}. The inverse operation of v(B) is defined as v-1({14, 30, 3}) = v-1(14) v-
1(30) v-1(3) = 100111100100 = B. Note that a bit vector with all zeros is encoded as 1.
Given a bit vector A = A1×A2×...×Am, where Ai is an array that contains |G| bits, let fA be the
position of the first “1” bit in A. Then, the j-th bit in a mask operator ¿ is defined as:
¿
Indeed, ¿ is a bit vector obtained by setting ¿ = 0 if j ≤ f A and ¿ = 1 if j> f A. For example,
assume A = 001001100100. We have fA = 3 and ¿ = 000111111111. Similarly, the mask operator
for prime block encoding is defined as (v( A ¿)¿⊳ = v(( A ¿¿¿⊳). For example, (v( A ¿)¿⊳ = v(¿) =
v(000111111111) = v(0001) v(1111) v(1111) = {7, 210, 210}. Because v(A) = v(001001100100)
= v(0010) v(0110) v(0100) = {5, 15, 3}. Thus, according to this definition, we have ¿ = (7, 210,
210).
Consider a set of items I = {i1, i2, ..., in}. Each item can appear in different sequences and in
different positions of a sequence. Therefore, two encoding mechanisms are executed for each
item ij, including ID encoding of sequence data and position encoding of items in each
sequence.Let P(SX, PX) denote the prime encoding of item X, where SX is the ID encoding of
sequence data that contain X and PX is the position encoding of item X in each sequence.The
prime block encoding included two steps shown as follows.
Step 1. Position block encoding
The prime encoding for positions of an item X appearing in a sequence S is conducted as
follows.Firstly, a bit vector, named BIT, is built to represent the positions of an item X appearing
in a sequence S. The length of the bit vector is the same as that of the sequence. If the i-th itemset
in S contains item X, then BIT[i] = 1; otherwise, BIT[i] = 0. Extra “0” bits are appended to the bit
vector such that the vector size is a multiple of |G|. For example, Figure 2(a) shows an example
of 5 sequences with I = {A, B, C}.
Figure 2(a). An example of sequence dataset
In the first sequence, item A appears in positions 1, 4 and 6, so the bit encoding of A is
100101. Assume G = {2, 3, 5, 7}. Because |G| = 4 = 22, two extra “0” bits are added and the bit
encoding becomes 10010100. v(1001) and v(0100) are then calculated as 14 and 3, respectively.
Figure 2(b) shows all the encoded positions of item A in the given five sequences. The same
procedure is repeated for the other two items, B and C.
Figure 2(b). Bit endcoded position block of item A
Step 2. Sequence block encoding
Another bit vector can be used to represent the sequences in which an item appears. In the
example above, since item A appears in all sequences except for the 4th sequence, the bit vector,
11101, is then used to represent the case. The bit vector is then encoded into a prime block based
on the given prime set G. Again, extra “0” bits are appended to the bit vector such that the vector
size is a multiple of |G|. Thus, three bits of “0” are added, and the resulting bit vector of A is
11101000 . The prime coding v(A) generated from the bit vector for item A is v(1110) v(1000),
which is {30, 2}. The results for all the three items are shown in Figure 2(c). Figure 2(d) shows
the final prime encoding, including sequence and position encoding.
Figure 2(c). Bit-encoded sid for items
Figure 2(d). Full primal block
Compression of prime block encoding
An SP appears in only some sequences of a dataset, and in each sequence only occurs in
some positions of a sequence. Therefore, a bit vector usually includes many bits of zero. A block
with all bits being zero is called an empty block. For example, Block A = 0000 is an empty block.
Empty blocks are removed during the compression process of prime encoding blocks to reduce
size. PRISM retains only non-empty blocks after prime encoding. Hence, it keeps an index with
each sequence block to indicate which non-empty position blocks correspond to a given
sequence block.
Figure 2(e) shows the compact prime block encoding for item A. The first sequence block is
30, with a factor-cardinality of ||30||G = 3, meaning that there are 3 valid (non-empty position
blocks) sequences in this block. For each of these, the offsets are stored in the position blocks.
Figure 2(e). Compact encoded blocks
For example, the offset of sequence 1 is 1, with the first two position blocks corresponding to
this sequence. The offset for sequence 2 is 3, and for sequence 3 is 4. The sequence that
represents sequence block 30 can be found directly from the corresponding bit vector v-1(30) =
1110, which indicates that sequence 4 is invalid because it is empty block. The second sequence
block for item A is v-1(2) = 1000, indicating that only sequence 5 is valid. Its position blocks
begin at position 5. The benefit of this sparse representation becomes clear when we consider the
primal encoding for item C. Full prime blocks in Figure 2(d) contain a lot of redundant
information, which has been removed in the compact prime block encoding in Figure 2(e).
Support counting via prime block join
Consider a sequence s, with P(Ss, Ps) denoting its prime encoding, where Ss is the set of all
encoded sequence blocks s and Ps is the set of all encoded position blocks for s. The support of
sequence s can be directly determined from the prime block encoding given by sup(s) =
∑vi∈Ss ∥v i∥G. For example, s = (A) and v(A) = {30, 2}. We have sup(s) = ||30||G + ||2||G = 3 + 1 =
4.
2.5. PRISM algorithm
We use PRISM (Gouda et al, 2010) as the basic serial algorithm for our parallel FSPs mining
algorithm. There are two reasons. First, PRISM is an effective algorithm for mining SPs. Second,
PRISM searches the space without maintaining a set of candidates, which facilitates its parallel
processing. PRISM uses the vertical data format based on prime block encoding to represent
candidate sequences. It also adopt join operations over the prime blocks to determine the
frequency for each candidate. It scans the dataset only once to find the set of SPs whose size is 1
together with the block encoding corresponding to the patterns. A new pattern is identified based
on the block encoding of an old pattern and the added item. Two main steps are included in
PRISM as follows.
Step 1. Construct the search tree from the set of frequent 1-itemset sequences.
Step 2. Extend the search tree and traverse the tree to find new SPs with a larger size using
itemset extension and sequence extension.
In PRISM, these key aspects are search space traversal strategy, data structure used to
represent the database and the computing supports method was description detail as below.
Search space
For a sequence dataset, the relation of subsequences is typically represented as a search tree,
defined recursively as follows. The root of the tree is at level zero and labeled with the null
sequence . A node labeled sequence S at level k (called a k-sequence) is repeatedly extended by
adding one item I to generate a child node at the next level (k+1) or (k+1)-sequence. A (k+1)-
sequence can be extended using sequence extension or itemset extension. In sequence extension,
the item is appended to the SP as a new itemset. In itemset extension, the item is added to the last
itemset in the pattern, so the item is lexicographically greater than all items in the last itemset.
For example, for node S = (A)(A), after item B is added, (A)(A)(B) is the sequence extension
and (A)(AB) is the itemset extension. Figure 3 shows the process of extending the search
sequence in depth-first order.
(ABC)(AB)(C)
(AB)(B)
(AB)(A)
(A)(BC)
(A)(B)(C)
(A)(B)(B)
(A)(AC)
(A)(A)(C)
(A)(AB)
(A)(A)(B)
(A)(A)(A)
(BC)(B)(C)(B)(B)(B)(A)(AC)(A)(C)(AB)(A)(B)(A)(A)
(C)(B)(A)
{}
(A)(B)(A)
Itemset extensions
Sequence extensions
Figure 3. Search space for mining SPs with itemset extension and sequence extension.
Method for extending patterns and computing supports
Suppose that in the initialization step, we compute the prime block encoding of each single
item in the sequence dataset. The FSP mining process starts from the root of the search tree.
PRISM mines SPs as follows. For each node in the search tree, the pattern for that node is
extended by adding an item to create a new pattern. The support is computed via the prime block
by joining patterns to extend. For node S, all of its extensions are evaluated before the depth-first
recursive call. The search stops when no new frequent extensions are found.
Itemset extension
Consider a pattern (A) with a prime block encoding of P(S(A), P(A)) and item B with a
prime block encoding of P(SB, PB). The itemset extension applied to pattern (A) creates the new
pattern (AB). The sequence blocks of S(A) and SB in Figure 4(a) contain all information about
the relevant sequence IDs, where A and B occur at the same time.
Figure 4(a). Itemset extensions
Sequence block encoding that contains pattern (A) and item B was computed via the
greatest common divisor (gcd) of each pair of elements in the two blocks S(A) and SB. We can
find the bit vector corresponding to pattern (A) and item B occurrences in the sequence data of
the dataset. For example, as shown in Figure 4(a), sequence A is S(A) = {30, 2} and item B is SB =
{210, 2}. Then, gcd(S(A), SB) = {gcd(30, 210), gcd(2, 2)} = {30, 2}. The reverse operation is
denoted by v-1(30, 2) = 11101000. Pattern (A) and item B occur in sequences 1, 2, 3, and 5.
For example, for sequence S1 in Table 1, P1A ={14, 3} and P1
B ={210, 2}. Then, gcd(P1A,
P1B) ={gcd(14, 210), gcd(3, 2)} ={14, 1}, which indicates that A and B occur at positions 1 and 4
in sequence 1.
Sequence extension
Sequence extension applied to pattern (A) with item B creates a new pattern (A)(B). As with
itemset extension, all sequence data that simultaneously contain pattern (A) and item B can be
found.For each sequence data found, it is checked whether item B appears after item A, if the
sequence data that contains (A)(B).
Figure 4(b). Sequence extension
For example, consider sequence 1in Figure 4(b). P1(A) ={14, 3} and P1
B ={210, 2}. Then,
gcd((P(A )1 )⊳ ,PB
1 ) = gcd¿¿, {210, 2}) = gcd({105, 210},{210, 2}) = {gcd(105, 210), gcd(210, 2)}
= {105, 2}. Since v-1(105, 2) = 01111000, which precisely indicates those positions in sequence 1
of Table 1.
3. Parallel PRISM for mining frequent sequential patterns
To improve the efficiency and reduce processing time of mining FSPs, this paper presents a
parallel model based on PRISM. The parallel model is shown in Figure 5.
Task(k+i)
Task(k+i)
Task(k+1)Task(k+1)
Taskk
Taskk
Fn
n-sequences
Fk
k-sequencesFk
k-sequencesFk+1
(k+1)-sequences
Collections
…………Fk+1
(k+1)-sequences
F1
1-sequence
Figure 5. Parallel model.
In the parallel mining for FSPs, each branch of the search tree can be regarded as a single
task, which is processed independently to generate FSPs. An example is given in Figure 6. It
shows that there are two tasks on level 1 of the task tree. Task 1 and task 2 process branches A
and B, respectively.
Figure 6. Task tree.
In general, this strategy is similar to the independent- class search strategy proposed by
Schlegel et al. (2013). Our proposed Parallel Independent Branch PRISM (PIB-PRISM) uses a
tree structure and a parallel implementation of tasks instead of threads. The advantage of PIB-
PRISM is that each task is assigned for searching branches of the tree and is processed
independently. The algorithm was implemented in .Net Framework 4.0. Using tasks has
advantages over using threads. First, tasks require less memory than threads do. Second, a thread
runs on only one core, whereas a task can run on multiple cores. Finally, threads require more
(ABC)(AB)(C)
(AB)(B)
(AB)(A)
(A)(BC)
(A)(B)(C)
(A)(B)(B)
(A)(AC)
(A)(A)(C)
(A)(AB)
(A)(A)(B)
(A)(A)(A)
(BC)(B)(C)(B)(B)(B)(A)(AC)(A)(C)(AB)(A)(B)(A)(A)
(C)(B)(A)
{}
(A)(B)(A)
Task 2Task 1
processing time than tasks do because the operating system need to allocate data structures for
threads, as initialization and destruction, and also performs the context switch between threads.
The parallel mining process for finding FSPs from a task tree is shown as follows.
- Step 1. Construct the search tree from the set of frequent 1-itemset sequences.
- Step 2. Assign each branch of the tree to a processor and mine FSPs independently.
In PIB-PRISM, a dataset must be preprocessed into vertical format. This step is performed
only once. After preprocessing, the dataset is called a transformed dataset D’ and has a structure
as shown in Figure 7. The first line in Figure 7 indicates the number of transactions in the
dataset, the second line includes an item and its support count of a task, and the next lines (called
support lines) are in the form of sid and positions, which indicates the item appears in the
positions of the sequence sid. For example, the third line in Figure 7 represents that item a
appears in positions 1, 4, 6 of the first sequence. Similarly, the fourth line indicates that item a
appears in position 1 of the second sequence.
Figure 7. An example of a transformed dataset from Table 1
The pseudo code of the PIB-PRISM strategy is shown in Figure 8.
Input Dataset D’, minSup
Output: All FSPs satisfying minSup
Procedure PIB-PRISM(D’, minSup)
1 Begin
//Find P1 satisfying minSup
2 dbpat = pGenarate_SPs(minSup, D’);
3 List<string> listpatstring = null;
4 For (int i = 0; i < |dbpat|; i++)
5 Begin
6 Add dbpat.Pats[i] to listpatstring
7 Task ti = new Task(() => {
8 extendTree(dbpat.Pats[i], dbpat, listpatstring); });
9 End
10 For each task in the list of created tasks do
11 collect the set of patterns (Ps) returned by each task;
12 totalPs = totalPs Ps;
13End
Figure 8. The PIB-PRISM strategy
In the PIB-PRISM strategy, the pGenarate_SPs procedure is called to identify all frequent 1-
sequences from the dataset D’ and store them into dbpat (line 2). Next, PIB-PRISM creates a
new dbpat task corresponding to dbpat (line 7) and executes the procedure of extendTree (line 8)
to extend itemsets and sequences. Each task is then performed independently on a processor core
to generate a partial set of FSPs. The final set of FSPs is then the union of the partial results. The
procedure of pGenerate_SPs is stated in Figure 9.
Procedure pGenerate_SPs(minSup, D’)
1 Begin
2 Create n tasks, which with n = ⌈IP⌉
, where I is the number of single items in D’, and P is the number of processor cores
3 For each item i in I
4 Assign item i for tasks[i]
5 Generates_SP(i)
6 End Foreach
7 End
Procudere Generates-SP(item)
1 Begin
2 if support(item) >= minsup then
3 Encoding information of item based on PRSIM
4 Store information of item in dbpat[i]
5 End
Figure 9. Procedure Generates-SP
In Figure 9, the pGenerate_SPs procedure creates n task (line 2) and assigns these tasks to
processor cores for executing the Generate_SP procedure (line 5) based on PRISM to find
frequent 1-sequences. The procedure of extendTree is shown in Figure 11.
Procedure extendTree(Pattern p, DB_Pattern dbpat, int level, List<string> listpatstring)
1 Begin
2 extendItemset(p, dbpat, listpatstring)
3 extendSequence(p, dbpat, listpatstring)
4 extendTreeCollection(p.Itemset_ext_pattern, dbpat, level + 1, listpatstring)
5 extendTreeCollection(p.Sequence_ext_pattern, dbpat, level + 1, listpatstring)
6 End
Procedure extendItemset(Pattern p, DB_Pattern dbpat, List<string> listpatstring)
1 Begin
//Let item_p be the last item of pattern P
//Find the position of item_p in dbpat
2 While (i ≤ |dbpat|)
3 Begin while
4 Pattern pnew = createNewPattern(p, dbpat.Pats[i], true)
5 Add pnew to listpatstring
6 Add pnew to p.Itemset_ext_pattern
7 i++
8 End while
9 End
Procedure extendSequence(Pattern p, DB_Pattern dbpat, List<string> listpatstring)
1 Begin
2 For each(Pattern pi in dbpat.Pats)
3 Pattern pnew = createNewPattern(p, pi pi, false)
4 Add pnew to listpatstring
5 Add pnew to p.Sequence_ext_pattern
6 End foreach
7 End
Procedure extendTreeCollection(List<Pattern> listpats, DB_Pattern dbpat, int level,
List<string> listpatstring)
1 Begin
2 For each(Pattern pipi in listpats)
3 extendTree(pipi, dbpat, level, listpatstring)
4 End for each
5 End
Figure 10. Procedure ExtendTree
For each node in the search tree, the pattern for that node is extended by call inged
extendItemset and extendSequence (Gouda et al, 2010) to create a new pattern in Figure 10. An
example of the results for itemset extension and sequence extension executed on the data in
Table 1 is shown in Table 2. The set of frequent sequential patterns derived is shown in Table 3.
Table 2. Item extension and sequence extension Frequent sequential patterns from Table 1 with
minSup =50%
Prefix Itemset
extension
Support ≥
minSup
Sequence
extension
Support ≥
minSup
(A): 4 (AB):4 Yes (A)(A): 2 No
(AC): 1 No (A)(B):3 Yes
(A)(C): 3 Yes
(AB): 4 (ABC): 0 No (AB)(A): 2 No
(AB)(B): 3 Yes
(AB)(C): 3 Yes
(AB)(B): 3 (AB)(BC): 2 No (AB)(B)(A): 2 No
(AB)(B)(B): 3 Yes
(AB)(B)(C): 3 Yes
(AB)(B)(B): 3 (AB)(B)(BC): 2 No (AB)(B)(B)(A): 2 No
(AB)(B)(B)(B): 2 No
(AB)(B)(B)(C): 2 No
(AB)(B)(C): 3 (AB)(B)(C)(A): 0 No
(AB)(B)(C)(B): 0 No
(AB)(B)(C)(C): 0 No
(AB)(C): 3 (AB)(C)(A): 0 No
(AB)(C)(B): 1 No
(AB)(C)(C): 1 No
(A)(B): 3 (A)(BC): 2 No (A)(B)(A): 2 No
(A)(B)(B): 3 Yes
(A)(B)(C): 3 Yes
(A)(B)(B): 3 (A)(B)(BC): 2 No (A)(B)(B)(A): 2 No
(A)(B)(B)(B): 2 No
(A)(B)(B)(C): 2 No
(A)(B)(C): 3 (A)(B)(C)(A): 0 No
(A)(B)(C)(B): 0 No
(A)(B)(C)(C): 0 No
(A)(C): 3 (A)(C)(A): 0 No
(A)(C)(B): 1 No
(A)(C)(C): 1 No
(B): 5 (BC): 3 Yes (B)(A): 3 Yes
(B)(B): 5 Yes
(B)(C): 4 Yes
(BC): 3 (BC)(A): 0 No
(BC)(B): 1 No
(BC)(C): 1 No
(B)(A): 3 (B)(AB): 3 Yes (B)(A)(A): 2 No
(B)(AC): 1 No (B)(A)(B): 2 No
(B)(A)(C): 2 No
(B)(AB): 3 (B)(ABC): 0 No (B)(AB)(A): 2 No
(B)(AB)(B): 2 No
(B)(AB)(C): 2 No
(B)(B): 5 (B)(BC): 3 Yes (B)(B)(A): 2 No
(B)(B)(B): 4 Yes
(B)(B)(C): 4 Yes
(B)(BC): 3 (B)(BC)(A): 0 No
(B)(BC)(B): 1 No
(B)(BC)(C): 1 No
(B)(B)(B): 4 (B)(B)(BC): 3 Yes (B)(B)(B)(A): 2 No
(B)(B)(B)(B): 2 No
(B)(B)(B)(C): 2 No
(B)(B)(BC): 3 (B)(B)(BC)(A): 0 No
(B)(B)(BC)(B): 0 No
(B)(B)(BC)(C): 0 No
(B)(B)(C): 4 (B)(B)(C)(A): 0 No
(B)(B)(C)(B): 0 No
(B)(B)(C)(C): 0 No
(B)(C): 4 (B)(C)(A): 0 No
(B)(C)(B): 1 No
(B)(C)(C): 1 No
(C): 4 (C)(A): 0 No
(C)(B): 1 No
(C)(C): 1 No
Table 3. The sSet of frequent sequential patterns from Table 1 with minSup =50%
No. FSPs Support
1 (A) 4
2 (AB) 43 (AB)(B) 34 (AB)(B)(B) 35 (AB)(B)(C) 36 (AB)(C) 37 (A)(B) 38 (A)(B)(B) 39 (A)(B)(C) 310 (A)(C) 311 (B) 512 (BC) 313 (B)(A) 314 (B)(AB) 315 (B)(B) 516 (B)(BC) 317 (B)(B)(B) 418 (B)(B)(BC) 319 (B)(B)(C) 420 (B)(C) 421 (C) 4
4. Experimental results
Experiments were conducted to evaluate the proposed parallel algorithm. The experiments
were performed on a personal computer with an Intel Core i7-4790 3.6-GHz CPU with 8 cores, 3
MB of L3 cache and 8 GB of RAM, running Windows 7. The algorithm was implemented
using .Net Framework 4.0.
Three standard datasets from the IBM data generator were used for comparison of runtime.
The information about the data sets are shown in Table 4.
Table 4. Three datasets used in experiments
Dataset No. of sequences No. of items
C6T5N1kD1k 1,000 1,000
C6T5N1kD10k 10,000 1,000
Kosarak25k 25,000 41,270
The mining time of PRISM (sequential) and the proposed PIB-PRISM for various minSup
values is shown in Table 5.
Table 5. Mining time for PRISM and PIB-PRISM
Dataset minSup
No. of
patterns
Parallel mining
time (sec.)
Sequential
mining time
(sec.)
C6T5N1kD1k
0.8 11209 69.803 145.454
0.7 14802 98.048 197.684
0.6 20644 136.079 280.067
0.5 31189 216.58 431.34
C6T5N1kD10k
0.8 8430 780.735 1108.835
0.7 10480 995.656 1423.003
0.6 13627 1349.231 1905.902
0.5 18461 1807.326 2632.395
Kosarak25k
0.8 198 5.975 7.629
0.7 244 9.453 11.263
0.6 315 14.992 18.018
0.5 419 26.13 31.824
Experiments were then conducted to compare the execution times of the two algorithms
for various minSup values (0.5-0.8%). The results are shown in Figures 11-13 for the three
datasets, respectively. With decreasing minSup values, more FSPs were obtained, and thus the
runtime increased.
0.8 0.7 0.6 0.50
100
200
300
400
500
600
700
69.803 98.048136.079
216.58145.454197.684
280.067
431.34
C6T5N1kD1k
SequentialParallel
minsup (%)
Runti
me
(s)
Figure 11. Comparison of runtime for various minSup values for C6T5N1kD1k.
0.8 0.7 0.6 0.50
500100015002000250030003500400045005000
780.735 995.6561349.231
1807.3261108.8351423.003
1905.902
2632.395
C6T5N1kD10k
SequentialParallel
minsup (%)
Runti
me
(s)
Figure 12. Comparison of runtime for various minSup values for C6T5N1kD10k.
0.8 0.7 0.6 0.50
1020
3040
506070
5.975 9.45314.992
26.13
7.62911.263
18.018
31.824
Kosarak25k
SequentialParallel
minsup (%)
Runti
me
(s)
Figure 13. Comparison of runtime for various minSup values for Kosarak25k.
The experimental results show that PIB-PRISM is faster than PRISM for all the three
datasets. With minSup = 0.8%, the runtime of PRISM and PIB-PRISM was 145.454 and 69.803
seconds, respectively, for the C6T5N1kD1k dataset. Similar results were obtained for the other
two datasets. PIB-PRISM had lower runtime because parallel processing divided the tasks to be
processed into independent branches. The experimental results show that the proposed algorithm
has significantly improved execution time when compared to the original algorithm for the three
synthetic datasets.
The search tree of PIB-PRIM might, however, be unbalanced. Some branches in the tree had
more nodes than others. This was a problem caused by depth-first search. A solution to it was to
sort items according to their support values in ascending order. After applying this solution, most
of the nodes in the leftmost branches was infrequent and would be pruned during the search. On
the contrary, nodes in the rightmost branches was frequent and would not be pruned during the
search. This strategy would help balance the search tree for parallel processing.
5. Conclusions and future work
This study has proposed a strategy for mining SPs in parallel using a multi-core architecture.
The proposed algorithm distributes the search for SPs to distributed tasks on a multi-core
computer and uses an efficient data structure for mining SPs fast. Experimental results show that
the proposed algorithm outperforms the PRISM algorithm.
This paper only solved the problem of mining SPs using multi-core proceesors. In the future,
we will further study parallel strategies for mining closed patterns and maximal patterns in
sequence datasets using multi-core proceesors. We will also study using other architecture to
solve the mining problem efficiently.
.
References
Agrawal R, Srikant R. Mining Sequential Patterns. In ICDE’95, pp. 3–14 (1995)
Agrawal R, Srikant R. Mining Sequential Patterns: Generalizations and Performance
Improvements. In EDBT’96, pp. 3–17 (1996)
Andrew B. Multi-Core Processor Architecture Explained. In
http://software.intel.com/en-us/articles/multi-core-processor-architecture-explained: Intel
(2008)
Ayres J, Gehrke JE, Yiu T, Flannick J. Sequential Pattern Mining using a Bitmap Representaion.
In SIGKDD’02, pp.1–7 (2002)
Burdick D, Calimlim M, Gehrke J. MAFIA: A maximal frequent itemset algorithm for
transactional databases. In ICDE’01, pp. 443–452 (2001)
Casali A, Ernst C. Extracting correlated patterns on multicore architectures. In CD-ARES’13,
pp.118-133 (2013)
Cong S, Han J, Padua D. Parallel mining of closed sequential patterns. In: ACM SIGKDD’05,
pp. 562-567 (2005)
Gouda K, Hassaan M, Zaki M. Prism: An effective approach for frequent sequence mining via
prime-block encoding. Journal of Computer and System Sciences, 76(1), 88-102 (2010)
Han J, Kamber M and Pei J. Data Mining: Concepts and Techniques. 3rd Edition, Morgan
Kaufmann (2011)
Han J, Pei J, Mortazavi Asl B, Chen Q, Dayal U, Hsu M. Freespan: Frequent pattern-projected
sequential pattern mining. In KDD’00, pp. 355–359 (2000)
Han J, Pei J, and Yin Y. Mining frequent patterns without candidate generation. In ACM
SIGMOD, pp. 1–12 (2000).
Liu L, Li E, Zhang Y, Tang Z. Optimization of Frequent Itemset Mining on Multiple-Core
Processor. In VLDB '07, pp. 1275-1285 (2007)
Lo D, Khoo SC, Liu C. Mining and ranking generators of sequential patterns. In SDM’08, pp.
553-564 (2008)
Masseglia F, Cathala F, Poncelet P. The PSP Approach for Mining Sequential Patterns. In
PKDD’98, pp. 176-184 (1998)
Mannila H, Toivonen H, and Verkamo AI. Discovery of frequent episodes in event sequences.
Data Mining and Knowledge Discovery, 259-289 (1997)
Negrevergne B, Termier A, Méhaut JF, Uno T. Discovering closed frequent itemsets on
multicore: Parallelizing computations and optimizing memory accesses. In HPCS’10,
IEEE, pp. 521–528 (2010)
Negrevergne B, Termier A, Rousset MC, Méhaut JF. Para Miner: a generic pattern mining
algorithm for multi-core architectures. Data Mining and Knowledge Discovery, 28(3), 1–
41 (2014)
Nguyen D, Vo B, Le B. Efficient strategies for parallel mining class association rules. Expert
Systems with Applications, 41(10), 4716–4729 (2014)
Slimani T, Lazzez A. Sequential mining: patterns and algorithms analysis. In International
Journal of Computer and Electronics Research, 2(5), 639-647 (2013)
Tran T, Le B, Vo B. Combination of dynamic bit vectors and transaction information for mining
frequent closed sequences efficiently. Engineering Applications of Artificial Intelligence,
183-189 (2015)
Pham T, Luo J, Hong TP, Vo B. An efficient method for mining non-redundant sequential rules
using attributed prefix trees. Engineering Applications of Artificial Intelligence 32, 88-99
(2014)
Van T, Vo B, Le B. IMSR_PreTree: an improved algorithm for mining sequential rules based on
the prefix-tree. Vietnam J. Computer Science, 1(2), 97-105 (2014)
Pham T, Luo J, Vo B. An effective algorithm for mining closed sequential patterns and their
minimal generators based on prefix trees. IJIIDS, 7(4), 324-339 (2013)
Pei J, Han J, Asl BM, Wang J, Pinto H, Chen Q, Dayal U, Hsu M. Mining Sequential Patterns
by Pattern-Growth: The PrefixSpan Approach. IEEE Trans. Knowl. Data Eng. 16(11),
1424-1440 (2004)
Schlegel B, Karnagel T, Kiefer T, Lehner W. Scalable frequent itemset mining on many-core
processors. In The 9th International Workshop on Data Management on New Hardware.
ACM. Article No. 3 (2013)
Raza K. Application of Data Mining In Bioinformatics. Indian Journal of Computer Science and
Engineering, 1(2), 114-118 (2013)
Vijayarani S, Deepa S. An efficient algorithm for sequence generation in data mining, IJCI, 3(1)
(2014)
Wanga CS, Lee AJT. Mining inter-sequence patterns. Experts System with Aplication, 36 (4),
8649–8658 (2009)
Wang W, Yang J. Mining Sequential Patterns from Large Data Sets. Advances in Database
Systems 28, 1-161 (2005)
Wang J and Han J. BIDE: Efficient Mining of Frequent Closed Sequences. In ICDE '04, pp. 79 –
90 (2004)
Weichbroth P, Owoc M, Pleszkun M. Web User Navigation Patterns Discovery from WWW
Server Log Files. FedCSIS’12, 1177–1176 (2012)
Yan X, Han J, Afshar R. CloSpan: Mining Closed Sequential Patterns in Large Datasets. In
SDM’03, pp. 166-177 (2003)
Yu KM, Wu SH. An efficient load balancing multi-core frequent patterns mining algorithm. In
TrustCom’11, pp. 1408–1412 (2011)
Zaki J, Wang TL, Toivonen TT. BIOKDD01: Workshop on Data Mining in Bioinformatics. In
ACM SIGKDD Explorations, 3(2), pp. 71-73 (2002)
Zaki MJ. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning
Journal, 42, 31–60 (2001a)
Zaki J. Parallel sequence mining on shared-memory machines. Journal of Parallel and
Distributed Computing, 61(3), 401-426 (2001b)
Zubi ZS, Raiani MSE. Using web logs dataset via web mining for user behavior understanding.
International Journal of Computers and Communications, 8, 103-111 (2014)