A Using Multi

Using Multi-Core Processors for Mining

Frequent Sequential Patterns

Abstract. The problem of mining frequent sequential patterns (FSPs) has attracted a great deal

of research attention. Althought there are many efficient algorithms for mining FSPs, the mining

time of these algorithms is still high, especially for very large or dense datasets. Parallel

processing has been widely applied to improve the processing speed for various problems. Based

on a multi-core processor architecture, this paper proposes a parallel approach called PIB-

PRISM for mining FSPs from very large datasets. Baranches of the search tree are thought of as

tasks in PIB-PRISM and are processed in parallel. The Prime-encoding theory is also used for

fast determining the support values. Experiments are conducted to verify the effectiveness of

PIB-PRISM. Experimental results show that the proposed algorithm outperforms PRISM for

mining FSPs in terms of mining time.

1. Introduction

Mining sequential patterns (SPs) is a fundamental problem in data mining and is applied in

many domains, such as customer shopping cart analysis (Agrawal and Srikant, 1995), weblog

analysis (Weichbroth et al., 2012; Zubi et al., 2014), and DNA sequence analysis (Raza, 2013;

Zaki et al., 2001). The input data, called a sequence dataset, include a set of sequences. Each

sequence is a list of transactions, each of which contains a set of items. An SP consists of a list of

tphong, 03/23/15,

Be careful for the term since the datasets in your experiments are not so very large

sets of items, with their appearing support above a user-specified minimum support, where the

support of an SP is the percentage of sequences that contain the pattern. The goal of SP mining is

to find all SPs in a sequence dataset.

An example of a sequence dataset is customer purchase behavior in a store. The dataset

contains the itemsets purchased in sequence by each customer, and thus the purchase behavior of

each customer can be represented in the form of [Customer ID, Ordered Sequence Events],

where each sequence event is a set of store items (e.g., bread, sugar, milk and tea). Below is an

example for the purchase behavior of two customers, C1 and C2: [C1, (bread, milk), (bread, milk,

sugar), (milk), (tea, sugar)]; [C2, (bread), (sugar, tea)]. The first customer, C1, purchases

(bread, milk), (bread, milk, sugar), (milk) and (tea, sugar) in sequence, and the second customer,

C2, purchases (bread) and (sugar, tea) in sequence.

The general idea of all existing methods is to start with general (short) sequences and then

extend them towards specific (long) ones. Existing methods can be divided into the following

three main types.

(1) Horizontal methods: Datasets are organized in the horizontal format, where each row has a

transaction in sid-itemset form, sid is a sequence (customer) ID, and itemset is a set of items

appeared in that sequence. Generalized Sequential Pattern algorithm denoted is GSP

(Agrawal and Srikant, 1996; Zhang et al., 2002), AprioriAll (Agrawal and Srikant, 1995) and

principle of GSP approach denoted is PSP (Masseglia et al., 1998), which are extensions of

the Apriori approach, are some examples. The main disadvantage of this type of methods is

that the datasets are scanned many times to determine frequent sequential pattern (FSP).

Therefore, the runtime and the memory usage of these algorithms are high.

Bao Huynh, 03/24/15,

để sid hay chuyển sang dạng cid, sid-itemset cid-itemset vậy sếp?

tphong, 03/23/15,

It is not consistent with the representation of the form [Customer ID, Ordered Sequence Events]

tphong, 03/23/15,

I suggest using ‘C’ is better than using ‘T’. Check all

(2) Vertical methods. Datasets are organized in the vertical format, in which, each row is with

the form of item, sid and sid is a set of sequence IDs containing item. In this layout, the

support of a sequence is the size of the set of sids. In SPADE (Zaki, 2001a), FSPs are

determined by combining sids based on the lattice theory to reduce the number of scaning a

dataset. Besides, SPAM (Ayres et al., 2002) and PRISM (Gouda et al., 2010a) algorithms use

bit vectors to store sids for reducing memory usage.

(3) Projection methods. Projection methods are hybrid methods that combine horizontal and

vertical approaches. Given any prefix sequence P, the main idea is to project the horizontal

datasets based on the methodology of pattern-growth mining of frequent patterns in

transaction databases developed in the FP-growth algorithm (Han et al., 2000) . Hence, the

projected (or conditional) dataset contains only those sequences that have P. The frequency

of extensions of P can be directly counted in the projection dataset. PrefixSpan (Pei et al.,

2004), an extension of FreeSpan (Han et al., 2000; Pei et al., 2004), is an example. Its general

idea is to examine only the prefix subsequences and project only their corresponding postfix

subsequences into projection datasets. This approach segments data, projections are first

performed on a dataset to reduce the cost of data storage. In each projected dataset,

sequential patterns are grown by exploring only local frequent patterns. Instead of projecting

sequence databases by considering all the possible occurrences of frequent subsequences, the

projection is based only on frequent prefixes because any frequent subsequence can always

be found by growing a frequent prefix. Beside that, data tree structure also using for

organized and stored represent candidate sequences it was represented by IMSR_PreTree

(Van et al., 2014) for mining sequential rules and MNSR-Pretree (Pham et al., 2014), for

mining non-redundant sequential rules are implemented based on the prefix tree architecture.

In addition, a number of extended problems related to sequence mining have been proposed,

including the mining of frequent closed SPs (Tran et al., 2015; Yan et al., 2003), frequent closed

sequences without candidate generation (Wang et al., 2004), sequence generation (Lo et al.,

2008; Vijayarani et al., 2014), and inter-sequence patterns (Wanga et al., 2009).

All FSP mining algorithms are implemented based on a sequential strategy and single-task

processing. This mean that a task must be completed before the next one can be started. Hence,

these methods are time-consuming for large datasets, especially dense datasets. To improve

performance, some researchers have applied parallel computing approaches to speed up the

processing. For example, Zaki (2001a) proposed a parallel algorithm named pSPADE, which is

based on SPADE (Zaki, 2001b). It is used for fast discovery of sequential patterns on computers

with distributed memory. Next, Cong et al. (2005) proposed the Par-CSP (parallel closed

sequential pattern mining) algorithm, also for computers with distributed memory. In addition,

multi-core processors (Andrew, 2008) allow for multiple tasks to be executed in parallel to

enhance performance. Some data mining researchers have developed parallel algorithms for

multi-core processor architectures. Liu et al. (2007) proposed cache-conscious FP-array and a

mechanism for parallel data lock-free dataset tiling for mining frequent itemsets on multi-core

computers. Yu and Wu (2011) proposed a strategy for effective load balancing to reduce the

number of duplicate candidates generated. In addition, parallel mining has been applied to closed

itemsets (Liu et al., 2007; Negrevergne et al., 2010; Schlegel et al., 2013; Yu et al., 2011),

correlated pattern mining (Casali et al., 2013), and generic pattern mining (Negrevergne et al.,

2013).

The present study proposes an approach for mining SPs in parallel based on the PRISM

algorithm for multi-core processor architectures. It is called parallel independent branch PRISM

(PIB-PRISM). PIB-PRISM uses prime block coding based on the prime factor theory for fast

determining the support values associated with SPs. Experiments were performed to show the

efficiency of PIB-PRISM when compared to that of PRISM.

The rest of this paper is organized as follows. Section 2 reviews the basic concepts of FSP

mining, multi-core processor architectures, prime block coding, and PRISM. The PIB-PRISM

algorithm is proposed in Section 3. Experimental results are discussed in Section 4. Finally,

conclusion and idea for future work are given in Section 5.

2. Related work

2.1. Preliminaries

Let I = {i1, i2, ..., im} be a set of m distinct attributes, also called items. An itemset X = {i1,

i2, ..., ik} is a non-empty unordered collection of items denoted as a k-itemset. Without loss of

generality, it is assumed that the items of an itemset are sorted in increasing order. A sequence S

is thus an ordered list of itemsets, and is denoted as (S1 S2... Sn), where a sequence element Si is

an itemset and n (the size of a sequence) is the number of itemsets (or elements) in this sequence.

The length of sequence is the total number of items in this sequence, denoted as k=∑j=1

n

Si. A

sequence of length k is called a k-sequence. For example, the sequence (B)(AC) is a 3-sequence

of size 2. A sequence = (b1 b2 ... bm) is called a subsequence of sequence = (a1 a2 ... an) and

is a supersequence of , denoted as , if there exist integers 1 j1 < j2 < ... < jn m such

that bk ajk, 1 k n. For example, the sequence (B)(AC) is a subsequence of (AB)(E)

(ACD), but (AB)(E)) is not a subsequence of (ABE)).

A sequence dataset D is a set of tuples (sid, S), where sid is a unique sequence identifier and

S = (S1 S2... Sn) is the sequence of itemsets. A pattern is a subsequence of ordered itemets, and

each itemset in a pattern is called an element. The absolute support of a sequence p in a dataset

tphong, 03/23/15,

So, Introduction needs not to be explained too detailed

D is defined as the total number of sequences in D that contain p, denoted as sup(p) = |{Si D |

p Si}|. The relative support of p is given as the fraction of the sequences that contain p in a

dataset. Absolute and relative supports are sometimes used interchangeably. Given a user-

specified threshold, called the minimum support (denoted minSup), a sequence p is said to be a

frequent sequence if it occurs more than minSup times. That is, sup(p) minSup. A frequent

sequence is maximal if it is not a subsequence of any other frequent sequence. A frequent

sequence is closed if it is not a subsequence of any other frequent sequence with the same

support. Given a sequence dataset D and minSup, the problem of mining SPs is to find all the

frequent sequences in the dataset.

Considering the sequence dataset D in Table 1. The set of items in the dataset is {A, B, C}.

Assume minSup is set at 2. Sequence S1 has six itemsets, including (AB), (B), (B), (AB), (B) and

(AC). The size and length of S1 are thus six and nine, respectively. Sequence p1 = (AB)(C) is a

subsequence of sequence S1. In the example (Table 1), only the three sequences S1, S2 and S5

contain p1. Therefore, sup(p1) = 3 and p1 is an FSP since sup(p1) > minSup.

Table 1. Example of a sequence dataset

SID Sequence dataset

S1 (AB)(B)(B)(AB)(B)(AC)

S2 (AB)(BC)(BC)

S3 (B)(AB)

S4 (B)(B)(BC)

S5 (AB)(AB)(AB)(A)(BC)

2.2. Multi-core processor architecture

In a multi-core architecture, a processor includes two or more independent cores in the same

physical package (Andrew, 2008), with each processing unit having separate memory and shared


Chuyển hết về Sid CID, S1 C1 ?

tphong, 03/23/15,

Use another symbol because it is different fomr S1 above.

Core 1 Core 2 Core 3 Core 4

Individual Memory

Individual Memory Individual

MemoryIndividual Memory

Chip Boundary

Shared Memory

Bus Interface

Off-chip Components

main memory. An example of a quad-core processor is shown in Figure 1. Multi-core processors

allow multiple tasks to be executed in parallel to increase performance.

Figure 1. Quad-core processor.

With the benefits of the multi-core architecture, this paper proposes a method for parallel

mining FSPs based on the multi-core architecture to speed up the execution process, thereby

improving the efficiency of intelligent systems.

2.3. Prime block encoding

The PRISM (Gouda et al., 2010) algorithm was proposed as an effective approach for

frequent-sequence mining via prime block encoding. It is briefly introduced as follows.

Given a set G of items is arranged, let P(G) is set of all subset of G. Considering S ∈ P(G), S

can be represented as a bit vector contains n bit (denoted by SB). SiB = 1 if Gi S vice versa, Si

B=

0 if Gi S. For example, given G = {2, 3, 5, 7} and S = {2, 5}, we have SB = 1010.

A set S has n items, S ∈ P(G), we define multiplication operator of S denoted is ⨂S =

S1×S2×...×Sn. If S = , ⨂S = 1. Meanwhile, ⨂P(G) = {⨂S: S ∈ P(G)} is a set obtained when

applying multiplication operator ⨂ on all sets S in P(G). G is a generators set of ⨂P(G) by

multiplication operator ⨂. If a set G only prime intergers then we call G is prime generator.

Given a set G of prime numbers sorted in increasing order, assume that N is a multiple of |G|

and B be a bit vector of length N. Then B can be partitioned into m = N

¿G∨¿¿ continuous

blocks. Each block Bi includes the segment from B[(i - 1) × |G| +1] to B[i × |G|], 1 ≤ i ≤ m. In

fact, each Bi {0, 1}|G|, is the indicator bit vector SB represents some subset S ⊆ G.

Let Bi[j] be the j-th bit in block Bi and G[j] be the j-th prime number in G. The value of Bi

with respect to G, denoted v(Bi, G), is defined as ⨂ {G [ j ]Bi [ j ]}. For example, let Bi = 1001 and G

= {2, 3, 5, 7}. Then v(Bi, G) = 21×30×50×71 = 2×7 = 14. If Bi = 0000, then v(Bi, G) = 1.

The prime block encoding of a bit vector B with respect to a base prime set G is denoted v(B,

G), which is the set of allv(Bi, G), 1 ≤ i ≤ m. We can write v(Bi, G) as v(Bi) and v(B, G) as v(B)

for simplicity. With the bit vector partitioned into blocks, each block has length |G|. For example,

given G = {2, 3, 5, 7} and B = 100111100100, we can partion B into 12/4 (= 3) blocks with B1 =

1001, B2 = 1110, and B3 = 0100. Thus, v(B1) = ⨂{2, 7} = 14, v(B2) = ⨂{2, 3, 5} = 30, and v(B)

= ⨂{3} = 3. Thus, prime block encoding v(B) of the bit vector B with respect to he given base

prime set G is {14, 30, 3}. The inverse operation of v(B) is defined as v-1({14, 30, 3}) = v-1(14) v-

1(30) v-1(3) = 100111100100 = B. Note that a bit vector with all zeros is encoded as 1.

Given a bit vector A = A1×A2×...×Am, where Ai is an array that contains |G| bits, let fA be the

position of the first “1” bit in A. Then, the j-th bit in a mask operator ¿ is defined as:

¿

Indeed, ¿ is a bit vector obtained by setting ¿ = 0 if j ≤ f A and ¿ = 1 if j> f A. For example,

assume A = 001001100100. We have fA = 3 and ¿ = 000111111111. Similarly, the mask operator

for prime block encoding is defined as (v( A ¿)¿⊳ = v(( A ¿¿¿⊳). For example, (v( A ¿)¿⊳ = v(¿) =

v(000111111111) = v(0001) v(1111) v(1111) = {7, 210, 210}. Because v(A) = v(001001100100)

= v(0010) v(0110) v(0100) = {5, 15, 3}. Thus, according to this definition, we have ¿ = (7, 210,

210).

Consider a set of items I = {i1, i2, ..., in}. Each item can appear in different sequences and in

different positions of a sequence. Therefore, two encoding mechanisms are executed for each

item ij, including ID encoding of sequence data and position encoding of items in each

sequence.Let P(SX, PX) denote the prime encoding of item X, where SX is the ID encoding of

sequence data that contain X and PX is the position encoding of item X in each sequence.The

prime block encoding included two steps shown as follows.

Step 1. Position block encoding

The prime encoding for positions of an item X appearing in a sequence S is conducted as

follows.Firstly, a bit vector, named BIT, is built to represent the positions of an item X appearing

in a sequence S. The length of the bit vector is the same as that of the sequence. If the i-th itemset

in S contains item X, then BIT[i] = 1; otherwise, BIT[i] = 0. Extra “0” bits are appended to the bit

vector such that the vector size is a multiple of |G|. For example, Figure 2(a) shows an example

of 5 sequences with I = {A, B, C}.

Figure 2(a). An example of sequence dataset

In the first sequence, item A appears in positions 1, 4 and 6, so the bit encoding of A is

100101. Assume G = {2, 3, 5, 7}. Because |G| = 4 = 22, two extra “0” bits are added and the bit


Nên tách riêng như thế này hay refer đến Table 1


Original term, defined above

tphong, 03/23/15,

The term is hard to understand. How about using the term item position encoding. Or the original approach uses the term?

encoding becomes 10010100. v(1001) and v(0100) are then calculated as 14 and 3, respectively.

Figure 2(b) shows all the encoded positions of item A in the given five sequences. The same

procedure is repeated for the other two items, B and C.

Figure 2(b). Bit endcoded position block of item A

Step 2. Sequence block encoding

Another bit vector can be used to represent the sequences in which an item appears. In the

example above, since item A appears in all sequences except for the 4th sequence, the bit vector,

11101, is then used to represent the case. The bit vector is then encoded into a prime block based

on the given prime set G. Again, extra “0” bits are appended to the bit vector such that the vector

size is a multiple of |G|. Thus, three bits of “0” are added, and the resulting bit vector of A is

11101000 . The prime coding v(A) generated from the bit vector for item A is v(1110) v(1000),

which is {30, 2}. The results for all the three items are shown in Figure 2(c). Figure 2(d) shows

the final prime encoding, including sequence and position encoding.

Figure 2(c). Bit-encoded sid for items

Figure 2(d). Full primal block

Compression of prime block encoding

An SP appears in only some sequences of a dataset, and in each sequence only occurs in

some positions of a sequence. Therefore, a bit vector usually includes many bits of zero. A block

with all bits being zero is called an empty block. For example, Block A = 0000 is an empty block.

Empty blocks are removed during the compression process of prime encoding blocks to reduce

size. PRISM retains only non-empty blocks after prime encoding. Hence, it keeps an index with

each sequence block to indicate which non-empty position blocks correspond to a given

sequence block.

Figure 2(e) shows the compact prime block encoding for item A. The first sequence block is

30, with a factor-cardinality of ||30||G = 3, meaning that there are 3 valid (non-empty position

blocks) sequences in this block. For each of these, the offsets are stored in the position blocks.

Figure 2(e). Compact encoded blocks

For example, the offset of sequence 1 is 1, with the first two position blocks corresponding to

this sequence. The offset for sequence 2 is 3, and for sequence 3 is 4. The sequence that

represents sequence block 30 can be found directly from the corresponding bit vector v-1(30) =

1110, which indicates that sequence 4 is invalid because it is empty block. The second sequence

tphong, 03/23/15,

Explain it

tphong, 03/23/15,

Very unclear. Explain it to match the numbers in the figure

block for item A is v-1(2) = 1000, indicating that only sequence 5 is valid. Its position blocks

begin at position 5. The benefit of this sparse representation becomes clear when we consider the

primal encoding for item C. Full prime blocks in Figure 2(d) contain a lot of redundant

information, which has been removed in the compact prime block encoding in Figure 2(e).

Support counting via prime block join

Consider a sequence s, with P(Ss, Ps) denoting its prime encoding, where Ss is the set of all

encoded sequence blocks s and Ps is the set of all encoded position blocks for s. The support of

sequence s can be directly determined from the prime block encoding given by sup(s) =

∑vi∈Ss ∥v i∥G. For example, s = (A) and v(A) = {30, 2}. We have sup(s) = ||30||G + ||2||G = 3 + 1 =

4.

2.5. PRISM algorithm

We use PRISM (Gouda et al, 2010) as the basic serial algorithm for our parallel FSPs mining

algorithm. There are two reasons. First, PRISM is an effective algorithm for mining SPs. Second,

PRISM searches the space without maintaining a set of candidates, which facilitates its parallel

processing. PRISM uses the vertical data format based on prime block encoding to represent

candidate sequences. It also adopt join operations over the prime blocks to determine the

frequency for each candidate. It scans the dataset only once to find the set of SPs whose size is 1

together with the block encoding corresponding to the patterns. A new pattern is identified based

on the block encoding of an old pattern and the added item. Two main steps are included in

PRISM as follows.

Step 1. Construct the search tree from the set of frequent 1-itemset sequences.

Step 2. Extend the search tree and traverse the tree to find new SPs with a larger size using

itemset extension and sequence extension.

In PRISM, these key aspects are search space traversal strategy, data structure used to

represent the database and the computing supports method was description detail as below.

Search space

For a sequence dataset, the relation of subsequences is typically represented as a search tree,

defined recursively as follows. The root of the tree is at level zero and labeled with the null

sequence . A node labeled sequence S at level k (called a k-sequence) is repeatedly extended by

adding one item I to generate a child node at the next level (k+1) or (k+1)-sequence. A (k+1)-

sequence can be extended using sequence extension or itemset extension. In sequence extension,

the item is appended to the SP as a new itemset. In itemset extension, the item is added to the last

itemset in the pattern, so the item is lexicographically greater than all items in the last itemset.

For example, for node S = (A)(A), after item B is added, (A)(A)(B) is the sequence extension

and (A)(AB) is the itemset extension. Figure 3 shows the process of extending the search

sequence in depth-first order.

(ABC)(AB)(C)

(AB)(B)

(AB)(A)

(A)(BC)

(A)(B)(C)

(A)(B)(B)

(A)(AC)

(A)(A)(C)

(A)(AB)

(A)(A)(B)

(A)(A)(A)

(BC)(B)(C)(B)(B)(B)(A)(AC)(A)(C)(AB)(A)(B)(A)(A)

(C)(B)(A)

{}

(A)(B)(A)

Itemset extensions

Sequence extensions

tphong, 03/23/15,

Write some sentences to conncet the last paragraph with this paragraph. It is not smooth.

Figure 3. Search space for mining SPs with itemset extension and sequence extension.

Method for extending patterns and computing supports

Suppose that in the initialization step, we compute the prime block encoding of each single

item in the sequence dataset. The FSP mining process starts from the root of the search tree.

PRISM mines SPs as follows. For each node in the search tree, the pattern for that node is

extended by adding an item to create a new pattern. The support is computed via the prime block

by joining patterns to extend. For node S, all of its extensions are evaluated before the depth-first

recursive call. The search stops when no new frequent extensions are found.

Itemset extension

Consider a pattern (A) with a prime block encoding of P(S(A), P(A)) and item B with a

prime block encoding of P(SB, PB). The itemset extension applied to pattern (A) creates the new

pattern (AB). The sequence blocks of S(A) and SB in Figure 4(a) contain all information about

the relevant sequence IDs, where A and B occur at the same time.

tphong, 03/23/15,

Separate 4a, 4b into tow figures. And move 4a here.

tphong, 03/23/15,

At the same time?

Figure 4(a). Itemset extensions

Sequence block encoding that contains pattern (A) and item B was computed via the

greatest common divisor (gcd) of each pair of elements in the two blocks S(A) and SB. We can

find the bit vector corresponding to pattern (A) and item B occurrences in the sequence data of

the dataset. For example, as shown in Figure 4(a), sequence A is S(A) = {30, 2} and item B is SB =

{210, 2}. Then, gcd(S(A), SB) = {gcd(30, 210), gcd(2, 2)} = {30, 2}. The reverse operation is

denoted by v-1(30, 2) = 11101000. Pattern (A) and item B occur in sequences 1, 2, 3, and 5.

tphong, 03/23/15,

Right?

For example, for sequence S1 in Table 1, P1A ={14, 3} and P1

B ={210, 2}. Then, gcd(P1A,

P1B) ={gcd(14, 210), gcd(3, 2)} ={14, 1}, which indicates that A and B occur at positions 1 and 4

in sequence 1.

Sequence extension

Sequence extension applied to pattern (A) with item B creates a new pattern (A)(B). As with

itemset extension, all sequence data that simultaneously contain pattern (A) and item B can be

found.For each sequence data found, it is checked whether item B appears after item A, if the

sequence data that contains (A)(B).

tphong, 03/23/15,

Check the sentence

Figure 4(b). Sequence extension

For example, consider sequence 1in Figure 4(b). P1(A) ={14, 3} and P1

B ={210, 2}. Then,

gcd((P(A )1 )⊳ ,PB

1 ) = gcd¿¿, {210, 2}) = gcd({105, 210},{210, 2}) = {gcd(105, 210), gcd(210, 2)}

= {105, 2}. Since v-1(105, 2) = 01111000, which precisely indicates those positions in sequence 1

of Table 1.

3. Parallel PRISM for mining frequent sequential patterns

To improve the efficiency and reduce processing time of mining FSPs, this paper presents a

parallel model based on PRISM. The parallel model is shown in Figure 5.

Task(k+i)

Task(k+i)

Task(k+1)Task(k+1)

Taskk

Taskk

Fn

n-sequences

Fk

k-sequencesFk

k-sequencesFk+1

(k+1)-sequences

Collections

…………Fk+1

(k+1)-sequences

F1

1-sequence

tphong, 03/23/15,

You had better explain Figure 5 here, especially for some are Fk and some are F(K+1)

tphong, 03/23/15,

Right?

tphong, 03/23/15,

Whose position?

Figure 5. Parallel model.

In the parallel mining for FSPs, each branch of the search tree can be regarded as a single

task, which is processed independently to generate FSPs. An example is given in Figure 6. It

shows that there are two tasks on level 1 of the task tree. Task 1 and task 2 process branches A

and B, respectively.

Figure 6. Task tree.

In general, this strategy is similar to the independent- class search strategy proposed by

Schlegel et al. (2013). Our proposed Parallel Independent Branch PRISM (PIB-PRISM) uses a

tree structure and a parallel implementation of tasks instead of threads. The advantage of PIB-

PRISM is that each task is assigned for searching branches of the tree and is processed

independently. The algorithm was implemented in .Net Framework 4.0. Using tasks has

advantages over using threads. First, tasks require less memory than threads do. Second, a thread

runs on only one core, whereas a task can run on multiple cores. Finally, threads require more

(ABC)(AB)(C)

(AB)(B)

(AB)(A)

(A)(BC)

(A)(B)(C)

(A)(B)(B)

(A)(AC)

(A)(A)(C)

(A)(AB)

(A)(A)(B)

(A)(A)(A)

(BC)(B)(C)(B)(B)(B)(A)(AC)(A)(C)(AB)(A)(B)(A)(A)

(C)(B)(A)

{}

(A)(B)(A)

Task 2Task 1

tphong, 03/23/15,

I am not sure. Because in my opinion, tasks can also be done using threads. So, chack it from OS book.

tphong, 03/23/15,

OK?

tphong, 03/23/15,

Only two tasks, right? There is no branch for C since it is the last item?

processing time than tasks do because the operating system need to allocate data structures for

threads, as initialization and destruction, and also performs the context switch between threads.

The parallel mining process for finding FSPs from a task tree is shown as follows.

- Step 1. Construct the search tree from the set of frequent 1-itemset sequences.

- Step 2. Assign each branch of the tree to a processor and mine FSPs independently.

In PIB-PRISM, a dataset must be preprocessed into vertical format. This step is performed

only once. After preprocessing, the dataset is called a transformed dataset D’ and has a structure

as shown in Figure 7. The first line in Figure 7 indicates the number of transactions in the

dataset, the second line includes an item and its support count of a task, and the next lines (called

support lines) are in the form of sid and positions, which indicates the item appears in the

positions of the sequence sid. For example, the third line in Figure 7 represents that item a

appears in positions 1, 4, 6 of the first sequence. Similarly, the fourth line indicates that item a

appears in position 1 of the second sequence.

Figure 7. An example of a transformed dataset from Table 1

tphong, 03/23/15,

You have task 3 in the example. It means the previous statement with only two tasks need to be modified.

tphong, 03/23/15,

A little rough, but OK

The pseudo code of the PIB-PRISM strategy is shown in Figure 8.

Input Dataset D’, minSup

Output: All FSPs satisfying minSup

Procedure PIB-PRISM(D’, minSup)

1 Begin

//Find P1 satisfying minSup

2 dbpat = pGenarate_SPs(minSup, D’);

3 List<string> listpatstring = null;

4 For (int i = 0; i < |dbpat|; i++)

5 Begin

6 Add dbpat.Pats[i] to listpatstring

7 Task ti = new Task(() => {

8 extendTree(dbpat.Pats[i], dbpat, listpatstring); });

9 End

10 For each task in the list of created tasks do

11 collect the set of patterns (Ps) returned by each task;

12 totalPs = totalPs Ps;

13End

Figure 8. The PIB-PRISM strategy

In the PIB-PRISM strategy, the pGenarate_SPs procedure is called to identify all frequent 1-

sequences from the dataset D’ and store them into dbpat (line 2). Next, PIB-PRISM creates a

new dbpat task corresponding to dbpat (line 7) and executes the procedure of extendTree (line 8)

to extend itemsets and sequences. Each task is then performed independently on a processor core

to generate a partial set of FSPs. The final set of FSPs is then the union of the partial results. The

procedure of pGenerate_SPs is stated in Figure 9.

tphong, 03/23/15,

Not drscribed clearly in the algorithm. Since your paper is aboutparallel processing, it is better to describe it in the algorithm

tphong, 03/23/15,

unclear

tphong, 03/23/15,

Add “, which is a array”?

tphong, 23/03/15,

unclear, ti?

tphong, 23/03/15,

unclear

tphong, 23/03/15,

Add comment

tphong, 23/03/15,

Unclear representation

tphong, 23/03/15,

Not defined. Add comments to the back

Mr.Bay, 23/03/15,

khong nhat quan ve cach trinh bay vong lap, check lai het

tphong, 23/03/15,

Italic or not? IF not, D’ and minSup are not as well

tphong, 23/03/15,

Declaration? Unclear representation. Add comments to the back

tphong, 23/03/15,

Change the parameter order to be consistent with the previous .2. Add comments at the back

tphong, 23/03/15,

Not clear. Defined before?

Procedure pGenerate_SPs(minSup, D’)

1 Begin

2 Create n tasks, which with n = ⌈IP⌉

, where I is the number of single items in D’, and P is the number of processor cores

3 For each item i in I

4 Assign item i for tasks[i]

5 Generates_SP(i)

6 End Foreach

7 End

Procudere Generates-SP(item)

1 Begin

2 if support(item) >= minsup then

3 Encoding information of item based on PRSIM

4 Store information of item in dbpat[i]

5 End

Figure 9. Procedure Generates-SP

In Figure 9, the pGenerate_SPs procedure creates n task (line 2) and assigns these tasks to

processor cores for executing the Generate_SP procedure (line 5) based on PRISM to find

frequent 1-sequences. The procedure of extendTree is shown in Figure 11.

Procedure extendTree(Pattern p, DB_Pattern dbpat, int level, List<string> listpatstring)

1 Begin

2 extendItemset(p, dbpat, listpatstring)

3 extendSequence(p, dbpat, listpatstring)

4 extendTreeCollection(p.Itemset_ext_pattern, dbpat, level + 1, listpatstring)

5 extendTreeCollection(p.Sequence_ext_pattern, dbpat, level + 1, listpatstring)

6 End

tphong, 23/03/15,

The figure can be divided into three figures.

tphong, 03/23/15,

Add “for …”

tphong, 03/23/15,

Should be Figure 10

tphong, 23/03/15,

Not clear. Add comments.

Mr.Bay, 23/03/15,

check toan bo thuat toan

tphong, 23/03/15,

This is figure 9. Add the caption of the figure. The procedure below should be Figure 10.

tphong, 23/03/15,

Add comments

Procedure extendItemset(Pattern p, DB_Pattern dbpat, List<string> listpatstring)

1 Begin

//Let item_p be the last item of pattern P

//Find the position of item_p in dbpat

2 While (i ≤ |dbpat|)

3 Begin while

4 Pattern pnew = createNewPattern(p, dbpat.Pats[i], true)

5 Add pnew to listpatstring

6 Add pnew to p.Itemset_ext_pattern

7 i++

8 End while

9 End

Procedure extendSequence(Pattern p, DB_Pattern dbpat, List<string> listpatstring)

1 Begin

2 For each(Pattern pi in dbpat.Pats)

3 Pattern pnew = createNewPattern(p, pi pi, false)

4 Add pnew to listpatstring

5 Add pnew to p.Sequence_ext_pattern

6 End foreach

7 End

Procedure extendTreeCollection(List<Pattern> listpats, DB_Pattern dbpat, int level,

List<string> listpatstring)

1 Begin

2 For each(Pattern pipi in listpats)

3 extendTree(pipi, dbpat, level, listpatstring)

4 End for each

5 End

Figure 10. Procedure ExtendTree

tphong, 03/23/15,

Which one? Check

tphong, 23/03/15,

Check foreach or for each

tphong, 23/03/15,

Add comment

Mr.Bay, 23/03/15,

check lai het

tphong, 23/03/15,

Check

tphong, 23/03/15,

Add comment

tphong, 23/03/15,

Add comment

Mr.Bay, 23/03/15,

khong can begin, end

For each node in the search tree, the pattern for that node is extended by call inged

extendItemset and extendSequence (Gouda et al, 2010) to create a new pattern in Figure 10. An

example of the results for itemset extension and sequence extension executed on the data in

Table 1 is shown in Table 2. The set of frequent sequential patterns derived is shown in Table 3.

Table 2. Item extension and sequence extension Frequent sequential patterns from Table 1 with

minSup =50%

Prefix Itemset

extension

Support ≥

minSup

Sequence

extension

Support ≥

minSup

(A): 4 (AB):4 Yes (A)(A): 2 No

(AC): 1 No (A)(B):3 Yes

(A)(C): 3 Yes

(AB): 4 (ABC): 0 No (AB)(A): 2 No

(AB)(B): 3 Yes

(AB)(C): 3 Yes

(AB)(B): 3 (AB)(BC): 2 No (AB)(B)(A): 2 No

(AB)(B)(B): 3 Yes

(AB)(B)(C): 3 Yes

(AB)(B)(B): 3 (AB)(B)(BC): 2 No (AB)(B)(B)(A): 2 No

(AB)(B)(B)(B): 2 No

(AB)(B)(B)(C): 2 No

(AB)(B)(C): 3 (AB)(B)(C)(A): 0 No

(AB)(B)(C)(B): 0 No

(AB)(B)(C)(C): 0 No

(AB)(C): 3 (AB)(C)(A): 0 No

(AB)(C)(B): 1 No

(AB)(C)(C): 1 No

(A)(B): 3 (A)(BC): 2 No (A)(B)(A): 2 No

(A)(B)(B): 3 Yes

tphong, 03/23/15,

The examples don’t show the parallel processing steps very well. Maybe you can try to provide an example showing the parallel processing stpes.

tphong, 03/23/15,

I add it. Right?

tphong, 03/23/15,

Explain the content in Table 2 a little

tphong, 03/23/15,

They were proposed by others? If yes, state it at somewhere above.

(A)(B)(C): 3 Yes

(A)(B)(B): 3 (A)(B)(BC): 2 No (A)(B)(B)(A): 2 No

(A)(B)(B)(B): 2 No

(A)(B)(B)(C): 2 No

(A)(B)(C): 3 (A)(B)(C)(A): 0 No

(A)(B)(C)(B): 0 No

(A)(B)(C)(C): 0 No

(A)(C): 3 (A)(C)(A): 0 No

(A)(C)(B): 1 No

(A)(C)(C): 1 No

(B): 5 (BC): 3 Yes (B)(A): 3 Yes

(B)(B): 5 Yes

(B)(C): 4 Yes

(BC): 3 (BC)(A): 0 No

(BC)(B): 1 No

(BC)(C): 1 No

(B)(A): 3 (B)(AB): 3 Yes (B)(A)(A): 2 No

(B)(AC): 1 No (B)(A)(B): 2 No

(B)(A)(C): 2 No

(B)(AB): 3 (B)(ABC): 0 No (B)(AB)(A): 2 No

(B)(AB)(B): 2 No

(B)(AB)(C): 2 No

(B)(B): 5 (B)(BC): 3 Yes (B)(B)(A): 2 No

(B)(B)(B): 4 Yes

(B)(B)(C): 4 Yes

(B)(BC): 3 (B)(BC)(A): 0 No

(B)(BC)(B): 1 No

(B)(BC)(C): 1 No

(B)(B)(B): 4 (B)(B)(BC): 3 Yes (B)(B)(B)(A): 2 No

(B)(B)(B)(B): 2 No

(B)(B)(B)(C): 2 No

(B)(B)(BC): 3 (B)(B)(BC)(A): 0 No

(B)(B)(BC)(B): 0 No

(B)(B)(BC)(C): 0 No

(B)(B)(C): 4 (B)(B)(C)(A): 0 No

(B)(B)(C)(B): 0 No

(B)(B)(C)(C): 0 No

(B)(C): 4 (B)(C)(A): 0 No

(B)(C)(B): 1 No

(B)(C)(C): 1 No

(C): 4 (C)(A): 0 No

(C)(B): 1 No

(C)(C): 1 No

Table 3. The sSet of frequent sequential patterns from Table 1 with minSup =50%

No. FSPs Support

1 (A) 4

2 (AB) 43 (AB)(B) 34 (AB)(B)(B) 35 (AB)(B)(C) 36 (AB)(C) 37 (A)(B) 38 (A)(B)(B) 39 (A)(B)(C) 310 (A)(C) 311 (B) 512 (BC) 313 (B)(A) 314 (B)(AB) 315 (B)(B) 516 (B)(BC) 317 (B)(B)(B) 418 (B)(B)(BC) 319 (B)(B)(C) 420 (B)(C) 421 (C) 4

4. Experimental results

Experiments were conducted to evaluate the proposed parallel algorithm. The experiments

were performed on a personal computer with an Intel Core i7-4790 3.6-GHz CPU with 8 cores, 3

MB of L3 cache and 8 GB of RAM, running Windows 7. The algorithm was implemented

using .Net Framework 4.0.

Three standard datasets from the IBM data generator were used for comparison of runtime.

The information about the data sets are shown in Table 4.

Table 4. Three datasets used in experiments

Dataset No. of sequences No. of items

C6T5N1kD1k 1,000 1,000

C6T5N1kD10k 10,000 1,000

Kosarak25k 25,000 41,270

The mining time of PRISM (sequential) and the proposed PIB-PRISM for various minSup

values is shown in Table 5.

tphong, 23/03/15,

Is the dataset also from IBM? You may explain the dataset a little

tphong, 23/03/15,

Explain the parameters, C, T, N ….

tphong, 03/23/15,

Cite

tphong, 03/23/15,

Right?

Table 5. Mining time for PRISM and PIB-PRISM

Dataset minSup

No. of

patterns

Parallel mining

time (sec.)

Sequential

mining time

(sec.)

C6T5N1kD1k

0.8 11209 69.803 145.454

0.7 14802 98.048 197.684

0.6 20644 136.079 280.067

0.5 31189 216.58 431.34

C6T5N1kD10k

0.8 8430 780.735 1108.835

0.7 10480 995.656 1423.003

0.6 13627 1349.231 1905.902

0.5 18461 1807.326 2632.395

Kosarak25k

0.8 198 5.975 7.629

0.7 244 9.453 11.263

0.6 315 14.992 18.018

0.5 419 26.13 31.824

Experiments were then conducted to compare the execution times of the two algorithms

for various minSup values (0.5-0.8%). The results are shown in Figures 11-13 for the three

datasets, respectively. With decreasing minSup values, more FSPs were obtained, and thus the

runtime increased.

0.8 0.7 0.6 0.50

100

200

300

400

500

600

700

69.803 98.048136.079

216.58145.454197.684

280.067

431.34

C6T5N1kD1k

SequentialParallel

minsup (%)

Runti

me

(s)

Figure 11. Comparison of runtime for various minSup values for C6T5N1kD1k.

0.8 0.7 0.6 0.50

500100015002000250030003500400045005000

780.735 995.6561349.231

1807.3261108.8351423.003

1905.902

2632.395

C6T5N1kD10k

SequentialParallel

minsup (%)

Runti

me

(s)

Figure 12. Comparison of runtime for various minSup values for C6T5N1kD10k.

tphong, 03/23/15,

The labels mye be changed as PRISM and PIB-PRISM

0.8 0.7 0.6 0.50

1020

3040

506070

5.975 9.45314.992

26.13

7.62911.263

18.018

31.824

Kosarak25k

SequentialParallel

minsup (%)

Runti

me

(s)

Figure 13. Comparison of runtime for various minSup values for Kosarak25k.

The experimental results show that PIB-PRISM is faster than PRISM for all the three

datasets. With minSup = 0.8%, the runtime of PRISM and PIB-PRISM was 145.454 and 69.803

seconds, respectively, for the C6T5N1kD1k dataset. Similar results were obtained for the other

two datasets. PIB-PRISM had lower runtime because parallel processing divided the tasks to be

processed into independent branches. The experimental results show that the proposed algorithm

has significantly improved execution time when compared to the original algorithm for the three

synthetic datasets.

The search tree of PIB-PRIM might, however, be unbalanced. Some branches in the tree had

more nodes than others. This was a problem caused by depth-first search. A solution to it was to

sort items according to their support values in ascending order. After applying this solution, most

of the nodes in the leftmost branches was infrequent and would be pruned during the search. On

the contrary, nodes in the rightmost branches was frequent and would not be pruned during the

search. This strategy would help balance the search tree for parallel processing.

5. Conclusions and future work

tphong, 03/23/15,

If possible, show and compare the execution time before and after using this strategy.

tphong, 03/23/15,

Ascending or descending is better?

tphong, 03/23/15,

Another experiment usualy done in parallel processing is comparing the time for different core numbers. You use 8 cores, but the speed up ratio is less than 2. It may be argued.

tphong, 03/23/15,

Maybe you can discuss why the gap between the two results will become larger alng with the decrease of minsup

This study has proposed a strategy for mining SPs in parallel using a multi-core architecture.

The proposed algorithm distributes the search for SPs to distributed tasks on a multi-core

computer and uses an efficient data structure for mining SPs fast. Experimental results show that

the proposed algorithm outperforms the PRISM algorithm.

This paper only solved the problem of mining SPs using multi-core proceesors. In the future,

we will further study parallel strategies for mining closed patterns and maximal patterns in

sequence datasets using multi-core proceesors. We will also study using other architecture to

solve the mining problem efficiently.

.

References

Agrawal R, Srikant R. Mining Sequential Patterns. In ICDE’95, pp. 3–14 (1995)

Agrawal R, Srikant R. Mining Sequential Patterns: Generalizations and Performance

Improvements. In EDBT’96, pp. 3–17 (1996)

Andrew B. Multi-Core Processor Architecture Explained. In

http://software.intel.com/en-us/articles/multi-core-processor-architecture-explained: Intel

(2008)

Ayres J, Gehrke JE, Yiu T, Flannick J. Sequential Pattern Mining using a Bitmap Representaion.

In SIGKDD’02, pp.1–7 (2002)

Burdick D, Calimlim M, Gehrke J. MAFIA: A maximal frequent itemset algorithm for

transactional databases. In ICDE’01, pp. 443–452 (2001)

Casali A, Ernst C. Extracting correlated patterns on multicore architectures. In CD-ARES’13,

pp.118-133 (2013)

tphong, 03/23/15,

Note the upper-case or ower-case. Please be conitent

Cong S, Han J, Padua D. Parallel mining of closed sequential patterns. In: ACM SIGKDD’05,

pp. 562-567 (2005)

Gouda K, Hassaan M, Zaki M. Prism: An effective approach for frequent sequence mining via

prime-block encoding. Journal of Computer and System Sciences, 76(1), 88-102 (2010)

Han J, Kamber M and Pei J. Data Mining: Concepts and Techniques. 3rd Edition, Morgan

Kaufmann (2011)

Han J, Pei J, Mortazavi Asl B, Chen Q, Dayal U, Hsu M. Freespan: Frequent pattern-projected

sequential pattern mining. In KDD’00, pp. 355–359 (2000)

Han J, Pei J, and Yin Y. Mining frequent patterns without candidate generation. In ACM

SIGMOD, pp. 1–12 (2000).

Liu L, Li E, Zhang Y, Tang Z. Optimization of Frequent Itemset Mining on Multiple-Core

Processor. In VLDB '07, pp. 1275-1285 (2007)

Lo D, Khoo SC, Liu C. Mining and ranking generators of sequential patterns. In SDM’08, pp.

553-564 (2008)

Masseglia F, Cathala F, Poncelet P. The PSP Approach for Mining Sequential Patterns. In

PKDD’98, pp. 176-184 (1998)

Mannila H, Toivonen H, and Verkamo AI. Discovery of frequent episodes in event sequences.

Data Mining and Knowledge Discovery, 259-289 (1997)

Negrevergne B, Termier A, Méhaut JF, Uno T. Discovering closed frequent itemsets on

multicore: Parallelizing computations and optimizing memory accesses. In HPCS’10,

IEEE, pp. 521–528 (2010)

Negrevergne B, Termier A, Rousset MC, Méhaut JF. Para Miner: a generic pattern mining

algorithm for multi-core architectures. Data Mining and Knowledge Discovery, 28(3), 1–

41 (2014)

Nguyen D, Vo B, Le B. Efficient strategies for parallel mining class association rules. Expert

Systems with Applications, 41(10), 4716–4729 (2014)

Slimani T, Lazzez A. Sequential mining: patterns and algorithms analysis. In International

Journal of Computer and Electronics Research, 2(5), 639-647 (2013)

Tran T, Le B, Vo B. Combination of dynamic bit vectors and transaction information for mining

frequent closed sequences efficiently. Engineering Applications of Artificial Intelligence,

183-189 (2015)

Pham T, Luo J, Hong TP, Vo B. An efficient method for mining non-redundant sequential rules

using attributed prefix trees. Engineering Applications of Artificial Intelligence 32, 88-99

(2014)

Van T, Vo B, Le B. IMSR_PreTree: an improved algorithm for mining sequential rules based on

the prefix-tree. Vietnam J. Computer Science, 1(2), 97-105 (2014)

Pham T, Luo J, Vo B. An effective algorithm for mining closed sequential patterns and their

minimal generators based on prefix trees. IJIIDS, 7(4), 324-339 (2013)

Pei J, Han J, Asl BM, Wang J, Pinto H, Chen Q, Dayal U, Hsu M. Mining Sequential Patterns

by Pattern-Growth: The PrefixSpan Approach. IEEE Trans. Knowl. Data Eng. 16(11),

1424-1440 (2004)

Schlegel B, Karnagel T, Kiefer T, Lehner W. Scalable frequent itemset mining on many-core

processors. In The 9th International Workshop on Data Management on New Hardware.

ACM. Article No. 3 (2013)

tphong, 03/23/15,

Use journal full name or short name, be consistent. Also for conference.

Raza K. Application of Data Mining In Bioinformatics. Indian Journal of Computer Science and

Engineering, 1(2), 114-118 (2013)

Vijayarani S, Deepa S. An efficient algorithm for sequence generation in data mining, IJCI, 3(1)

(2014)

Wanga CS, Lee AJT. Mining inter-sequence patterns. Experts System with Aplication, 36 (4),

8649–8658 (2009)

Wang W, Yang J. Mining Sequential Patterns from Large Data Sets. Advances in Database

Systems 28, 1-161 (2005)

Wang J and Han J. BIDE: Efficient Mining of Frequent Closed Sequences. In ICDE '04, pp. 79 –

90 (2004)

Weichbroth P, Owoc M, Pleszkun M. Web User Navigation Patterns Discovery from WWW

Server Log Files. FedCSIS’12, 1177–1176 (2012)

Yan X, Han J, Afshar R. CloSpan: Mining Closed Sequential Patterns in Large Datasets. In

SDM’03, pp. 166-177 (2003)

Yu KM, Wu SH. An efficient load balancing multi-core frequent patterns mining algorithm. In

TrustCom’11, pp. 1408–1412 (2011)

Zaki J, Wang TL, Toivonen TT. BIOKDD01: Workshop on Data Mining in Bioinformatics. In

ACM SIGKDD Explorations, 3(2), pp. 71-73 (2002)

Zaki MJ. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning

Journal, 42, 31–60 (2001a)

Zaki J. Parallel sequence mining on shared-memory machines. Journal of Parallel and

Distributed Computing, 61(3), 401-426 (2001b)

Zubi ZS, Raiani MSE. Using web logs dataset via web mining for user behavior understanding.

International Journal of Computers and Communications, 8, 103-111 (2014)

A Using Multi

Documents

Transcript of A Using Multi