Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for...

15
Using Glocal Event Alignment for Comparing Sequences of Significantly Different Lengths Vinh-Trung Luu, Mathis Ripken, Germain Forestier, Fr´ ed´ eric Fondement, Pierre-Alain Muller MIPS, Universit´ e de Haute Alsace, 12, rue des fr` eres Lumi` ere - 68093 Mulhouse cedex - France {trung.luu-vinh, mathis.ripken, germain.forestier, frederic.fondement, pierre-alain.muller}@uha.fr Abstract. This work takes place in the context of conversion rate op- timization by enhancing the user experience during navigation on e- commerce web sites. The requirement is to be able to segment visi- tors into meaningful clusters, which can then be targeted with specific call-to-actions, in order to increase the web site turnover. This paper presents an original approach, which equally combines global- and local- alignment techniques (Needleman-Wunsch and Smith-Waterman) in or- der to automatically segment visitors according to the sequence of vis- ited pages. Experimental results on synthetic datasets show that our approach out-performs other typically used alignment metrics, such as hybrid approaches or Dynamic Time Warping. Keywords: web mining, sequential pattern mining, clustering 1 Introduction Conversion rate optimization is considered as one of the most promising ap- proaches for improving the turnover of e-commerce web sites. A lot of researches have already focused on understanding web browsing event-patterns, in order to improve the online content delivery. Clustering visitors into meaningful seg- ments, associated to targeted call-to-actions and related item recommendation, is one of the techniques typically used for cross- and up-selling. As clustering aims at organizing similar items into the same group with no prior knowledge of item class, it is seen as an approach of unsupervised learning. In web usage mining context, the similarity of page visits and their order in a session is one of the relevant information to cluster. For example, cluster analysis helps to reach people who are interested in some specific kind of goods or services so that the owner can recommend to such groups other related things, or offer them some discounts. The clustering result can also be applied to advertising placement organization on web sites, based on page visiting frequency in each cluster. In this paper, we present our work for computing the similarity between event-sequences of significantly different lengths. Our proposal is based on a new This is a pre-print of an article published at MLDM 2016. The final authenticated version is available online at: http://link.springer.com/chapter/10.1007%2F978-3-319-41920-6_5

Transcript of Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for...

Page 1: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

Using Glocal Event Alignment for ComparingSequences of Significantly Different Lengths

Vinh-Trung Luu, Mathis Ripken, Germain Forestier,Frederic Fondement, Pierre-Alain Muller

MIPS, Universite de Haute Alsace,12, rue des freres Lumiere - 68093 Mulhouse cedex - France{trung.luu-vinh, mathis.ripken, germain.forestier,

frederic.fondement, pierre-alain.muller}@uha.fr

Abstract. This work takes place in the context of conversion rate op-timization by enhancing the user experience during navigation on e-commerce web sites. The requirement is to be able to segment visi-tors into meaningful clusters, which can then be targeted with specificcall-to-actions, in order to increase the web site turnover. This paperpresents an original approach, which equally combines global- and local-alignment techniques (Needleman-Wunsch and Smith-Waterman) in or-der to automatically segment visitors according to the sequence of vis-ited pages. Experimental results on synthetic datasets show that ourapproach out-performs other typically used alignment metrics, such ashybrid approaches or Dynamic Time Warping.

Keywords: web mining, sequential pattern mining, clustering

1 Introduction

Conversion rate optimization is considered as one of the most promising ap-proaches for improving the turnover of e-commerce web sites. A lot of researcheshave already focused on understanding web browsing event-patterns, in orderto improve the online content delivery. Clustering visitors into meaningful seg-ments, associated to targeted call-to-actions and related item recommendation,is one of the techniques typically used for cross- and up-selling. As clusteringaims at organizing similar items into the same group with no prior knowledgeof item class, it is seen as an approach of unsupervised learning. In web usagemining context, the similarity of page visits and their order in a session is one ofthe relevant information to cluster. For example, cluster analysis helps to reachpeople who are interested in some specific kind of goods or services so that theowner can recommend to such groups other related things, or offer them somediscounts. The clustering result can also be applied to advertising placementorganization on web sites, based on page visiting frequency in each cluster.

In this paper, we present our work for computing the similarity betweenevent-sequences of significantly different lengths. Our proposal is based on a new

This is a pre-print of an article published at MLDM 2016. The final authenticated version is availableonline at: http://link.springer.com/chapter/10.1007%2F978-3-319-41920-6_5

Page 2: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

2 Luu et al.

way of equally combining global- and local-alignment techniques (i.e. Needleman-Wunsch [21] and Smith-Waterman [29]). The originality of our measure is to takeinto account the length of longest sequence in the pair of compared sequences.

Thus, regardless of the difference in sequence lengths, the result providedby our metric is accurate and can be used to perform clustering. Experimen-tal results show that our approach outperforms other typically used similaritymeasures, such as hybrid approaches or Dynamic Time Warping (DTW), in thecontext of event-sequences of different lengths. This paper is divided into fivesections with the following structure: Section 2 explains the proposed method.Section 3 describes experimental results. The discussion of these results is in Sec-tion 4. Section 5 presents related work. Finally, Section 6 concludes the paperand gives some future research directions.

2 Proposed method

Before discussing our sequence alignment approach, we introduce a few basicconcepts: Given a finite set Σ whose elements are characters, called alphabet,any possible string of length k > 0 over Σ is a k-tuple built by charactersfrom Σ. For example, if Σ = {A,B,C}, a set S = {s1, s2, . . . , sn} of n finitestrings over Σ can consist of s1 = AB, s2 = ABC, . . . , sn = ACB. In our model,each web session contains a series of page visits is assumed to be a sequence(i.e., visits which are ordered). Hence, each sequence si is composed as a stringfrom Σ, representing a session. A set of navigation sequences S, as mentioned,contains sessions from multiple visitors. To group sessions based on visit order ofvisitors, our method works as follows: S is processed to create clusters containingcomparable sequences that are dissimilar to sequences in other clusters. For thispurpose, our alignment-based similarity measure is proposed. An alignment overa set of sessions S = {s1, s2, . . . , sn} can be described as another set Sa ={s1a, s2a, . . . , sna} of equal length sessions which built by adding necessary gap“-”to si, for 1 ≤ i ≤ n [7]. Next, elements at the same index of session stringsare compared and scored by a scoring scheme, so-called similarity definition.

Sequence similarity definition of specific context (or application-dependent)is essential to perform a relevant similarity evaluation. For instance, the corre-spondence of DNA sequences [26] is not identical to time-series [18], and both ofthem are different to web session similarity. Therefore, session similarity measurehas to be adapted to web usage situation. At first, it has to deal with the varietyof session lengths and thus traditional vector distances like Euclidean, Manhat-tan or even Hamming cannot be applied. Such distances require equal-lengthsequences like s1 = ABCD, s2 = ABCD. Secondly, as sessions are expected tobe differentiated by page visit orders, the appropriate metric has to take intoaccount this order. Thus, metrics such as Levenshtein and VLVD [25] are inap-propriate as they consider the visit of pageA before B and vice versa to be thesame. As a result, s1 = ABCD and s2 = BDCA are identical. These statis-tical approaches count the occurrence of element in each sequence to measuretheir similarity, regardless of element order. Additionally, the continuity of com-

Page 3: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

Using Glocal Event Alignment for Comparing Sequences 3

mon pages between two browsing behaviors is a significant factor to evaluatetheir correspondence. Therefore, LCS or SAM[11] should not be used as it con-sider sessions started by page A and ended by B, that are common pages, likes1 = AB and s2 = ACDFB to be the same, regardless of how many uniquepages between them. As the matter of fact, there is a meaningful difference inweb visitor interest when one hits C,D and F between A and B and the otherhits no pages between those two but they are not counted in this kind of metrics.In summary, an applicable approach in web usage mining context should be ableto (1) process sequences with variable length, (2) take the order and successionof common pages into consideration. Such measures are suitable to compute thesimilarity between two sets of visits.

The Needleman-Wunsch (NW) method is a dynamic programming algorithmfor sequence alignment which was developed by Needleman and Wunsch in 1970.Dynamic programming makes it possible to find the optimal alignment of se-quences, is easy to implement and popular in computer science. When aligningelements of a sequence, matching and mismatching scoring scheme are given. Acorresponding score matrix is then established to find the highest score of allpossible alignments. In NW, this alignment is carried out from beginning to endof each sequence, it is called a global alignment. Global alignment is appropriateto work with sequences of similar length to find their best alignment. However,sequences may inherently not have the same length but might contains similarsubsequences. Thus, a local alignment is relevant to detect them. To address thisissue, the Smith-Waterman (SW) method, introduced by Smith and Watermanin 1981, performs the alignment by taking high comparable regions within se-quences into account, regardless of the dissimilar parts and even the differenceof sequence lengths. These two dynamic programming algorithms are commonlyadopted for aligning protein or nucleotide sequences [14,19].

As NW and SW alignment methods are somehow opposite, each one has theirown advantages and drawbacks. NW finds the optimal similarity of the entiresequence, while SW detects regions of likeliness between two sequences. As aresult, a combination of both methods is better than using a single one, sincethe correspondence between sequences can be evaluated correctly (i.e. globallyand locally). We pointed out the effectiveness of the NW and SW rules combi-nation compared to DTW [24] and hybrid metric [8,6] in our previous work [16].Following this finding, we propose a new similarity metric called combinationmetric. This metric is based on NW, SW and the size of the longest sequence:

S(si, sj) =

[NW (si, sj)

l

]+

[SW (si, sj)

(2 ∗ l)

](1)

with NW (si, sj) and SW (si, sj) respectively NW and SW scores betweenthe two sequences si and sj , l the length of longest sequence in the pair (i.e.max(|si|, |sj |)), the NW scoring scheme of +1 for matching and -1 for non-matching pair of items in sequences, the SW scoring scheme of +2 for matchingand -1 for non-matching inside matching, ignore non-matching outside.

Page 4: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

4 Luu et al.

Another way to combine NW and SW was proposed previously in hybridmetric scores similarity [8,6] between two sequences si and sj , is defined as:

S(si, sj) = (1− p) ∗ SW (si, sj) + p ∗NW (si, sj) (2)

with the defined parameter p = |si|/|sj |. From this definition (Eq. 2) , onecan notice that hybrid metric does not equally take both NW and SW intoaccount because of the difference in sequence lengths. As a consequence, theadvantage of combination metric (Eq. 1) over the hybrid metric (Eq. 2) is thatthe similarity measure works better when sequence lengths are very different. Inthis case, hybrid metric only focuses on SW, while combination metric focuseson both NW and SW. Consequently, the more different in lengths the sequencepair are, the more the hybrid metric will focus on SW to measure their similarity.Therefore, the hybrid metric would consider two sequences of different classesto be in same class, if their difference in length and SW are important enough.On the other hand, two sequences of same class with comparable lengths maynot be similar enough to be in the same cluster because their similarity scoreby hybrid metric is smaller than in the previous case. To illustrate this scenario,we created cluster dendrograms with a toy example. The Figure 1a shows thesimilarity evaluation of the hybrid metric in case of sequences with quite differentlength. As the two sequences of blue class are not similar enough to be merged inagglomerative hierarchical clustering, the resulting clustering is of poor quality.As illustrated in Figure 1b, this case does not happen using the combinationmetric as both NW and SW are considered equally, regardless of the lengthdifference between the sequences. Consequently, two clusters green and blue ofperfect accuracy are obtained when the dendrogram is cut.

(a) Hybrid metric. (b) Combination metric. (c) DTW

Fig. 1: Example of clustering of 4 sequences of 2 classes (blue and green) withquite different length for hybrid (a), combination (b) and DTW (c) metrics.

Dynamic Time Warping (DTW) is known to be effective to find good align-ment between time-series. In the context of sequence pairs of quite differentlengths and no duplicate as in Figure 1, DTW works mostly as good as combi-nation method. As illustrated in Figure 1c, it makes blue and green sequencesmerged in agglomerative hierarchical clustering. However, DTW minimizes dis-

Page 5: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

Using Glocal Event Alignment for Comparing Sequences 5

tance of one sequence to another by allowing flexible transformation so thattime-series with similar shapes can be detected. This feature leads to a problemwhen identical consecutive elements in sequences are merged. Thus, the warpingpath of sequence pairs is vertical or horizontal. In other words, a single elementfrom one sequence is aligned with many successive and duplicate elements in theother as they are all identical. This feature is a drawback in web usage mining assessions containing duplicate web pages should not be ”skipped” but mined forweb visitor behavior. Our proposed metric considers them to be traversal patternand takes this duplication into account while DTW regards them as only onepage no matter how many visits are duplicated. Figure 2b and 2c respectivelyillustrate the similarity evaluation of the combination metric and DTW in caseof sequences with duplicate elements. Furthermore, with sequence pairs whichcontain duplicate elements and quite different in length like in Figure 2, hybridmetric is less effective than the combination, as presented in Figure 2a.

(a) Hybrid metric. (b) Combination metric. (c) DTW

Fig. 2: Example of clustering of 4 sequences of 2 classes (blue and green) withduplicated elements for hybrid (a) and combination (b) and DTW (c) metrics.

3 Experimental results

3.1 Synthetic data

In order to evaluate the performance of our combination metric, clustering vali-dation including internal and external measures are considered. As some appro-priate classified data is not currently available for external validation, we usedinternal validation on generated synthetic datasets. Nevertheless, Liu et al. [13]analyzed both kind of validations and revealed the relevance of internal vali-dation measure in many aspects over some other external validation measures.Rendon et al. [27] also concluded that they can get better precision by validityinternal indexes than external ones, on their datasets and scenarios. To evaluatethe performance of the proposed metric and competitors (i.e. hybrid metric andDTW ), we generated 10 synthetic datasets randomly. Each of them containsmore than 500 sequences of sessions (about 520 in average) which are groupedinto three defined classes with following features:

Page 6: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

6 Luu et al.

– Class 1: About 170 sequences of lengths [20 − 22], sharing a common sub-sequence, for instance: ABCDU3YU31DQ6Q4FO2JGHW, ABCDSI5OPHH9-EDGLPFLST, ABCDAFF5UAK7GEX3XJIU. Generated sequences in otherclasses are related to this common sub sequence ;

– Class 2: About 170 sequences of lengths [3 − 4], sharing a common sub-sequence (that is also sub-sequence of common sub-sequence in Class 1), forinstance: ZBC, ABC5, BC0 ;

– Class 3: About 170 sequences of lengths [18−20], mostly containing identicaland consecutive symbols (that appear in the common sub-sequence in Class1), for instance: DDDDDDDDBBCCCCCCCC, DDDDDDDDCCAAAA-AAA, BBBBBBBBBADBBBBBBBBB ;

Note that all the datasets used in the experiments are available to downloadhere1. As shown in Section 2, sequences with common subsequence but differentlengths such as Class 1 and 2 are likely to be misclassified by hybrid metric. Yetsequences with same consecutive symbols in Class 3 are likely to be confused withsequences of Class 2 using DTW. These classes are assumed to be representativeof behaviors that can be witnessed when analyzing real web sessions.

Using these sequence datasets, similarity matrices for each metric were com-puted. In order to implement agglomerative hierarchical clustering [9], thesematrices are then used as input with three well known hierarchical methods:single-linkage (sing.), complete-linkage (compl.) and average-linkage (avg.). Thisvariety of hierarchical methods contributes to the effectiveness of the evaluation.Table 1 shows the means (µ) and standard deviations (σ) of clustering resultprecision for the three methods with the three hierarchical methods over the 10datasets. Note that as hierarchical clustering is deterministic, running the exper-iments multiple times is not required. Thus, the means and standard deviationscorrespond to the execution on the 10 different datasets.

Table 1: Results for the three methods on the 10 datasets.

Original datasets

Hybrid DTW Combination

sing. compl. avg. sing. compl. avg. sing. compl. avg.

µ 100% 58.5% 100% 90.5% 85.5% 89.7% 100% 98.2% 100%

σ ±0% ±8.5% ±0% ±9.7% ±1.4% ±9.8% ±0% ±5.7% ±-0%

The correlation of experimental results in Table 1 illustrated by dendrogramsin Figure 3, 4 and 5 on sample set of the sequences (for sake of clarity) with leavevalues as defined classes. By cutting dendrograms at the desired level, clusters areseparated into frames and their quantity matching the number of defined classes.As shown in Figure 3, there is one cluster with unexpected accuracy containing

1 https://www.dropbox.com/sh/b6wxv5opn1u3n6n/AAB8ObwvqBPbDsnXvB9xZ_yca

Page 7: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

Using Glocal Event Alignment for Comparing Sequences 7

sequences from both Class 1 and 3. The number of inaccurate clusters in Figure4 is two as they correspond to Class 2, 3 and Class 1, 3 sequences. However,clusters in Figure 5 achieve a perfect accuracy. These figures also illustrate thatcombination metric works very well compared to hybrid and DTW.

Fig. 3: Hierarchical clustering using hybrid metric on original dataset.

Fig. 4: Hierarchical clustering using DTW metric on original dataset.

Fig. 5: Hierarchical clustering using combination metric on original dataset.

Following, we present additional results performed to consider two popularaspects of data in web usage mining: the noise and unbalanced density of classes(i.e. classes with important difference in number of elements).

Page 8: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

8 Luu et al.

Noise: About 15 sequences of lengths [3, 24] were randomly generated fromalphabet and numbers, for instance: APE8V98MDTIH77I, H96YXT7N, M9AK-KAA, etc. were added to the original datasets with 3 classes. The accuracy ofclustering results on these datasets is presented in Table 2.

Table 2: Results for the methods on the 10 datasets with noise.

Datasets with noise

Hybrid DTW Combination

sing. compl. avg. sing. compl. avg. sing. compl. avg.

µ 90.4% 84.3% 90% 65.8% 73.1% 86.7% 100% 89.2% 100%

σ ±11.6% ±3% ±7.2% ±1% ±9% ±11% ±0% ±6% ±-0%

Similarly to the previous results, dendrograms on sample of sequences withleave values as defined classes are presented on Figure 6, 7 and 8. These figuresillustrate the correlation of experimental results presented in Table 2. Dendro-grams are cut at the desired level to make cluster quantity match the definednumber of classes and separate them into red frames.

Fig. 6: Hierarchical clustering using hybrid metric on dataset with noise.

Fig. 7: Hierarchical clustering using DTW metric on dataset with noise.

Page 9: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

Using Glocal Event Alignment for Comparing Sequences 9

Fig. 8: Hierarchical clustering using combination metric on dataset with noise.

Unbalanced density: The number of sequences of Class 1, 2 and 3 are respectivelyaround 320, 170 and 10 in the first three datasets. In the next four datasets,number of sequences in Class 1, 2 and 3 are respectively around 170, 320 and 10.Lastly, in the remaining three datasets, sequence numbers are around 10, 170 and320 for Class 1, 2 and 3. As the number of users having the same usage can bevery different according to specific behaviors, this kind of datasets are assumedto be representative of the data available in web usage mining. The results ofthe experiments using these unbalanced datasets are presented in Table 3.

Table 3: Results for the methods on the 10 datasets with unbalanced classes.

Unbalanced density datasets

Hybrid DTW Combination

sing. compl. avg. sing. compl. avg. sing. compl. avg.

µ 81.6% 73.2% 84.8% 90.9% 91.1% 91.2% 100% 93.7% 91.1%

σ ±19% ±18.9% ±24.5% ±16.5% ±12.7% ±12.8% ±0% ±10% ±-0%

As mentioned above, experimental results in Table 3 are illustrated by sampledata dendrograms in Figure 9, 10 and 11, with leave values as defined classes.Similarly, red frames separate elements into class defined number of clusters bycutting the tree at desired levels. Similar to the previous dendrograms, it presentsthe best accuracy obtained using combination metric compared to the others.

3.2 Real data

The dataset used for our external validation was collected from a commercialwebsite. The dataset was provided by the Beampulse company which commer-cializes a product written in Javascript and Java, which collects informationabout web visitors behaviours such as page visit order, activity time or durationof page visit. As shown in dendrograms in Figure 12,13 and 14 that is a sampleextracted from experimental result on 1500 individual sessions, each containsnumbered page(s), the advantages of our metric is highlighted compared to the

Page 10: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

10 Luu et al.

Fig. 9: Hierarchical clustering using hybrid metric on unbalanced dataset.

Fig. 10: Hierarchical clustering using DTW metric on unbalanced dataset.

Fig. 11: Hierarchical clustering using combination metric on unbalanced dataset.

Page 11: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

Using Glocal Event Alignment for Comparing Sequences 11

other methods. Using our metric, sessions with similar pages, similar page orderand similar length are likely to be grouped in the same cluster. However, suchfeatures are not obtained using hybrid and DTW metrics.

Fig. 12: Hierarchical clustering using DTW metric on real dataset

Fig. 13: Hierarchical clustering using hybrid metric on real dataset

Fig. 14: Hierarchical clustering using combination metric on real dataset

4 Discussion

As shown in Table 1, throughout the 10 normal datasets (i.e., with neither noisenor unbalanced clusters density), similarity matrix produced by DTW outputsthe lowest precision clustering using single- and average-linkage. However, DTWprecision is higher than hybrid metric by complete linkage. Hybrid metric is as

Page 12: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

12 Luu et al.

good as combination metric using single and average-linkage but is significantlyworse using complete-linkage where combination metric is mostly stable.

As noise may impact the clustering algorithm performance, it is good toobtain datasets without noise before clustering. However, clustering algorithmsshould have the ability to deal with noise because of the difficulty to avoid itspresence, especially in big datasets. Our metric maintains a perfect precisionusing average and single-linkage, and also provides the highest precision usingcomplete-linkage in Table 2, where 10 datasets including noise are used as input.Meanwhile, hybrid handles noise more accurately than DTW in this context.

Similarly, clustering methods are challenged by various density datasets be-cause of its importance in the clustering process. On unbalanced datasets, thegood clustering precision obtained using our metric remain almost the sameusing single and average hierarchical methods. In contrast to DTW, hybrid met-ric is highly influenced by unbalanced density. Again, our metric reached thebest accuracy using single-linkage and always works better than the other twomethods using complete and average linkage (see Table 3).

5 Related work

The goal of web usage mining is to identify hidden pattern from visitor browsingdata. This involves clustering of different visits having similar navigational pat-terns. One of the most popular approaches to discover these clusters is sessionclassification. Among various classification forms, many previous studies havefocused on sequence alignment algorithms to evaluate the similarity of sessions.Mandal et al. [17] presented the calculation of distance between two sessions us-ing Cosine measure but it requires sessions with exactly the same length, which ismore suitable to vectors than web accesses. Furthermore, these approaches alsoignore regions of local similarity of session pages. In [5], Chitraa et al. intendedto find usage patterns using k-means algorithm application, yet clustering is todiscover hidden pattern without specific input parameters.Similarly, defining kby the granularity that clusters should be in order to group web users proposedby Jianfeng et al. in [28] is inappropriate to browsing sessions.

Pairwise alignment is commonly used to compare sequences optimally. Thereexist sequence alignment algorithms, both global and local adopted throughpairwise alignment such as NW [12] and SW [31]. These two algorithms takeinto account the similarity between sequences in different alignment [30] andhave their own strengths and drawbacks [10]. Lu et al. [15] studied how togenerate significant usage patterns using NW, however it ignores consecutionthat is essential to evaluate similarity of web session pairs. In contrary, a localalignment algorithm such as SW can only detect partial similarities [2].

Consequently, there have been previous works on integrating one into theother to take advantage of the combination. For example, Brudno et al. [4]proposed a system to align genomes with biological features glocally. Chordia etal. [6] described a hybrid metric, concerning a consolidation of global and localsequence similarity scoring. Correspondingly, the same metric was developed by

Page 13: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

Using Glocal Event Alignment for Comparing Sequences 13

Dimopoulos et al. [8] to measure the similarity of two sequences. This formulacomputes the distance between sequences by taking global and local alignmentand their weights into consideration. These weights are in inverse proportionto each other, depending on how different sequences are in length. Specifically,local alignment weight would be greater if sequence lengths are different. As localalignment scoring does not take the difference in sequence lengths into account,this computation may work in some specific situations but not in web accessessimilarity because that difference reveals the dissimilarity in visitor browsingbehavior. Similarly, Algiriyage et al. [1] used Levenshtein distance as a similaritymetric for web session pairs but it does not identify the succession of commonvisits. On the other hand, with regard to sequential characteristic of web session,there has been approach such as [3] compares homogeneity of sequence pair byfrequency of common items occurrence but does not take into account their order,that is key feature to differentiate sessions. DTW, that is a popular algorithmin comparing two sequences of events [20] or data points [23], is also used insymbolic sequence comparison. However, this kind of approach is not effectivein web usage mining since DTW ignores duplicated elements. Consequently, it isnot able to evaluate the dissimilarity in browsing behavior comparing a specificweb page loaded many times with the same page loaded only one time.

Note that the lack of available benchmarks in the domain of web usage miningmakes the comparison of the different existing methods difficult. In order toaddress this issue, we released with this paper all datasets2 (i.e. original, withnoise and with unbalanced density) that were used for the experiments. We hopethat these datasets will be used in future research to compare new contributions.

6 Conclusion

In this paper, we have presented our contribution to event-sequence comparison,with a specific focus on sequences of significantly different lengths. This newway of combining global- and local-alignment techniques is based on the equalcombination of both approaches. We experimentally evaluated this approach incontext of clustering web site visitors, by analyzing the browsing patterns. Underthose settings, it was observed that our sequence similarity metric outperformedother related techniques. In the close future, we plan to introduce mutations inthe current datasets to better stress the combination technique in the presenceof noise. We also want to compare the approach with other techniques such asPAM/k-medoids, ROCK or Ward. Furthermore, other distance measures suchas Hamming or Levenshtein, which have been studied in a variety of sequencecomparisons including spectra [22] should be considered in upcoming works.

Supplementary materials

All the datasets used in the experiments are available here: https://www.dropbox.com/sh/b6wxv5opn1u3n6n/AAB8ObwvqBPbDsnXvB9xZ_yca.

2 https://www.dropbox.com/sh/bse1ifyu2gdywm4/AACRSbKysPVGudinjTBg6Ocsa

Page 14: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

14 Luu et al.

References

1. Algiriyage, N., Jayasena, S., Dias, G.: Web user profiling using hierarchical clus-tering with improved similarity measure. In: Moratuwa Engineering Research Con-ference (MERCon), 2015. pp. 295–300. IEEE (2015)

2. Aruk, T., Ustek, D., Kursun, O.: A comparative analysis of smith-waterman basedpartial alignment. In: Computers and Communications (ISCC), 2012 IEEE Sym-posium on. pp. 000250–000252. IEEE (2012)

3. Bouguessa, M.: A practical approach for clustering transaction data. In: MachineLearning and Data Mining in Pattern Recognition, pp. 265–279. Springer (2011)

4. Brudno, M., Malde, S., Poliakov, A., Do, C.B., Couronne, O., Dubchak, I., Bat-zoglou, S.: Glocal alignment: finding rearrangements during alignment. Bioinfor-matics 19(suppl 1), i54–i62 (2003)

5. Chitraa, V., Thanamni, A.S.: An enhanced clustering technique for web usagemining. In: International Journal of Engineering Research and Technology. vol. 1.ESRSA Publications (2012)

6. Chordia, B.S., Adhiya, K.P.: Grouping web access sequences using sequence align-ment method. Indian Journal of Computer Science and Engineering (IJCSE) 2(3),308–314 (2011)

7. Della Vedova, G.: Multiple Sequence Alignment and Phylogenetic Reconstruction:Theory and Methods in Biological Data Analysis. Ph.D. thesis, Citeseer (2000)

8. Dimopoulos, C., Makris, C., Panagis, Y., Theodoridis, E., Tsakalidis, A.: A webpage usage prediction scheme using sequence indexing and clustering techniques.Data & Knowledge Engineering 69(4), 371–382 (2010)

9. Duraiswamy, K., Mayil, V.V.: Similarity matrix based session clustering by se-quence alignment using dynamic programming. Computer and Information Science1(3), p66 (2008)

10. Giegerich, R., Wheeler, D.: Pairwise sequence alignment. BioComputing HypertextCoursebook 2 (1996)

11. Hay, B., Wets, G., Vanhoof, K.: Clustering navigation patterns on a website using asequence alignment method. Intelligent Techniques for Web Personalization: IJCAIpp. 1–6 (2001)

12. Likic, V.: The needleman-wunsch algorithm for sequence alignment. Lecture givenat the 7th Melbourne Bioinformatics Course, Bi021 Molecular Science and Biotech-nology Institute, University of Melbourne (2008)

13. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clusteringvalidation measures. In: International Conference on Data Mining. pp. 911–916.IEEE (2010)

14. Liu, Y., Hong, Y., Lin, C.Y., Hung, C.L.: Accelerating smith-waterman alignmentfor protein database search using frequency distance filtration scheme based oncpu-gpu collaborative system. International journal of genomics 2015 (2015)

15. Lu, L., Dunham, M., Meng, Y.: Discovery of significant usage patterns from clustersof clickstream data. In: Proc. of WebKDD. pp. 21–24. Citeseer (2005)

16. Luu, V.T., Forestier, G., Fondement, F., Muller, P.A.: Web site audience segmenta-tion using hybrid alignment techniques. In: Trends and Applications in KnowledgeDiscovery and Data Mining, Lecture Notes in Computer Science, vol. 9441, pp.29–40. Springer International Publishing (2015)

17. Mandal, O.P., Azad, H.K.: Web access prediction model using clustering and arti-ficial neural network. In: International Journal of Engineering Research and Tech-nology. vol. 3. ESRSA Publications (2014)

Page 15: Using Glocal Event Alignment for Comparing Sequences of ... · Using Glocal Event Alignment for Comparing Sequences 3 mon pages between two browsing behaviors is a signi cant factor

Using Glocal Event Alignment for Comparing Sequences 15

18. Meesrikamolkul, W., Niennattrakul, V., Ratanamahatana, C.A.: Shape-based clus-tering for time series data. In: Advances in Knowledge Discovery and Data Mining(PAKDD), pp. 530–541. Springer (2012)

19. Muhamad, F.N., Ahmad, R., Asi, S.M., Murad, M.: Reducing the search spaceand time complexity of needleman-wunsch algorithm (global alignment) andsmith-waterman algorithm (local alignment) for dna sequence alignment. JurnalTeknologi 77(20) (2015)

20. Nakamura, A., Kudo, M.: Packing alignment: alignment for sequences of variouslength events. In: Advances in Knowledge Discovery and Data Mining (PAKDD),pp. 234–245. Springer (2011)

21. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search forsimilarities in the amino acid sequence of two proteins. Journal of molecular biology48(3), 443–453 (1970)

22. Perner, P.: A novel method for the interpretation of spectrometer signals basedon delta-modulation and similarity determination. In: Advanced Information Net-working and Applications (AINA), 2014 IEEE 28th International Conference on.pp. 1154–1160. IEEE (2014)

23. Petitjean, F., Forestier, G., Webb, G., Nicholson, A.E., Chen, Y., Keogh, E., et al.:Dynamic time warping averaging of time series allows faster and more accurateclassification. In: International Conference on Data Mining. pp. 470–479. IEEE(2014)

24. Petitjean, F., Gancarski, P.: Summarizing a set of time series by averaging: Fromsteiner sequence to compact multiple alignment. Theoretical Computer Science414(1), 76–91 (2012)

25. Poornalatha, G., Raghavendra, P.S.: Web user session clustering using modified k-means algorithm. In: Advances in Computing and Communications, pp. 243–252.Springer (2011)

26. Qi, Z., Redding, S., Lee, J.Y., Gibb, B., Kwon, Y., Niu, H., Gaines, W.A., Sung,P., Greene, E.C.: Dna sequence alignment by microhomology sampling during ho-mologous recombination. Cell 160(5), 856–869 (2015)

27. Rendon, E., Abundez, I., Arizmendi, A., Quiroz, E.: Internal versus external clustervalidation indexes. International Journal of computers and communications 5(1),27–34 (2011)

28. Si, J., Li, Q., Qian, T., Deng, X.: Discovering k web user groups with specificaspect interests. In: Machine Learning and Data Mining in Pattern Recognition,pp. 321–335. Springer (2012)

29. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences.Journal of molecular biology 147(1), 195–197 (1981)

30. Yan, R., Xu, D., Yang, J., Walker, S., Zhang, Y.: A comparative assessment andanalysis of 20 representative sequence alignment methods for protein structureprediction. Scientific reports 3 (2013)

31. Zahid, S.K., Hasan, L., Khan, A.A., Ullah, S.: A novel structure of the smith-waterman algorithm for efficient sequence alignment. In: International Conferenceon Digital Information, Networking, and Wireless Communications (DINWC). pp.6–9. IEEE (2015)