The Cyclic Model Analysis
-
Upload
prabakaran-periyannan -
Category
Documents
-
view
60 -
download
2
Transcript of The Cyclic Model Analysis
The Cyclic Model Analysis onSequential Patterns
Ding-An Chiang, Cheng-Tzu Wang, Shao-Ping Chen, and Chun-Chi Chen
Abstract—Sequential pattern mining has been used to predict various aspects of customer buying behavior for a long time.
Discovered sequence reveals the chronological relation between items and provides valuable information to aid in developing
marketing strategies. Nevertheless, we can hardly know whether the buying is cyclic and how long the interval between the two
consecutive items in the sequential pattern is. To solve this problem, in this paper, data mining skills and the fundamentals of statistics
are combined to develop a set of algorithms to unearth the cyclic properties of discovered sequential patterns. The algorithms, coupled
with the sequential pattern mining process, constitute a thorough scheme to analyze and predict likely consumer behavior. The
proposed algorithms are implemented and applied to test against real data collected from a consumer goods company. The
experimental results illustrate how the model can be used to predict likely purchases within a certain time frame. Consequently,
marketing professionals can execute campaigns to favorably impact customers’ behaviors.
Index Terms—Association rules, data mining, frequency, sequential pattern, polynomial regression.
Ç
1 INTRODUCTION
DISCOVERING sequential relationships presented in trans-action data is important to many application domains,
particularly useful in the analysis of customers, wherecertain buying patterns can be identified, such as likelyfollow-up purchases. When the data can be interpretedfrom a temporal or sequential perspective, the task ofsequential pattern mining is to identify the sequenceswhose statistical significance in the database is above theuser-specified threshold.
Sequential pattern mining can be applicable in a widerange of applications. For example, the sequential patternsdiscovered from supermarket transactions can provideinsight for developing marketing and product strategies,the patterns mined from Web usage logs can propose abetter way to arrange the Web site, and the alarm patternsoccurred in telecom networks are very useful for alarmprediction and alarm control.
The problem of discovering sequence was first intro-duced by Agrawal and Sirkant [2]. Many algorithms weredeveloped afterward and successfully improved the effi-ciency of the task of mining sequential patterns. A greatdiversity of algorithms for sequential pattern mining exists.The most basic and earlier algorithms are based uponApriori algorithm [1]. The core of the Apriori property isthat any subpattern of a frequent pattern must be frequent.
Based on this heuristic, a series of Apriori-like algorithmssuch as AprioriAll, AprioriSome, DynamicSome [3], andGSP [17] was developed. Later on, different kinds ofalgorithms are proposed by different researchers. Forexample, FreeSpan [8] and PrefixSpan [9] were developedby the data projection approach. SPADE [19] is a lattice-based algorithm, MEMISP [11] is a memory-indexing-basedapproach, while SPIRIT [6] integrates constraints by usingregular expression. The above algorithms focus on thechronological order only.
Some have tried to extend the mining of sequentialpatterns to periodical patterns. Periodicity analysis at-tempts to analyze the data to identify pattern, which repeator recur in a time series. In general, full periodicityindicates the situation where all data points contribute tothe behavior of the series. Whereas partial periodicitymeans that only certain points contribute to the behavior ofthe series. Cyclical periodicity relates to the set of events,which occurs periodically.
Han et al. [7] proposed two algorithms for mining partialperiodic pattern—single period and multiple periods. Yanget al. [18] proposed distance-based pruning to find theperiodic patterns, which may contain a disturbance oflength up to a certain threshold. The mining of frequentpartial periodic sequential patterns in a time series is to findpossibly with some restriction or disturbance.
Ozden et al. [14] proposed the sequential algorithm andinterleaved algorithm to determine cyclic association rules.Associate rules capture interrelationships between variousitems. Cycle Pruning, Cycle Skipping, and Cycle Elimina-tion aim to identify the association rules that have theminimum confidence and support occurring at regularintervals. Chiang et al. [5] and Chen et al. [4] proposedalgorithms to determine the intervals of recurring patterns.
The approaches presented above primarily target theorders of items purchased, or cyclic patterns occurring withina time window, defined by users. However, by these works,
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009 1617
. D.A. Chiang, S.P. Chen, and C.C. Chen are with the Department ofComputer Science and Information Engineering, Tamkang University,Tamsui, Taipei County, Taiwan 25137, ROC.E-mail: [email protected], {694190033, 894190049}@s94.tku.edu.tw.
. C.-T. Wang is with the Department of Computer Science, National TaipeiUniversity of Education, Taipei 106, Taiwan, ROC.E-mail: [email protected].
Manuscript received 15 Dec. 2006; revised 21 Dec. 2007; accepted 6 Jan. 2009;published online 16 Jan. 2009.Recommended for acceptance by W. Wang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-0566-1206.Digital Object Identifier no. 10.1109/TKDE.2009.36.
1041-4347/09/$25.00 � 2009 IEEE Published by the IEEE Computer Society
the cycle and interval of items purchased are hardly known.Thus, the best time to recommend the right products to theright person is hardly known either. Actually, periodicalpatterns are common in daily life. A time-interval sequentialpattern provides more information than a conventionalsequential pattern does, discovering time interval of thesuccessive item set is the first step toward more accurateanalysis of customer analysis. Therefore, in this paper, wedevelop a set of algorithms to analyze the periodical proper-ties of time intervals over sequential patterns.
Data mining skills and the fundamentals of statistic arecombined to introduce an algorithm Cyclic Model Analysis(CMA) to find out the model of recurring purchasing. Themodeling process commences with the discovery of sequen-tial patterns from the transactional database. Then theexistence of periodicity is identified and the interval ofsuccessive events by the Generalized Periodicity Detection(GPD)/Trend Modeling (TM) computed, which will beexplained in more detail later. Next, the CMA algorithm isused to obtain the period and trends of quantities ofpurchasing. Consequently, marketing people can recom-mend the right products to the right customers at the righttime.
This section provides a comprehensive review of priorworks related to sequence pattern mining. In addition, themotivation and research objectives of this paper are alsoexplained in this section. The remainder of this paper isorganized as follows: The mathematical models that portraythe sequential buying behavior are constructed in Section 2.The proposed algorithms are presented in Section 3. Theexperimental results are shown in Section 4. The briefingson short conclusion and discussion on the future directionare shown in Section 5.
2 PROBLEM STATEMENT
An item set i, denoted by (x1; x2; . . . ; xt), is a nonempty set ofitems. A sequenceS, denoted by<i1; i2; . . . ; iq>, is an orderedset of item sets. The size of a sequence S, written as jSj, is thenumber of elements in S. A sequence is a k-sequence ifjSj ¼ k. For example, sequence <a; b; c; d> is a 4-sequence.
A sequence <a1; a2; . . . ; an> is a subsequence of anothersequence <b1; b2; . . . ; bm> if there exist 1 � i1< i2 < i3 � � � <in � m such that a1 � bi1; a2 � bi2; . . . , and an � bin. We alsocall that the sequence <a1; a2; . . . ; an> is contained in thesequence <b1; b2; . . . ; bm>. For example, the sequence <a; b>is a subsequence of <ða; cÞ; ðb; dÞ> since a � ða; cÞ andb � ðb; dÞ. On the other hand, the sequence <ða; cÞ; b> is notcontained in <ða; c; bÞ>, and vice versa.
Given a database D of customer transactions, eachtransaction is characterized by the fields: <customer-id>,<time stamp>, and <items purchased>. More precisely,each transaction is a set of item sets and each sequence is alist of transactions ordered by transaction time. Usually, thelist of all the transactions of a customer is called thecustomer sequence.
A customer supports a sequence s if s is contained in thecorresponding customer sequence. The support for asequence s is defined as the number of data sequencescontaining s. The definition of support for a sequence s canbe written as follows:
Support ðSÞ ¼ ðNumber of customer supports sequencesÞðTotal number of customersÞ :
A sequence is maximal if it is not contained in any othersequences. Given a database D of customer transactions,sequential pattern mining is the process of finding maximalsequences among all sequences that have a certain user-specified minimum support. Each such maximal sequencerepresents a sequential pattern. The user-specified mini-mum support threshold (denoted by minsup) meansstatistical significance of a sequence in the database.
Table 1 gives a simple example, which contains fourcustomers and their activities over one month. Given thethreshold minsup ¼ 0:5, three frequent sequences <A, F>,<F, H>, and <D, E> are found. The support of <A, F> is3=4 ¼ 0:75. The support of <F, H> and <D, E> is2=4 ¼ 0:5. Hence, there are three sequential patterns inthe example database.
A sequential pattern indicates the correlation betweentransactions. The sequence mined from the transactiondatabases represents the order of purchases by the samecustomer, those items come from different transactions. Atypical example of such a sequential pattern is a customerwho buys a personal computer, then a laser printer. Asdiscussed in the previous section, there are many algorithmsdeveloped by researchers to address the problem of effi-ciently discovering sequences. However, prior works seldomaddress the issue of our major concerns: Tendency andPeriodicity. Whether the next purchase will happen or howlong the purchase behavior will last is hard to tell. A tool tocapture the characteristics of discovered sequences is needed.To simplify the discussion, the case for 2-sequence <i1; i2>,where i1; i2 are item sets, is considered. The item set is acollection of the items. Thus, the case can be extended to morecomplicated situations. Given a 2-sequnece <i1; i2> minedfrom transactions, the definition of the Trend DistributionFunction (TDF) of the 2-sequence is stated in Definition 1.
Definition 1. The sequential pattern s ¼ <i1; i2> is a2-sequence mined from transaction database over designatedtime frame T ¼ ½t1; tn�. The Trend Distribution Function ofthe given sequence s, denoted by fðxjÞ, is a nonnegativefunction defined on ½0; tn � t1�. A sequence s is said to be anxj-interval-sequence if the interval difference between i1 and i2is xj. The value of fðxjÞ is the total occurrences of xj-interval-sequence in the transaction database D.
The pseudocode for computing the value of the trenddistribution function is presented in Fig. 1.
1618 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
TABLE 1The Transactions by Four Customers over One Month
For a 2-sequence <a, b>, there are many complicateddistributions exist. The examples below are used toillustrate how the TDF is computed. For the sake of brevity,the postfix increment operator þþ means adding 1 to thevalue of the expression. Therefore, fðdnÞþþ means that thevalue of the function at dn is incremented by one.Considering the example 2-sequence <a, b>, the differenttypes of appearances of distribution are as follows:
1. Simple: <ða; t1Þ; ðb; t2Þ>
) fðt2 � t1Þ þþ:
2. Repeated appearance: <ða; t1Þ; ðb; t2Þ; ðb; t3Þ;ðb; t4Þ . . . :ðb; tnÞ>
) fðt2 � t1Þþþ; fðt3 � t1Þþþ;fðt4 � t1Þþþ; . . . ; fðtn � t1Þþþ:
3. Multiple appearance: < ða; t1Þ; ðb; t2Þ; ða; t3Þ;ðb;t4Þ; ða; t5Þ; ðb; t6Þ >
) fðt2 � t1Þþþ; fðt4 � t3Þþþ; fðt6 � t5Þþþ:
4. More complicated: < ða; t1Þ; ðb; t2Þ; ða; t3Þ; ða; t4Þ;ðb; t5Þ; ða; t6Þ; ðb; t7Þ; ðb; t8Þ >
) fðt2 � t1Þþþ; fðt5 � t4Þþþ; fðt7 � t6Þþþ;fðt8 � t6Þþþ:
The distribution function defined in Definition 1 is a time-series representation of the transactions associated with thesequence discovered. The traditional sequential formulationreveals the chronological order of purchase only. Therefore,the model of distribution function to portray the sequentialpurchasing phenomenon is constructed. Take the sequence<i1; i2> as an example. The distribution function is character-ized by the interval between the purchases of the consecutiveitems. The graph of the function is the set of all points (x, f(x))in the xy-plane such that x is in the domain of f and y ¼ fðxÞ.If plot of the function is sketched, the movement along thecurve shows the tendency of the purchase.
Since the function is the portrait of the real world, thegraph of the function will never be a simple straight line.What is more, it is unlikely to be a periodic function withthe standard sine (or cosine) wave graph either. The curvegoes along the x-axis with the inclination moving upwardor downward with slight vibration. To better facilitate theresearch of the tendency of the function, the slope of theregression line of function within a designated domain isused to characterize the inclination of the function.
For any given subset of the domain of the function, asimple linear regression is used to construct the regressionline of the function within the subdomain. If the slope of theobtained straight line is a positive number, the purchase ofitem-i2 increases in a certain rate. If the slope is a negativenumber, the sales volume of item-i2 is on the decrease at acertain rate.
The distribution of interest can be categorized into threetypes. The first type is the simply ascending type. The secondone is the plainly descending distribution. And the thirdtype is the most common one that occurs frequently in thereal-world scenarios. First, consumers are more likely to buyitem-i2 after the initial purchase of item-i1. Then, the salesvolume of item-i2 decreases after a certain point is reached.
Fig. 2 is the plot of the distribution functions obtainedfrom a real-world database. The upper half of Fig. 2illustrates the third distribution type; however, the functionis of type one on the domain [0, 90]. The follow-up sales ofitem-i2 go up in a span of time after the item-i1 is purchased;then, the sales volume decreases after the ascendingduration. Besides, part 2 of Fig. 2 is another common type.The purchase of item-i2 decreases at a moderate rate.
After the mathematical model is constructed, the char-acteristics of the model and the relations between the modeland the real world will be explained. The periodicity anddegeneration will be the major concerns. The study of thecharacteristics of the model is useful to uncovers morepreviously hidden facts underlying patterns such as “howlong the repeated behavior will last?” and “when the nextpurchase will happen?”
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1619
Fig. 1. Pseudocode for computing the value of trend distribution function
for sequence s.
Fig. 2. The illustrations of the three types of distributions.
If the function repeats its value after some certain period,the function is called a periodic function. The periodicityimplies the cyclic sales volume. In other words, when thecustomers are likely to buy, the item-i2 can be predicted.Therefore, the definition of periodicity of the function isvital to the study. The following definition considers theformulation of a simple periodic function.
Definition 2. Let f(x) be a trend distribution function of sequences defined on X ¼ ½x1; xn�. For each xi in X, thenfðxÞ ¼� fðxþ �Þ and � � ðxn � x1Þ=2, then f(x) is said tobe a periodical trend distribution function of the sequence swith period �.
In daily life, purchasing amount varies over time, it ishardly a constant. Consider the curves with downwardtendency such as
fðxÞ ¼ ð24� xÞ18
ðsinðxÞ þ 1:5Þ: ð1Þ
The function shown in Fig. 3 is not a strict (monotone)decreasing function. But the movement along the curve goesdownward steadily. As mentioned before in this section, theinclination of the function within the designated domain canbe characterized as the slope of the regression line withinthe domain. Given any subset of the whole domain, thefunction is called a linear increasing distribution function ifthe slope of the regression line is positive. The function is alinearly decreasing function if the slope of the regressionline is negative. Accordingly, the following is the definitionfor a periodical distribution function.
Definition 3. Let f(x) be a linearly periodical trend distributionfunction of sequence s defined on the domain X ¼ ½x; xn�.The straight line y ¼ axþ b is the trend line constructed bylinear regression. For each xi in X; fðxÞ ffi fðxþ �Þ þ ax;� � 1=2ðxn � x1Þ, then f(x) is said to be a linearlyincreasing periodic trend distribution function of thesequence with period l on the domain X. The function f(x)is said to be a linearly decreasing periodical trend distribu-tion function if fðxÞ ffi fðxþ �Þ � ax; � � ðxn � x1:Þ=2.
Fig. 4 is a sample of a typical linearly decreasingfunction. The graph of the function goes downward alongthe x-axis at a certain rate. And the curve repeats its shapeafter a period of 63. That is, the function decreases steadilywith period ¼ 63. Then the function reaches zero at a certainpoint, that is, x ¼ 300.
As mentioned, periodicity is not the only interest. Thedegeneration phenomenon is another major concern. Sincethe distribution function is a nonnegative function defined on
the domain, the value of the function will be greater than orequal to zero. Hence, the point, which causes the function toreaches zero, is important for the study of the model as well.
If f(x) is a linearly decreasing distribution function andreaches zero within the domain, the point x is called thedegeneration point of the function. We learned from real-world experiences that the periodical purchase will not lastindefinitely for many reasons. The degeneration pointsignifies the fact that the customers tend to stop thepurchasing. The business owner must be alerted beforethe degeneration point is reached. Thus, the marketers cantake actions to favorably impact the customers.
In the next section, the data mining skills and somemathematical tools will be combined to formulate thealgorithms to construct an automated and attainable analysisprocedure for both engineers and marketing professionals.
3 ALGORITHMS
In this section, a set of algorithms designed to deal withthe distribution functions obtained from the transactiondatabases is presented. The procedures proposed here arethe synthesis of data mining techniques and mathematicaltools. More specifically, the aim of this research is to devisea scheme to analyze the trend underlying the patterns. Thescheme is to be integrated with traditional sequentialpattern mining to offer a comprehensive analysis proce-dure, which can more easily be adopted by marketers. Thescheme presented in this paper takes a two-phaseapproach to cope with all periodicity-related problems,which occur in the analysis process of sequential patternmined from transactions.
The core theme of the research is Simple is Beauty. It iswell known that a host of algorithms have been developedfor efficient mining of sequential patterns. To solve theperiodicity problem, a mathematical model constructed toportray the sequential pattern mined from the database. Thestructure proposed to describe the nature of the pattern canreveal not only the periodicity but also the tendency of theoccurrence of purchasing actions. Then the mathematicaltool is used to determine that the periodicity exists. If theperiodicity exists, a procedure is proposed to analyze thelikely consumer behavior.
The scheme comprises the sequential pattern miningtechnique and the algorithms presented in this section.Given the result of sequential pattern mining, the primaryconcern is to know where there are regularities that can befound. Thus, the value of trend distribution function iscomputed and then the GPD is introduced to detect theperiodicity of the function. If the periodicity can be
1620 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 3. The plot of the function (1).
Fig. 4. The plot of a linearly decreasing function.
identified, the analysts can have a better understanding ofthe patterns and decide on the next appropriate action.
Next, the TM procedure is developed to find an equationto approximate the trend distribution function obtainedfrom the sequential pattern mined from databases. Thus,the nature of the phenomenon represented by the sequencecan be identified and more or less described by themathematical tool introduced here.
The third procedure presented in this paper is CMA. TheCMA algorithm is used to depict the tendency andperiodicity of the buying activities within the investigatedtime frame. Coupled with industry domain knowledge andmarketers’ expertise, the constructed model helps to predictlikely buying behavior.
3.1 Generalized Periodicity Detection
In a real-world scenario, purchasing behavior is extremelydynamic. The occurrences of purchases will not happen in astrictly regular way. However, the marketers and analystsare eager to identify the nature of the phenomenon.Periodicity detection is one of the primary areas of interest.Therefore, a systematic way has to be found to determinewhether the function repeats itself at a regular period andwhat value of the period is. As mentioned in Section 1, thesimple periodicity is rarely found in the real world. Thus,the conceptual definition of periodicity has been general-ized. Then an empirical measure is proposed to determinethe generalized periodicity of the function. The Procedure
GPD determines whether the function has periodicity andcomputes the value of the periods if periodicity exists.
Three parameters are needed to be determined before-hand. Given a trend distribution function f defined onX ¼ fx1; x2; . . . ; xng, the parameter n stands for the numberof elements in X. The parameter min_period is the smallestpossible value of the period. The possibility of existence ofperiodicity is calculated from the designated minimumpossible value to the maximum possible value. The value ofperiod usually cannot exceed half of the investigated timeframe (n/2). The next parameter is max_error, which is theerror threshold used to judge if the investigated value issubject to the generalized periodicity. Fig. 5 is thepseudocode of the procedure GPD, the output of GPD isempty if the periodicity does not exist.
Since the given trend distribution function will not be astraight line or a monotone increasing (or decreasing)function, the measure
Pjfðxiþ�Þ � fðxiÞj is not sufficient to
judge whether the function has the period �. Hence, a moregeneral measure to complete the task is proposed. The firststep is to find the simple linear model y ¼ axþ b of thegiven function f . Then the function g(x) is defined as(fðxÞÞ=ðaxþ b). The function g(x) is to be used to constructthe measure
errorð�Þ ¼Pn��
i¼1 gðxiþ�Þ � gðxiÞj jPn��i¼1 gðxiÞj j
: ð2Þ
Equation (2) is used to calculate the error threshold todetermine if the examined value is a prospect of period. Foreach prospect value of the period, the value of the measureis computed. If errorðxiÞ is less than the threshold max_errorpredefined by users, the value xi is a period prospect. Fig. 6shows the result of applying GPD to the given input (1)mentioned before in Section 2.
Let f(x) be defined on X ¼ ½0; 18�, divide the domain into100 partitions. Thus, we have xi ¼ i ð18=100Þ for1 � i � 100. The value of function will be computed ateach xi.
Set the input parameters min period ¼ 0:5 andmax error ¼ 0:2. The first step is to find the value of
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1621
Fig. 5. The GPD procedure determines the possible values of period.
Fig. 6. The first step of the GPD procedure is to find the regression line
y ¼ axþ b. Then the iterative computation of error threshold suggests
that 6.39 has maximum likelihood that 6.39 is the period of the function.
a; b to determine the linear model y ¼ axþ b. Next, thevalue of the error indicators errorðxiÞ is computed foreach prospect of period xi. Then, we have a ¼ �0:112;b ¼ 2:2326, and � ¼ 6:39.
Fig. 6 illustrates the result of the examination. Fig. 6ashows the graph of the function and the regression liney ¼ �0:112xþ 2:2326 found by step 1 of the procedure.Fig. 6b shows the vertical bar at the discovered period forthe given function.
3.2 Trend Modeling
Frequency is a good indicator of the importance of apattern. In real life, the environment may change constantlyand patterns discovered from databases may also changeover time. Once the existence of periodicity has beenidentified, the characteristic of the sequence can be capturedmore precisely.
Hence, a mathematical structure, which symbolizes anddescribes the transactions occurring in the real world, isneeded. The results of the modeling process must map therelationships between relevant attributes in the transactiondatabases to those relationships between relevant attributesof the function obtained by computing rules presented inDefinition 1.
The purchasing behavior observed in the real world ishighly dynamic. The graph of the function is usually avisual aid to illustrate the dynamics of the observed facts.The salient features of the (x, y) plot are the evidence.Therefore, a sophisticated model is needed.
In many cases, broad movements can be discerned thatevolve more gradually than the other movements which areevident. These gradually changes are called trends. Thechanges, which are of a transitory nature, are described asfluctuations. When dealing with time-series data, there is aninclination to break down the time series into the trendcomponent to capture the trend characteristics of thefunction. Since regression analysis is frequently used forfitting equation to data, regression techniques are applied toconstruct the Trend Modeling procedure.
Obviously, the simple linear regression model, which isused to find a straight line representing the inclination ofthe scatter of (x, y) plot, is not sufficient to describe the trendcomponent underlying the series data. This type of model iscalled a polynomial regression.
The simple linear regression model Y ¼ �0 þ �1xþ " canbe generalized as an mth-order polynomial regression inone variable, which is written as
yi ¼ �0 þ �1xþ �2x2 þ � � � þ �mxm þ ei: ð3Þ
To approximate the substantial curvature as is com-monly understood in the real world, the simple linearregression model or the general model represented by (3) isnot appropriate to fit the data. In addition, there are pitfallsthat must be aware of [10], [13].
The polynomial model can be highly ill-conditioned,even for low-order polynomials. This may lead to sub-stantial round-off errors. Generally, as low an order aspossible should be used to obtain a satisfactory fit. To dealwith the ill-conditioning issue, the polynomial model (4) isbrought in to solve the problem:
ðaxþ bÞ Xmi¼0
ci � ðxmod �Þi: ð4Þ
However, the overfitting problem needs to be addressed.
For the overfitting problem, the degree of the polynomial is
usually left as user input parameter [13].The TM algorithm is straightforward. It begins with the
establishment of the simple linear regression model
y ¼ axþ b. The regression line obtained characterizes the
inclination of the variable Y . Then the polynomial model
described in (4) is computed. Combined with the result of
GPD, the complete model characterizing the given input
distribution function is obtained. The pseudocode of the
algorithm is presented in Fig. 7.To illustrate how Trend Modeling is used to find the
approximating model of a given input, two typical linearly
increasing/decreasing periodic functions are used as sample
input.It has been proved that (1) is a typical linearly decreasing
function, which has period 6.39. TM is invoked to find the
polynomial approximates (1).Let f(x) be defined on X ¼ ½0; 18�, divide the domain
into 100 partitions. Given predetermined parameters
min period ¼ 0:5;max erroe ¼ 0:2, and degree ¼ 6, Trend
Modeling is applied to find the approximating model of
(1). Then, this gives the following:
1. Applying GPD to f(x) to find that a ¼ �0:012; b ¼2:2326, and � ¼ 6:39.
2. Compute the polynomial f 0¼
ð�0:012xþ 2:2326Þf0:9390þ 0:5577ðx mod 6:39þ 0:5251ðx mod 6:39Þ2� 0:5153ðx mod 6:39Þ3 þ 0:1321ðx mod 6:39Þ4� 0:0144ðx mod 6:39Þ5 þ 0:00066ðx mod 6:39Þ6g:
8>><>>:
1622 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 7. The pseudocode of the CMA procedure.
Fig. 8 shows the result of applying TM to the functionf(x). The darker line is the polynomial f 0ðxÞ determined byregression and the other line is the input function f(x).
Next, TM is used to find the polynomial of the function(5), which is similar to the previous inspected example (1)but it is a linearly increasing function:
gðxÞ ¼ ð24þ xÞ18
ðsinðxÞ þ 1:5Þ: ð5Þ
The function g(x) is defined on the same domain X bedefined on X = [0, 18], divide the domain into 100 partitions.Use the same input parameters as done to (1). Letmin period ¼ 0:5;max error ¼ 0:2, and degree ¼ 6; then,apply Trend Modeling to find the approximating modelof g(x). This will give the following:
1. Invoking GPD to g(x) to find that a ¼ 0:023;b ¼ 2:526, and � ¼ 6:42.
2. Compute the polynomial g0¼
ð0:023xþ 2:526Þf0:9012þ 0:5577ðx mod 6:42Þ þ 0:5251ðx mod 6:42Þ2� 0:5153ðx mod 6:42Þ3 þ 0:1321ðx mod 6:42Þ4� 0:0144ðx mod 6:42Þ5 þ 0:00066ðx mod 6:42Þ6g:
8>><>>:
The darker line in Fig. 9 is the polynomial g0ðxÞ, which isdetermined by regression to the input g(x); the lighter one isthe plot of the input function g(x).
It has been demonstrated how TM can perfectlyapproximate the descending and ascending types of thelinearly periodic functions.
The polynomial gained by the TM process is an aid toidentify the nature of the sequence mined. The graphicalrepresentation of the polynomial is an extremely good aidto help observers have a better understanding of thetendency of the pattern. And the analysis of the character-istics of the polynomial itself is helpful in describing thephenomenon of the mined pattern. Hence, an elaborate andsystematic plan of action is needed to complete the task.That is why we developed CMA.
3.3 Cyclic Model Analysis
The purpose of the establishment of the mathematicalmodel is to help analysts obtain a better understanding ofthe whole picture of what happened and predict what is likelyto happen. With GPD and TM, it can be determined ifcustomers tend to repeat buying at a regular period and anequation can be formulated to approximate the distribution
obtained from the sequence. In short, the mathematicalmodel established by GPD/TM is used to describe thecharacteristics of the sequential patterns mined fromdesignated time frame.
Next, CMA is proposed to analyze and describe thecharacteristics of the sequential pattern mined from thetransaction databases. Users must determine the value ofparameters min_period, max_error, degree, and trcd. Themeaning of the parameters min_period, max_error, and degreeis the same as defined in GPD and TM. The value of trcd isthe terminating condition of the process. If the length of thedomain is too short, the process should be stopped since itis meaningless to investigate the characteristics of repeatedpatterns. The procedural steps are shown in Fig. 10.
The trend distribution function of a given sequence isdefined in Definition 1. Then the type of the function isdetermined by finding the local maximum of the function. Ifthe local maximum exists at the end of the domain of thefunction, the function belongs to the ascending type. Thedescending type can be determined if the local maximumexists at the beginning of the domain.
If the distribution function is ascending or descendingtype, apply GPD/TM directly to get the polynomialapproximating the patterns and find the period of thedistribution function. If the distribution is neither theascending nor the descending type, whole time frame hasto be partitioned into two subframes and invoke CMArecursively until the distribution function of subframe issimplified. If the length of inspected subframe is smallerthan the predefined terminating condition trcd, the processwill be stopped.
In other words, CMA takes the divide-and-conquerapproach to collect the knowledge of the designateddistribution function. Analyzers use the synthesizedmathematical model associated with product knowledgeto interpret the meaning of the model discovered by theproposed algorithms. Consequently, the interpretationswill be translated into marketing insight and marketingpractice accordingly.
Below is an illustration of CMA applied to the real-worlddatabases. The data collected were transactions of adomestic cosmetic supplier. The marketing departmentdiscovered several sequential patterns, which are of inter-est. They found that the pattern <37, 27> was unusual;
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1623
Fig. 8. The plot of the function fðxÞ and the polynomial f 0ðxÞ.
Fig. 9. The plot of the function fðxÞ and the polynomial f 0ðxÞ.
therefore, they chose to take on this pattern to demonstratethe CMA process.
First, the analyzer computed the trend distributionfunction according to definition. The input parameterswere set as min period ¼ 5;max error ¼ 0:5; degree ¼ 6,and trcd ¼ 30, then GPD/TM was invoked to find theperiod of the distribution and the approximating poly-nomial. The results obtained by GPD were a ¼ �0:039,b ¼ 3:760, and � ¼ 42. The approximating polynomialproduced by TM is as follows:
fðxÞ ¼ð�0:039xþ 3:760Þ If x < 170
f1:6473� 0:4335ðx mod 42Þþ 0:0691ðx mod 42Þ2
� 0:052ðx mod 42Þ3 þ 0:0002ðx mod 42Þ4
� 0:000ðx mod 42Þ5 þ 0:000ðx mod 42Þ6g;0; Otherwise:
8>>>>>>>><>>>>>>>>:
The plot of distribution function of pattern <37, 27> isshown in Fig. 11. It was learned from the picture and thepolynomial that:
1. Consumers usually purchased <27> after 42 daysafter they purchased <37> (� ¼ 42).
2. In the first 6-month time frame after the first purchaseof <37>, more and more purchases were made.
3. After six months, all buying stopped radically. Thisis unusual for consumer goods.
After consultation with business analysts, the strangepattern was explained. The item <27> was faced with theinternal and external competition during the first half-yearof the investigated time frame. The marketers conducted aprice-cut campaign, which stimulated the sales volume.After six months, the product was replaced with analternative manufactured by the brand owner. Thus, thesales of <27> decreased dramatically.
As mentioned before, the interpretation of the modelestablished by TM/GPD/TM must be based on theabstraction of purchasing behavior, knowledge of products,and industry know-how. More comprehensive examplesand discussions will be detailed in the next section.
4 EXPERIMENTS
In this section, the experiments conducted to demonstratethe usage of the processed Cyclic Model Analysis procedureare described and how the proposed method can affectmarketers’ decisions and actions is discussed.
4.1 Analysis Process and Data Sets
In general, consumer markets have several characteristics incommon such as repeat-buying over the relevant timeframe, a large number of customers, and a wealth ofinformation detailing past customer purchases. Hence, awell-known cosmetic company was approached to conductthe sequential pattern mining project.
The first experiment illustrates how to interpret theanalysis and how to capture the cyclic characteristics of
1624 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 11. (a) The plot of the distribution function of pattern <37, 27>.
(b) After applying GPD/TM, the vertical bar used to indicate the
discovered period was drawn on the picture.
Fig. 10. The pseudocode of the CMA algorithm.
patterns discovered by sequential pattern mining. Next, itwas demonstrated that consistent behavior pattern shownin the first experiment proved to be significant in thefollowing year.
The process of applying CMA to real-world data isdisplayed in Fig. 12. After the mining process wascompleted, the sequential patterns to be investigated wereselected. The distribution function of the designated patternwas computed. The plot of the function was outlined todetermine if the function was of interest to marketers. ThenCMA was applied to capture the knowledge of themathematical model. CMA divides the domain into smallersegments such that the distribution function defined on thesegment is of simply increasing or decreasing type. ThenGPD and TM were invoked to obtain the polynomialapproximating to the pattern defined on the segment.
The result of divide-and-conquer process will becollected and synthesized. Then the synthesized knowledgewill be used to help analyze past and likely futurebehaviors. From the collected knowledge, marketers canelicit more fact from the transaction records. Consequently,the marketers can take appropriate action to favorablyimpact customers’ behavior.
In the first experiment, GPD/TM/CMA were applied toconduct the analysis against the transactions that occurredin the year 2000. The experiment showed how theprocedure is implemented and how the customers’ likely
behavior can be predicted with the aide of a simple and
elegant mathematical model.Next, the patterns that were identified as vital in 2000
were selected, and the results of the analysis conducted
against 2001 data sets were examined.The purchasing records of a total of 41 products between
years 2000 and 2001 were collected. Then IBM Intelligent
Miner was used as our mining tool. In summary, there were
160,334 and 215,854 transactions in the years 2000 and 2001,
respectively. In the year 2000, the average products
purchased in a single transaction were 2.93 (items). Assum-
ing the likelihood that each product sold was equal, ð1=41Þ 2:93 ¼: 0:0715. Thus, we set the min sup ¼ 7:0 percent as the
threshold of frequent item set. After applying the sequential
mining process to the transaction database, the sequences
found are listed in Table 2.
4.2 Finding Periodicity and Tendency
Workings with the owner of the transaction database, two
patterns <38, 20> and <38, 36> were found to be two
typical cyclic purchasing patterns of interest to marketing
professionals.In terms of product sales, the existence of sequence
<38, 20> means that the customer purchases product 38
first, then buys product 20. But the information did not
reveal when the user will buy product 20. Therefore,
CMA was applied to analyze <38, 20> to discover the
periodicity and tendency of the pattern.The first step of the process is to compute the distribu-
tion function of the pattern and find the regression line of
distribution function. The GPD procedure is invoked to find
the linear model and optimal candidate for period of the
function. Invoking GPD with parameters min period ¼ 5;
max error ¼ 0:5, and degree ¼ 6, the results obtained were
a ¼ �0:08; b ¼ 25:87, and � ¼ 35.Then TM was used to get the approximating polynomial.
The polynomial generated by TM is as follows:
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1625
Fig. 12. The process flow of how the CMA is applied to the real-world
data.
TABLE 2The Result of Sequential Pattern Mining
fðxÞ ¼ð�0:08xþ 25:87Þ If x < 305
f1:756� 0:361ðx mod 35Þþ 0:070ðx mod 35Þ2
� 0:007ðx mod 35Þ3 þ 0:000ðx mod 35Þ4
� 0:000ðx mod 35Þ5 þ 0:000ðx mod 35Þ6g;0; Otherwise
8>>>>>>>><>>>>>>>>:
However, the approximating polynomial is an abstrac-tion of the pattern, which is hardly interpreted bynontechnical people. With the help of visual representationof the distribution function, engineers, marketers, andbusiness owners communicate among themselves easily.Thus, the results of modeling process can easily beincorporated into a marketing practice.
The plot of the distribution function and its regressionline are shown in Fig. 13. It can easily be seen that <38, 20>is a simple descending sequence. The pattern has the period¼ 35 days and degenerates at day x ¼ 305. Thus, theanalysis suggested the following:
. The majority of customers tended to buy product<20> every 35 days.
. The purchasing decreases moderately.
. Customers will not buy product <20> after 305 daysafter the initial purchase of product <38>.
The characteristics of the patterns were learned from
the mathematical model and visualization of the distribu-
tion function. Marketing professionals incorporate infor-
mation gained from CMA with the knowledge of a
product, then adapt the marketing practice to impact
consumers’ likely behavior.Next, CMA was applied to take on <38, 36>. Similarly,
sequence <38, 36> means that the customer purchases
product 38 first and then buys product 36. It was understood
that marketers require more information than what was
revealed. Thus, predetermined parameters min period ¼ 5;
max error ¼ 0:5, and degree ¼ 6 were invoked. The results
of GPD were: a ¼ �0:07; b ¼ 22:13, and � ¼ 63. The approx-
imating polynomial of the distribution function of <38, 36> is
fðxÞ ¼ð�0:07xþ 22:13Þ If x < 299
f1:995� 0:231ðx mod 63Þþ 0:016ðx mod 63Þ2
� 0:001ðx mod 63Þ3 þ 0:000ðx mod 63Þ4
� 0:000ðx mod 63Þ5 þ 0:000ðx mod 63Þ6g;0; Otherwise
8>>>>>>>><>>>>>>>>:
The plot of the distribution function and its regressionline are shown in Fig. 14. Together, the picture of the modeland the characteristics of the polynomial were obtained. Itwas learned that:
. The majority of customers tended to buy product<36> every 63 days.
. The purchasing decreases moderately.
. Customer will not buy product <36> after 299 daysafter the initial purchase of product <38>.
The results indicated that the CMA performs well inexploring the trends of repeat-buying behaviors andprovides a practical model for predicting when thecustomers tend to purchase, and when they are likely tostop buying. Consequently, the marketers can allocateresources to build and execute marketing campaigns, whichfavorably impact the behavior of these customers.
4.3 Consistent Buying Behaviors
Next, transactions which occurred in the year 2001 wereexamined to see if the patterns proved to be vital by CMA inthe year 2000 have the same characteristic. Hence, weapplied GPD/TM to find the regression polynomial of eachpattern in the years 2000 and 2001.
Fig. 14 is the plot of the trend line and regressionpolynomial determined by GPD/TM of the pattern <38,20> in the years 2000 and 2001, respectively. Fig. 15 is theplot of the trend line and regression polynomial discoveredby GPD/TM of the pattern <38, 36> in the years 2000 and2001, respectively.
1626 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 13. The plot of the pattern <38, 20> and its regression line.
Fig. 14. The plot of the pattern <38, 36> and its regression line.
Fig. 15. The regression line and approximation polynomial of <38, 20> in
the years 2000 and 2001. We found out that the shapes of the two plots
of the two polynomials are similar.
The patterns were examined in turn and some factsfound to be in common. Sketches of the regressionpolynomial for the years 2000 and 2001 are similar inshape, but the sales volume varied. Figs. 15 and 16 show thesimilarity of the shape of the function plot.
In addition, it was found that both polynomials for eachyear have periodicity but with slight difference. The periodof <38, 20> in the year 2001 is 35, the period of the repeatedbuying in the year 2001 is 42. The period of <38, 36> foundby GPD for the years 2000 and 2001 is the same, i.e., 20.
The similarity of the plots and the computation of periodshowed amazing consistency. The variation of sales volumeand periodicity can be explained. Many factors can affectsales: economic projection (boom or recession), emergenceof product replacement, and marketing practice conductedin the first year, etc.
Analysis of the customer profile database revealed aninteresting fact. Some customers bought the cosmeticsduring both 2000 and 2001. Some only purchased theproducts in 2000 and did not buy anything in 2001. Someonly purchased the products in the year 2001.
The discovered facts revealed that the repeating-buyingpattern holds for different sets of customers. This indicatedthat the picked patterns are user-independent since theCMA only cares the items bought and the chronologicalorder of the purchases.
The consistency of the repeat-buying behavior over timeand the item-based feature of the CMA algorithm suggestedthat a hybrid recommendation system can be formed toprovide better prediction. Thus, marketing professionals willhave a better tool with which they retain their customers.
5 CONCLUSION
The main purpose of this work is to design a time-intervalanalysis algorithm, which can be applied to a wide array ofapplications, especially to analyze time-variant purchasebehavior and establish a model for predicting consumers’likely behavior. The algorithms proposed in this paper weredesigned to work with traditional mining algorithms toprovide better understanding of customers and predictlikely actions taken by observed targets.
It has been shown that the CMA performs well indescribing the buying pattern of consumers and predictinglikely behaviors. However, it leaves some room for
improvement. First, the periodicity detecting procedure
can be improved by applying fuzzy techniques and other
statistics tools. In addition, the CMA algorithm should be
implemented as an add-on to existing mining tools.Nowadays, mass production is an outmoded business
model, and companies must provide goods and services
that fit customers’ individual needs. Mass customization
has become the new paradigm. The GPD/TD/CMA
procedures can be incorporated into personalized recom-
mendation systems. The hybrid recommender can be
formed to build an automated process for discovering
time-relevant knowledge of customers, predicting custo-
mers’ likely actions, and providing useful suggestions for
marketing practice.
REFERENCES
[1] R. Agrawal, T. Imieli�nski, and A. Swami, “Mining AssociationRules between Sets of Items in Large Databases,” Proc. ACMSIGMOD ’93, pp. 207-216, 1993.
[2] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc.1995 IEEE 11th Int’l Conf. Data Eng. (ICDE ’95), pp. 3-14, 1995.
[3] R. Agrawal and R. Srikant, “Fast Algorithms for MiningAssociation Rules in Large Databases,” Proc. 20th Int’l Conf. VeryLarge Data Bases (VLDB ’94), pp. 487-499, 1994.
[4] Y. Chen, M. Chiang, and M. Ko, “Discovering Time-IntervalSequential Patterns in Sequence Databases,” Expert Systems withApplications, vol. 25, pp. 343-354, 2003.
[5] D. Chiang, S. Lee, C. Chen, and M. Wang, “Mining IntervalSequential Patterns,” Int’l J. Intelligent Systems, vol. 20, pp. 359-373,2005.
[6] M. Garofalakis, R. Rastogi, and K. Shim, “Mining SequentialPatterns with Regular Expression Constraints,” IEEE Trans.Knowledge and Data Eng., vol. 14, no. 3, pp. 530-552, May 2002.
[7] J. Han, G. Dong, and Y. Yin, “Efficient Mining of Partial PeriodicPatterns in Time Series Database,” Proc. Int’l Conf. Data Eng. (ICDE’99), p. 106-115, 1999.
[8] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu,“FreeSpan: Frequent Pattern-Projected Sequential PatternMining,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery andData Mining (SIGKDD ’00), pp. 355-359, 2000.
[9] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, andM.-C. Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently byPrefix-Projected Pattern Growth,” Proc. 17th Int’l Conf. Data Eng.(ICDE ’01), pp. 215-224, 2001.
[10] J. Neter, M.H. Kutner, W. Wasserman, and C.J. Nachtsheim,Applied Linear Statistics Model, fourth ed. McGraw-Hill, 1996.
[11] M. Lin and S. Lee, “Fast Discovery of Sequential Patterns byMemory Indexing,” Proc. Fourth Int’l Conf. Data WarehousingKnowledge Discovery (DaWaK ’02), pp. 150-160, 2002.
[12] F. Masseglia, F. Cathala, and P. Poncelet, “The PSP Approach forMining Sequential Patterns,” Proc. Second European Symp. Princi-ples Data Mining Knowledge Discovery (PKDD ’98), pp. 176-184,1998.
[13] M.A. Golberg and H.A. Cho, Introduction to Regression Analysis,vol. 1. WIT Press, 2003.
[14] B. Ozden, S. Ramaswamy, and A. Silberschatz, “Cyclic Associa-tion Rules,” Proc. 14th Int’l Conf. Data Eng. (ICDE ’98), pp. 412-421,1998.
[15] J. Pei, J. Han, and W. Wang, “Mining Sequential Patterns withConstraints in Large Databases,” Proc. 11th Int’l Conf. Informationand Knowledge Management (CIKM ’02), pp. 18-25, 2002.
[16] P. Rolland, “FlExPat: Flexible Extraction of Sequential Patterns,”Proc. IEEE Int’l Conf. Data Mining (ICDM ’01), pp. 481-488, 2001.
[17] R. Srikant and R. Agrawal, “Mining Sequential Patterns: General-izations and Performance Improvements,” Proc. Fifth Int’l Conf.Extending Database Technology, (EDBT ’96), p. 3-17, 1996.
[18] J. Yang, W. Wang, P.S. Yu, and J. Han, “Mining Long SequentialPatterns in a Noisy Environment,” Proc. ACM SIGMOD ’02,pp. 406-417, 2002.
[19] M.J.E. Zaki, “SPADE: An Efficient Algorithm for Mining FrequentSequences,” Machine Learning, vol. 42, nos. 1/2, pp. 31-60, 2001.
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1627
Fig. 16. The regression line and approximation polynomial of <38, 36> in
the years 2000 and 2001. The results of the experiment are similar to the
experiment conducted on <38, 20>.
Ding-An Chiang received the BS degree in hydraulic engineering fromChung Yuan Christian University, Taiwan, in 1981, and the MS and PhDdegrees in computer science from the University of SouthwesternLouisiana in 1986 and 1990, respectively. He is currently a professor inthe Department of Computer Science and Information Engineering andthe dean of the student affairs at Tamkang University. His researchinterests include fuzzy, relational databases and data mining.
Cheng-Tzu Wang received the MS and PhD degrees from the Centerfor Advanced Computer Studies at the University of Louisiana in 1991and 1994, respectively. He is currently an associate professor in theDepartment of Computer Science at the National Taipei University ofEducation, Taiwan. His research interests include software engineering,hybrid soft computing models, and data mining.
Shao-Ping Chen is currently working toward the PhD degree incomputer science and information engineering at Tamkang Uni-versity in Taipei, Taiwan. His research interests include data mining,e-commence, and cyber culture.
Chun-Chi Chen received the MS degrees in computer science andinformation engineering from Tamkang University in Taipei, Taiwan, in2003. His research interests include relational databases and datamining.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
1628 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009